1. 25 Aug, 2022 5 commits
    • Joanne Koong's avatar
      selftests/net: Add test for timing a bind request to a port with a populated bhash entry · c35ecb95
      Joanne Koong authored
      This test populates the bhash table for a given port with
      MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
      a bind request on the port takes.
      
      When populating the bhash table, we create the sockets and then bind
      the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
      are set). When timing how long a bind on the port takes, we bind on a
      different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
      because we are interested in the case where the bind request does not
      go through the tb->fastreuseport path, which is fragile (eg
      tb->fastreuseport path does not work if binding with a different uid).
      
      To run the script:
          Usage: ./bind_bhash.sh [-6 | -4] [-p port] [-a address]
      	    6: use ipv6
      	    4: use ipv4
      	    port: Port number
      	    address: ip address
      
      Without any arguments, ./bind_bhash.sh defaults to ipv6 using ip address
      "2001:0db8:0:f101::1" on port 443.
      
      On my local machine, I see:
      ipv4:
      before - 0.002317 seconds
      with bhash2 - 0.000020 seconds
      
      ipv6:
      before - 0.002431 seconds
      with bhash2 - 0.000021 seconds
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c35ecb95
    • Joanne Koong's avatar
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong authored
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      28044fc1
    • Zhengchao Shao's avatar
      netlink: fix some kernel-doc comments · 0bf73255
      Zhengchao Shao authored
      Modify the comment of input parameter of nlmsg_ and nla_ function.
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20220824013621.365103-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0bf73255
    • Randy Dunlap's avatar
      net: ethernet: ti: davinci_mdio: fix build for mdio bitbang uses · 35bbe652
      Randy Dunlap authored
      davinci_mdio.c uses mdio bitbang APIs, so it should select
      MDIO_BITBANG to prevent build errors.
      
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_remove':
      drivers/net/ethernet/ti/davinci_mdio.c:649: undefined reference to `free_mdio_bitbang'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_probe':
      drivers/net/ethernet/ti/davinci_mdio.c:545: undefined reference to `alloc_mdio_bitbang'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_read':
      drivers/net/ethernet/ti/davinci_mdio.c:236: undefined reference to `mdiobb_read'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_write':
      drivers/net/ethernet/ti/davinci_mdio.c:253: undefined reference to `mdiobb_write'
      
      Fixes: d04807b8 ("net: ethernet: ti: davinci_mdio: Add workaround for errata i2329")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Ravi Gunasekaran <r-gunasekaran@ti.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
      Link: https://lore.kernel.org/r/20220824024216.4939-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      35bbe652
    • Bagas Sanjaya's avatar
      Documentation: sysctl: align cells in second content column · 1faa3467
      Bagas Sanjaya authored
      Stephen Rothwell reported htmldocs warning when merging net-next tree:
      
      Documentation/admin-guide/sysctl/net.rst:37: WARNING: Malformed table.
      Text in column margin in table line 4.
      
      ========= =================== = ========== ==================
      Directory Content               Directory  Content
      ========= =================== = ========== ==================
      802       E802 protocol         mptcp     Multipath TCP
      appletalk Appletalk protocol    netfilter Network Filter
      ax25      AX25                  netrom     NET/ROM
      bridge    Bridging              rose      X.25 PLP layer
      core      General parameter     tipc      TIPC
      ethernet  Ethernet protocol     unix      Unix domain sockets
      ipv4      IP version 4          x25       X.25 protocol
      ipv6      IP version 6
      ========= =================== = ========== ==================
      
      The warning above is caused by cells in second "Content" column of
      /proc/sys/net subdirectory table which are in column margin.
      
      Align these cells against the column header to fix the warning.
      
      Link: https://lore.kernel.org/linux-next/20220823134905.57ed08d5@canb.auug.org.au/
      Fixes: 1202cdd6 ("Remove DECnet support from kernel")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Link: https://lore.kernel.org/r/20220824035804.204322-1-bagasdotme@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1faa3467
  2. 24 Aug, 2022 27 commits
  3. 23 Aug, 2022 8 commits
    • Jakub Kicinski's avatar
      docs: netlink: basic introduction to Netlink · 510156a7
      Jakub Kicinski authored
      Provide a bit of a brain dump of netlink related information
      as documentation. Hopefully this will be useful to people
      trying to navigate implementing YAML based parsing in languages
      we won't be able to help with.
      
      I started writing this doc while trying to figure out what
      it'd take to widen the applicability of YAML to good old rtnl,
      but the doc grew beyond that as it usually happens.
      
      In all honesty a lot of this information is new to me as I usually
      follow the "copy an existing example, drink to forget" process
      of writing netlink user space, so reviews will be much appreciated.
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Acked-by: default avatarJonathan Corbet <corbet@lwn.net>
      Link: https://lore.kernel.org/r/20220819200221.422801-2-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      510156a7
    • Jakub Kicinski's avatar
      net: improve and fix netlink kdoc · 30b60554
      Jakub Kicinski authored
      Subsequent patch will render the kdoc from
      include/uapi/linux/netlink.h into Documentation.
      We need to fix the warnings. While at it move
      the comments on struct nlmsghdr to a proper
      kdoc comment.
      
      Link: https://lore.kernel.org/r/20220819200221.422801-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      30b60554
    • Sergei Antonov's avatar
      net: ftmac100: set max_mtu to allow DSA overhead setting · 6c2c782f
      Sergei Antonov authored
      In case ftmac100 is used with a DSA switch, Linux wants to set MTU
      to 1504 to accommodate for DSA overhead. With the default max_mtu
      it leads to the error message:
       ftmac100 92000000.mac eth0: error -22 setting MTU to 1504 to include DSA overhead
      
      ftmac100 supports packet length 1518 (MAX_PKT_SIZE constant), so it is
      safe to report it in max_mtu.
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220821160844.474277-1-saproj@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c2c782f
    • Paolo Abeni's avatar
      Merge branch 'dsa-changes-for-multiple-cpu-ports-part-3' · 52412f55
      Paolo Abeni authored
      Vladimir Oltean says:
      
      ====================
      DSA changes for multiple CPU ports (part 3)
      
      Those who have been following part 1:
      https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/
      and part 2:
      https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/
      will know that I am trying to enable the second internal port pair from
      the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q".
      This series represents part 3 of that effort.
      
      Covered here are some preparations in DSA for handling multiple DSA
      masters:
      - when changing the tagging protocol via sysfs
      - when the masters go down
      as well as preparation for monitoring the upper devices of a DSA master
      (to support DSA masters under a LAG).
      
      There are also 2 small preparations for the ocelot driver, for the case
      where multiple tag_8021q CPU ports are used in a LAG. Both those changes
      have to do with PGID forwarding domains.
      
      Compared to v1, the patches were trimmed down to just another
      preparation stage, and the UAPI changes were pushed further out to part 4.
      https://patchwork.kernel.org/project/netdevbpf/cover/20220523104256.3556016-1-olteanv@gmail.com/
      
      Compared to v2, I had to export a symbol I forgot to
      (ocelot_port_teardown_dsa_8021q_cpu), to avoid a build breakage when the
      felix and seville drivers are built as modules.
      ====================
      
      Link: https://lore.kernel.org/r/20220819174820.3585002-1-vladimir.oltean@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      52412f55
    • Vladimir Oltean's avatar
      net: mscc: ocelot: adjust forwarding domain for CPU ports in a LAG · 291ac151
      Vladimir Oltean authored
      Currently when we have 2 CPU ports configured for DSA tag_8021q mode and
      we put them in a LAG, a PGID dump looks like this:
      
      PGID_SRC[0] = ports 4,
      PGID_SRC[1] = ports 4,
      PGID_SRC[2] = ports 4,
      PGID_SRC[3] = ports 4,
      PGID_SRC[4] = ports 0, 1, 2, 3, 4, 5,
      PGID_SRC[5] = no ports
      
      (ports 0-3 are user ports, ports 4 and 5 are CPU ports)
      
      There are 2 problems with the configuration above:
      
      - user ports should enable forwarding towards both CPU ports, not just 4,
        and the aggregation PGIDs should prune one CPU port or the other from
        the destination port mask, based on a hash computed from packet headers.
      
      - CPU ports should not be allowed to forward towards themselves and also
        not towards other ports in the same LAG as themselves
      
      The first problem requires fixing up the PGID_SRC of user ports, when
      ocelot_port_assigned_dsa_8021q_cpu_mask() is called. We need to say that
      when a user port is assigned to a tag_8021q CPU port and that port is in
      a LAG, it should forward towards all ports in that LAG.
      
      The second problem requires fixing up the PGID_SRC of port 4, to remove
      ports 4 and 5 (in a LAG) from the allowed destinations.
      
      After this change, the PGID source masks look as follows:
      
      PGID_SRC[0] = ports 4, 5,
      PGID_SRC[1] = ports 4, 5,
      PGID_SRC[2] = ports 4, 5,
      PGID_SRC[3] = ports 4, 5,
      PGID_SRC[4] = ports 0, 1, 2, 3,
      PGID_SRC[5] = no ports
      
      Note that PGID_SRC[5] still looks weird (it should say "0, 1, 2, 3" just
      like PGID_SRC[4] does), but I've tested forwarding through this CPU port
      and it doesn't seem like anything is affected (it appears that PGID_SRC[4]
      is being looked up on forwarding from the CPU, since both ports 4 and 5
      have logical port ID 4). The reason why it looks weird is because
      we've never called ocelot_port_assign_dsa_8021q_cpu() for any user port
      towards port 5 (all user ports are assigned to port 4 which is in a LAG
      with 5).
      
      Since things aren't broken, I'm willing to leave it like that for now
      and just document the oddity.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      291ac151
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up tag_8021q CPU ports independent of user port affinity · 36a0bf44
      Vladimir Oltean authored
      This is a partial revert of commit c295f983 ("net: mscc: ocelot:
      switch from {,un}set to {,un}assign for tag_8021q CPU ports"), because
      as it turns out, this isn't how tag_8021q CPU ports under a LAG are
      supposed to work.
      
      Under that scenario, all user ports are "assigned" to the single
      tag_8021q CPU port represented by the logical port corresponding to the
      bonding interface. So one CPU port in a LAG would have is_dsa_8021q_cpu
      set to true (the one whose physical port ID is equal to the logical port
      ID), and the other one to false.
      
      In turn, this makes 2 undesirable things happen:
      
      (1) PGID_CPU contains only the first physical CPU port, rather than both
      (2) only the first CPU port will be added to the private VLANs used by
          ocelot for VLAN-unaware bridging
      
      To make the driver behave in the same way for both bonded CPU ports, we
      need to bring back the old concept of setting up a port as a tag_8021q
      CPU port, and this is what deals with VLAN membership and PGID_CPU
      updating. But we also need the CPU port "assignment" (the user to CPU
      port affinity), and this is what updates the PGID_SRC forwarding rules.
      
      All DSA CPU ports are statically configured for tag_8021q mode when the
      tagging protocol is changed to ocelot-8021q. User ports are "assigned"
      to one CPU port or the other dynamically (this will be handled by a
      future change).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      36a0bf44
    • Vladimir Oltean's avatar
      net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_master · 5dc760d1
      Vladimir Oltean authored
      More logic will be added to dsa_tree_setup_master() and
      dsa_tree_teardown_master() in upcoming changes.
      
      Reduce the indentation by one level in these functions by introducing
      and using a dedicated iterator for CPU ports of a tree.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5dc760d1
    • Vladimir Oltean's avatar
      net: dsa: all DSA masters must be down when changing the tagging protocol · f41ec1fd
      Vladimir Oltean authored
      The fact that the tagging protocol is set and queried from the
      /sys/class/net/<dsa-master>/dsa/tagging file is a bit of a quirk from
      the single CPU port days which isn't aging very well now that DSA can
      have more than a single CPU port. This is because the tagging protocol
      is a switch property, yet in the presence of multiple CPU ports it can
      be queried and set from multiple sysfs files, all of which are handled
      by the same implementation.
      
      The current logic ensures that the net device whose sysfs file we're
      changing the tagging protocol through must be down. That net device is
      the DSA master, and this is fine for single DSA master / CPU port setups.
      
      But exactly because the tagging protocol is per switch [ tree, in fact ]
      and not per DSA master, this isn't fine any longer with multiple CPU
      ports, and we must iterate through the tree and find all DSA masters,
      and make sure that all of them are down.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f41ec1fd