1. 25 Aug, 2022 8 commits
    • Jiri Pirko's avatar
      Documentation: devlink: fix the locking section · 77a70f9c
      Jiri Pirko authored
      As all callbacks are converted now, fix the text reflecting that change.
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20220823070213.1008956-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77a70f9c
    • Jakub Kicinski's avatar
      Merge branch 'add-a-second-bind-table-hashed-by-port-and-address' · c21e1bf4
      Jakub Kicinski authored
      Joanne Koong says:
      
      ====================
      Add a second bind table hashed by port and address
      
      Currently, there is one bind hashtable (bhash) that hashes by port only.
      This patchset adds a second bind table (bhash2) that hashes by port and
      address.
      
      The motivation for adding bhash2 is to expedite bind requests in situations
      where the port has many sockets in its bhash table entry (eg a large number
      of sockets bound to different addresses on the same port), which makes checking
      bind conflicts costly especially given that we acquire the table entry spinlock
      while doing so, which can cause softirq cpu lockups and can prevent new tcp
      connections.
      
      We ran into this problem at Meta where the traffic team binds a large number
      of IPs to port 443 and the bind() call took a significant amount of time
      which led to cpu softirq lockups, which caused packet drops and other failures
      on the machine.
      
      When experimentally testing this on a local server for ~24k sockets bound to
      the port, the results seen were:
      
      ipv4:
      before - 0.002317 seconds
      with bhash2 - 0.000020 seconds
      
      ipv6:
      before - 0.002431 seconds
      with bhash2 - 0.000021 seconds
      
      The additions to the initial bhash2 submission [0] are:
      * Updating bhash2 in the cases where a socket's rcv saddr changes after it has
      * been bound
      * Adding locks for bhash2 hashbuckets
      
      [0] https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/
      
      v3: https://lore.kernel.org/netdev/20220722195406.1304948-2-joannelkoong@gmail.com/
      v2: https://lore.kernel.org/netdev/20220712235310.1935121-1-joannelkoong@gmail.com/
      v1: https://lore.kernel.org/netdev/20220623234242.2083895-2-joannelkoong@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20220822181023.3979645-1-joannelkoong@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c21e1bf4
    • Joanne Koong's avatar
      selftests/net: Add sk_bind_sendto_listen and sk_connect_zero_addr · 1be9ac87
      Joanne Koong authored
      This patch adds 2 new tests: sk_bind_sendto_listen and
      sk_connect_zero_addr.
      
      The sk_bind_sendto_listen test exercises the path where a socket's
      rcv saddr changes after it has been added to the binding tables,
      and then a listen() on the socket is invoked. The listen() should
      succeed.
      
      The sk_bind_sendto_listen test is copied over from one of syzbot's
      tests: https://syzkaller.appspot.com/x/repro.c?x=1673a38df00000
      
      The sk_connect_zero_addr test exercises the path where the socket was
      never previously added to the binding tables and it gets assigned a
      saddr upon a connect() to address 0.
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1be9ac87
    • Joanne Koong's avatar
      selftests/net: Add test for timing a bind request to a port with a populated bhash entry · c35ecb95
      Joanne Koong authored
      This test populates the bhash table for a given port with
      MAX_THREADS * MAX_CONNECTIONS sockets, and then times how long
      a bind request on the port takes.
      
      When populating the bhash table, we create the sockets and then bind
      the sockets to the same address and port (SO_REUSEADDR and SO_REUSEPORT
      are set). When timing how long a bind on the port takes, we bind on a
      different address without SO_REUSEPORT set. We do not set SO_REUSEPORT
      because we are interested in the case where the bind request does not
      go through the tb->fastreuseport path, which is fragile (eg
      tb->fastreuseport path does not work if binding with a different uid).
      
      To run the script:
          Usage: ./bind_bhash.sh [-6 | -4] [-p port] [-a address]
      	    6: use ipv6
      	    4: use ipv4
      	    port: Port number
      	    address: ip address
      
      Without any arguments, ./bind_bhash.sh defaults to ipv6 using ip address
      "2001:0db8:0:f101::1" on port 443.
      
      On my local machine, I see:
      ipv4:
      before - 0.002317 seconds
      with bhash2 - 0.000020 seconds
      
      ipv6:
      before - 0.002431 seconds
      with bhash2 - 0.000021 seconds
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c35ecb95
    • Joanne Koong's avatar
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong authored
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: default avatarJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      28044fc1
    • Zhengchao Shao's avatar
      netlink: fix some kernel-doc comments · 0bf73255
      Zhengchao Shao authored
      Modify the comment of input parameter of nlmsg_ and nla_ function.
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20220824013621.365103-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0bf73255
    • Randy Dunlap's avatar
      net: ethernet: ti: davinci_mdio: fix build for mdio bitbang uses · 35bbe652
      Randy Dunlap authored
      davinci_mdio.c uses mdio bitbang APIs, so it should select
      MDIO_BITBANG to prevent build errors.
      
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_remove':
      drivers/net/ethernet/ti/davinci_mdio.c:649: undefined reference to `free_mdio_bitbang'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdio_probe':
      drivers/net/ethernet/ti/davinci_mdio.c:545: undefined reference to `alloc_mdio_bitbang'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_read':
      drivers/net/ethernet/ti/davinci_mdio.c:236: undefined reference to `mdiobb_read'
      arm-linux-gnueabi-ld: drivers/net/ethernet/ti/davinci_mdio.o: in function `davinci_mdiobb_write':
      drivers/net/ethernet/ti/davinci_mdio.c:253: undefined reference to `mdiobb_write'
      
      Fixes: d04807b8 ("net: ethernet: ti: davinci_mdio: Add workaround for errata i2329")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: Ravi Gunasekaran <r-gunasekaran@ti.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Sudip Mukherjee (Codethink) <sudipm.mukherjee@gmail.com>
      Link: https://lore.kernel.org/r/20220824024216.4939-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      35bbe652
    • Bagas Sanjaya's avatar
      Documentation: sysctl: align cells in second content column · 1faa3467
      Bagas Sanjaya authored
      Stephen Rothwell reported htmldocs warning when merging net-next tree:
      
      Documentation/admin-guide/sysctl/net.rst:37: WARNING: Malformed table.
      Text in column margin in table line 4.
      
      ========= =================== = ========== ==================
      Directory Content               Directory  Content
      ========= =================== = ========== ==================
      802       E802 protocol         mptcp     Multipath TCP
      appletalk Appletalk protocol    netfilter Network Filter
      ax25      AX25                  netrom     NET/ROM
      bridge    Bridging              rose      X.25 PLP layer
      core      General parameter     tipc      TIPC
      ethernet  Ethernet protocol     unix      Unix domain sockets
      ipv4      IP version 4          x25       X.25 protocol
      ipv6      IP version 6
      ========= =================== = ========== ==================
      
      The warning above is caused by cells in second "Content" column of
      /proc/sys/net subdirectory table which are in column margin.
      
      Align these cells against the column header to fix the warning.
      
      Link: https://lore.kernel.org/linux-next/20220823134905.57ed08d5@canb.auug.org.au/
      Fixes: 1202cdd6 ("Remove DECnet support from kernel")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Link: https://lore.kernel.org/r/20220824035804.204322-1-bagasdotme@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1faa3467
  2. 24 Aug, 2022 27 commits
  3. 23 Aug, 2022 5 commits