1. 29 Nov, 2021 19 commits
  2. 27 Nov, 2021 21 commits
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-replace-unix_table_lock-with-per-hash-locks' · d40ce48c
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Replace unix_table_lock with per-hash locks.
      
      The hash table of AF_UNIX sockets is protected by a single big lock,
      unix_table_lock.  This series replaces it with small per-hash locks.
      
      1st -  2nd : Misc refactoring
      3rd -  8th : Separate BSD/abstract address logics
      9th - 11th : Prep to save a hash in each socket
      12th       : Replace the big lock
      13th       : Speed up autobind()
      
      Note to maintainers:
      The 12th patch adds two kinds of Sparse warnings on patchwork:
      
        about unix_table_double_lock/unlock()
          We can avoid this by adding two apparent acquires/releases annotations,
          but there are the same kinds of warnings about unix_state_double_lock().
      
        about unix_next_socket() and unix_seq_stop() (/proc/net/unix)
          This is because Sparse does not understand logic in unix_next_socket(),
          which leaves a spin lock held until it returns NULL.
          Also, tcp_seq_stop() causes a warning for the same reason.
      
      These warnings seem reasonable, but let me know if there is any better way.
      Please see [0] for details.
      
      [0]: https://lore.kernel.org/netdev/20211117001611.74123-1-kuniyu@amazon.co.jp/
      ====================
      
      Link: https://lore.kernel.org/r/20211124021431.48956-1-kuniyu@amazon.co.jpSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d40ce48c
    • Kuniyuki Iwashima's avatar
      af_unix: Relax race in unix_autobind(). · 9acbc584
      Kuniyuki Iwashima authored
      When we bind an AF_UNIX socket without a name specified, the kernel selects
      an available one from 0x00000 to 0xFFFFF.  unix_autobind() starts searching
      from a number in the 'static' variable and increments it after acquiring
      two locks.
      
      If multiple processes try autobind, they obtain the same lock and check if
      a socket in the hash list has the same name.  If not, one process uses it,
      and all except one end up retrying the _next_ number (actually not, it may
      be incremented by the other processes).  The more we autobind sockets in
      parallel, the longer the latency gets.  We can avoid such a race by
      searching for a name from a random number.
      
      These show latency in unix_autobind() while 64 CPUs are simultaneously
      autobind-ing 1024 sockets for each.
      
        Without this patch:
      
           usec          : count     distribution
              0          : 1176     |***                                     |
              2          : 3655     |***********                             |
              4          : 4094     |*************                           |
              6          : 3831     |************                            |
              8          : 3829     |************                            |
              10         : 3844     |************                            |
              12         : 3638     |***********                             |
              14         : 2992     |*********                               |
              16         : 2485     |*******                                 |
              18         : 2230     |*******                                 |
              20         : 2095     |******                                  |
              22         : 1853     |*****                                   |
              24         : 1827     |*****                                   |
              26         : 1677     |*****                                   |
              28         : 1473     |****                                    |
              30         : 1573     |*****                                   |
              32         : 1417     |****                                    |
              34         : 1385     |****                                    |
              36         : 1345     |****                                    |
              38         : 1344     |****                                    |
              40         : 1200     |***                                     |
      
        With this patch:
      
           usec          : count     distribution
              0          : 1855     |******                                  |
              2          : 6464     |*********************                   |
              4          : 9936     |********************************        |
              6          : 12107    |****************************************|
              8          : 10441    |**********************************      |
              10         : 7264     |***********************                 |
              12         : 4254     |**************                          |
              14         : 2538     |********                                |
              16         : 1596     |*****                                   |
              18         : 1088     |***                                     |
              20         : 800      |**                                      |
              22         : 670      |**                                      |
              24         : 601      |*                                       |
              26         : 562      |*                                       |
              28         : 525      |*                                       |
              30         : 446      |*                                       |
              32         : 378      |*                                       |
              34         : 337      |*                                       |
              36         : 317      |*                                       |
              38         : 314      |*                                       |
              40         : 298      |                                        |
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9acbc584
    • Kuniyuki Iwashima's avatar
      af_unix: Replace the big lock with small locks. · afd20b92
      Kuniyuki Iwashima authored
      The hash table of AF_UNIX sockets is protected by the single lock.  This
      patch replaces it with per-hash locks.
      
      The effect is noticeable when we handle multiple sockets simultaneously.
      Here is a test result on an EC2 c5.24xlarge instance.  It shows latency
      (under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating
      1024 sockets for each in parallel.
      
        Without this patch:
      
           nsec          : count     distribution
              0          : 179      |                                        |
              500        : 3021     |*********                               |
              1000       : 6271     |*******************                     |
              1500       : 6318     |*******************                     |
              2000       : 5828     |*****************                       |
              2500       : 5124     |***************                         |
              3000       : 4426     |*************                           |
              3500       : 3672     |***********                             |
              4000       : 3138     |*********                               |
              4500       : 2811     |********                                |
              5000       : 2384     |*******                                 |
              5500       : 2023     |******                                  |
              6000       : 1954     |*****                                   |
              6500       : 1737     |*****                                   |
              7000       : 1749     |*****                                   |
              7500       : 1520     |****                                    |
              8000       : 1469     |****                                    |
              8500       : 1394     |****                                    |
              9000       : 1232     |***                                     |
              9500       : 1138     |***                                     |
              10000      : 994      |***                                     |
      
        With this patch:
      
           nsec          : count     distribution
              0          : 1634     |****                                    |
              500        : 13170    |****************************************|
              1000       : 13156    |*************************************** |
              1500       : 9010     |***************************             |
              2000       : 6363     |*******************                     |
              2500       : 4443     |*************                           |
              3000       : 3240     |*********                               |
              3500       : 2549     |*******                                 |
              4000       : 1872     |*****                                   |
              4500       : 1504     |****                                    |
              5000       : 1247     |***                                     |
              5500       : 1035     |***                                     |
              6000       : 889      |**                                      |
              6500       : 744      |**                                      |
              7000       : 634      |*                                       |
              7500       : 498      |*                                       |
              8000       : 433      |*                                       |
              8500       : 355      |*                                       |
              9000       : 336      |*                                       |
              9500       : 284      |                                        |
              10000      : 243      |                                        |
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      afd20b92
    • Kuniyuki Iwashima's avatar
      af_unix: Save hash in sk_hash. · e6b4b873
      Kuniyuki Iwashima authored
      To replace unix_table_lock with per-hash locks in the next patch, we need
      to save a hash in each socket because /proc/net/unix or BPF prog iterate
      sockets while holding a hash table lock and release it later in a different
      function.
      
      Currently, we store a real/pseudo hash in struct unix_address.  However, we
      do not allocate it to unbound sockets, nor should we do just for that.  For
      this purpose, we can use sk_hash.  Then, we no longer use the hash field in
      struct unix_address and can remove it.
      
      Also, this patch does
        - rename unix_insert_socket() to unix_insert_unbound_socket()
        - remove the redundant list argument from __unix_insert_socket() and
           unix_insert_unbound_socket()
        - use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
        - remove 'inline' from unix_remove_socket() and
           unix_insert_unbound_socket().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e6b4b873
    • Kuniyuki Iwashima's avatar
      af_unix: Add helpers to calculate hashes. · f452be49
      Kuniyuki Iwashima authored
      This patch adds three helper functions that calculate hashes for unbound
      sockets and bound sockets with BSD/abstract addresses.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f452be49
    • Kuniyuki Iwashima's avatar
      af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead. · 5ce7ab49
      Kuniyuki Iwashima authored
      In BSD and abstract address cases, we store sockets in the hash table with
      keys between 0 and UNIX_HASH_SIZE - 1.  However, the hash saved in a socket
      varies depending on its address type; sockets with BSD addresses always
      have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.
      
      This is just for the UNIX_ABSTRACT() macro used to check the address type.
      The difference of the saved hashes comes from the first byte of the address
      in the first place.  So, we can test it directly.
      
      Then we can keep a real hash in each socket and replace unix_table_lock
      with per-hash locks in the later patch.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5ce7ab49
    • Kuniyuki Iwashima's avatar
      af_unix: Allocate unix_address in unix_bind_(bsd|abstract)(). · 12f21c49
      Kuniyuki Iwashima authored
      To terminate address with '\0' in unix_bind_bsd(), we add
      unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().
      
      Also, unix_bind_abstract() does not return -EEXIST.  Only
      kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
      so we move the last error check in unix_bind() to unix_bind_bsd().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      12f21c49
    • Kuniyuki Iwashima's avatar
      af_unix: Remove unix_mkname(). · 5c32a3ed
      Kuniyuki Iwashima authored
      This patch removes unix_mkname() and postpones calculating a hash to
      unix_bind_abstract().  Some BSD stuffs still remain in unix_bind()
      though, the next patch packs them into unix_bind_bsd().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5c32a3ed
    • Kuniyuki Iwashima's avatar
      af_unix: Copy unix_mkname() into unix_find_(bsd|abstract)(). · d2d8c9fd
      Kuniyuki Iwashima authored
      We should not call unix_mkname() before unix_find_other() and instead do
      the same thing where necessary based on the address type:
      
        - terminating the address with '\0' in unix_find_bsd()
        - calculating the hash in unix_find_abstract().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d2d8c9fd
    • Kuniyuki Iwashima's avatar
      af_unix: Cut unix_validate_addr() out of unix_mkname(). · b8a58aa6
      Kuniyuki Iwashima authored
      unix_mkname() tests socket address length and family and does some
      processing based on the address type.  It is called in the early stage,
      and therefore some instructions are redundant and can end up in vain.
      
      The address length/family tests are done twice in unix_bind().  Also, the
      address type is rechecked later in unix_bind() and unix_find_other(), where
      we can do the same processing.  Moreover, in the BSD address case, the hash
      is set to 0 but never used and confusing.
      
      This patch moves the address tests out of unix_mkname(), and the following
      patches move the other part into appropriate places and remove
      unix_mkname() finally.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b8a58aa6
    • Kuniyuki Iwashima's avatar
      af_unix: Return an error as a pointer in unix_find_other(). · aed26f55
      Kuniyuki Iwashima authored
      We can return an error as a pointer and need not pass an additional
      argument to unix_find_other().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aed26f55
    • Kuniyuki Iwashima's avatar
      af_unix: Factorise unix_find_other() based on address types. · fa39ef0e
      Kuniyuki Iwashima authored
      As done in the commit fa42d910 ("unix_bind(): take BSD and abstract
      address cases into new helpers"), this patch moves BSD and abstract address
      cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa39ef0e
    • Kuniyuki Iwashima's avatar
      af_unix: Pass struct sock to unix_autobind(). · f7ed31f4
      Kuniyuki Iwashima authored
      We do not use struct socket in unix_autobind() and pass struct sock to
      unix_bind_bsd() and unix_bind_abstract().  Let's pass it to unix_autobind()
      as well.
      
      Also, this patch fixes these errors by checkpatch.pl.
      
        ERROR: do not use assignment in if condition
        #1795: FILE: net/unix/af_unix.c:1795:
        +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
      
        CHECK: Logical continuations should be on the previous line
        #1796: FILE: net/unix/af_unix.c:1796:
        +	if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
        +	    && (err = unix_autobind(sock)) != 0)
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f7ed31f4
    • Kuniyuki Iwashima's avatar
      af_unix: Use offsetof() instead of sizeof(). · 755662ce
      Kuniyuki Iwashima authored
      The length of the AF_UNIX socket address contains an offset to the member
      sun_path of struct sockaddr_un.
      
      Currently, the preceding member is just sun_family, and its type is
      sa_family_t and resolved to short.  Therefore, the offset is represented by
      sizeof(short).  However, it is not clear and fragile to changes in struct
      sockaddr_storage or sockaddr_un.
      
      This commit makes it clear and robust by rewriting sizeof() with
      offsetof().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      755662ce
    • Xin Long's avatar
      bridge: use __set_bit in __br_vlan_set_default_pvid · 442b03c3
      Xin Long authored
      The same optimization as the one in commit cc0be1ad ("net:
      bridge: Slightly optimize 'find_portno()'") is needed for the
      'changed' bitmap in __br_vlan_set_default_pvid().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/4e35f415226765e79c2a11d2c96fbf3061c486e2.1637782773.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      442b03c3
    • Tonghao Zhang's avatar
      net: ethtool: set a default driver name · bde3b0fd
      Tonghao Zhang authored
      The netdev (e.g. ifb, bareudp), which not support ethtool ops
      (e.g. .get_drvinfo), we can use the rtnl kind as a default name.
      
      ifb netdev may be created by others prefix, not ifbX.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Hao Chen <chenhao288@hisilicon.com>
      Cc: Heiner Kallweit <hkallweit1@gmail.com>
      Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
      Cc: Danielle Ratson <danieller@nvidia.com>
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20211125163049.84970-1-xiangxia.m.yue@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bde3b0fd
    • Jakub Kicinski's avatar
      Merge branch 'selftests-net-bridge-vlan-multicast-tests' · c2e0cf08
      Jakub Kicinski authored
      Nikolay Aleksandrov says:
      
      ====================
      selftests: net: bridge: vlan multicast tests
      
      This patch-set adds selftests for the new vlan multicast options that
      were recently added. Most of the tests check for default values,
      changing options and try to verify that the changes actually take
      effect. The last test checks if the dependency between vlan_filtering
      and mcast_vlan_snooping holds. The rest are pretty self-explanatory.
      
      TEST: Vlan multicast snooping enable                                [ OK ]
      TEST: Vlan global options existence                                 [ OK ]
      TEST: Vlan mcast_snooping global option default value               [ OK ]
      TEST: Vlan 10 multicast snooping control                            [ OK ]
      TEST: Vlan mcast_querier global option default value                [ OK ]
      TEST: Vlan 10 multicast querier enable                              [ OK ]
      TEST: Vlan 10 tagged IGMPv2 general query sent                      [ OK ]
      TEST: Vlan 10 tagged MLD general query sent                         [ OK ]
      TEST: Vlan mcast_igmp_version global option default value           [ OK ]
      TEST: Vlan mcast_mld_version global option default value            [ OK ]
      TEST: Vlan 10 mcast_igmp_version option changed to 3                [ OK ]
      TEST: Vlan 10 tagged IGMPv3 general query sent                      [ OK ]
      TEST: Vlan 10 mcast_mld_version option changed to 2                 [ OK ]
      TEST: Vlan 10 tagged MLDv2 general query sent                       [ OK ]
      TEST: Vlan mcast_last_member_count global option default value      [ OK ]
      TEST: Vlan mcast_last_member_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_last_member_count option changed to 3           [ OK ]
      TEST: Vlan 10 mcast_last_member_interval option changed to 200      [ OK ]
      TEST: Vlan mcast_startup_query_interval global option default value   [ OK ]
      TEST: Vlan mcast_startup_query_count global option default value    [ OK ]
      TEST: Vlan 10 mcast_startup_query_interval option changed to 100    [ OK ]
      TEST: Vlan 10 mcast_startup_query_count option changed to 3         [ OK ]
      TEST: Vlan mcast_membership_interval global option default value    [ OK ]
      TEST: Vlan 10 mcast_membership_interval option changed to 200       [ OK ]
      TEST: Vlan 10 mcast_membership_interval mdb entry expire            [ OK ]
      TEST: Vlan mcast_querier_interval global option default value       [ OK ]
      TEST: Vlan 10 mcast_querier_interval option changed to 100          [ OK ]
      TEST: Vlan 10 mcast_querier_interval expire after outside query     [ OK ]
      TEST: Vlan mcast_query_interval global option default value         [ OK ]
      TEST: Vlan 10 mcast_query_interval option changed to 200            [ OK ]
      TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_query_response_interval option changed to 200   [ OK ]
      TEST: Port vlan 10 option mcast_router default value                [ OK ]
      TEST: Port vlan 10 mcast_router option changed to 2                 [ OK ]
      TEST: Flood unknown vlan multicast packets to router port only      [ OK ]
      TEST: Disable multicast vlan snooping when vlan filtering is disabled   [ OK ]
      ====================
      
      Link: https://lore.kernel.org/r/20211125140858.3639139-1-razor@blackwall.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2e0cf08
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add test for vlan_filtering dependency · f5a9dd58
      Nikolay Aleksandrov authored
      Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If
      vlan_filtering gets disabled, then mcast_vlan_snooping must be
      automatically disabled as well.
      
      TEST: Disable multicast vlan snooping when vlan filtering is disabled   [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f5a9dd58
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_router tests · 2cd67a4e
      Nikolay Aleksandrov authored
      Add tests for the new per-port/vlan mcast_router option, verify that
      unknown multicast packets are flooded only to router ports.
      
      TEST: Port vlan 10 option mcast_router default value                [ OK ]
      TEST: Port vlan 10 mcast_router option changed to 2                 [ OK ]
      TEST: Flood unknown vlan multicast packets to router port only      [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cd67a4e
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast query and query response interval tests · b4ce7b95
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan mcast_query_interval and verify
      the new value is in effect, also add a test to change
      mcast_query_response_interval's value.
      
      TEST: Vlan mcast_query_interval global option default value         [ OK ]
      TEST: Vlan 10 mcast_query_interval option changed to 200            [ OK ]
      TEST: Vlan mcast_query_response_interval global option default value   [ OK ]
      TEST: Vlan 10 mcast_query_response_interval option changed to 200   [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b4ce7b95
    • Nikolay Aleksandrov's avatar
      selftests: net: bridge: add vlan mcast_querier_interval tests · 4d8610ee
      Nikolay Aleksandrov authored
      Add tests which change the new per-vlan mcast_querier_interval and
      verify that it is taken into account when an outside querier is present.
      
      TEST: Vlan mcast_querier_interval global option default value       [ OK ]
      TEST: Vlan 10 mcast_querier_interval option changed to 100          [ OK ]
      TEST: Vlan 10 mcast_querier_interval expire after outside query     [ OK ]
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d8610ee