1. 09 Aug, 2023 12 commits
    • Jakub Kicinski's avatar
      Merge branch 'nexthop-nexthop-dump-fixes' · f8d3e0dc
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      nexthop: Nexthop dump fixes
      
      Patches #1 and #3 fix two problems related to nexthops and nexthop
      buckets dump, respectively. Patch #2 is a preparation for the third
      patch.
      
      The pattern described in these patches of splitting the NLMSG_DONE to a
      separate response is prevalent in other rtnetlink dump callbacks. I
      don't know if it's because I'm missing something or if this was done
      intentionally to ensure the message is delivered to user space. After
      commit 0642840b ("af_netlink: ensure that NLMSG_DONE never fails in
      dumps") this is no longer necessary and I can improve these dump
      callbacks assuming this analysis is correct.
      
      No regressions in existing tests:
      
       # ./fib_nexthops.sh
       [...]
       Tests passed: 230
       Tests failed:   0
      ====================
      
      Link: https://lore.kernel.org/r/20230808075233.3337922-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f8d3e0dc
    • Ido Schimmel's avatar
      nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID · 8743aeff
      Ido Schimmel authored
      A netlink dump callback can return a positive number to signal that more
      information needs to be dumped or zero to signal that the dump is
      complete. In the second case, the core netlink code will append the
      NLMSG_DONE message to the skb in order to indicate to user space that
      the dump is complete.
      
      The nexthop bucket dump callback always returns a positive number if
      nexthop buckets were filled in the provided skb, even if the dump is
      complete. This means that a dump will span at least two recvmsg() calls
      as long as nexthop buckets are present. In the last recvmsg() call the
      dump callback will not fill in any nexthop buckets because the previous
      call indicated that the dump should restart from the last dumped nexthop
      ID plus one.
      
       # ip link add name dummy1 up type dummy
       # ip nexthop add id 1 dev dummy1
       # ip nexthop add id 10 group 1 type resilient buckets 2
       # strace -e sendto,recvmsg -s 5 ip nexthop bucket
       sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
       id 10 index 0 idle_time 6.66 nhid 1
       id 10 index 1 idle_time 6.66 nhid 1
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
       +++ exited with 0 +++
      
      This behavior is both inefficient and buggy. If the last nexthop to be
      dumped had the maximum ID of 0xffffffff, then the dump will restart from
      0 (0xffffffff + 1) and never end:
      
       # ip link add name dummy1 up type dummy
       # ip nexthop add id 1 dev dummy1
       # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
       # ip nexthop bucket
       id 4294967295 index 0 idle_time 5.55 nhid 1
       id 4294967295 index 1 idle_time 5.55 nhid 1
       id 4294967295 index 0 idle_time 5.55 nhid 1
       id 4294967295 index 1 idle_time 5.55 nhid 1
       [...]
      
      Fix by adjusting the dump callback to return zero when the dump is
      complete. After the fix only one recvmsg() call is made and the
      NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses:
      
       # ip link add name dummy1 up type dummy
       # ip nexthop add id 1 dev dummy1
       # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2
       # strace -e sendto,recvmsg -s 5 ip nexthop bucket
       sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
       id 4294967295 index 0 idle_time 6.61 nhid 1
       id 4294967295 index 1 idle_time 6.61 nhid 1
       +++ exited with 0 +++
      
      Note that if the NLMSG_DONE message cannot be appended because of size
      limitations, then another recvmsg() will be needed, but the core netlink
      code will not invoke the dump callback and simply reply with a
      NLMSG_DONE message since it knows that the callback previously returned
      zero.
      
      Add a test that fails before the fix:
      
       # ./fib_nexthops.sh -t basic_res
       [...]
       TEST: Maximum nexthop ID dump                                       [FAIL]
       [...]
      
      And passes after it:
      
       # ./fib_nexthops.sh -t basic_res
       [...]
       TEST: Maximum nexthop ID dump                                       [ OK ]
       [...]
      
      Fixes: 8a1bbabb ("nexthop: Add netlink handlers for bucket dump")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8743aeff
    • Ido Schimmel's avatar
      nexthop: Make nexthop bucket dump more efficient · f10d3d9d
      Ido Schimmel authored
      rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging
      to a specific resilient nexthop group. The function returns a positive
      return code (the skb length) upon both success and failure.
      
      The above behavior is problematic. When a complete nexthop bucket dump
      is requested, the function that walks the different nexthops treats the
      non-zero return code as an error. This causes buckets belonging to
      different resilient nexthop groups to be dumped using different buffers
      even if they can all fit in the same buffer:
      
       # ip link add name dummy1 up type dummy
       # ip nexthop add id 1 dev dummy1
       # ip nexthop add id 10 group 1 type resilient buckets 1
       # ip nexthop add id 20 group 1 type resilient buckets 1
       # strace -e recvmsg -s 0 ip nexthop bucket
       [...]
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
       id 10 index 0 idle_time 10.27 nhid 1
       [...]
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
       id 20 index 0 idle_time 6.44 nhid 1
       [...]
      
      Fix by only returning a non-zero return code when an error occurred and
      restarting the dump from the bucket index we failed to fill in. This
      allows buckets belonging to different resilient nexthop groups to be
      dumped using the same buffer:
      
       # ip link add name dummy1 up type dummy
       # ip nexthop add id 1 dev dummy1
       # ip nexthop add id 10 group 1 type resilient buckets 1
       # ip nexthop add id 20 group 1 type resilient buckets 1
       # strace -e recvmsg -s 0 ip nexthop bucket
       [...]
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128
       id 10 index 0 idle_time 30.21 nhid 1
       id 20 index 0 idle_time 26.7 nhid 1
       [...]
      
      While this change is more of a performance improvement change than an
      actual bug fix, it is a prerequisite for a subsequent patch that does
      fix a bug.
      
      Fixes: 8a1bbabb ("nexthop: Add netlink handlers for bucket dump")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f10d3d9d
    • Ido Schimmel's avatar
      nexthop: Fix infinite nexthop dump when using maximum nexthop ID · 913f60ca
      Ido Schimmel authored
      A netlink dump callback can return a positive number to signal that more
      information needs to be dumped or zero to signal that the dump is
      complete. In the second case, the core netlink code will append the
      NLMSG_DONE message to the skb in order to indicate to user space that
      the dump is complete.
      
      The nexthop dump callback always returns a positive number if nexthops
      were filled in the provided skb, even if the dump is complete. This
      means that a dump will span at least two recvmsg() calls as long as
      nexthops are present. In the last recvmsg() call the dump callback will
      not fill in any nexthops because the previous call indicated that the
      dump should restart from the last dumped nexthop ID plus one.
      
       # ip nexthop add id 1 blackhole
       # strace -e sendto,recvmsg -s 5 ip nexthop
       sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36
       id 1 blackhole
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
       +++ exited with 0 +++
      
      This behavior is both inefficient and buggy. If the last nexthop to be
      dumped had the maximum ID of 0xffffffff, then the dump will restart from
      0 (0xffffffff + 1) and never end:
      
       # ip nexthop add id $((2**32-1)) blackhole
       # ip nexthop
       id 4294967295 blackhole
       id 4294967295 blackhole
       [...]
      
      Fix by adjusting the dump callback to return zero when the dump is
      complete. After the fix only one recvmsg() call is made and the
      NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response:
      
       # ip nexthop add id $((2**32-1)) blackhole
       # strace -e sendto,recvmsg -s 5 ip nexthop
       sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56
       recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56
       id 4294967295 blackhole
       +++ exited with 0 +++
      
      Note that if the NLMSG_DONE message cannot be appended because of size
      limitations, then another recvmsg() will be needed, but the core netlink
      code will not invoke the dump callback and simply reply with a
      NLMSG_DONE message since it knows that the callback previously returned
      zero.
      
      Add a test that fails before the fix:
      
       # ./fib_nexthops.sh -t basic
       [...]
       TEST: Maximum nexthop ID dump                                       [FAIL]
       [...]
      
      And passes after it:
      
       # ./fib_nexthops.sh -t basic
       [...]
       TEST: Maximum nexthop ID dump                                       [ OK ]
       [...]
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Reported-by: default avatarPetr Machata <petrm@nvidia.com>
      Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      913f60ca
    • Vlad Buslov's avatar
      vlan: Fix VLAN 0 memory leak · 718cb09a
      Vlad Buslov authored
      The referenced commit intended to fix memleak of VLAN 0 that is implicitly
      created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it
      doesn't take into account that the feature can be re-set during the
      netdevice lifetime which will cause memory leak if feature is disabled
      during the device deletion as illustrated by [0]. Fix the leak by
      unconditionally deleting VLAN 0 on NETDEV_DOWN event.
      
      [0]:
      > modprobe 8021q
      > ip l set dev eth2 up
      > ethtool -K eth2 rx-vlan-filter off
      > modprobe -r mlx5_ib
      > modprobe -r mlx5_core
      > cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff888103dcd900 (size 256):
        comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
        hex dump (first 32 bytes):
          00 80 5d 03 81 88 ff ff 00 00 00 00 00 00 00 00  ..].............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
          [<000000002889a7a2>] vlan_vid_add+0xa0/0x210
          [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
          [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
          [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
          [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
          [<00000000982807e9>] do_setlink+0x28d/0x10a0
          [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
          [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
          [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
          [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
          [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
          [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
          [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
          [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
          [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0
      unreferenced object 0xffff88813354fde0 (size 32):
        comm "ip", pid 1490, jiffies 4294907305 (age 325.364s)
        hex dump (first 32 bytes):
          a0 d9 dc 03 81 88 ff ff a0 d9 dc 03 81 88 ff ff  ................
          81 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000899f3bb9>] kmalloc_trace+0x25/0x80
          [<000000002da64724>] vlan_vid_add+0xdf/0x210
          [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q]
          [<000000009a0716b1>] notifier_call_chain+0x35/0xb0
          [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0
          [<0000000053d2b05d>] dev_change_flags+0x4d/0x60
          [<00000000982807e9>] do_setlink+0x28d/0x10a0
          [<0000000058c1be00>] __rtnl_newlink+0x545/0x980
          [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70
          [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390
          [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100
          [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0
          [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0
          [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60
          [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200
          [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0
      
      Fixes: efc73f4b ("net: Fix memory leak - vlan_info struct")
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Link: https://lore.kernel.org/r/20230808093521.1468929-1-vladbu@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      718cb09a
    • David S. Miller's avatar
      Merge branch 'smc-fixes' · c992fde9
      David S. Miller authored
      Gerd Bayer says:
      
      ====================
      net/smc: Fix effective buffer size
      
      commit 0227f058 ("net/smc: Unbind r/w buffer size from clcsock
      and make them tunable") started to derive the effective buffer size for
      SMC connections inconsistently in case a TCP fallback was used and
      memory consumption of SMC with the default settings was doubled when
      a connection negotiated SMC. That was not what we want.
      
      This series consolidates the resulting effective buffer size that is
      used with SMC sockets, which is based on Jan Karcher's effort (see
      [1]). For all TCP exchanges (in particular in case of a fall back when
      no SMC connection was possible) the values from net.ipv4.tcp_[rw]mem
      are used. If SMC succeeds in establishing a SMC connection, the newly
      introduced values from net.smc.[rw]mem are used.
      
      net.smc.[rw]mem is initialized to 64kB, respectively. Internal test
      have show this to be a good compromise between throughput/latency
      and memory consumption. Also net.smc.[rw]mem is now decoupled completely
      from any tuning through net.ipv4.tcp_[rw]mem.
      
      If a user chose to tune a socket's receive or send buffer size with
      setsockopt, this tuning is now consistently applied to either fall-back
      TCP or proper SMC connections over the socket.
      
      Thanks,
      Gerd
      
      v2 - v3:
       - Rebase to and resolve conflict of second patch with latest net/master.
      v1 - v2:
       - In second patch, use sock_net() helper as suggested by Tony and demanded
         by kernel test robot.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c992fde9
    • Gerd Bayer's avatar
      net/smc: Use correct buffer sizes when switching between TCP and SMC · 30c3c4a4
      Gerd Bayer authored
      Tuning of the effective buffer size through setsockopts was working for
      SMC traffic only but not for TCP fall-back connections even before
      commit 0227f058 ("net/smc: Unbind r/w buffer size from clcsock and
      make them tunable"). That change made it apparent that TCP fall-back
      connections would use net.smc.[rw]mem as buffer size instead of
      net.ipv4_tcp_[rw]mem.
      
      Amend the code that copies attributes between the (TCP) clcsock and the
      SMC socket and adjust buffer sizes appropriately:
      - Copy over sk_userlocks so that both sockets agree on whether tuning
        via setsockopt is active.
      - When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with
        setsockopt. Otherwise, use the sysctl value for TCP/IPv4.
      - Likewise, use either values from setsockopt or from sysctl for SMC
        (duplicated) on successful SMC connect.
      
      In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that
      is taken care of by the attribute copy.
      
      Fixes: 0227f058 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30c3c4a4
    • Gerd Bayer's avatar
      net/smc: Fix setsockopt and sysctl to specify same buffer size again · 833bac7e
      Gerd Bayer authored
      Commit 0227f058 ("net/smc: Unbind r/w buffer size from clcsock
      and make them tunable") introduced the net.smc.rmem and net.smc.wmem
      sysctls to specify the size of buffers to be used for SMC type
      connections. This created a regression for users that specified the
      buffer size via setsockopt() as the effective buffer size was now
      doubled.
      
      Re-introduce the division by 2 in the SMC buffer create code and level
      this out by duplicating the net.smc.[rw]mem values used for initializing
      sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
      methods (setsockopt or sysctl) the effective buffer size that they
      expect.
      
      Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
      Internal performance tests show that this value is a good compromise
      between throughput/latency and memory consumption. Also, this decouples
      it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
      module for SMC protocol was loaded. Check that no more than INT_MAX / 2
      is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
      when that is doubled for use in sk_sndbuf or sk_rcvbuf.
      
      While at it, drop the confusing sk_buf_size variable from
      __smc_buf_create and name "compressed" buffer size variables more
      consistently.
      
      Background:
      
      Before the commit mentioned above, SMC's buffer allocator in
      __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
      value as initial value to search for appropriate buffers. If the search
      resorted to using a bigger buffer when all buffers of the specified
      size were busy, the duplicate of the used effective buffer size is
      stored back to sk_rcvbuf/sk_sndbuf.
      
      When available, buffers of exactly the size that a user had specified as
      input to setsockopt() were used, despite setsockopt()'s documentation in
      "man 7 socket" talking of a mandatory duplication:
      
      [...]
             SO_SNDBUF
                    Sets  or  gets the maximum socket send buffer in bytes.
                    The kernel doubles this value (to allow space for book‐
                    keeping  overhead)  when it is set using setsockopt(2),
                    and this doubled value is  returned  by  getsockopt(2).
                    The     default     value     is     set     by     the
                    /proc/sys/net/core/wmem_default file  and  the  maximum
                    allowed value is set by the /proc/sys/net/core/wmem_max
                    file.  The minimum (doubled) value for this  option  is
                    2048.
      [...]
      
      Fixes: 0227f058 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
      Co-developed-by: default avatarJan Karcher <jaka@linux.ibm.com>
      Signed-off-by: default avatarJan Karcher <jaka@linux.ibm.com>
      Reviewed-by: default avatarWenjia Zhang <wenjia@linux.ibm.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarGerd Bayer <gbayer@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      833bac7e
    • David S. Miller's avatar
      Merge branch 'enetc-probe-fix' · d0378ae6
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fix ENETC probing after 6fffbc7a ("PCI: Honor firmware's device disabled status")
      
      I'm not sure who should take this patch set (net maintainers or PCI
      maintainers). Everyone could pick up just their part, and that would
      work (no compile time dependencies). However, the entire series needs
      ACK from both sides and Rob for sure.
      
      v1 at:
      https://lore.kernel.org/netdev/20230521115141.2384444-1-vladimir.oltean@nxp.com/
      ====================
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0378ae6
    • Vladimir Oltean's avatar
      net: enetc: remove of_device_is_available() handling · bfce089d
      Vladimir Oltean authored
      Since commit 6fffbc7a ("PCI: Honor firmware's device disabled
      status"), this is redundant and does nothing, because enetc_pf_probe()
      no longer even gets called.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfce089d
    • Vladimir Oltean's avatar
      net: enetc: reimplement RFS/RSS memory clearing as PCI quirk · f0168042
      Vladimir Oltean authored
      The workaround implemented in commit 3222b5b6 ("net: enetc:
      initialize RFS/RSS memories for unused ports too") is no longer
      effective after commit 6fffbc7a ("PCI: Honor firmware's device
      disabled status"). Thus, it has introduced a regression and we see AER
      errors being reported again:
      
      $ ip link set sw2p0 up && dhclient -i sw2p0 && ip addr show sw2p0
      fsl_enetc 0000:00:00.2 eno2: configuring for fixed/internal link mode
      fsl_enetc 0000:00:00.2 eno2: Link is Up - 2.5Gbps/Full - flow control rx/tx
      mscc_felix 0000:00:00.5 swp2: configuring for fixed/sgmii link mode
      mscc_felix 0000:00:00.5 swp2: Link is Up - 1Gbps/Full - flow control off
      sja1105 spi2.2 sw2p0: configuring for phy/rgmii-id link mode
      sja1105 spi2.2 sw2p0: Link is Up - 1Gbps/Full - flow control off
      pcieport 0000:00:1f.0: AER: Multiple Corrected error received: 0000:00:00.0
      pcieport 0000:00:1f.0: AER: can't find device of ID0000
      
      Rob's suggestion is to reimplement the enetc driver workaround as a
      PCI fixup, and to modify the PCI core to run the fixups for all PCI
      functions. This change handles the first part.
      
      We refactor the common code in enetc_psi_create() and enetc_psi_destroy(),
      and use the PCI fixup only for those functions for which enetc_pf_probe()
      won't get called. This avoids some work being done twice for the PFs
      which are enabled.
      
      Fixes: 6fffbc7a ("PCI: Honor firmware's device disabled status")
      Link: https://lore.kernel.org/netdev/CAL_JsqLsVYiPLx2kcHkDQ4t=hQVCR7NHziDwi9cCFUFhx48Qow@mail.gmail.com/Suggested-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0168042
    • Vladimir Oltean's avatar
      PCI: move OF status = "disabled" detection to dev->match_driver · 1a8c251c
      Vladimir Oltean authored
      The blamed commit has broken probing on
      arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi when &enetc_port0
      (PCI function 0) has status = "disabled".
      
      Background: pci_scan_slot() has logic to say that if the function 0 of a
      device is absent, the entire device is absent and we can skip the other
      functions entirely. Traditionally, this has meant that
      pci_bus_read_dev_vendor_id() returns an error code for that function.
      
      However, since the blamed commit, there is an extra confounding
      condition: function 0 of the device exists and has a valid vendor id,
      but it is disabled in the device tree. In that case, pci_scan_slot()
      would incorrectly skip the entire device instead of just that function.
      
      In the case of NXP LS1028A, status = "disabled" does not mean that the
      PCI function's config space is not available for reading. It is, but the
      Ethernet port is just not functionally useful with a particular SerDes
      protocol configuration (0x9999) due to pinmuxing constraints of the Soc.
      So, pci_scan_slot() skips all other functions on the ENETC ECAM
      (enetc_port1, enetc_port2, enetc_mdio_pf3 etc) when just enetc_port0 had
      to not be probed.
      
      There is an additional regression introduced by the change, caused by
      its fundamental premise. The enetc driver needs to run code for all PCI
      functions, regardless of whether they're enabled or not in the device
      tree. That is no longer possible if the driver's probe function is no
      longer called. But Rob recommends that we move the of_device_is_available()
      detection to dev->match_driver, and this makes the PCI fixups still run
      on all functions, while just probing drivers for those functions that
      are enabled. So, a separate change in the enetc driver will have to move
      the workarounds to a PCI fixup.
      
      Fixes: 6fffbc7a ("PCI: Honor firmware's device disabled status")
      Link: https://lore.kernel.org/netdev/CAL_JsqLsVYiPLx2kcHkDQ4t=hQVCR7NHziDwi9cCFUFhx48Qow@mail.gmail.com/Suggested-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a8c251c
  2. 08 Aug, 2023 13 commits
  3. 07 Aug, 2023 14 commits
    • Jakub Kicinski's avatar
      Merge branch 'wireguard-fixes-for-6-5-rc6' · fa41884c
      Jakub Kicinski authored
      Jason A. Donenfeld says:
      
      ====================
      wireguard fixes for 6.5-rc6
      
      Just one patch this time, somewhat late in the cycle:
      
      1) Fix an off-by-one calculation for the maximum node depth size in the
         allowedips trie data structure, and also adjust the self-tests to hit
         this case so it doesn't regress again in the future.
      ====================
      
      Link: https://lore.kernel.org/r/20230807132146.2191597-1-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa41884c
    • Jason A. Donenfeld's avatar
      wireguard: allowedips: expand maximum node depth · 46622219
      Jason A. Donenfeld authored
      In the allowedips self-test, nodes are inserted into the tree, but it
      generated an even amount of nodes, but for checking maximum node depth,
      there is of course the root node, which makes the total number
      necessarily odd. With two few nodes added, it never triggered the
      maximum depth check like it should have. So, add 129 nodes instead of
      128 nodes, and do so with a more straightforward scheme, starting with
      all the bits set, and shifting over one each time. Then increase the
      maximum depth to 129, and choose a better name for that variable to
      make it clear that it represents depth as opposed to bits.
      
      Cc: stable@vger.kernel.org
      Fixes: e7096c13 ("net: WireGuard secure network tunnel")
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Link: https://lore.kernel.org/r/20230807132146.2191597-2-Jason@zx2c4.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      46622219
    • Ziyang Xuan's avatar
      bonding: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves · 01f4fd27
      Ziyang Xuan authored
      BUG_ON(!vlan_info) is triggered in unregister_vlan_dev() with
      following testcase:
      
        # ip netns add ns1
        # ip netns exec ns1 ip link add bond0 type bond mode 0
        # ip netns exec ns1 ip link add bond_slave_1 type veth peer veth2
        # ip netns exec ns1 ip link set bond_slave_1 master bond0
        # ip netns exec ns1 ip link add link bond_slave_1 name vlan10 type vlan id 10 protocol 802.1ad
        # ip netns exec ns1 ip link add link bond0 name bond0_vlan10 type vlan id 10 protocol 802.1ad
        # ip netns exec ns1 ip link set bond_slave_1 nomaster
        # ip netns del ns1
      
      The logical analysis of the problem is as follows:
      
      1. create ETH_P_8021AD protocol vlan10 for bond_slave_1:
      register_vlan_dev()
        vlan_vid_add()
          vlan_info_alloc()
          __vlan_vid_add() // add [ETH_P_8021AD, 10] vid to bond_slave_1
      
      2. create ETH_P_8021AD protocol bond0_vlan10 for bond0:
      register_vlan_dev()
        vlan_vid_add()
          __vlan_vid_add()
            vlan_add_rx_filter_info()
                if (!vlan_hw_filter_capable(dev, proto)) // condition established because bond0 without NETIF_F_HW_VLAN_STAG_FILTER
                    return 0;
      
                if (netif_device_present(dev))
                    return dev->netdev_ops->ndo_vlan_rx_add_vid(dev, proto, vid); // will be never called
                    // The slaves of bond0 will not refer to the [ETH_P_8021AD, 10] vid.
      
      3. detach bond_slave_1 from bond0:
      __bond_release_one()
        vlan_vids_del_by_dev()
          list_for_each_entry(vid_info, &vlan_info->vid_list, list)
              vlan_vid_del(dev, vid_info->proto, vid_info->vid);
              // bond_slave_1 [ETH_P_8021AD, 10] vid will be deleted.
              // bond_slave_1->vlan_info will be assigned NULL.
      
      4. delete vlan10 during delete ns1:
      default_device_exit_batch()
        dev->rtnl_link_ops->dellink() // unregister_vlan_dev() for vlan10
          vlan_info = rtnl_dereference(real_dev->vlan_info); // real_dev of vlan10 is bond_slave_1
      	BUG_ON(!vlan_info); // bond_slave_1->vlan_info is NULL now, bug is triggered!!!
      
      Add S-VLAN tag related features support to bond driver. So the bond driver
      will always propagate the VLAN info to its slaves.
      
      Fixes: 8ad227ff ("net: vlan: add 802.1ad support")
      Suggested-by: default avatarIdo Schimmel <idosch@idosch.org>
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20230802114320.4156068-1-william.xuanziyang@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      01f4fd27
    • Lama Kayal's avatar
      net/mlx5e: Add capability check for vnic counters · 548ee049
      Lama Kayal authored
      Add missing capability check for each of the vnic counters exposed by
      devlink health reporter, and thus avoid unexpected behavior due to
      invalid access to registers.
      
      While at it, read only the exact number of bits for each counter whether
      it was 32 bits or 64 bits.
      
      Fixes: b0bc615d ("net/mlx5: Add vnic devlink health reporter to PFs/VFs")
      Fixes: a33682e4 ("net/mlx5e: Expose catastrophic steering error counters")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      548ee049
    • Moshe Shemesh's avatar
      net/mlx5: Reload auxiliary devices in pci error handlers · aab8e1a2
      Moshe Shemesh authored
      Handling pci errors should fully teardown and load back auxiliary
      devices, same as done through mlx5 health recovery flow.
      
      Fixes: 72ed5d56 ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend")
      Signed-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      aab8e1a2
    • Moshe Shemesh's avatar
      net/mlx5: Skip clock update work when device is in error state · d0062076
      Moshe Shemesh authored
      When device is in error state, marked by the flag
      MLX5_DEVICE_STATE_INTERNAL_ERROR, the HW and PCI may not be accessible
      and so clock update work should be skipped. Furthermore, such access
      through PCI in error state, after calling mlx5_pci_disable_device() can
      result in failing to recover from pci errors.
      
      Fixes: ef9814de ("net/mlx5e: Add HW timestamping (TS) support")
      Reported-and-tested-by: default avatarGanesh G R <ganeshgr@linux.ibm.com>
      Closes: https://lore.kernel.org/netdev/9bdb9b9d-140a-7a28-f0de-2e64e873c068@nvidia.comSigned-off-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarAya Levin <ayal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d0062076
    • Shay Drory's avatar
      net/mlx5: LAG, Check correct bucket when modifying LAG · 86ed7b77
      Shay Drory authored
      Cited patch introduced buckets in hash mode, but missed to update
      the ports/bucket check when modifying LAG.
      Fix the check.
      
      Fixes: 352899f3 ("net/mlx5: Lag, use buckets in hash mode")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      86ed7b77
    • Chris Mi's avatar
      net/mlx5e: Unoffload post act rule when handling FIB events · 6b5926eb
      Chris Mi authored
      If having the following tc rule on stack device:
      
      filter parent ffff: protocol ip pref 3 flower chain 1
      filter parent ffff: protocol ip pref 3 flower chain 1 handle 0x1
        dst_mac 24:25:d0:e1:00:00
        src_mac 02:25:d0:25:01:02
        eth_type ipv4
        ct_state +trk+new
        in_hw in_hw_count 1
              action order 1: ct commit zone 0 pipe
               index 2 ref 1 bind 1 installed 3807 sec used 3779 sec firstused 3800 sec
              Action statistics:
              Sent 120 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 2: tunnel_key  set
              src_ip 192.168.1.25
              dst_ip 192.168.1.26
              key_id 4
              dst_port 4789
              csum pipe
               index 3 ref 1 bind 1 installed 3807 sec used 3779 sec firstused 3800 sec
              Action statistics:
              Sent 120 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 3: mirred (Egress Redirect to device vxlan1) stolen
              index 9 ref 1 bind 1 installed 3807 sec used 3779 sec firstused 3800 sec
              Action statistics:
              Sent 120 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
      When handling FIB events, the rule in post act will not be deleted.
      And because the post act rule has packet reformat and modify header
      actions, also will hit the following syndromes:
      
      mlx5_core 0000:08:00.0: mlx5_cmd_out_err:829:(pid 11613): DEALLOC_MODIFY_HEADER_CONTEXT(0x941) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x1ab444), err(-22)
      mlx5_core 0000:08:00.0: mlx5_cmd_out_err:829:(pid 11613): DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x179e84), err(-22)
      
      Fix it by unoffloading post act rule when handling FIB events.
      
      Fixes: 314e1105 ("net/mlx5e: Add post act offload/unoffload API")
      Signed-off-by: default avatarChris Mi <cmi@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6b5926eb
    • Daniel Jurgens's avatar
      net/mlx5: Fix devlink controller number for ECVF · 2d691c90
      Daniel Jurgens authored
      The controller number for ECVFs is always 0, because the ECPF must be
      the eswitch owner for EC VFs to be enabled.
      
      Fixes: dc131808 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
      Signed-off-by: default avatarDaniel Jurgens <danielj@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2d691c90
    • Daniel Jurgens's avatar
      net/mlx5: Allow 0 for total host VFs · 2dc2b392
      Daniel Jurgens authored
      When querying eswitch functions 0 is a valid number of host VFs. After
      introducing ARM SRIOV falling through to getting the max value from PCI
      results in using the total VFs allowed on the ARM for the host.
      
      Fixes: 86eec50b ("net/mlx5: Support querying max VFs from device");
      Signed-off-by: default avatarDaniel Jurgens <danielj@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2dc2b392
    • Daniel Jurgens's avatar
      net/mlx5: Return correct EC_VF function ID · 06c868fd
      Daniel Jurgens authored
      The ECVF function ID range is 1..max_ec_vfs. Currently
      mlx5_vport_to_func_id returns 0..max_ec_vfs - 1. Which
      results in a syndrome when querying the caps with more
      recent firmware, or reading incorrect caps with older
      firmware that supports EC VFs.
      
      Fixes: 9ac0b128 ("net/mlx5: Update vport caps query/set for EC VFs")
      Signed-off-by: default avatarDaniel Jurgens <danielj@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      06c868fd
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Fix wrong allocation of modify hdr pattern · 8bfe1e19
      Yevgeny Kliteynik authored
      Fixing wrong calculation of the modify hdr pattern size,
      where the previously calculated number would not be enough
      to accommodate the required number of actions.
      
      Fixes: da5d0027 ("net/mlx5: DR, Add cache for modify header pattern")
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Reviewed-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8bfe1e19
    • Jianbo Liu's avatar
      net/mlx5e: TC, Fix internal port memory leak · ac5da544
      Jianbo Liu authored
      The flow rule can be splited, and the extra post_act rules are added
      to post_act table. It's possible to trigger memleak when the rule
      forwards packets from internal port and over tunnel, in the case that,
      for example, CT 'new' state offload is allowed. As int_port object is
      assigned to the flow attribute of post_act rule, and its refcnt is
      incremented by mlx5e_tc_int_port_get(), but mlx5e_tc_int_port_put() is
      not called, the refcnt is never decremented, then int_port is never
      freed.
      
      The kmemleak reports the following error:
      unreferenced object 0xffff888128204b80 (size 64):
        comm "handler20", pid 50121, jiffies 4296973009 (age 642.932s)
        hex dump (first 32 bytes):
          01 00 00 00 19 00 00 00 03 f0 00 00 04 00 00 00  ................
          98 77 67 41 81 88 ff ff 98 77 67 41 81 88 ff ff  .wgA.....wgA....
        backtrace:
          [<00000000e992680d>] kmalloc_trace+0x27/0x120
          [<000000009e945a98>] mlx5e_tc_int_port_get+0x3f3/0xe20 [mlx5_core]
          [<0000000035a537f0>] mlx5e_tc_add_fdb_flow+0x473/0xcf0 [mlx5_core]
          [<0000000070c2cec6>] __mlx5e_add_fdb_flow+0x7cf/0xe90 [mlx5_core]
          [<000000005cc84048>] mlx5e_configure_flower+0xd40/0x4c40 [mlx5_core]
          [<000000004f8a2031>] mlx5e_rep_indr_offload.isra.0+0x10e/0x1c0 [mlx5_core]
          [<000000007df797dc>] mlx5e_rep_indr_setup_tc_cb+0x90/0x130 [mlx5_core]
          [<0000000016c15cc3>] tc_setup_cb_add+0x1cf/0x410
          [<00000000a63305b4>] fl_hw_replace_filter+0x38f/0x670 [cls_flower]
          [<000000008bc9e77c>] fl_change+0x1fd5/0x4430 [cls_flower]
          [<00000000e7f766e4>] tc_new_tfilter+0x867/0x2010
          [<00000000e101c0ef>] rtnetlink_rcv_msg+0x6fc/0x9f0
          [<00000000e1111d44>] netlink_rcv_skb+0x12c/0x360
          [<0000000082dd6c8b>] netlink_unicast+0x438/0x710
          [<00000000fc568f70>] netlink_sendmsg+0x794/0xc50
          [<0000000016e92590>] sock_sendmsg+0xc5/0x190
      
      So fix this by moving int_port cleanup code to the flow attribute
      free helper, which is used by all the attribute free cases.
      
      Fixes: 8300f225 ("net/mlx5e: Create new flow attr for multi table actions")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ac5da544
    • Gal Pressman's avatar
      net/mlx5e: Take RTNL lock when needed before calling xdp_set_features() · 72cc6549
      Gal Pressman authored
      Hold RTNL lock when calling xdp_set_features() with a registered netdev,
      as the call triggers the netdev notifiers. This could happen when
      switching from uplink rep to nic profile for example.
      
      This resolves the following call trace:
      
      RTNL: assertion failed at net/core/dev.c (1953)
      WARNING: CPU: 6 PID: 112670 at net/core/dev.c:1953 call_netdevice_notifiers_info+0x7c/0x80
      Modules linked in: sch_mqprio sch_mqprio_lib act_tunnel_key act_mirred act_skbedit cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress bonding ib_umad ip_gre rdma_ucm mlx5_vfio_pci ipip tunnel4 ip6_gre gre mlx5_ib vfio_pci vfio_pci_core vfio_iommu_type1 ib_uverbs vfio mlx5_core ib_ipoib geneve nf_tables ip6_tunnel tunnel6 iptable_raw openvswitch nsh rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: ib_uverbs]
      CPU: 6 PID: 112670 Comm: devlink Not tainted 6.4.0-rc7_for_upstream_min_debug_2023_06_28_17_02 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      RIP: 0010:call_netdevice_notifiers_info+0x7c/0x80
      Code: 90 ff 80 3d 2d 6b f7 00 00 75 c5 ba a1 07 00 00 48 c7 c6 e4 ce 0b 82 48 c7 c7 c8 f4 04 82 c6 05 11 6b f7 00 01 e8 a4 7c 8e ff <0f> 0b eb a2 0f 1f 44 00 00 55 48 89 e5 41 54 48 83 e4 f0 48 83 ec
      RSP: 0018:ffff8882a21c3948 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff82e6f880 RCX: 0000000000000027
      RDX: ffff88885f99b5c8 RSI: 0000000000000001 RDI: ffff88885f99b5c0
      RBP: 0000000000000028 R08: ffff88887ffabaa8 R09: 0000000000000003
      R10: ffff88887fecbac0 R11: ffff88887ff7bac0 R12: ffff8882a21c3968
      R13: ffff88811c018940 R14: 0000000000000000 R15: ffff8881274401a0
      FS:  00007fe141c81800(0000) GS:ffff88885f980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f787c28b948 CR3: 000000014bcf3005 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? __warn+0x79/0x120
       ? call_netdevice_notifiers_info+0x7c/0x80
       ? report_bug+0x17c/0x190
       ? handle_bug+0x3c/0x60
       ? exc_invalid_op+0x14/0x70
       ? asm_exc_invalid_op+0x16/0x20
       ? call_netdevice_notifiers_info+0x7c/0x80
       ? call_netdevice_notifiers_info+0x7c/0x80
       call_netdevice_notifiers+0x2e/0x50
       mlx5e_set_xdp_feature+0x21/0x50 [mlx5_core]
       mlx5e_nic_init+0xf1/0x1a0 [mlx5_core]
       mlx5e_netdev_init_profile+0x76/0x110 [mlx5_core]
       mlx5e_netdev_attach_profile+0x1f/0x90 [mlx5_core]
       mlx5e_netdev_change_profile+0x92/0x160 [mlx5_core]
       mlx5e_netdev_attach_nic_profile+0x1b/0x30 [mlx5_core]
       mlx5e_vport_rep_unload+0xaa/0xc0 [mlx5_core]
       __esw_offloads_unload_rep+0x52/0x60 [mlx5_core]
       mlx5_esw_offloads_rep_unload+0x52/0x70 [mlx5_core]
       esw_offloads_unload_rep+0x34/0x70 [mlx5_core]
       esw_offloads_disable+0x2b/0x90 [mlx5_core]
       mlx5_eswitch_disable_locked+0x1b9/0x210 [mlx5_core]
       mlx5_devlink_eswitch_mode_set+0xf5/0x630 [mlx5_core]
       ? devlink_get_from_attrs_lock+0x9e/0x110
       devlink_nl_cmd_eswitch_set_doit+0x60/0xe0
       genl_family_rcv_msg_doit.isra.0+0xc2/0x110
       genl_rcv_msg+0x17d/0x2b0
       ? devlink_get_from_attrs_lock+0x110/0x110
       ? devlink_nl_cmd_eswitch_get_doit+0x290/0x290
       ? devlink_pernet_pre_exit+0xf0/0xf0
       ? genl_family_rcv_msg_doit.isra.0+0x110/0x110
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1f6/0x2c0
       netlink_sendmsg+0x232/0x4a0
       sock_sendmsg+0x38/0x60
       ? _copy_from_user+0x2a/0x60
       __sys_sendto+0x110/0x160
       ? __count_memcg_events+0x48/0x90
       ? handle_mm_fault+0x161/0x260
       ? do_user_addr_fault+0x278/0x6e0
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7fe141b1340a
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007fff61d03de8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000afab00 RCX: 00007fe141b1340a
      RDX: 0000000000000038 RSI: 0000000000afab00 RDI: 0000000000000003
      RBP: 0000000000afa910 R08: 00007fe141d80200 R09: 000000000000000c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      
      Fixes: 4d5ab0ad ("net/mlx5e: take into account device reconfiguration for xdp_features flag")
      Signed-off-by: default avatarGal Pressman <gal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      72cc6549
  4. 06 Aug, 2023 1 commit