1. 26 Oct, 2021 40 commits
    • Vladimir Oltean's avatar
      net: dsa: flush switchdev workqueue when leaving the bridge · d7d0d423
      Vladimir Oltean authored
      DSA is preparing to offer switch drivers an API through which they can
      associate each FDB entry with a struct net_device *bridge_dev. This can
      be used to perform FDB isolation (the FDB lookup performed on the
      ingress of a standalone, or bridged port, should not find an FDB entry
      that is present in the FDB of another bridge).
      
      In preparation of that work, DSA needs to ensure that by the time we
      call the switch .port_fdb_add and .port_fdb_del methods, the
      dp->bridge_dev pointer is still valid, i.e. the port is still a bridge
      port.
      
      This is not guaranteed because the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE API
      requires drivers that must have sleepable context to handle those events
      to schedule the deferred work themselves. DSA does this through the
      dsa_owq.
      
      It can happen that a port leaves a bridge, del_nbp() flushes the FDB on
      that port, SWITCHDEV_FDB_DEL_TO_DEVICE is notified in atomic context,
      DSA schedules its deferred work, but del_nbp() finishes unlinking the
      bridge as a master from the port before DSA's deferred work is run.
      
      Fundamentally, the port must not be unlinked from the bridge until all
      FDB deletion deferred work items have been flushed. The bridge must wait
      for the completion of these hardware accesses.
      
      An attempt has been made to address this issue centrally in switchdev by
      making SWITCHDEV_FDB_DEL_TO_DEVICE deferred (=> blocking) at the switchdev
      level, which would offer implicit synchronization with del_nbp:
      
      https://patchwork.kernel.org/project/netdevbpf/cover/20210820115746.3701811-1-vladimir.oltean@nxp.com/
      
      but it seems that any attempt to modify switchdev's behavior and make
      the events blocking there would introduce undesirable side effects in
      other switchdev consumers.
      
      The most undesirable behavior seems to be that
      switchdev_deferred_process_work() takes the rtnl_mutex itself, which
      would be worse off than having the rtnl_mutex taken individually from
      drivers which is what we have now (except DSA which has removed that
      lock since commit 0faf890f ("net: dsa: drop rtnl_lock from
      dsa_slave_switchdev_event_work")).
      
      So to offer the needed guarantee to DSA switch drivers, I have come up
      with a compromise solution that does not require switchdev rework:
      we already have a hook at the last moment in time when the bridge is
      still an upper of ours: the NETDEV_PRECHANGEUPPER handler. We can flush
      the dsa_owq manually from there, which makes all FDB deletions
      synchronous.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d0d423
    • Lukas Wunner's avatar
      ifb: Depend on netfilter alternatively to tc · 046178e7
      Lukas Wunner authored
      IFB originally depended on NET_CLS_ACT for traffic redirection.
      But since v4.5, that may be achieved with NFT_FWD_NETDEV as well.
      
      Fixes: 39e6dea2 ("netfilter: nf_tables: add forward expression to the netdev family")
      Signed-off-by: default avatarLukas Wunner <lukas@wunner.de>
      Cc: <stable@vger.kernel.org> # v4.5+: bcfabee1: netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
      Cc: <stable@vger.kernel.org> # v4.5+
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      046178e7
    • Jeremy Kerr's avatar
      mctp: Implement extended addressing · 99ce45d5
      Jeremy Kerr authored
      This change allows an extended address struct - struct sockaddr_mctp_ext
      - to be passed to sendmsg/recvmsg. This allows userspace to specify
      output ifindex and physical address information (for sendmsg) or receive
      the input ifindex/physaddr for incoming messages (for recvmsg). This is
      typically used by userspace for MCTP address discovery and assignment
      operations.
      
      The extended addressing facility is conditional on a new sockopt:
      MCTP_OPT_ADDR_EXT; userspace must explicitly enable addressing before
      the kernel will consume/populate the extended address data.
      
      Includes a fix for an uninitialised var:
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99ce45d5
    • Nathan Chancellor's avatar
      net: ax88796c: Remove pointless check in ax88796c_open() · 971f5c40
      Nathan Chancellor authored
      Clang warns:
      
      drivers/net/ethernet/asix/ax88796c_main.c:851:24: error: address of
      array 'ax_local->phydev->advertising' will always evaluate to 'true'
      [-Werror,-Wpointer-bool-conversion]
              if (ax_local->phydev->advertising &&
                  ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~ ~~
      
      advertising cannot be NULL here if ax_local is not NULL, which cannot
      happen due to the check in ax88796c_probe(). Remove the check.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1492Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      971f5c40
    • Nathan Chancellor's avatar
      net: ax88796c: Fix clang -Wimplicit-fallthrough in ax88796c_set_mac() · 3c554881
      Nathan Chancellor authored
      Clang warns:
      
      drivers/net/ethernet/asix/ax88796c_main.c:696:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
              case SPEED_10:
              ^
      drivers/net/ethernet/asix/ax88796c_main.c:696:2: note: insert 'break;' to avoid fall-through
              case SPEED_10:
              ^
              break;
      drivers/net/ethernet/asix/ax88796c_main.c:706:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
              case DUPLEX_HALF:
              ^
      drivers/net/ethernet/asix/ax88796c_main.c:706:2: note: insert 'break;' to avoid fall-through
              case DUPLEX_HALF:
              ^
              break;
      
      Clang is a little more pedantic than GCC, which permits implicit
      fallthroughs to cases that contain just break or return. Clang's version
      is more in line with the kernel's own stance in deprecated.rst, which
      states that all switch/case blocks must end in either break,
      fallthrough, continue, goto, or return. Add the missing breaks to fix
      the warning.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1491Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c554881
    • Haiyang Zhang's avatar
      net: mana: Allow setting the number of queues while the NIC is down · a137c069
      Haiyang Zhang authored
      The existing code doesn't allow setting the number of queues while the
      NIC is down.
      
      Update the ethtool handler functions to support setting the number of
      queues while the NIC is at down state.
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a137c069
    • Andreas Oetken's avatar
      net: hsr: Add support for redbox supervision frames · eafaa88b
      Andreas Oetken authored
      added support for the redbox supervision frames
      as defined in the IEC-62439-3:2018.
      Signed-off-by: default avatarAndreas Oetken <andreas.oetken@siemens-energy.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eafaa88b
    • David S. Miller's avatar
      Merge branch 'tcp_stream_alloc_skb' · 3247e3ff
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: tcp_stream_alloc_skb() changes
      
      sk_stream_alloc_skb() is only used by TCP.
      
      Rename it to tcp_stream_alloc_skb() and apply small
      optimizations.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3247e3ff
    • Eric Dumazet's avatar
      tcp: remove unneeded code from tcp_stream_alloc_skb() · c4322884
      Eric Dumazet authored
      Aligning @size argument to 4 bytes is not needed.
      
      The header alignment has nothing to do with @size.
      
      It really depends on skb->head alignment and MAX_TCP_HEADER.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4322884
    • Eric Dumazet's avatar
      tcp: use MAX_TCP_HEADER in tcp_stream_alloc_skb · 8a794df6
      Eric Dumazet authored
      Both IPv4 and IPv6 uses same reserve, no need risking
      cache line misses to fetch its value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a794df6
    • Eric Dumazet's avatar
      tcp: rename sk_stream_alloc_skb · f8dd3b8d
      Eric Dumazet authored
      sk_stream_alloc_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration
      to include/net/tcp.h
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8dd3b8d
    • Eric Dumazet's avatar
      net: annotate data-race in neigh_output() · d18785e2
      Eric Dumazet authored
      neigh_output() reads n->nud_state and hh->hh_len locklessly.
      
      This is fine, but we need to add annotations and document this.
      
      We evaluate skip_cache first to avoid reading these fields
      if the cache has to by bypassed.
      
      syzbot report:
      
      BUG: KCSAN: data-race in __neigh_event_send / ip_finish_output2
      
      write to 0xffff88810798a885 of 1 bytes by interrupt on cpu 1:
       __neigh_event_send+0x40d/0xac0 net/core/neighbour.c:1128
       neigh_event_send include/net/neighbour.h:444 [inline]
       neigh_resolve_output+0x104/0x410 net/core/neighbour.c:1476
       neigh_output include/net/neighbour.h:510 [inline]
       ip_finish_output2+0x80a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       secondary_startup_64_no_verify+0xb1/0xbb
      
      read to 0xffff88810798a885 of 1 bytes by interrupt on cpu 0:
       neigh_output include/net/neighbour.h:507 [inline]
       ip_finish_output2+0x79a/0xaa0 net/ipv4/ip_output.c:221
       ip_finish_output+0x3b5/0x510 net/ipv4/ip_output.c:309
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:423
       dst_output include/net/dst.h:450 [inline]
       ip_local_out+0x164/0x220 net/ipv4/ip_output.c:126
       __ip_queue_xmit+0x9d3/0xa20 net/ipv4/ip_output.c:525
       ip_queue_xmit+0x34/0x40 net/ipv4/ip_output.c:539
       __tcp_transmit_skb+0x142a/0x1a00 net/ipv4/tcp_output.c:1405
       tcp_transmit_skb net/ipv4/tcp_output.c:1423 [inline]
       tcp_xmit_probe_skb net/ipv4/tcp_output.c:4011 [inline]
       tcp_write_wakeup+0x4a9/0x810 net/ipv4/tcp_output.c:4064
       tcp_send_probe0+0x2c/0x2b0 net/ipv4/tcp_output.c:4079
       tcp_probe_timer net/ipv4/tcp_timer.c:398 [inline]
       tcp_write_timer_handler+0x394/0x520 net/ipv4/tcp_timer.c:626
       tcp_write_timer+0xb9/0x180 net/ipv4/tcp_timer.c:642
       call_timer_fn+0x2e/0x1d0 kernel/time/timer.c:1421
       expire_timers+0x135/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x430 kernel/time/timer.c:1734
       run_timer_softirq+0x19/0x30 kernel/time/timer.c:1747
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x4e/0xa0 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x69/0x80 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       native_safe_halt arch/x86/include/asm/irqflags.h:51 [inline]
       arch_safe_halt arch/x86/include/asm/irqflags.h:89 [inline]
       acpi_safe_halt drivers/acpi/processor_idle.c:109 [inline]
       acpi_idle_do_entry drivers/acpi/processor_idle.c:553 [inline]
       acpi_idle_enter+0x258/0x2e0 drivers/acpi/processor_idle.c:688
       cpuidle_enter_state+0x2b4/0x760 drivers/cpuidle/cpuidle.c:237
       cpuidle_enter+0x3c/0x60 drivers/cpuidle/cpuidle.c:351
       call_cpuidle kernel/sched/idle.c:158 [inline]
       cpuidle_idle_call kernel/sched/idle.c:239 [inline]
       do_idle+0x1a3/0x250 kernel/sched/idle.c:306
       cpu_startup_entry+0x15/0x20 kernel/sched/idle.c:403
       rest_init+0xee/0x100 init/main.c:734
       arch_call_rest_init+0xa/0xb
       start_kernel+0x5e4/0x669 init/main.c:1142
       secondary_startup_64_no_verify+0xb1/0xbb
      
      value changed: 0x20 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18785e2
    • David S. Miller's avatar
      Merge branch 'mlxsw-rif-mac-prefixes' · 72b93a86
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Support multiple RIF MAC prefixes
      
      Currently, mlxsw enforces that all the netdevs used as router interfaces
      (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1).
      Otherwise, an error is returned to user space with extack. This patchset
      relaxes the limitation through the use of RIF MAC profiles.
      
      A RIF MAC profile is a hardware entity that represents a particular MAC
      prefix which multiple RIFs can reference. Therefore, the number of
      possible MAC prefixes is no longer one, but the number of profiles
      supported by the device.
      
      The ability to change the MAC of a particular netdev is useful, for
      example, for users who use the netdev to connect to an upstream provider
      that performs MAC filtering. Currently, such users are either forced to
      negotiate with the provider or change the MAC address of all other
      netdevs so that they share the same prefix.
      
      Patchset overview:
      
      Patches #1-#3 are preparations.
      
      Patch #4 adds actual support for RIF MAC profiles.
      
      Patch #5 exposes RIF MAC profiles as a devlink resource, so that user
      space has visibility into the maximum number of profiles and current
      occupancy. Useful for debugging and testing (next 3 patches).
      
      Patches #6-#8 add both scale and functional tests.
      
      Patch #9 removes tests that validated the previous limitation. It is now
      covered by patch #6 for devices that support a single profile.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72b93a86
    • Danielle Ratson's avatar
      selftests: mlxsw: Remove deprecated test cases · c24dbf3d
      Danielle Ratson authored
      After adding the previous patches, the constraint that all the router
      interface MAC addresses have the same prefix is no longer relevant.
      
      Remove the test cases that validated that this constraint is honored.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24dbf3d
    • Danielle Ratson's avatar
      selftests: Add an occupancy test for RIF MAC profiles · 20d446db
      Danielle Ratson authored
      When all the RIF MAC profiles are in use, test that it is possible to
      change the MAC of a netdev (i.e., a RIF) when its MAC profile is not
      shared with other RIFs. Test that replacement fails when the MAC profile
      is shared.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20d446db
    • Danielle Ratson's avatar
      selftests: mlxsw: Add forwarding test for RIF MAC profiles · a10b7bac
      Danielle Ratson authored
      Verify that MAC profile changes are indeed applied and that packets are
      forwarded with the correct source MAC.
      
      Output example:
      
      $ ./rif_mac_profiles.sh
      TEST: h1->h2: new mac profile                                       [ OK ]
      TEST: h2->h1: new mac profile                                       [ OK ]
      TEST: h1->h2: edit mac profile                                      [ OK ]
      TEST: h2->h1: edit mac profile                                      [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10b7bac
    • Danielle Ratson's avatar
      selftests: mlxsw: Add a scale test for RIF MAC profiles · 152f98e7
      Danielle Ratson authored
      Query the maximum number of supported RIF MAC profiles using
      devlink-resource and verify that all available MAC profiles can be utilized
      and that an error is generated when user space tries to exceed this number.
      
      Output example in Spectrum-2:
      
      $ TESTS='rif_mac_profile' ./resource_scale.sh
      TEST: 'rif_mac_profile' 4                                           [ OK ]
      TEST: 'rif_mac_profile' overflow 5                                  [ OK ]
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      152f98e7
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource · 1c375ffb
      Danielle Ratson authored
      Expose via devlink-resource the maximum number of RIF MAC profiles and
      their current occupancy, so it can be used for debug and writing generic
      tests, like in the next patch.
      
      Example for Spectrum-2 output:
      
      $ devlink resource show pci/0000:06:00.0
      ...
        name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c375ffb
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Add RIF MAC profiles support · 605d25cd
      Danielle Ratson authored
      Currently, mlxsw enforces that all the router interfaces (RIFs) have the
      same MAC prefix.
      
      Relax this limitation by using RIF MAC profiles. Each profile is
      associated with a particular MAC prefix and multiple RIFs can use the
      same profile. Therefore, the number of possible MAC prefixes is no
      longer one, but the number of profiles supported by the device.
      
      Store the profiles in an IDR and reference count them according to the
      number of RIFs using them.
      
      Associate a RIF with a profile when the RIF is created and remove the
      association when the RIF is deleted.
      
      Change the association following 'NETDEV_CHANGEADDR' events, except when
      only one RIF is using the profile. In which case, change the MAC prefix
      of the profile itself instead of associating the RIF with a new profile.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      605d25cd
    • Danielle Ratson's avatar
      mlxsw: spectrum_router: Propagate extack further · 26029225
      Danielle Ratson authored
      The next patch will set the MAC profile of a router interface (RIF) as
      part of its configure() callback. The operation can fail in case the
      maximum number of profiles was exceeded.
      
      Add extack to mlxsw_sp_rif_ops::configure() in order to communicate such
      failures to user space.
      
      In addition, the MAC profile of a RIF can change following a
      'NETDEV_CHANGEADDR' notification. Propagate extack to
      mlxsw_sp_router_port_change_event() so that failures could be
      communicated in this path as well.
      
      No functional changes intended.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26029225
    • Danielle Ratson's avatar
      mlxsw: resources: Add resource identifier for RIF MAC profiles · a8428e50
      Danielle Ratson authored
      Add a resource identifier for maximum RIF MAC profiles so that it could
      be later used to query the information from firmware.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8428e50
    • Danielle Ratson's avatar
      mlxsw: reg: Add MAC profile ID field to RITR register · d25d7fc3
      Danielle Ratson authored
      Add MAC profile ID field to RITR register so that it could be used for
      associating a RIF with a MAC profile ID by a later patch.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25d7fc3
    • David S. Miller's avatar
      Merge branch 'netfilter-vrf-rework' · be348926
      David S. Miller authored
      Florian Westphal says:
      
      ====================
      vrf: rework interaction with netfilter/conntrack
      
      V2:
      - fix 'plain integer as null pointer' warning
      - reword commit message in patch 2 to clarify loss of 'ct set untracked'
      
      This patch series aims to solve the to-be-reverted change 09e856d5
      ("vrf: Reset skb conntrack connection on VRF rcv") in a different way.
      
      Rather than have skbs pass through conntrack and nat hooks twice, suppress
      conntrack invocation if the conntrack/nat hook is called from the vrf driver.
      
      First patch deals with 'incoming connection' case:
      1. suppress NAT transformations
      2. skip conntrack confirmation
      
      NAT and conntrack confirmation is done when ip/ipv6 stack calls
      the postrouting hook.
      
      Second patch deals with local packets:
      in vrf driver, mark the skbs as 'untracked', so conntrack output
      hook ignores them.  This skips all nat hooks as well.
      
      Afterwards, remove the untracked state again so the second
      round will pick them up.
      
      One alternative to the chosen implementation would be to add a 'caller
      id' field to 'struct nf_hook_state' and then use that, these patches
      use the more straightforward check of VRF flag on the state->out device.
      
      The two patches apply to both net and net-next, i am targeting -next
      because I think that since snat did not work correctly for so long that
      we can take the longer route.  If you disagree, apply to net at your
      discretion.
      
      The patches apply both with 09e856d5 reverted or still
      in-place, but only with the revert in place ingress conntrack settings
      (zone, notrack etc) start working again.
      
      I've already submitted selftests for vrf+nfqueue and conntrack+vrf.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be348926
    • Florian Westphal's avatar
      vrf: run conntrack only in context of lower/physdev for locally generated packets · 8c9c296a
      Florian Westphal authored
      The VRF driver invokes netfilter for output+postrouting hooks so that users
      can create rules that check for 'oif $vrf' rather than lower device name.
      
      This is a problem when NAT rules are configured.
      
      To avoid any conntrack involvement in round 1, tag skbs as 'untracked'
      to prevent conntrack from picking them up.
      
      This gets cleared before the packet gets handed to the ip stack so
      conntrack will be active on the second iteration.
      
      One remaining issue is that a rule like
      
        output ... oif $vrfname notrack
      
      won't propagate to the second round because we can't tell
      'notrack set via ruleset' and 'notrack set by vrf driver' apart.
      However, this isn't a regression: the 'notrack' removal happens
      instead of unconditional nf_reset_ct().
      I'd also like to avoid leaking more vrf specific conditionals into the
      netfilter infra.
      
      For ingress, conntrack has already been done before the packet makes it
      to the vrf driver, with this patch egress does connection tracking with
      lower/physical device as well.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c9c296a
    • Florian Westphal's avatar
      netfilter: conntrack: skip confirmation and nat hooks in postrouting for vrf · 8e0538d8
      Florian Westphal authored
      The VRF driver invokes netfilter for output+postrouting hooks so that users
      can create rules that check for 'oif $vrf' rather than lower device name.
      
      Afterwards, ip stack calls those hooks again.
      
      This is a problem when conntrack is used with IP masquerading.
      masquerading has an internal check that re-validates the output
      interface to account for route changes.
      
      This check will trigger in the vrf case.
      
      If the -j MASQUERADE rule matched on the first iteration, then round 2
      finds state->out->ifindex != nat->masq_index: the latter is the vrf
      index, but out->ifindex is the lower device.
      
      The packet gets dropped and the conntrack entry is invalidated.
      
      This change makes conntrack postrouting skip the nat hooks.
      Also skip confirmation.  This allows the second round
      (postrouting invocation from ipv4/ipv6) to create nat bindings.
      
      This also prevents the second round from seeing packets that had their
      source address changed by the nat hook.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e0538d8
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 4900a769
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2021-10-25
      
      Misc updates for mlx5 driver:
      
      1) Misc updates and cleanups:
       - Don't write directly to netdev->dev_addr, From Jakub Kicinski
       - Remove unnecessary checks for slow path flag in tc module
       - Fix unused function warning of mlx5i_flow_type_mask
       - Bridge, support replacing existing FDB entry
      
      2) Sub Functions, Reduction in memory usage:
       - Reduce flow counters bulk query buffer size
       - Implement max_macs devlink parameter
       - Add devlink vendor params to control Event Queue sizes
       - Added SF life cycle trace points by Parav/
      
      3) From Aya, Firmware health buffer reporting improvements
       - Print health buffer by log level and more missing information
       - Periodic update of host time to firmware
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4900a769
    • Jon Maxwell's avatar
      tcp: don't free a FIN sk_buff in tcp_remove_empty_skb() · cf12e6f9
      Jon Maxwell authored
      v1: Implement a more general statement as recommended by Eric Dumazet. The
      sequence number will be advanced, so this check will fix the FIN case and
      other cases.
      
      A customer reported sockets stuck in the CLOSING state. A Vmcore revealed that
      the write_queue was not empty as determined by tcp_write_queue_empty() but the
      sk_buff containing the FIN flag had been freed and the socket was zombied in
      that state. Corresponding pcaps show no FIN from the Linux kernel on the wire.
      
      Some instrumentation was added to the kernel and it was found that there is a
      timing window where tcp_sendmsg() can run after tcp_send_fin().
      
      tcp_sendmsg() will hit an error, for example:
      
      1269 ▹       if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
      1270 ▹       ▹       goto do_error;
      
      tcp_remove_empty_skb() will then free the FIN sk_buff as "skb->len == 0". The
      TCP socket is now wedged in the FIN-WAIT-1 state because the FIN is never sent.
      
      If the other side sends a FIN packet the socket will transition to CLOSING and
      remain that way until the system is rebooted.
      
      Fix this by checking for the FIN flag in the sk_buff and don't free it if that
      is the case. Testing confirmed that fixed the issue.
      
      Fixes: fdfc5c85 ("tcp: remove empty skb from write queue in error cases")
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Reported-by: default avatarMonir Zouaoui <Monir.Zouaoui@mail.schwarz>
      Reported-by: default avatarSimon Stier <simon.stier@mail.schwarz>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf12e6f9
    • Jakub Kicinski's avatar
      Merge branch 'small-fixes-for-true-expression-checks' · 36d935a0
      Jakub Kicinski authored
      Jean Sacren says:
      
      ====================
      Small fixes for true expression checks
      
      This series fixes checks of true !rc expression.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1634974124.git.sakiwit@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      36d935a0
    • Jean Sacren's avatar
      net: qed_dev: fix check of true !rc expression · 036f590f
      Jean Sacren authored
      Remove the check of !rc in (!rc && !resc_lock_params.b_granted) since it
      is always true.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      036f590f
    • Jean Sacren's avatar
      net: qed_ptp: fix check of true !rc expression · 165f8e82
      Jean Sacren authored
      Remove the check of !rc in (!rc && !params.b_granted) since it is always
      true.
      
      We should also use constant 0 for return.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      165f8e82
    • Jakub Kicinski's avatar
      Merge branch 'tcp-receive-path-optimizations' · e43b76ab
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      tcp: receive path optimizations
      
      This series aims to reduce cache line misses in RX path.
      
      I am still working on better cache locality in tcp_sock but
      this will wait few more weeks.
      ====================
      
      Link: https://lore.kernel.org/r/20211025164825.259415-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e43b76ab
    • Eric Dumazet's avatar
      ipv6/tcp: small drop monitor changes · 12c8691d
      Eric Dumazet authored
      Two kfree_skb() calls must be replaced by consume_skb()
      for skbs that are not technically dropped.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      12c8691d
    • Eric Dumazet's avatar
      ipv4: guard IP_MINTTL with a static key · 020e71a3
      Eric Dumazet authored
      RFC 5082 IP_MINTTL option is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch inet_sk(sk)->min_ttl
      
      Note that once ip4_min_ttl static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      020e71a3
    • Eric Dumazet's avatar
      ipv4: annotate data races arount inet->min_ttl · 14834c4f
      Eric Dumazet authored
      No report yet from KCSAN, yet worth documenting the races.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      14834c4f
    • Eric Dumazet's avatar
      ipv6: guard IPV6_MINHOPCOUNT with a static key · 790eb673
      Eric Dumazet authored
      RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.
      
      Add a static key to remove from TCP fast path useless code,
      and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount
      
      Note that once ip6_min_hopcount static key has been enabled,
      it stays enabled until next boot.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      790eb673
    • Eric Dumazet's avatar
      ipv6: annotate data races around np->min_hopcount · cc17c3c8
      Eric Dumazet authored
      No report yet from KCSAN, yet worth documenting the races.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cc17c3c8
    • Eric Dumazet's avatar
      net: annotate accesses to sk->sk_rx_queue_mapping · 09b89846
      Eric Dumazet authored
      sk->sk_rx_queue_mapping can be modified locklessly,
      add a couple of READ_ONCE()/WRITE_ONCE() to document this fact.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09b89846
    • Eric Dumazet's avatar
      net: avoid dirtying sk->sk_rx_queue_mapping · 342159ee
      Eric Dumazet authored
      sk_rx_queue_mapping is located in a cache line that should be kept read mostly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      342159ee
    • Eric Dumazet's avatar
      net: avoid dirtying sk->sk_napi_id · 2b13af8a
      Eric Dumazet authored
      sk_napi_id is located in a cache line that can be kept read mostly.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b13af8a
    • Eric Dumazet's avatar
      ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie · ef57c161
      Eric Dumazet authored
      Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst
      
      This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef57c161