1. 16 Apr, 2024 11 commits
    • Michal Swiatkowski's avatar
      ice: tc: allow zero flags in parsing tc flower · 73278715
      Michal Swiatkowski authored
      The check for flags is done to not pass empty lookups to adding switch
      rule functions. Since metadata is always added to lookups there is no
      need to check against the flag.
      
      It is also fixing the problem with such rule:
      $ tc filter add dev gtp_dev ingress protocol ip prio 0 flower \
      	enc_dst_port 2123 action drop
      Switch block in case of GTP can't parse the destination port, because it
      should always be set to GTP specific value. The same with ethertype. The
      result is that there is no other matching criteria than GTP tunnel. In
      this case flags is 0, rule can't be added only because of defensive
      check against flags.
      
      Fixes: 9a225f81 ("ice: Support GTP-U and GTP-C offload in switchdev")
      Reviewed-by: default avatarWojciech Drewek <wojciech.drewek@intel.com>
      Signed-off-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarSujai Buvaneswaran <sujai.buvaneswaran@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      73278715
    • Michal Swiatkowski's avatar
      ice: tc: check src_vsi in case of traffic from VF · 42805160
      Michal Swiatkowski authored
      In case of traffic going from the VF (so ingress for port representor)
      source VSI should be consider during packet classification. It is
      needed for hardware to not match packets from different ports with
      filters added on other port.
      
      It is only for "from VF" traffic, because other traffic direction
      doesn't have source VSI.
      
      Set correct ::src_vsi in rule_info to pass it to the hardware filter.
      
      For example this rule should drop only ipv4 packets from eth10, not from
      the others VF PRs. It is needed to check source VSI in this case.
      $tc filter add dev eth10 ingress protocol ip flower skip_sw action drop
      
      Fixes: 0d08a441 ("ice: ndo_setup_tc implementation for PF")
      Reviewed-by: default avatarJedrzej Jagielski <jedrzej.jagielski@intel.com>
      Reviewed-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: default avatarSujai Buvaneswaran <sujai.buvaneswaran@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      42805160
    • Paolo Abeni's avatar
      Merge branch 'net-stmmac-fix-mac-capabilities-procedure' · e226eade
      Paolo Abeni authored
      Serge Semin says:
      
      ====================
      net: stmmac: Fix MAC-capabilities procedure
      
      The series got born as a result of the discussions around the recent
      Yanteng' series adding the Loongson LS7A1000, LS2K1000, LS7A2000, LS2K2000
      MACs support:
      Link: https://lore.kernel.org/netdev/fu3f6uoakylnb6eijllakeu5i4okcyqq7sfafhp5efaocbsrwe@w74xe7gb6x7p
      
      In particular the Yanteng' patchset needed to implement the Loongson
      MAC-specific constraints applied to the link speed and link duplex mode.
      As a result of the discussion with Russel the next preliminary patch was
      born:
      Link: https://lore.kernel.org/netdev/df31e8bcf74b3b4ddb7ddf5a1c371390f16a2ad5.1712917541.git.siyanteng@loongson.cn
      
      The patch above was a temporal solution utilized by Yanteng for further
      developments and to move on with the on-going review. This patchset is a
      refactored version of that single patch with formatting required for the
      fixes patches.
      
      In particular the series starts with fixing the half-duplex-less
      constraint currently applied for all IP-cores. In fact it's specific for
      the DW QoS Eth only (DW GMAC v4.x/v5.x).
      
      The next patch fixes the MAC-capabilities setting up during the active
      Tx/Rx queues re-initialization procedure. Particularly the procedure
      missed the max-speed limit thus possibly activating speeds prohibited on
      the respective platforms.
      
      Third patch fixes the incorrect MAC-capabilities initialization for DW
      MAC100, DW XGMAC and DW XLGMAC devices by moving the correct
      initialization to the IP-core specific setup() methods.
      
      That's it for now. Thanks for review and testing in advance.
      Signed-off-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Simon Horman <horms@kernel.org>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Chen-Yu Tsai <wens@csie.org>
      Cc: Jernej Skrabec <jernej.skrabec@gmail.com>
      Cc: Samuel Holland <samuel@sholland.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-stm32@st-md-mailman.stormreply.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-sunxi@lists.linux.dev
      Cc: linux-kernel@vger.kernel.org
      ====================
      
      Link: https://lore.kernel.org/r/20240412180340.7965-1-fancer.lancer@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e226eade
    • Serge Semin's avatar
      net: stmmac: Fix IP-cores specific MAC capabilities · 9cb54af2
      Serge Semin authored
      Here is the list of the MAC capabilities specific to the particular DW MAC
      IP-cores currently supported by the driver:
      
      DW MAC100: MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
      	   MAC_10 | MAC_100
      
      DW GMAC:  MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                MAC_10 | MAC_100 | MAC_1000
      
      Allwinner sun8i MAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                           MAC_10 | MAC_100 | MAC_1000
      
      DW QoS Eth: MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                  MAC_10 | MAC_100 | MAC_1000 | MAC_2500FD
      if there is more than 1 active Tx/Rx queues:
      	   MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                 MAC_10FD | MAC_100FD | MAC_1000FD | MAC_2500FD
      
      DW XGMAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                MAC_1000FD | MAC_2500FD | MAC_5000FD | MAC_10000FD
      
      DW XLGMAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE |
                MAC_1000FD | MAC_2500FD | MAC_5000FD | MAC_10000FD |
                MAC_25000FD | MAC_40000FD | MAC_50000FD | MAC_100000FD
      
      As you can see there are only two common capabilities:
      MAC_ASYM_PAUSE | MAC_SYM_PAUSE.
      Meanwhile what is currently implemented defines 10/100/1000 link speeds
      for all IP-cores, which is definitely incorrect for DW MAC100, DW XGMAC
      and DW XLGMAC devices.
      
      Seeing the flow-control is implemented as a callback for each MAC IP-core
      (see dwmac100_flow_ctrl(), dwmac1000_flow_ctrl(), sun8i_dwmac_flow_ctrl(),
      etc) and since the MAC-specific setup() method is supposed to be called
      for each available DW MAC-based device, the capabilities initialization
      can be freely moved to these setup() functions, thus correctly setting up
      the MAC-capabilities for each IP-core (including the Allwinner Sun8i). A
      new stmmac_link::caps field was specifically introduced for that so to
      have all link-specific info preserved in a single structure.
      
      Note the suggested change fixes three earlier commits at a time. The
      commit 5b0d7d7d ("net: stmmac: Add the missing speeds that XGMAC
      supports") permitted the 10-100 link speeds and 1G half-duplex mode for DW
      XGMAC IP-core even though it doesn't support them. The commit df7699c7
      ("net: stmmac: Do not cut down 1G modes") incorrectly added the MAC1000
      capability to the DW MAC100 IP-core. Similarly to the DW XGMAC the commit
      8a880936 ("net: stmmac: Add XLGMII support") incorrectly permitted the
      10-100 link speeds and 1G half-duplex mode for DW XLGMAC IP-core.
      
      Fixes: 5b0d7d7d ("net: stmmac: Add the missing speeds that XGMAC supports")
      Fixes: df7699c7 ("net: stmmac: Do not cut down 1G modes")
      Fixes: 8a880936 ("net: stmmac: Add XLGMII support")
      Suggested-by: default avatarRussell King (Oracle) <linux@armlinux.org.uk>
      Signed-off-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Reviewed-by: default avatarRomain Gantois <romain.gantois@bootlin.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9cb54af2
    • Serge Semin's avatar
      net: stmmac: Fix max-speed being ignored on queue re-init · 59c3d6ca
      Serge Semin authored
      It's possible to have the maximum link speed being artificially limited on
      the platform-specific basis. It's done either by setting up the
      plat_stmmacenet_data::max_speed field or by specifying the "max-speed"
      DT-property. In such cases it's required that any specific
      MAC-capabilities re-initializations would take the limit into account. In
      particular the link speed capabilities may change during the number of
      active Tx/Rx queues re-initialization. But the currently implemented
      procedure doesn't take the speed limit into account.
      
      Fix that by calling phylink_limit_mac_speed() in the
      stmmac_reinit_queues() method if the speed limitation was required in the
      same way as it's done in the stmmac_phy_setup() function.
      
      Fixes: 95201f36 ("net: stmmac: update MAC capabilities when tx queues are updated")
      Signed-off-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Reviewed-by: default avatarRomain Gantois <romain.gantois@bootlin.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      59c3d6ca
    • Serge Semin's avatar
      net: stmmac: Apply half-duplex-less constraint for DW QoS Eth only · 0ebd96f5
      Serge Semin authored
      There are three DW MAC IP-cores which can have the multiple Tx/Rx queues
      enabled:
      DW GMAC v3.7+ with AV feature,
      DW QoS Eth v4.x/v5.x,
      DW XGMAC/XLGMAC
      Based on the respective HW databooks, only the DW QoS Eth IP-core doesn't
      support the half-duplex link mode in case if more than one queues enabled:
      
      "In multiple queue/channel configurations, for half-duplex operation,
      enable only the Q0/CH0 on Tx and Rx. For single queue/channel in
      full-duplex operation, any queue/channel can be enabled."
      
      The rest of the IP-cores don't have such constraint. Thus in order to have
      the constraint applied for the DW QoS Eth MACs only, let's move the it'
      implementation to the respective MAC-capabilities getter and make sure the
      getter is called in the queues re-init procedure.
      
      Fixes: b6cfffa7 ("stmmac: fix DMA channel hang in half-duplex mode")
      Signed-off-by: default avatarSerge Semin <fancer.lancer@gmail.com>
      Reviewed-by: default avatarRomain Gantois <romain.gantois@bootlin.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0ebd96f5
    • Paolo Abeni's avatar
      Merge branch 'selftests-net-tcp_ao-a-bunch-of-fixes-for-tcp-ao-selftests' · 24f4c99e
      Paolo Abeni authored
      Dmitry Safonov via says:
      
      ====================
      selftests/net/tcp_ao: A bunch of fixes for TCP-AO selftests
      
      Started as addressing the flakiness issues in rst_ipv*, that affect
      netdev dashboard.
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      ====================
      
      Link: https://lore.kernel.org/r/20240413-tcp-ao-selftests-fixes-v1-0-f9c41c96949d@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      24f4c99e
    • Dmitry Safonov's avatar
      selftests/tcp_ao: Printing fixes to confirm with format-security · b476c936
      Dmitry Safonov authored
      On my new laptop with packages from nixos-unstable, gcc 12.3.0 produces
      > lib/setup.c: In function ‘__test_msg’:
      > lib/setup.c:20:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    20 |         ksft_print_msg(buf);
      >       |         ^~~~~~~~~~~~~~
      > lib/setup.c: In function ‘__test_ok’:
      > lib/setup.c:26:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    26 |         ksft_test_result_pass(buf);
      >       |         ^~~~~~~~~~~~~~~~~~~~~
      > lib/setup.c: In function ‘__test_fail’:
      > lib/setup.c:32:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    32 |         ksft_test_result_fail(buf);
      >       |         ^~~~~~~~~~~~~~~~~~~~~
      > lib/setup.c: In function ‘__test_xfail’:
      > lib/setup.c:38:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    38 |         ksft_test_result_xfail(buf);
      >       |         ^~~~~~~~~~~~~~~~~~~~~~
      > lib/setup.c: In function ‘__test_error’:
      > lib/setup.c:44:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    44 |         ksft_test_result_error(buf);
      >       |         ^~~~~~~~~~~~~~~~~~~~~~
      > lib/setup.c: In function ‘__test_skip’:
      > lib/setup.c:50:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    50 |         ksft_test_result_skip(buf);
      >       |         ^~~~~~~~~~~~~~~~~~~~~
      > cc1: some warnings being treated as errors
      
      As the buffer was already pre-printed into, print it as a string
      rather than a format-string.
      
      Fixes: cfbab37b ("selftests/net: Add TCP-AO library")
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Reported-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b476c936
    • Dmitry Safonov's avatar
      selftests/tcp_ao: Fix fscanf() call for format-security · beb78cd1
      Dmitry Safonov authored
      On my new laptop with packages from nixos-unstable, gcc 12.3.0 produces:
      > lib/proc.c: In function ‘netstat_read_type’:
      > lib/proc.c:89:9: error: format not a string literal and no format arguments [-Werror=format-security]
      >    89 |         if (fscanf(fnetstat, type->header_name) == EOF)
      >       |         ^~
      > cc1: some warnings being treated as errors
      
      Here the selftests lib parses header name, while expectes non-space word
      ending with a column.
      
      Fixes: cfbab37b ("selftests/net: Add TCP-AO library")
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Reported-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      beb78cd1
    • Dmitry Safonov's avatar
      selftests/tcp_ao: Zero-init tcp_ao_info_opt · b089b3be
      Dmitry Safonov authored
      The structure is on the stack and has to be zero-initialized as
      the kernel checks for:
      >	if (in.reserved != 0 || in.reserved2 != 0)
      >		return -EINVAL;
      
      Fixes: b2666053 ("selftests/net: Add test for TCP-AO add setsockopt() command")
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b089b3be
    • Dmitry Safonov's avatar
      selftests/tcp_ao: Make RST tests less flaky · 4225dfa4
      Dmitry Safonov authored
      Currently, "active reset" cases are flaky, because select() is called
      for 3 sockets, while only 2 are expected to receive RST.
      The idea of the third socket was to get into request_sock_queue,
      but the test mistakenly attempted to connect() after the listener
      socket was shut down.
      
      Repair this test, it's important to check the different kernel
      code-paths for signing RST TCP-AO segments.
      
      Fixes: c6df7b23 ("selftests/net: Add TCP-AO RST test")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4225dfa4
  2. 15 Apr, 2024 2 commits
  3. 14 Apr, 2024 1 commit
    • Yuri Benditovich's avatar
      net: change maximum number of UDP segments to 128 · 1382e3b6
      Yuri Benditovich authored
      The commit fc8b2a61
      ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
      adds check of potential number of UDP segments vs
      UDP_MAX_SEGMENTS in linux/virtio_net.h.
      After this change certification test of USO guest-to-guest
      transmit on Windows driver for virtio-net device fails,
      for example with packet size of ~64K and mss of 536 bytes.
      In general the USO should not be more restrictive than TSO.
      Indeed, in case of unreasonably small mss a lot of segments
      can cause queue overflow and packet loss on the destination.
      Limit of 128 segments is good for any practical purpose,
      with minimal meaningful mss of 536 the maximal UDP packet will
      be divided to ~120 segments.
      The number of segments for UDP packets is validated vs
      UDP_MAX_SEGMENTS also in udp.c (v4,v6), this does not affect
      quest-to-guest path but does affect packets sent to host, for
      example.
      It is important to mention that UDP_MAX_SEGMENTS is kernel-only
      define and not available to user mode socket applications.
      In order to request MSS smaller than MTU the applications
      just uses setsockopt with SOL_UDP and UDP_SEGMENT and there is
      no limitations on socket API level.
      
      Fixes: fc8b2a61 ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
      Signed-off-by: default avatarYuri Benditovich <yuri.benditovich@daynix.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1382e3b6
  4. 13 Apr, 2024 11 commits
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-fixes' · 72041e53
      Jakub Kicinski authored
      Tariq Toukan says:
      
      ====================
      mlx5 fixes
      
      This patchset provides bug fixes to mlx5 core and Eth drivers.
      ====================
      
      Link: https://lore.kernel.org/r/20240411115444.374475-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72041e53
    • Carolina Jubran's avatar
      net/mlx5e: Prevent deadlock while disabling aRFS · fef96576
      Carolina Jubran authored
      When disabling aRFS under the `priv->state_lock`, any scheduled
      aRFS works are canceled using the `cancel_work_sync` function,
      which waits for the work to end if it has already started.
      However, while waiting for the work handler, the handler will
      try to acquire the `state_lock` which is already acquired.
      
      The worker acquires the lock to delete the rules if the state
      is down, which is not the worker's responsibility since
      disabling aRFS deletes the rules.
      
      Add an aRFS state variable, which indicates whether the aRFS is
      enabled and prevent adding rules when the aRFS is disabled.
      
      Kernel log:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.7.0-rc4_net_next_mlx5_5483eb2 #1 Tainted: G          I
      ------------------------------------------------------
      ethtool/386089 is trying to acquire lock:
      ffff88810f21ce68 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}, at: __flush_work+0x74/0x4e0
      
      but task is already holding lock:
      ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&priv->state_lock){+.+.}-{3:3}:
             __mutex_lock+0x80/0xc90
             arfs_handle_work+0x4b/0x3b0 [mlx5_core]
             process_one_work+0x1dc/0x4a0
             worker_thread+0x1bf/0x3c0
             kthread+0xd7/0x100
             ret_from_fork+0x2d/0x50
             ret_from_fork_asm+0x11/0x20
      
      -> #0 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}:
             __lock_acquire+0x17b4/0x2c80
             lock_acquire+0xd0/0x2b0
             __flush_work+0x7a/0x4e0
             __cancel_work_timer+0x131/0x1c0
             arfs_del_rules+0x143/0x1e0 [mlx5_core]
             mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
             mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
             ethnl_set_channels+0x28f/0x3b0
             ethnl_default_set_doit+0xec/0x240
             genl_family_rcv_msg_doit+0xd0/0x120
             genl_rcv_msg+0x188/0x2c0
             netlink_rcv_skb+0x54/0x100
             genl_rcv+0x24/0x40
             netlink_unicast+0x1a1/0x270
             netlink_sendmsg+0x214/0x460
             __sock_sendmsg+0x38/0x60
             __sys_sendto+0x113/0x170
             __x64_sys_sendto+0x20/0x30
             do_syscall_64+0x40/0xe0
             entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&priv->state_lock);
                                     lock((work_completion)(&rule->arfs_work));
                                     lock(&priv->state_lock);
        lock((work_completion)(&rule->arfs_work));
      
       *** DEADLOCK ***
      
      3 locks held by ethtool/386089:
       #0: ffffffff82ea7210 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
       #1: ffffffff82e94c88 (rtnl_mutex){+.+.}-{3:3}, at: ethnl_default_set_doit+0xd3/0x240
       #2: ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
      
      stack backtrace:
      CPU: 15 PID: 386089 Comm: ethtool Tainted: G          I        6.7.0-rc4_net_next_mlx5_5483eb2 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x60/0xa0
       check_noncircular+0x144/0x160
       __lock_acquire+0x17b4/0x2c80
       lock_acquire+0xd0/0x2b0
       ? __flush_work+0x74/0x4e0
       ? save_trace+0x3e/0x360
       ? __flush_work+0x74/0x4e0
       __flush_work+0x7a/0x4e0
       ? __flush_work+0x74/0x4e0
       ? __lock_acquire+0xa78/0x2c80
       ? lock_acquire+0xd0/0x2b0
       ? mark_held_locks+0x49/0x70
       __cancel_work_timer+0x131/0x1c0
       ? mark_held_locks+0x49/0x70
       arfs_del_rules+0x143/0x1e0 [mlx5_core]
       mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
       mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
       ethnl_set_channels+0x28f/0x3b0
       ethnl_default_set_doit+0xec/0x240
       genl_family_rcv_msg_doit+0xd0/0x120
       genl_rcv_msg+0x188/0x2c0
       ? ethnl_ops_begin+0xb0/0xb0
       ? genl_family_rcv_msg_dumpit+0xf0/0xf0
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1a1/0x270
       netlink_sendmsg+0x214/0x460
       __sock_sendmsg+0x38/0x60
       __sys_sendto+0x113/0x170
       ? do_user_addr_fault+0x53f/0x8f0
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x40/0xe0
       entry_SYSCALL_64_after_hwframe+0x46/0x4e
       </TASK>
      
      Fixes: 45bf454a ("net/mlx5e: Enabling aRFS mechanism")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-7-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fef96576
    • Carolina Jubran's avatar
      net/mlx5e: Acquire RTNL lock before RQs/SQs activation/deactivation · fdce06bd
      Carolina Jubran authored
      netif_queue_set_napi asserts whether RTNL lock is held if
      the netdev is initialized.
      
      Acquire the RTNL lock before activating or deactivating
      RQs/SQs if the lock has not been held before in the flow.
      
      Fixes: f25e7b82 ("net/mlx5e: link NAPI instances to queues and IRQs")
      Cc: Joe Damato <jdamato@fastly.com>
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Reviewed-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-6-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fdce06bd
    • Rahul Rameshbabu's avatar
      net/mlx5e: Use channel mdev reference instead of global mdev instance for coalescing · 6c685bdb
      Rahul Rameshbabu authored
      Channels can potentially have independent mdev instances. Do not refer to
      the global mdev instance in the mlx5e_priv instance for channel FW
      operations related to coalescing. CQ numbers that would be valid on the
      channel's mdev instance may not be correctly referenced if using the
      mlx5e_priv instance.
      
      Fixes: 67936e13 ("net/mlx5e: Let channels be SD-aware")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-5-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c685bdb
    • Shay Drory's avatar
      net/mlx5: Restore mistakenly dropped parts in register devlink flow · bf729988
      Shay Drory authored
      Code parts from cited commit were mistakenly dropped while rebasing
      before submission. Add them here.
      
      Fixes: c6e77aa9 ("net/mlx5: Register devlink first under devlink lock")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-4-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bf729988
    • Tariq Toukan's avatar
      net/mlx5: SD, Handle possible devcom ERR_PTR · aa4ac90d
      Tariq Toukan authored
      Check if devcom holds an error pointer and return immediately.
      
      This fixes Smatch static checker warning:
      drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c:221 sd_register()
      error: 'devcom' dereferencing possible ERR_PTR()
      
      Enhance mlx5_devcom_register_component() so it stops returning NULL,
      making it easier for its callers.
      
      Fixes: d3d05766 ("net/mlx5: SD, Implement devcom communication and primary election")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Link: https://lore.kernel.org/all/f09666c8-e604-41f6-958b-4cc55c73faf9@gmail.com/T/Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-3-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aa4ac90d
    • Shay Drory's avatar
      net/mlx5: Lag, restore buckets number to default after hash LAG deactivation · 37cc10da
      Shay Drory authored
      The cited patch introduces the concept of buckets in LAG in hash mode.
      However, the patch doesn't clear the number of buckets in the LAG
      deactivation. This results in using the wrong number of buckets in
      case user create a hash mode LAG and afterwards create a non-hash
      mode LAG.
      
      Hence, restore buckets number to default after hash mode LAG
      deactivation.
      
      Fixes: 352899f3 ("net/mlx5: Lag, use buckets in hash mode")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240411115444.374475-2-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      37cc10da
    • Asbjørn Sloth Tønnesen's avatar
      net: sparx5: flower: fix fragment flags handling · 68aba004
      Asbjørn Sloth Tønnesen authored
      I noticed that only 3 out of the 4 input bits were used,
      mt.key->flags & FLOW_DIS_IS_FRAGMENT was never checked.
      
      In order to avoid a complicated maze, I converted it to
      use a 16 byte mapping table.
      
      As shown in the table below the old heuristics doesn't
      always do the right thing, ie. when FLOW_DIS_IS_FRAGMENT=1/1
      then it used to only match follow-up fragment packets.
      
      Here are all the combinations, and their resulting new/old
      VCAP key/mask filter:
      
        /- FLOW_DIS_IS_FRAGMENT (key/mask)
        |    /- FLOW_DIS_FIRST_FRAG (key/mask)
        |    |    /-- new VCAP fragment (key/mask)
        v    v    v    v- old VCAP fragment (key/mask)
      
       0/0  0/0  -/-  -/-     impossible (due to entry cond. on mask)
       0/0  0/1  -/-  0/3 !!  invalid (can't match non-fragment + follow-up frag)
       0/0  1/0  -/-  -/-     impossible (key > mask)
       0/0  1/1  1/3  1/3     first fragment
      
       0/1  0/0  0/3  3/3 !!  not fragmented
       0/1  0/1  0/3  3/3 !!  not fragmented (+ not first fragment)
       0/1  1/0  -/-  -/-     impossible (key > mask)
       0/1  1/1  -/-  1/3 !!  invalid (non-fragment and first frag)
      
       1/0  0/0  -/-  -/-     impossible (key > mask)
       1/0  0/1  -/-  -/-     impossible (key > mask)
       1/0  1/0  -/-  -/-     impossible (key > mask)
       1/0  1/1  -/-  -/-     impossible (key > mask)
      
       1/1  0/0  1/1  3/3 !!  some fragment
       1/1  0/1  3/3  3/3     follow-up fragment
       1/1  1/0  -/-  -/-     impossible (key > mask)
       1/1  1/1  1/3  1/3     first fragment
      
      In the datasheet the VCAP fragment values are documented as:
       0 = no fragment
       1 = initial fragment
       2 = suspicious fragment
       3 = valid follow-up fragment
      
      Result: 3 combinations match the old behavior,
              3 combinations have been corrected,
              2 combinations are now invalid, and fail,
              8 combinations are impossible.
      
      It should now be aligned with how FLOW_DIS_IS_FRAGMENT
      and FLOW_DIS_FIRST_FRAG is set in __skb_flow_dissect() in
      net/core/flow_dissector.c
      
      Since the VCAP fragment values are not a bitfield, we have
      to ignore the suspicious fragment value, eg. when matching
      on any kind of fragment with FLOW_DIS_IS_FRAGMENT=1/1.
      
      Only compile tested, and logic tested in userspace, as I
      unfortunately don't have access to this switch chip (yet).
      
      Fixes: d6c2964d ("net: microchip: sparx5: Adding more tc flower keys for the IS2 VCAP")
      Signed-off-by: default avatarAsbjørn Sloth Tønnesen <ast@fiberby.net>
      Reviewed-by: default avatarSteen Hegelund <Steen.Hegelund@microchip.com>
      Tested-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://lore.kernel.org/r/20240411111321.114095-1-ast@fiberby.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68aba004
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-fix-msg_oob-bugs-with-msg_peek' · 27f58f7f
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix MSG_OOB bugs with MSG_PEEK.
      
      Currently, OOB data can be read without MSG_OOB accidentally
      in two cases, and this seris fixes the bugs.
      
      v1: https://lore.kernel.org/netdev/20240409225209.58102-1-kuniyu@amazon.com/
      ====================
      
      Link: https://lore.kernel.org/r/20240410171016.7621-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      27f58f7f
    • Kuniyuki Iwashima's avatar
      af_unix: Don't peek OOB data without MSG_OOB. · 22dd70eb
      Kuniyuki Iwashima authored
      Currently, we can read OOB data without MSG_OOB by using MSG_PEEK
      when OOB data is sitting on the front row, which is apparently
      wrong.
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'a', MSG_OOB)
        1
        >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
        b'a'
      
      If manage_oob() is called when no data has been copied, we only
      check if the socket enables SO_OOBINLINE or MSG_PEEK is not used.
      Otherwise, the skb is returned as is.
      
      However, here we should return NULL if MSG_PEEK is set and no data
      has been copied.
      
      Also, in such a case, we should not jump to the redo label because
      we will be caught in the loop and hog the CPU until normal data
      comes in.
      
      Then, we need to handle skb == NULL case with the if-clause below
      the manage_oob() block.
      
      With this patch:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'a', MSG_OOB)
        1
        >>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        BlockingIOError: [Errno 11] Resource temporarily unavailable
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240410171016.7621-3-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      22dd70eb
    • Kuniyuki Iwashima's avatar
      af_unix: Call manage_oob() for every skb in unix_stream_read_generic(). · 283454c8
      Kuniyuki Iwashima authored
      When we call recv() for AF_UNIX socket, we first peek one skb and
      calls manage_oob() to check if the skb is sent with MSG_OOB.
      
      However, when we fetch the next (and the following) skb, manage_oob()
      is not called now, leading a wrong behaviour.
      
      Let's say a socket send()s "hello" with MSG_OOB and the peer tries
      to recv() 5 bytes with MSG_PEEK.  Here, we should get only "hell"
      without 'o', but actually not:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'hello', MSG_OOB)
        5
        >>> c2.recv(5, MSG_PEEK)
        b'hello'
      
      The first skb fills 4 bytes, and the next skb is peeked but not
      properly checked by manage_oob().
      
      Let's move up the again label to call manage_oob() for evry skb.
      
      With this patch:
      
        >>> from socket import *
        >>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
        >>> c1.send(b'hello', MSG_OOB)
        5
        >>> c2.recv(5, MSG_PEEK)
        b'hell'
      
      Fixes: 314001f0 ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240410171016.7621-2-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      283454c8
  5. 12 Apr, 2024 1 commit
    • David S. Miller's avatar
      Merge tag 'nf-24-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 90be7a5c
      David S. Miller authored
      netfilter pull request 24-04-11
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      Patches #1 and #2 add missing rcu read side lock when iterating over
      expression and object type list which could race with module removal.
      
      Patch #3 prevents promisc packet from visiting the bridge/input hook
      	 to amend a recent fix to address conntrack confirmation race
      	 in br_netfilter and nf_conntrack_bridge.
      
      Patch #4 adds and uses iterate decorator type to fetch the current
      	 pipapo set backend datastructure view when netlink dumps the
      	 set elements.
      
      Patch #5 fixes removal of duplicate elements in the pipapo set backend.
      
      Patch #6 flowtable validates pppoe header before accessing it.
      
      Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
               fails and pppoe packets follow classic path.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90be7a5c
  6. 11 Apr, 2024 14 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 2ae9a897
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bluetooth.
      
        Current release - new code bugs:
      
         - netfilter: complete validation of user input
      
         - mlx5: disallow SRIOV switchdev mode when in multi-PF netdev
      
        Previous releases - regressions:
      
         - core: fix u64_stats_init() for lockdep when used repeatedly in one
           file
      
         - ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
      
         - bluetooth: fix memory leak in hci_req_sync_complete()
      
         - batman-adv: avoid infinite loop trying to resize local TT
      
         - drv: geneve: fix header validation in geneve[6]_xmit_skb
      
         - drv: bnxt_en: fix possible memory leak in
           bnxt_rdma_aux_device_init()
      
         - drv: mlx5: offset comp irq index in name by one
      
         - drv: ena: avoid double-free clearing stale tx_info->xdpf value
      
         - drv: pds_core: fix pdsc_check_pci_health deadlock
      
        Previous releases - always broken:
      
         - xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
      
         - bluetooth: fix setsockopt not validating user input
      
         - af_unix: clear stale u->oob_skb.
      
         - nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
      
         - drv: virtio_net: fix guest hangup on invalid RSS update
      
         - drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
      
         - dsa: mt7530: trap link-local frames regardless of ST Port State"
      
      * tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
        net: ena: Set tx_info->xdpf value to NULL
        net: ena: Fix incorrect descriptor free behavior
        net: ena: Wrong missing IO completions check order
        net: ena: Fix potential sign extension issue
        af_unix: Fix garbage collector racing against connect()
        net: dsa: mt7530: trap link-local frames regardless of ST Port State
        Revert "s390/ism: fix receive message buffer allocation"
        net: sparx5: fix wrong config being used when reconfiguring PCS
        net/mlx5: fix possible stack overflows
        net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
        net/mlx5e: RSS, Block XOR hash with over 128 channels
        net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
        net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
        net/mlx5e: Fix mlx5e_priv_init() cleanup flow
        net/mlx5e: RSS, Block changing channels number when RXFH is configured
        net/mlx5: Correctly compare pkt reformat ids
        net/mlx5: Properly link new fs rules into the tree
        net/mlx5: offset comp irq index in name by one
        net/mlx5: Register devlink first under devlink lock
        net/mlx5: E-switch, store eswitch pointer before registering devlink_param
        ...
      2ae9a897
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ab4319fd
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "The most important fix is the sg one because the regression it fixes
        (spurious warning and use after final put) is already backported to
        stable.
      
        The next biggest impact is the target fix for wrong credentials used
        to load a module because it's affecting new kernels installed on
        selinux based distributions.
      
        The other three fixes are an obvious off by one and SATA protocol
        issues"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
        scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
        scsi: hisi_sas: Handle the NCQ error returned by D2H frame
        scsi: target: Fix SELinux error when systemd-modules loads the target module
        scsi: sg: Avoid race in error handling & drop bogus warn
      ab4319fd
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.9-1' of... · 5de6b467
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
      
       - make {virt, phys, page, pfn} translation work with KFENCE for
         LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
         enabled)
      
       - update dts files for Loongson-2K series to make devices work
         correctly
      
       - fix a build error
      
      * tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
        LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
        LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
        LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
        LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
        LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
        LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
        mm: Move lowmem_page_address() a little later
      5de6b467
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs · e1dc191d
      Linus Torvalds authored
      Pull more bcachefs fixes from Kent Overstreet:
       "Notable user impacting bugs
      
         - On multi device filesystems, recovery was looping in
           btree_trans_too_many_iters(). This checks if a transaction has
           touched too many btree paths (because of iteration over many keys),
           and isuses a restart to drop unneeded paths.
      
           But it's now possible for some paths to exceed the previous limit
           without iteration in the interior btree update path, since the
           transaction commit will do alloc updates for every old and new
           btree node, and during journal replay we don't use the btree write
           buffer for locking reasons and thus those updates use btree paths
           when they wouldn't normally.
      
         - Fix a corner case in rebalance when moving extents on a
           durability=0 device. This wouldn't be hit when a device was
           formatted with durability=0 since in that case we'll only use it as
           a write through cache (only cached extents will live on it), but
           durability can now be changed on an existing device.
      
         - bch2_get_acl() could rarely forget to handle a transaction restart;
           this manifested as the occasional missing acl that came back after
           dropping caches.
      
         - Fix a major performance regression on high iops multithreaded write
           workloads (only since 6.9-rc1); a previous fix for a deadlock in
           the interior btree update path to check the journal watermark
           introduced a dependency on the state of btree write buffer flushing
           that we didn't want.
      
         - Assorted other repair paths and recovery fixes"
      
      * tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
        bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
        bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
        bcachefs: Fix a race in btree_update_nodes_written()
        bcachefs: btree_node_scan: Respect member.data_allowed
        bcachefs: Don't scan for btree nodes when we can reconstruct
        bcachefs: Fix check_topology() when using node scan
        bcachefs: fix eytzinger0_find_gt()
        bcachefs: fix bch2_get_acl() transaction restart handling
        bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
        bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
        bcachefs: Rename struct field swap to prevent macro naming collision
        MAINTAINERS: Add entry for bcachefs documentation
        Documentation: filesystems: Add bcachefs toctree
        bcachefs: JOURNAL_SPACE_LOW
        bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
        bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
        bcachefs: fix rand_delete unit test
        bcachefs: fix ! vs ~ typo in __clear_bit_le64()
        bcachefs: Fix rebalance from durability=0 device
        bcachefs: Print shutdown journal sequence number
        ...
      e1dc191d
    • Linus Torvalds's avatar
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of... · 346668f0
      Linus Torvalds authored
      Merge tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome platform fix from Tzung-Bi Shih:
       "Fix a NULL pointer dereference"
      
      * tag 'tag-chrome-platform-fixes-for-v6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_uart: properly fix race condition
      346668f0
    • Pablo Neira Ayuso's avatar
      netfilter: flowtable: incorrect pppoe tuple · 6db5dc7b
      Pablo Neira Ayuso authored
      pppoe traffic reaching ingress path does not match the flowtable entry
      because the pppoe header is expected to be at the network header offset.
      This bug causes a mismatch in the flow table lookup, so pppoe packets
      enter the classical forwarding path.
      
      Fixes: 72efd585 ("netfilter: flowtable: add pppoe support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      6db5dc7b
    • Pablo Neira Ayuso's avatar
      netfilter: flowtable: validate pppoe header · 87b3593b
      Pablo Neira Ayuso authored
      Ensure there is sufficient room to access the protocol field of the
      PPPoe header. Validate it once before the flowtable lookup, then use a
      helper function to access protocol field.
      
      Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
      Fixes: 72efd585 ("netfilter: flowtable: add pppoe support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      87b3593b
    • Florian Westphal's avatar
      netfilter: nft_set_pipapo: do not free live element · 3cfc9ec0
      Florian Westphal authored
      Pablo reports a crash with large batches of elements with a
      back-to-back add/remove pattern.  Quoting Pablo:
      
        add_elem("00000000") timeout 100 ms
        ...
        add_elem("0000000X") timeout 100 ms
        del_elem("0000000X") <---------------- delete one that was just added
        ...
        add_elem("00005000") timeout 100 ms
      
        1) nft_pipapo_remove() removes element 0000000X
        Then, KASAN shows a splat.
      
      Looking at the remove function there is a chance that we will drop a
      rule that maps to a non-deactivated element.
      
      Removal happens in two steps, first we do a lookup for key k and return the
      to-be-removed element and mark it as inactive in the next generation.
      Then, in a second step, the element gets removed from the set/map.
      
      The _remove function does not work correctly if we have more than one
      element that share the same key.
      
      This can happen if we insert an element into a set when the set already
      holds an element with same key, but the element mapping to the existing
      key has timed out or is not active in the next generation.
      
      In such case its possible that removal will unmap the wrong element.
      If this happens, we will leak the non-deactivated element, it becomes
      unreachable.
      
      The element that got deactivated (and will be freed later) will
      remain reachable in the set data structure, this can result in
      a crash when such an element is retrieved during lookup (stale
      pointer).
      
      Add a check that the fully matching key does in fact map to the element
      that we have marked as inactive in the deactivation step.
      If not, we need to continue searching.
      
      Add a bug/warn trap at the end of the function as well, the remove
      function must not ever be called with an invisible/unreachable/non-existent
      element.
      
      v2: avoid uneeded temporary variable (Stefano)
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Reported-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3cfc9ec0
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_pipapo: walk over current view on netlink dump · 29b359cf
      Pablo Neira Ayuso authored
      The generation mask can be updated while netlink dump is in progress.
      The pipapo set backend walk iterator cannot rely on it to infer what
      view of the datastructure is to be used. Add notation to specify if user
      wants to read/update the set.
      
      Based on patch from Florian Westphal.
      
      Fixes: 2b84e215 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      29b359cf
    • Pablo Neira Ayuso's avatar
      netfilter: br_netfilter: skip conntrack input hook for promisc packets · 751de201
      Pablo Neira Ayuso authored
      For historical reasons, when bridge device is in promisc mode, packets
      that are directed to the taps follow bridge input hook path. This patch
      adds a workaround to reset conntrack for these packets.
      
      Jianbo Liu reports warning splats in their test infrastructure where
      cloned packets reach the br_netfilter input hook to confirm the
      conntrack object.
      
      Scratch one bit from BR_INPUT_SKB_CB to annotate that this packet has
      reached the input hook because it is passed up to the bridge device to
      reach the taps.
      
      [   57.571874] WARNING: CPU: 1 PID: 0 at net/bridge/br_netfilter_hooks.c:616 br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.572749] Modules linked in: xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_isc si ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5ctl mlx5_core
      [   57.575158] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.0+ #19
      [   57.575700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [   57.576662] RIP: 0010:br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.577195] Code: fe ff ff 41 bd 04 00 00 00 be 04 00 00 00 e9 4a ff ff ff be 04 00 00 00 48 89 ef e8 f3 a9 3c e1 66 83 ad b4 00 00 00 04 eb 91 <0f> 0b e9 f1 fe ff ff 0f 0b e9 df fe ff ff 48 89 df e8 b3 53 47 e1
      [   57.578722] RSP: 0018:ffff88885f845a08 EFLAGS: 00010202
      [   57.579207] RAX: 0000000000000002 RBX: ffff88812dfe8000 RCX: 0000000000000000
      [   57.579830] RDX: ffff88885f845a60 RSI: ffff8881022dc300 RDI: 0000000000000000
      [   57.580454] RBP: ffff88885f845a60 R08: 0000000000000001 R09: 0000000000000003
      [   57.581076] R10: 00000000ffff1300 R11: 0000000000000002 R12: 0000000000000000
      [   57.581695] R13: ffff8881047ffe00 R14: ffff888108dbee00 R15: ffff88814519b800
      [   57.582313] FS:  0000000000000000(0000) GS:ffff88885f840000(0000) knlGS:0000000000000000
      [   57.583040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   57.583564] CR2: 000000c4206aa000 CR3: 0000000103847001 CR4: 0000000000370eb0
      [   57.584194] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [   57.584820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [   57.585440] Call Trace:
      [   57.585721]  <IRQ>
      [   57.585976]  ? __warn+0x7d/0x130
      [   57.586323]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.586811]  ? report_bug+0xf1/0x1c0
      [   57.587177]  ? handle_bug+0x3f/0x70
      [   57.587539]  ? exc_invalid_op+0x13/0x60
      [   57.587929]  ? asm_exc_invalid_op+0x16/0x20
      [   57.588336]  ? br_nf_local_in+0x157/0x180 [br_netfilter]
      [   57.588825]  nf_hook_slow+0x3d/0xd0
      [   57.589188]  ? br_handle_vlan+0x4b/0x110
      [   57.589579]  br_pass_frame_up+0xfc/0x150
      [   57.589970]  ? br_port_flags_change+0x40/0x40
      [   57.590396]  br_handle_frame_finish+0x346/0x5e0
      [   57.590837]  ? ipt_do_table+0x32e/0x430
      [   57.591221]  ? br_handle_local_finish+0x20/0x20
      [   57.591656]  br_nf_hook_thresh+0x4b/0xf0 [br_netfilter]
      [   57.592286]  ? br_handle_local_finish+0x20/0x20
      [   57.592802]  br_nf_pre_routing_finish+0x178/0x480 [br_netfilter]
      [   57.593348]  ? br_handle_local_finish+0x20/0x20
      [   57.593782]  ? nf_nat_ipv4_pre_routing+0x25/0x60 [nf_nat]
      [   57.594279]  br_nf_pre_routing+0x24c/0x550 [br_netfilter]
      [   57.594780]  ? br_nf_hook_thresh+0xf0/0xf0 [br_netfilter]
      [   57.595280]  br_handle_frame+0x1f3/0x3d0
      [   57.595676]  ? br_handle_local_finish+0x20/0x20
      [   57.596118]  ? br_handle_frame_finish+0x5e0/0x5e0
      [   57.596566]  __netif_receive_skb_core+0x25b/0xfc0
      [   57.597017]  ? __napi_build_skb+0x37/0x40
      [   57.597418]  __netif_receive_skb_list_core+0xfb/0x220
      
      Fixes: 62e7151a ("netfilter: bridge: confirm multicast packets before passing them up the stack")
      Reported-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      751de201
    • Ziyang Xuan's avatar
      netfilter: nf_tables: Fix potential data-race in __nft_obj_type_get() · d78d867d
      Ziyang Xuan authored
      nft_unregister_obj() can concurrent with __nft_obj_type_get(),
      and there is not any protection when iterate over nf_tables_objects
      list in __nft_obj_type_get(). Therefore, there is potential data-race
      of nf_tables_objects list entry.
      
      Use list_for_each_entry_rcu() to iterate over nf_tables_objects
      list in __nft_obj_type_get(), and use rcu_read_lock() in the caller
      nft_obj_type_get() to protect the entire type query process.
      
      Fixes: e5009240 ("netfilter: nf_tables: add stateful objects")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d78d867d
    • Ziyang Xuan's avatar
      netfilter: nf_tables: Fix potential data-race in __nft_expr_type_get() · f969eb84
      Ziyang Xuan authored
      nft_unregister_expr() can concurrent with __nft_expr_type_get(),
      and there is not any protection when iterate over nf_tables_expressions
      list in __nft_expr_type_get(). Therefore, there is potential data-race
      of nf_tables_expressions list entry.
      
      Use list_for_each_entry_rcu() to iterate over nf_tables_expressions
      list in __nft_expr_type_get(), and use rcu_read_lock() in the caller
      nft_expr_type_get() to protect the entire type query process.
      
      Fixes: ef1f7df9 ("netfilter: nf_tables: expression ops overloading")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f969eb84
    • Paolo Abeni's avatar
      Merge branch 'ena-driver-bug-fixes' · 4e1ad31c
      Paolo Abeni authored
      David Arinzon says:
      
      ====================
      ENA driver bug fixes
      
      From: David Arinzon <darinzon@amazon.com>
      
      This patchset contains multiple bug fixes for the
      ENA driver.
      ====================
      
      Link: https://lore.kernel.org/r/20240410091358.16289-1-darinzon@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4e1ad31c
    • David Arinzon's avatar
      net: ena: Set tx_info->xdpf value to NULL · 36a1ca01
      David Arinzon authored
      The patch mentioned in the `Fixes` tag removed the explicit assignment
      of tx_info->xdpf to NULL with the justification that there's no need
      to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case
      of a mapping error. Both values won't be used once the mapping function
      returns an error, and their values would be overridden by the next
      transmitted packet.
      
      While both values do indeed get overridden in the next transmission
      call, the value of tx_info->xdpf is also used to check whether a TX
      descriptor's transmission has been completed (i.e. a completion for it
      was polled).
      
      An example scenario:
      1. Mapping failed, tx_info->xdpf wasn't set to NULL
      2. A VF reset occurred leading to IO resource destruction and
         a call to ena_free_tx_bufs() function
      3. Although the descriptor whose mapping failed was freed by the
         transmission function, it still passes the check
           if (!tx_info->skb)
      
         (skb and xdp_frame are in a union)
      4. The xdp_frame associated with the descriptor is freed twice
      
      This patch returns the assignment of NULL to tx_info->xdpf to make the
      cleaning function knows that the descriptor is already freed.
      
      Fixes: 504fd6a5 ("net: ena: fix DMA mapping function issues in XDP")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      36a1ca01