1. 25 Jan, 2024 10 commits
    • Maciej Fijalkowski's avatar
      i40e: set xdp_rxq_info::frag_size · a045d2f2
      Maciej Fijalkowski authored
      i40e support XDP multi-buffer so it is supposed to use
      __xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
      frag_size. It can not be simply converted at existing callsite because
      rx_buf_len could be un-initialized, so let us register xdp_rxq_info
      within i40e_configure_rx_ring(), which happen to be called with already
      initialized rx_buf_len value.
      
      Commit 5180ff13 ("i40e: use int for i40e_status") converted 'err' to
      int, so two variables to deal with return codes are not needed within
      i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
      from xdp_rxq_info registration.
      
      Fixes: e213ced1 ("i40e: add support for XDP multi-buffer Rx")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-11-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a045d2f2
    • Maciej Fijalkowski's avatar
      xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL · fbadd83a
      Maciej Fijalkowski authored
      XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
      queue via subtracting xdp_buff::data_end from xdp_buff::data.
      
      In bpf_xdp_frags_increase_tail(), when underlying memory type of
      xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
      fragment, so that later on user space will be able to take into account
      the amount of bytes added by XDP program.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fbadd83a
    • Maciej Fijalkowski's avatar
      ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue · 3de38c87
      Maciej Fijalkowski authored
      Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
      make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
      is being set via xsk_pool_get_rx_frame_size() and this needs to be
      propagated up to xdp_rxq_info.
      
      Use a bigger hammer and instead of unregistering only xdp_rxq_info's
      memory model, unregister it altogether and register it again and have
      xdp_rxq_info with correct frag_size value.
      
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-9-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3de38c87
    • Maciej Fijalkowski's avatar
      intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers · 29077990
      Maciej Fijalkowski authored
      Ice and i40e ZC drivers currently set offset of a frag within
      skb_shared_info to 0, which is incorrect. xdp_buffs that come from
      xsk_buff_pool always have 256 bytes of a headroom, so they need to be
      taken into account to retrieve xdp_buff::data via skb_frag_address().
      Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
      xdp_buff::data_hard_start which would result in overwriting existing
      payload.
      
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-8-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      29077990
    • Maciej Fijalkowski's avatar
      ice: remove redundant xdp_rxq_info registration · 2ee788c0
      Maciej Fijalkowski authored
      xdp_rxq_info struct can be registered by drivers via two functions -
      xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
      drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
      which in turn will make it possible to grow the packet via
      bpf_xdp_adjust_tail() BPF helper.
      
      Currently, ice registers xdp_rxq_info in two spots:
      1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
      2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK
      
      Cited commit under fixes tag took care of setting up frag_size and
      updated registration scheme in 2) but it did not help as
      1) is called before 2) and as shown above it uses old registration
      function. This means that 2) sees that xdp_rxq_info is already
      registered and never calls __xdp_rxq_info_reg() which leaves us with
      xdp_rxq_info::frag_size being set to 0.
      
      To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
      ice_setup_rx_ring().
      
      Fixes: 2fba7dc5 ("ice: Add support for XDP multi-buffer on Rx side")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-7-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2ee788c0
    • Tirthendu Sarkar's avatar
      i40e: handle multi-buffer packets that are shrunk by xdp prog · 83014323
      Tirthendu Sarkar authored
      XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
      helper function. For multi-buffer packets this may lead to reduction of
      frag count stored in skb_shared_info area of the xdp_buff struct. This
      results in issues with the current handling of XDP_PASS and XDP_DROP
      cases.
      
      For XDP_PASS, currently skb is being built using frag count of
      xdp_buffer before it was processed by XDP prog and thus will result in
      an inconsistent skb when frag count gets reduced by XDP prog. To fix
      this, get correct frag count while building the skb instead of using
      pre-obtained frag count.
      
      For XDP_DROP, current page recycling logic will not reuse the page but
      instead will adjust the pagecnt_bias so that the page can be freed. This
      again results in inconsistent behavior as the page refcnt has already
      been changed by the helper while freeing the frag(s) as part of
      shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
      that are stillpart of the packet post-xdp prog run.
      
      Fixes: e213ced1 ("i40e: add support for XDP multi-buffer Rx")
      Reported-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarTirthendu Sarkar <tirthendu.sarkar@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-6-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      83014323
    • Maciej Fijalkowski's avatar
      ice: work on pre-XDP prog frag count · ad2047cf
      Maciej Fijalkowski authored
      Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
      multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
      socket.
      
      Since support for handling multi-buffer frames was added to XDP, usage
      of bpf_xdp_adjust_tail() helper within XDP program can free the page
      that given fragment occupies and in turn decrease the fragment count
      within skb_shared_info that is embedded in xdp_buff struct. In current
      ice driver codebase, it can become problematic when page recycling logic
      decides not to reuse the page. In such case, __page_frag_cache_drain()
      is used with ice_rx_buf::pagecnt_bias that was not adjusted after
      refcount of page was changed by XDP prog which in turn does not drain
      the refcount to 0 and page is never freed.
      
      To address this, let us store the count of frags before the XDP program
      was executed on Rx ring struct. This will be used to compare with
      current frag count from skb_shared_info embedded in xdp_buff. A smaller
      value in the latter indicates that XDP prog freed frag(s). Then, for
      given delta decrement pagecnt_bias for XDP_DROP verdict.
      
      While at it, let us also handle the EOP frag within
      ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
      needed to be applied against freed frags are performed in the single
      place.
      
      Fixes: 2fba7dc5 ("ice: Add support for XDP multi-buffer on Rx side")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ad2047cf
    • Maciej Fijalkowski's avatar
      xsk: fix usage of multi-buffer BPF helpers for ZC XDP · c5114710
      Maciej Fijalkowski authored
      Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
      type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:
      
      [1136314.192256] BUG: kernel NULL pointer dereference, address:
      0000000000000034
      [1136314.203943] #PF: supervisor read access in kernel mode
      [1136314.213768] #PF: error_code(0x0000) - not-present page
      [1136314.223550] PGD 0 P4D 0
      [1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
      [1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
      BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
      [1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
      [1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
      [1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
      [1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
      0000000000000000
      [1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
      ffffc9003168c000
      [1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
      0000000000010000
      [1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
      0000000000000001
      [1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
      0000000000000001
      [1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
      knlGS:0000000000000000
      [1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
      00000000007706f0
      [1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [1136314.431890] PKRU: 55555554
      [1136314.439143] Call Trace:
      [1136314.446058]  <IRQ>
      [1136314.452465]  ? __die+0x20/0x70
      [1136314.459881]  ? page_fault_oops+0x15b/0x440
      [1136314.468305]  ? exc_page_fault+0x6a/0x150
      [1136314.476491]  ? asm_exc_page_fault+0x22/0x30
      [1136314.484927]  ? __xdp_return+0x6c/0x210
      [1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
      [1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
      [1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
      [1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
      [1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
      [1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
      [1136314.546010]  __napi_poll+0x29/0x1b0
      [1136314.553462]  net_rx_action+0x133/0x270
      [1136314.561619]  __do_softirq+0xbe/0x28e
      [1136314.569303]  do_softirq+0x3f/0x60
      
      This comes from __xdp_return() call with xdp_buff argument passed as
      NULL which is supposed to be consumed by xsk_buff_free() call.
      
      To address this properly, in ZC case, a node that represents the frag
      being removed has to be pulled out of xskb_list. Introduce
      appropriate xsk helpers to do such node operation and use them
      accordingly within bpf_xdp_adjust_tail().
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c5114710
    • Maciej Fijalkowski's avatar
      xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags · f7f6aa8e
      Maciej Fijalkowski authored
      XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
      used by drivers to notify data path whether xdp_buff contains fragments
      or not. Data path looks up mentioned flag on first buffer that occupies
      the linear part of xdp_buff, so drivers only modify it there. This is
      sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
      stack or it resides within struct representing driver's queue and
      fragments are carried via skb_frag_t structs. IOW, we are dealing with
      only one xdp_buff.
      
      ZC mode though relies on list of xdp_buff structs that is carried via
      xsk_buff_pool::xskb_list, so ZC data path has to make sure that
      fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
      xsk_buff_free() could misbehave if it would be executed against xdp_buff
      that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
      take place when within supplied XDP program bpf_xdp_adjust_tail() is
      used with negative offset that would in turn release the tail fragment
      from multi-buffer frame.
      
      Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
      result in releasing all the nodes from xskb_list that were produced by
      driver before XDP program execution, which is not what is intended -
      only tail fragment should be deleted from xskb_list and then it should
      be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
      make it up to user space, so from AF_XDP application POV there would be
      no traffic running, however due to free_list getting constantly new
      nodes, driver will be able to feed HW Rx queue with recycled buffers.
      Bottom line is that instead of traffic being redirected to user space,
      it would be continuously dropped.
      
      To fix this, let us clear the mentioned flag on xsk_buff_pool side
      during xdp_buff initialization, which is what should have been done
      right from the start of XSK multi-buffer support.
      
      Fixes: 1bbc04de ("ice: xsk: add RX multi-buffer support")
      Fixes: 1c9ba9c1 ("i40e: xsk: add RX multi-buffer support")
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f7f6aa8e
    • Maciej Fijalkowski's avatar
      xsk: recycle buffer in case Rx queue was full · 26900989
      Maciej Fijalkowski authored
      Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
      descriptor to XSK Rx queue.
      
      Fixes: 24ea5012 ("xsk: support mbuf on ZC RX")
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      26900989
  2. 23 Jan, 2024 6 commits
    • Pu Lehui's avatar
      riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops · 1732ebc4
      Pu Lehui authored
      We encountered a kernel crash triggered by the bpf_tcp_ca testcase as
      show below:
      
      Unable to handle kernel paging request at virtual address ff60000088554500
      Oops [#1]
      ...
      CPU: 3 PID: 458 Comm: test_progs Tainted: G           OE      6.8.0-rc1-kselftest_plain #1
      Hardware name: riscv-virtio,qemu (DT)
      epc : 0xff60000088554500
       ra : tcp_ack+0x288/0x1232
      epc : ff60000088554500 ra : ffffffff80cc7166 sp : ff2000000117ba50
       gp : ffffffff82587b60 tp : ff60000087be0040 t0 : ff60000088554500
       t1 : ffffffff801ed24e t2 : 0000000000000000 s0 : ff2000000117bbc0
       s1 : 0000000000000500 a0 : ff20000000691000 a1 : 0000000000000018
       a2 : 0000000000000001 a3 : ff60000087be03a0 a4 : 0000000000000000
       a5 : 0000000000000000 a6 : 0000000000000021 a7 : ffffffff8263f880
       s2 : 000000004ac3c13b s3 : 000000004ac3c13a s4 : 0000000000008200
       s5 : 0000000000000001 s6 : 0000000000000104 s7 : ff2000000117bb00
       s8 : ff600000885544c0 s9 : 0000000000000000 s10: ff60000086ff0b80
       s11: 000055557983a9c0 t3 : 0000000000000000 t4 : 000000000000ffc4
       t5 : ffffffff8154f170 t6 : 0000000000000030
      status: 0000000200000120 badaddr: ff60000088554500 cause: 000000000000000c
      Code: c796 67d7 0000 0000 0052 0002 c13b 4ac3 0000 0000 (0001) 0000
      ---[ end trace 0000000000000000 ]---
      
      The reason is that commit 2cd3e377 ("x86/cfi,bpf: Fix bpf_struct_ops
      CFI") changes the func_addr of arch_prepare_bpf_trampoline in struct_ops
      from NULL to non-NULL, while we use func_addr on RV64 to differentiate
      between struct_ops and regular trampoline. When the struct_ops testcase
      is triggered, it emits wrong prologue and epilogue, and lead to
      unpredictable issues. After commit 2cd3e377, we can use
      BPF_TRAMP_F_INDIRECT to distinguish them as it always be set in
      struct_ops.
      
      Fixes: 2cd3e377 ("x86/cfi,bpf: Fix bpf_struct_ops CFI")
      Signed-off-by: default avatarPu Lehui <pulehui@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarBjörn Töpel <bjorn@rivosinc.com>
      Acked-by: default avatarBjörn Töpel <bjorn@kernel.org>
      Link: https://lore.kernel.org/bpf/20240123023207.1917284-1-pulehui@huaweicloud.com
      1732ebc4
    • Jakub Kicinski's avatar
      Merge tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 1347775d
      Jakub Kicinski authored
      Kalle Valo says:
      
      ====================
      wireless fixes for v6.8-rc2
      
      The most visible fix here is the ath11k crash fix which was introduced
      in v6.7. We also have a fix for iwlwifi memory corruption and few
      smaller fixes in the stack.
      
      * tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
        wifi: mac80211: fix race condition on enabling fast-xmit
        wifi: iwlwifi: fix a memory corruption
        wifi: mac80211: fix potential sta-link leak
        wifi: cfg80211/mac80211: remove dependency on non-existing option
        wifi: cfg80211: fix missing interfaces when dumping
        wifi: ath11k: rely on mac80211 debugfs handling for vif
        wifi: p54: fix GCC format truncation warning with wiphy->fw_version
      ====================
      
      Link: https://lore.kernel.org/r/20240122153434.E0254C433C7@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1347775d
    • Zhengchao Shao's avatar
      ipv6: init the accept_queue's spinlocks in inet6_create · 435e202d
      Zhengchao Shao authored
      In commit 198bc90e("tcp: make sure init the accept_queue's spinlocks
      once"), the spinlocks of accept_queue are initialized only when socket is
      created in the inet4 scenario. The locks are not initialized when socket
      is created in the inet6 scenario. The kernel reports the following error:
      INFO: trying to register non-static key.
      The code is fine but needs lockdep annotation, or maybe
      you didn't initialize this object before use?
      turning off the locking correctness validator.
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
      <TASK>
      	dump_stack_lvl (lib/dump_stack.c:107)
      	register_lock_class (kernel/locking/lockdep.c:1289)
      	__lock_acquire (kernel/locking/lockdep.c:5015)
      	lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
      	_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
      	inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
      	tcp_disconnect (net/ipv4/tcp.c:2981)
      	inet_shutdown (net/ipv4/af_inet.c:935)
      	__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
      	__x64_sys_shutdown (net/socket.c:2445)
      	do_syscall_64 (arch/x86/entry/common.c:52)
      	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f52ecd05a3d
      Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
      RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
      RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
      RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
      R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0
      
      Fixes: 198bc90e ("tcp: make sure init the accept_queue's spinlocks once")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      435e202d
    • Zhengchao Shao's avatar
      netlink: fix potential sleeping issue in mqueue_flush_file · 234ec0b6
      Zhengchao Shao authored
      I analyze the potential sleeping issue of the following processes:
      Thread A                                Thread B
      ...                                     netlink_create  //ref = 1
      do_mq_notify                            ...
        sock = netlink_getsockbyfilp          ...     //ref = 2
        info->notify_sock = sock;             ...
      ...                                     netlink_sendmsg
      ...                                       skb = netlink_alloc_large_skb  //skb->head is vmalloced
      ...                                       netlink_unicast
      ...                                         sk = netlink_getsockbyportid //ref = 3
      ...                                         netlink_sendskb
      ...                                           __netlink_sendskb
      ...                                             skb_queue_tail //put skb to sk_receive_queue
      ...                                         sock_put //ref = 2
      ...                                     ...
      ...                                     netlink_release
      ...                                       deferred_put_nlk_sk //ref = 1
      mqueue_flush_file
        spin_lock
        remove_notification
          netlink_sendskb
            sock_put  //ref = 0
              sk_free
                ...
                __sk_destruct
                  netlink_sock_destruct
                    skb_queue_purge  //get skb from sk_receive_queue
                      ...
                      __skb_queue_purge_reason
                        kfree_skb_reason
                          __kfree_skb
                          ...
                          skb_release_all
                            skb_release_head_state
                              netlink_skb_destructor
                                vfree(skb->head)  //sleeping while holding spinlock
      
      In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
      vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
      When the mqueue executes flush, the sleeping bug will occur. Use
      vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
      
      Fixes: c05cdb1b ("netlink: allow large data transfers from user-space")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      234ec0b6
    • Kuniyuki Iwashima's avatar
      selftest: Don't reuse port for SO_INCOMING_CPU test. · 97de5a15
      Kuniyuki Iwashima authored
      Jakub reported that ASSERT_EQ(cpu, i) in so_incoming_cpu.c seems to
      fire somewhat randomly.
      
        # #  RUN           so_incoming_cpu.before_reuseport.test3 ...
        # # so_incoming_cpu.c:191:test3:Expected cpu (32) == i (0)
        # # test3: Test terminated by assertion
        # #          FAIL  so_incoming_cpu.before_reuseport.test3
        # not ok 3 so_incoming_cpu.before_reuseport.test3
      
      When the test failed, not-yet-accepted CLOSE_WAIT sockets received
      SYN with a "challenging" SEQ number, which was sent from an unexpected
      CPU that did not create the receiver.
      
      The test basically does:
      
        1. for each cpu:
          1-1. create a server
          1-2. set SO_INCOMING_CPU
      
        2. for each cpu:
          2-1. set cpu affinity
          2-2. create some clients
          2-3. let clients connect() to the server on the same cpu
          2-4. close() clients
      
        3. for each server:
          3-1. accept() all child sockets
          3-2. check if all children have the same SO_INCOMING_CPU with the server
      
      The root cause was the close() in 2-4. and net.ipv4.tcp_tw_reuse.
      
      In a loop of 2., close() changed the client state to FIN_WAIT_2, and
      the peer transitioned to CLOSE_WAIT.
      
      In another loop of 2., connect() happened to select the same port of
      the FIN_WAIT_2 socket, and it was reused as the default value of
      net.ipv4.tcp_tw_reuse is 2.
      
      As a result, the new client sent SYN to the CLOSE_WAIT socket from
      a different CPU, and the receiver's sk_incoming_cpu was overwritten
      with unexpected CPU ID.
      
      Also, the SYN had a different SEQ number, so the CLOSE_WAIT socket
      responded with Challenge ACK.  The new client properly returned RST
      and effectively killed the CLOSE_WAIT socket.
      
      This way, all clients were created successfully, but the error was
      detected later by 3-2., ASSERT_EQ(cpu, i).
      
      To avoid the failure, let's make sure that (i) the number of clients
      is less than the number of available ports and (ii) such reuse never
      happens.
      
      Fixes: 6df96146 ("selftest: Add test for SO_INCOMING_CPU.")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Tested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20240120031642.67014-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      97de5a15
    • Salvatore Dipietro's avatar
      tcp: Add memory barrier to tcp_push() · 7267e8dc
      Salvatore Dipietro authored
      On CPUs with weak memory models, reads and updates performed by tcp_push
      to the sk variables can get reordered leaving the socket throttled when
      it should not. The tasklet running tcp_wfree() may also not observe the
      memory updates in time and will skip flushing any packets throttled by
      tcp_push(), delaying the sending. This can pathologically cause 40ms
      extra latency due to bad interactions with delayed acks.
      
      Adding a memory barrier in tcp_push removes the bug, similarly to the
      previous commit bf06200e ("tcp: tsq: fix nonagle handling").
      smp_mb__after_atomic() is used to not incur in unnecessary overhead
      on x86 since not affected.
      
      Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
      22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
      
      import java.io.IOException;
      import java.io.OutputStreamWriter;
      import java.io.PrintWriter;
      import javax.servlet.ServletException;
      import javax.servlet.http.HttpServlet;
      import javax.servlet.http.HttpServletRequest;
      import javax.servlet.http.HttpServletResponse;
      
      public class HelloWorldServlet extends HttpServlet {
          @Override
          protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
              response.setContentType("text/html;charset=utf-8");
              OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
              String s = "a".repeat(3096);
              osw.write(s,0,s.length());
              osw.flush();
          }
      }
      
      Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
      c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
      values is observed while, with the patch, the extra latency disappears.
      
      No patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.91ms
       75.000%    1.13ms
       90.000%    1.46ms
       99.000%    1.74ms
       99.900%    1.89ms
       99.990%   41.95ms  <<< 40+ ms extra latency
       99.999%   48.32ms
      100.000%   48.96ms
      
      With patch and tcp_autocorking=1
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
        ...
       50.000%    0.90ms
       75.000%    1.13ms
       90.000%    1.45ms
       99.000%    1.72ms
       99.900%    1.83ms
       99.990%    2.11ms  <<< no 40+ ms extra latency
       99.999%    2.53ms
      100.000%    2.62ms
      
      Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
      affected by this issue and the patch doesn't introduce any additional
      delay.
      
      Fixes: 7aa5470c ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
      Signed-off-by: default avatarSalvatore Dipietro <dipiets@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7267e8dc
  3. 22 Jan, 2024 11 commits
    • Sharath Srinivasan's avatar
      net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv · 13e788de
      Sharath Srinivasan authored
      Syzcaller UBSAN crash occurs in rds_cmsg_recv(),
      which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1),
      but with array size of 4 (RDS_RX_MAX_TRACES).
      Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from
      trace.rx_trace_pos[i] in rds_recv_track_latency(),
      with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the
      off-by-one bounds check in rds_recv_track_latency() to prevent
      a potential crash in rds_cmsg_recv().
      
      Found by syzcaller:
      =================================================================
      UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39
      index 4 is out of range for type 'u64 [4]'
      CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b35 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.15.0-1 04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106
       ubsan_epilogue lib/ubsan.c:217 [inline]
       __ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348
       rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585
       rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716
       sock_recvmsg_nosec net/socket.c:1044 [inline]
       sock_recvmsg+0xe2/0x160 net/socket.c:1066
       __sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246
       __do_sys_recvfrom net/socket.c:2264 [inline]
       __se_sys_recvfrom net/socket.c:2260 [inline]
       __x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      ==================================================================
      
      Fixes: 3289025a ("RDS: add receive message trace used by application")
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/Signed-off-by: default avatarSharath Srinivasan <sharath.srinivasan@oracle.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13e788de
    • Horatiu Vultur's avatar
      net: micrel: Fix PTP frame parsing for lan8814 · aaf632f7
      Horatiu Vultur authored
      The HW has the capability to check each frame if it is a PTP frame,
      which domain it is, which ptp frame type it is, different ip address in
      the frame. And if one of these checks fail then the frame is not
      timestamp. Most of these checks were disabled except checking the field
      minorVersionPTP inside the PTP header. Meaning that once a partner sends
      a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
      frame was not timestamp because the HW expected by default a value of 0
      in minorVersionPTP. This is exactly the same issue as on lan8841.
      Fix this issue by removing this check so the userspace can decide on this.
      
      Fixes: ece19502 ("net: phy: micrel: 1588 support for LAN8814 phy")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Reviewed-by: default avatarDivya Koppera <divya.koppera@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaf632f7
    • David S. Miller's avatar
      Merge branch 'dpll-fixes' · 94fa82b0
      David S. Miller authored
      Arkadiusz Kubalewski says:
      
      ====================
      dpll: fix unordered unbind/bind registerer issues
      
      Fix issues when performing unordered unbind/bind of a kernel modules
      which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
      Currently only serialized bind/unbind of such use case works, fix
      the issues and allow for unserialized kernel module bind order.
      
      The issues are observed on the ice driver, i.e.,
      
      $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
      $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
      
      results in:
      
      ice 0000:af:00.0: Removed PTP clock
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      PF: supervisor read access in kernel mode
      PF: error_code(0x0000) - not-present page
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-rc5_next-queue_19th-Oct-2023-01625-g039e5d15e451 #1
      Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
      RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
      Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b 66 08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b 10 41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
      RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
      RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
      R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
      R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
      FS:  00007fdc7dae0740(0000) GS:ffff888c105c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die+0x20/0x70
       ? page_fault_oops+0x76/0x170
       ? exc_page_fault+0x65/0x150
       ? asm_exc_page_fault+0x22/0x30
       ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
       ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
       dpll_msg_add_pin_parents+0x142/0x1d0
       dpll_pin_event_send+0x7d/0x150
       dpll_pin_on_pin_unregister+0x3f/0x100
       ice_dpll_deinit_pins+0xa1/0x230 [ice]
       ice_dpll_deinit+0x29/0xe0 [ice]
       ice_remove+0xcd/0x200 [ice]
       pci_device_remove+0x33/0xa0
       device_release_driver_internal+0x193/0x200
       unbind_store+0x9d/0xb0
       kernfs_fop_write_iter+0x128/0x1c0
       vfs_write+0x2bb/0x3e0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x59/0x90
       ? filp_close+0x1b/0x30
       ? do_dup2+0x7d/0xd0
       ? syscall_exit_work+0x103/0x130
       ? syscall_exit_to_user_mode+0x22/0x40
       ? do_syscall_64+0x69/0x90
       ? syscall_exit_work+0x103/0x130
       ? syscall_exit_to_user_mode+0x22/0x40
       ? do_syscall_64+0x69/0x90
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      RIP: 0033:0x7fdc7d93eb97
      Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
      RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
      RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
      RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
      R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
      R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
       </TASK>
      Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1 vfio irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs libcrc32c rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr i2c_i801 ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded: iavf]
      CR2: 0000000000000010
      
      v6:
      - fix memory corruption on error path in patch [v5 2/4]
      ====================
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94fa82b0
    • Arkadiusz Kubalewski's avatar
      dpll: fix register pin with unregistered parent pin · 7dc5b18f
      Arkadiusz Kubalewski authored
      In case of multiple kernel module instances using the same dpll device:
      if only one registers dpll device, then only that one can register
      directly connected pins with a dpll device. When unregistered parent is
      responsible for determining if the muxed pin can be registered with it
      or not, the drivers need to be loaded in serialized order to work
      correctly - first the driver instance which registers the direct pins
      needs to be loaded, then the other instances could register muxed type
      pins.
      
      Allow registration of a pin with a parent even if the parent was not
      yet registered, thus allow ability for unserialized driver instance
      load order.
      Do not WARN_ON notification for unregistered pin, which can be invoked
      for described case, instead just return error.
      
      Fixes: 9431063a ("dpll: core: Add DPLL framework base functions")
      Fixes: 9d71b54b ("dpll: netlink: Add DPLL framework base functions")
      Reviewed-by: default avatarJan Glaza <jan.glaza@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dc5b18f
    • Arkadiusz Kubalewski's avatar
      dpll: fix userspace availability of pins · db2ec3c9
      Arkadiusz Kubalewski authored
      If parent pin was unregistered but child pin was not, the userspace
      would see the "zombie" pins - the ones that were registered with
      a parent pin (dpll_pin_on_pin_register(..)).
      Technically those are not available - as there is no dpll device in the
      system. Do not dump those pins and prevent userspace from any
      interaction with them. Provide a unified function to determine if the
      pin is available and use it before acting/responding for user requests.
      
      Fixes: 9d71b54b ("dpll: netlink: Add DPLL framework base functions")
      Reviewed-by: default avatarJan Glaza <jan.glaza@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db2ec3c9
    • Arkadiusz Kubalewski's avatar
      dpll: fix pin dump crash for rebound module · 830ead5f
      Arkadiusz Kubalewski authored
      When a kernel module is unbound but the pin resources were not entirely
      freed (other kernel module instance of the same PCI device have had kept
      the reference to that pin), and kernel module is again bound, the pin
      properties would not be updated (the properties are only assigned when
      memory for the pin is allocated), prop pointer still points to the
      kernel module memory of the kernel module which was deallocated on the
      unbind.
      
      If the pin dump is invoked in this state, the result is a kernel crash.
      Prevent the crash by storing persistent pin properties in dpll subsystem,
      copy the content from the kernel module when pin is allocated, instead of
      using memory of the kernel module.
      
      Fixes: 9431063a ("dpll: core: Add DPLL framework base functions")
      Fixes: 9d71b54b ("dpll: netlink: Add DPLL framework base functions")
      Reviewed-by: default avatarJan Glaza <jan.glaza@intel.com>
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      830ead5f
    • Arkadiusz Kubalewski's avatar
      dpll: fix broken error path in dpll_pin_alloc(..) · b6a11a7f
      Arkadiusz Kubalewski authored
      If pin type is not expected, or pin properities failed to allocate
      memory, the unwind error path shall not destroy pin's xarrays, which
      were not yet initialized.
      Add new goto label and use it to fix broken error path.
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6a11a7f
    • David S. Miller's avatar
      Merge branch 'tun-fixes' · ef485216
      David S. Miller authored
      Yunjian Wang says:
      
      ====================
      fixes for tun
      
      There are few places on the receive path where packet receives and
      packet drops were not accounted for. This patchset fixes that issue.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef485216
    • Yunjian Wang's avatar
      tun: add missing rx stats accounting in tun_xdp_act · f1084c42
      Yunjian Wang authored
      The TUN can be used as vhost-net backend, and it is necessary to
      count the packets transmitted from TUN to vhost-net/virtio-net.
      However, there are some places in the receive path that were not
      taken into account when using XDP. It would be beneficial to also
      include new accounting for successfully received bytes using
      dev_sw_netstats_rx_add.
      
      Fixes: 761876c8 ("tap: XDP support")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1084c42
    • Yunjian Wang's avatar
      tun: fix missing dropped counter in tun_xdp_act · 5744ba05
      Yunjian Wang authored
      The commit 8ae1aff0 ("tuntap: split out XDP logic") includes
      dropped counter for XDP_DROP, XDP_ABORTED, and invalid XDP actions.
      Unfortunately, that commit missed the dropped counter when error
      occurs during XDP_TX and XDP_REDIRECT actions. This patch fixes
      this issue.
      
      Fixes: 8ae1aff0 ("tuntap: split out XDP logic")
      Signed-off-by: default avatarYunjian Wang <wangyunjian@huawei.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5744ba05
    • Jakub Kicinski's avatar
      net: fix removing a namespace with conflicting altnames · d09486a0
      Jakub Kicinski authored
      Mark reports a BUG() when a net namespace is removed.
      
          kernel BUG at net/core/dev.c:11520!
      
      Physical interfaces moved outside of init_net get "refunded"
      to init_net when that namespace disappears. The main interface
      name may get overwritten in the process if it would have
      conflicted. We need to also discard all conflicting altnames.
      Recent fixes addressed ensuring that altnames get moved
      with the main interface, which surfaced this problem.
      Reported-by: default avatarМарк Коренберг <socketpair@gmail.com>
      Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/
      Fixes: 7663d522 ("net: check for altname conflicts when changing netdev's netns")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d09486a0
  4. 21 Jan, 2024 2 commits
    • Michal Schmidt's avatar
      idpf: distinguish vports by the dev_port attribute · 359724fa
      Michal Schmidt authored
      idpf registers multiple netdevs (virtual ports) for one PCI function,
      but it does not provide a way for userspace to distinguish them with
      sysfs attributes. Per Documentation/ABI/testing/sysfs-class-net, it is
      a bug not to set dev_port for independent ports on the same PCI bus,
      device and function.
      
      Without dev_port set, systemd-udevd's default naming policy attempts
      to assign the same name ("ens2f0") to all four idpf netdevs on my test
      system and obviously fails, leaving three of them with the initial
      eth<N> name.
      
      With this patch, systemd-udevd is able to assign unique names to the
      netdevs (e.g. "ens2f0", "ens2f0d1", "ens2f0d2", "ens2f0d3").
      
      The Intel-provided out-of-tree idpf driver already sets dev_port. In
      this patch I chose to do it in the same place in the idpf_cfg_netdev
      function.
      
      Fixes: 0fe45467 ("idpf: add create vport and netdev configuration")
      Signed-off-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Reviewed-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      359724fa
    • Eric Dumazet's avatar
      udp: fix busy polling · a54d51fb
      Eric Dumazet authored
      Generic sk_busy_loop_end() only looks at sk->sk_receive_queue
      for presence of packets.
      
      Problem is that for UDP sockets after blamed commit, some packets
      could be present in another queue: udp_sk(sk)->reader_queue
      
      In some cases, a busy poller could spin until timeout expiration,
      even if some packets are available in udp_sk(sk)->reader_queue.
      
      v3: - make sk_busy_loop_end() nicer (Willem)
      
      v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats.
          - add a sk_is_inet() check in sk_is_udp() (Willem feedback)
          - add a sk_is_inet() check in sk_is_tcp().
      
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a54d51fb
  5. 20 Jan, 2024 11 commits
    • Kuniyuki Iwashima's avatar
      llc: Drop support for ETH_P_TR_802_2. · e3f9bed9
      Kuniyuki Iwashima authored
      syzbot reported an uninit-value bug below. [0]
      
      llc supports ETH_P_802_2 (0x0004) and used to support ETH_P_TR_802_2
      (0x0011), and syzbot abused the latter to trigger the bug.
      
        write$tun(r0, &(0x7f0000000040)={@val={0x0, 0x11}, @val, @mpls={[], @llc={@snap={0xaa, 0x1, ')', "90e5dd"}}}}, 0x16)
      
      llc_conn_handler() initialises local variables {saddr,daddr}.mac
      based on skb in llc_pdu_decode_sa()/llc_pdu_decode_da() and passes
      them to __llc_lookup().
      
      However, the initialisation is done only when skb->protocol is
      htons(ETH_P_802_2), otherwise, __llc_lookup_established() and
      __llc_lookup_listener() will read garbage.
      
      The missing initialisation existed prior to commit 211ed865
      ("net: delete all instances of special processing for token ring").
      
      It removed the part to kick out the token ring stuff but forgot to
      close the door allowing ETH_P_TR_802_2 packets to sneak into llc_rcv().
      
      Let's remove llc_tr_packet_type and complete the deprecation.
      
      [0]:
      BUG: KMSAN: uninit-value in __llc_lookup_established+0xe9d/0xf90
       __llc_lookup_established+0xe9d/0xf90
       __llc_lookup net/llc/llc_conn.c:611 [inline]
       llc_conn_handler+0x4bd/0x1360 net/llc/llc_conn.c:791
       llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
       __netif_receive_skb_one_core net/core/dev.c:5527 [inline]
       __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5641
       netif_receive_skb_internal net/core/dev.c:5727 [inline]
       netif_receive_skb+0x58/0x660 net/core/dev.c:5786
       tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
       tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002
       tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
       call_write_iter include/linux/fs.h:2020 [inline]
       new_sync_write fs/read_write.c:491 [inline]
       vfs_write+0x8ef/0x1490 fs/read_write.c:584
       ksys_write+0x20f/0x4c0 fs/read_write.c:637
       __do_sys_write fs/read_write.c:649 [inline]
       __se_sys_write fs/read_write.c:646 [inline]
       __x64_sys_write+0x93/0xd0 fs/read_write.c:646
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Local variable daddr created at:
       llc_conn_handler+0x53/0x1360 net/llc/llc_conn.c:783
       llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
      
      CPU: 1 PID: 5004 Comm: syz-executor994 Not tainted 6.6.0-syzkaller-14500-g1c410411 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
      
      Fixes: 211ed865 ("net: delete all instances of special processing for token ring")
      Reported-by: syzbot+b5ad66046b913bc04c6f@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=b5ad66046b913bc04c6fSigned-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240119015515.61898-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e3f9bed9
    • Eric Dumazet's avatar
      llc: make llc_ui_sendmsg() more robust against bonding changes · dad555c8
      Eric Dumazet authored
      syzbot was able to trick llc_ui_sendmsg(), allocating an skb with no
      headroom, but subsequently trying to push 14 bytes of Ethernet header [1]
      
      Like some others, llc_ui_sendmsg() releases the socket lock before
      calling sock_alloc_send_skb().
      Then it acquires it again, but does not redo all the sanity checks
      that were performed.
      
      This fix:
      
      - Uses LL_RESERVED_SPACE() to reserve space.
      - Check all conditions again after socket lock is held again.
      - Do not account Ethernet header for mtu limitation.
      
      [1]
      
      skbuff: skb_under_panic: text:ffff800088baa334 len:1514 put:14 head:ffff0000c9c37000 data:ffff0000c9c36ff2 tail:0x5dc end:0x6c0 dev:bond0
      
       kernel BUG at net/core/skbuff.c:193 !
      Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 PID: 6875 Comm: syz-executor.0 Not tainted 6.7.0-rc8-syzkaller-00101-g0802e17d9aca-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
       pc : skb_panic net/core/skbuff.c:189 [inline]
       pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
       lr : skb_panic net/core/skbuff.c:189 [inline]
       lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
      sp : ffff800096f97000
      x29: ffff800096f97010 x28: ffff80008cc8d668 x27: dfff800000000000
      x26: ffff0000cb970c90 x25: 00000000000005dc x24: ffff0000c9c36ff2
      x23: ffff0000c9c37000 x22: 00000000000005ea x21: 00000000000006c0
      x20: 000000000000000e x19: ffff800088baa334 x18: 1fffe000368261ce
      x17: ffff80008e4ed000 x16: ffff80008a8310f8 x15: 0000000000000001
      x14: 1ffff00012df2d58 x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000001 x10: 0000000000ff0100 x9 : e28a51f1087e8400
      x8 : e28a51f1087e8400 x7 : ffff80008028f8d0 x6 : 0000000000000000
      x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800082b78714
      x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000089
      Call trace:
        skb_panic net/core/skbuff.c:189 [inline]
        skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
        skb_push+0xf0/0x108 net/core/skbuff.c:2451
        eth_header+0x44/0x1f8 net/ethernet/eth.c:83
        dev_hard_header include/linux/netdevice.h:3188 [inline]
        llc_mac_hdr_init+0x110/0x17c net/llc/llc_output.c:33
        llc_sap_action_send_xid_c+0x170/0x344 net/llc/llc_s_ac.c:85
        llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline]
        llc_sap_next_state net/llc/llc_sap.c:182 [inline]
        llc_sap_state_process+0x1ec/0x774 net/llc/llc_sap.c:209
        llc_build_and_send_xid_pkt+0x12c/0x1c0 net/llc/llc_sap.c:270
        llc_ui_sendmsg+0x7bc/0xb1c net/llc/af_llc.c:997
        sock_sendmsg_nosec net/socket.c:730 [inline]
        __sock_sendmsg net/socket.c:745 [inline]
        sock_sendmsg+0x194/0x274 net/socket.c:767
        splice_to_socket+0x7cc/0xd58 fs/splice.c:881
        do_splice_from fs/splice.c:933 [inline]
        direct_splice_actor+0xe4/0x1c0 fs/splice.c:1142
        splice_direct_to_actor+0x2a0/0x7e4 fs/splice.c:1088
        do_splice_direct+0x20c/0x348 fs/splice.c:1194
        do_sendfile+0x4bc/0xc70 fs/read_write.c:1254
        __do_sys_sendfile64 fs/read_write.c:1322 [inline]
        __se_sys_sendfile64 fs/read_write.c:1308 [inline]
        __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1308
        __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
        invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
        el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
        do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
        el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678
        el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
        el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
      Code: aa1803e6 aa1903e7 a90023f5 94792f6a (d4210000)
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: syzbot+2a7024e9502df538e8ef@syzkaller.appspotmail.com
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240118183625.4007013-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dad555c8
    • Lin Ma's avatar
      vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPING · 6c21660f
      Lin Ma authored
      In the vlan_changelink function, a loop is used to parse the nested
      attributes IFLA_VLAN_EGRESS_QOS and IFLA_VLAN_INGRESS_QOS in order to
      obtain the struct ifla_vlan_qos_mapping. These two nested attributes are
      checked in the vlan_validate_qos_map function, which calls
      nla_validate_nested_deprecated with the vlan_map_policy.
      
      However, this deprecated validator applies a LIBERAL strictness, allowing
      the presence of an attribute with the type IFLA_VLAN_QOS_UNSPEC.
      Consequently, the loop in vlan_changelink may parse an attribute of type
      IFLA_VLAN_QOS_UNSPEC and believe it carries a payload of
      struct ifla_vlan_qos_mapping, which is not necessarily true.
      
      To address this issue and ensure compatibility, this patch introduces two
      type checks that skip attributes whose type is not IFLA_VLAN_QOS_MAPPING.
      
      Fixes: 07b5b17e ("[VLAN]: Use rtnl_link API")
      Signed-off-by: default avatarLin Ma <linma@zju.edu.cn>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240118130306.1644001-1-linma@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c21660f
    • Jakub Kicinski's avatar
      Merge branch 'bnxt_en-bug-fixes' · 9b697956
      Jakub Kicinski authored
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes
      
      This series contains 5 miscellaneous fixes.  The fixes include adding
      delay for FLR, buffer memory leak, RSS table size calculation,
      ethtool self test kernel warning, and mqprio crash.
      ====================
      
      Link: https://lore.kernel.org/r/20240117234515.226944-1-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9b697956
    • Michael Chan's avatar
      bnxt_en: Fix possible crash after creating sw mqprio TCs · 467739ba
      Michael Chan authored
      The driver relies on netdev_get_num_tc() to get the number of HW
      offloaded mqprio TCs to allocate and free TX rings.  This won't
      work and can potentially crash the system if software mqprio or
      taprio TCs have been setup.  netdev_get_num_tc() will return the
      number of software TCs and it may cause the driver to allocate or
      free more TX rings that it should.  Fix it by adding a bp->num_tc
      field to store the number of HW offload mqprio TCs for the device.
      Use bp->num_tc instead of netdev_get_num_tc().
      
      This fixes a crash like this:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD 42b8404067 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 120 PID: 8661 Comm: ifconfig Kdump: loaded Tainted: G           OE     5.18.16 #1
      Hardware name: Lenovo ThinkSystem SR650 V3/SB27A92818, BIOS ESE114N-2.12 04/25/2023
      RIP: 0010:bnxt_hwrm_cp_ring_alloc_p5+0x10/0x90 [bnxt_en]
      Code: 41 5c 41 5d 41 5e c3 cc cc cc cc 41 8b 44 24 08 66 89 03 eb c6 e8 b0 f1 7d db 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 <48> 8b 06 48 89 f3 48 81 c6 28 01 00 00 0f b6 96 13 ff ff ff 44 8b
      RSP: 0018:ff65907660d1fa88 EFLAGS: 00010202
      RAX: 0000000000000010 RBX: ff4dde1d907e4980 RCX: f400000000000000
      RDX: 0000000000000010 RSI: 0000000000000000 RDI: ff4dde1d907e4980
      RBP: ff4dde1d907e4980 R08: 000000000000000f R09: 0000000000000000
      R10: ff4dde5f02671800 R11: 0000000000000008 R12: 0000000088888889
      R13: 0500000000000000 R14: 00f0000000000000 R15: ff4dde5f02671800
      FS:  00007f4b126b5740(0000) GS:ff4dde9bff600000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 000000416f9c6002 CR4: 0000000000771ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       bnxt_hwrm_ring_alloc+0x204/0x770 [bnxt_en]
       bnxt_init_chip+0x4d/0x680 [bnxt_en]
       ? bnxt_poll+0x1a0/0x1a0 [bnxt_en]
       __bnxt_open_nic+0xd2/0x740 [bnxt_en]
       bnxt_open+0x10b/0x220 [bnxt_en]
       ? raw_notifier_call_chain+0x41/0x60
       __dev_open+0xf3/0x1b0
       __dev_change_flags+0x1db/0x250
       dev_change_flags+0x21/0x60
       devinet_ioctl+0x590/0x720
       ? avc_has_extended_perms+0x1b7/0x420
       ? _copy_from_user+0x3a/0x60
       inet_ioctl+0x189/0x1c0
       ? wp_page_copy+0x45a/0x6e0
       sock_do_ioctl+0x42/0xf0
       ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
       sock_ioctl+0x1ce/0x2e0
       __x64_sys_ioctl+0x87/0xc0
       do_syscall_64+0x59/0x90
       ? syscall_exit_work+0x103/0x130
       ? syscall_exit_to_user_mode+0x12/0x30
       ? do_syscall_64+0x69/0x90
       ? exc_page_fault+0x62/0x150
      
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Reviewed-by: default avatarDamodharam Ammepalli <damodharam.ammepalli@broadcom.com>
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240117234515.226944-6-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      467739ba
    • Michael Chan's avatar
      bnxt_en: Prevent kernel warning when running offline self test · c20f4821
      Michael Chan authored
      We call bnxt_half_open_nic() to setup the chip partially to run
      loopback tests.  The rings and buffers are initialized normally
      so that we can transmit and receive packets in loopback mode.
      That means page pool buffers are allocated for the aggregation ring
      just like the normal case.  NAPI is not needed because we are just
      polling for the loopback packets.
      
      When we're done with the loopback tests, we call bnxt_half_close_nic()
      to clean up.  When freeing the page pools, we hit a WARN_ON()
      in page_pool_unlink_napi() because the NAPI state linked to the
      page pool is uninitialized.
      
      The simplest way to avoid this warning is just to initialize the
      NAPIs during half open and delete the NAPIs during half close.
      Trying to skip the page pool initialization or skip linking of
      NAPI during half open will be more complicated.
      
      This fix avoids this warning:
      
      WARNING: CPU: 4 PID: 46967 at net/core/page_pool.c:946 page_pool_unlink_napi+0x1f/0x30
      CPU: 4 PID: 46967 Comm: ethtool Tainted: G S      W          6.7.0-rc5+ #22
      Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.3.8 08/31/2021
      RIP: 0010:page_pool_unlink_napi+0x1f/0x30
      Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 18 48 85 c0 74 1b 48 8b 50 10 83 e2 01 74 08 8b 40 34 83 f8 ff 74 02 <0f> 0b 48 c7 47 18 00 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90
      RSP: 0018:ffa000003d0dfbe8 EFLAGS: 00010246
      RAX: ff110003607ce640 RBX: ff110010baf5d000 RCX: 0000000000000008
      RDX: 0000000000000000 RSI: ff110001e5e522c0 RDI: ff110010baf5d000
      RBP: ff11000145539b40 R08: 0000000000000001 R09: ffffffffc063f641
      R10: ff110001361eddb8 R11: 000000000040000f R12: 0000000000000001
      R13: 000000000000001c R14: ff1100014553a080 R15: 0000000000003fc0
      FS:  00007f9301c4f740(0000) GS:ff1100103fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f91344fa8f0 CR3: 00000003527cc005 CR4: 0000000000771ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __warn+0x81/0x140
       ? page_pool_unlink_napi+0x1f/0x30
       ? report_bug+0x102/0x200
       ? handle_bug+0x44/0x70
       ? exc_invalid_op+0x13/0x60
       ? asm_exc_invalid_op+0x16/0x20
       ? bnxt_free_ring.isra.123+0xb1/0xd0 [bnxt_en]
       ? page_pool_unlink_napi+0x1f/0x30
       page_pool_destroy+0x3e/0x150
       bnxt_free_mem+0x441/0x5e0 [bnxt_en]
       bnxt_half_close_nic+0x2a/0x40 [bnxt_en]
       bnxt_self_test+0x21d/0x450 [bnxt_en]
       __dev_ethtool+0xeda/0x2e30
       ? native_queued_spin_lock_slowpath+0x17f/0x2b0
       ? __link_object+0xa1/0x160
       ? _raw_spin_unlock_irqrestore+0x23/0x40
       ? __create_object+0x5f/0x90
       ? __kmem_cache_alloc_node+0x317/0x3c0
       ? dev_ethtool+0x59/0x170
       dev_ethtool+0xa7/0x170
       dev_ioctl+0xc3/0x530
       sock_do_ioctl+0xa8/0xf0
       sock_ioctl+0x270/0x310
       __x64_sys_ioctl+0x8c/0xc0
       do_syscall_64+0x3e/0xf0
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      Fixes: 294e39e0 ("bnxt: hook NAPIs to page pools")
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Reviewed-by: default avatarAjit Khaparde <ajit.khaparde@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240117234515.226944-5-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c20f4821
    • Michael Chan's avatar
      bnxt_en: Fix RSS table entries calculation for P5_PLUS chips · 523384a6
      Michael Chan authored
      The existing formula used in the driver to calculate the number of RSS
      table entries is to round up the number of RX rings to the next integer
      multiples of 64 (e.g. 64, 128, 192, ..).  This is incorrect.  The valid
      values supported by the chip are 64, 128, 256, 512 only (power of 2
      starting from 64).  When the number of RX rings is greater than 128, the
      entry size will likely be wrong.  Firmware will round down the invalid
      value (e.g. 192 rounded down to 128) provided by the driver, causing some
      RSS rings to not receive any packets.
      
      We already have an existing function bnxt_calc_nr_ring_pages() to
      do this calculation.  Use it in bnxt_get_nr_rss_ctxs() to calculate the
      number of RSS contexts correctly for P5_PLUS chips.
      Reviewed-by: default avatarAndy Gospodarek <andrew.gospodarek@broadcom.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Fixes: 7b3af4f7 ("bnxt_en: Add RSS support for 57500 chips.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240117234515.226944-4-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      523384a6
    • Michael Chan's avatar
      bnxt_en: Fix memory leak in bnxt_hwrm_get_rings() · 2ad8e573
      Michael Chan authored
      bnxt_hwrm_get_rings() can abort and return error when there are not
      enough ring resources.  It aborts without releasing the HWRM DMA buffer,
      causing a dma_pool_destroy warning when the driver is unloaded:
      
      bnxt_en 0000:99:00.0: dma_pool_destroy bnxt_hwrm, 000000005b089ba8 busy
      
      Fixes: f1e50b27 ("bnxt_en: Fix trimming of P5 RX and TX rings")
      Reviewed-by: default avatarSomnath Kotur <somnath.kotur@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240117234515.226944-3-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ad8e573
    • Michael Chan's avatar
      bnxt_en: Wait for FLR to complete during probe · 3c1069fa
      Michael Chan authored
      The first message to firmware may fail if the device is undergoing FLR.
      The driver has some recovery logic for this failure scenario but we must
      wait 100 msec for FLR to complete before proceeding.  Otherwise the
      recovery will always fail.
      
      Fixes: ba02629f ("bnxt_en: log firmware status on firmware init failure")
      Reviewed-by: default avatarDamodharam Ammepalli <damodharam.ammepalli@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Link: https://lore.kernel.org/r/20240117234515.226944-2-michael.chan@broadcom.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3c1069fa
    • Zhengchao Shao's avatar
      tcp: make sure init the accept_queue's spinlocks once · 198bc90e
      Zhengchao Shao authored
      When I run syz's reproduction C program locally, it causes the following
      issue:
      pvqspinlock: lock 0xffff9d181cd5c660 has corrupted value 0x0!
      WARNING: CPU: 19 PID: 21160 at __pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508)
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      RIP: 0010:__pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508)
      Code: 73 56 3a ff 90 c3 cc cc cc cc 8b 05 bb 1f 48 01 85 c0 74 05 c3 cc cc cc cc 8b 17 48 89 fe 48 c7 c7
      30 20 ce 8f e8 ad 56 42 ff <0f> 0b c3 cc cc cc cc 0f 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90
      RSP: 0018:ffffa8d200604cb8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d1ef60e0908
      RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d1ef60e0900
      RBP: ffff9d181cd5c280 R08: 0000000000000000 R09: 00000000ffff7fff
      R10: ffffa8d200604b68 R11: ffffffff907dcdc8 R12: 0000000000000000
      R13: ffff9d181cd5c660 R14: ffff9d1813a3f330 R15: 0000000000001000
      FS:  00007fa110184640(0000) GS:ffff9d1ef60c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000000 CR3: 000000011f65e000 CR4: 00000000000006f0
      Call Trace:
      <IRQ>
        _raw_spin_unlock (kernel/locking/spinlock.c:186)
        inet_csk_reqsk_queue_add (net/ipv4/inet_connection_sock.c:1321)
        inet_csk_complete_hashdance (net/ipv4/inet_connection_sock.c:1358)
        tcp_check_req (net/ipv4/tcp_minisocks.c:868)
        tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2260)
        ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205)
        ip_local_deliver_finish (net/ipv4/ip_input.c:234)
        __netif_receive_skb_one_core (net/core/dev.c:5529)
        process_backlog (./include/linux/rcupdate.h:779)
        __napi_poll (net/core/dev.c:6533)
        net_rx_action (net/core/dev.c:6604)
        __do_softirq (./arch/x86/include/asm/jump_label.h:27)
        do_softirq (kernel/softirq.c:454 kernel/softirq.c:441)
      </IRQ>
      <TASK>
        __local_bh_enable_ip (kernel/softirq.c:381)
        __dev_queue_xmit (net/core/dev.c:4374)
        ip_finish_output2 (./include/net/neighbour.h:540 net/ipv4/ip_output.c:235)
        __ip_queue_xmit (net/ipv4/ip_output.c:535)
        __tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
        tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6469)
        tcp_rcv_state_process (net/ipv4/tcp_input.c:6657)
        tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929)
        __release_sock (./include/net/sock.h:1121 net/core/sock.c:2968)
        release_sock (net/core/sock.c:3536)
        inet_wait_for_connect (net/ipv4/af_inet.c:609)
        __inet_stream_connect (net/ipv4/af_inet.c:702)
        inet_stream_connect (net/ipv4/af_inet.c:748)
        __sys_connect (./include/linux/file.h:45 net/socket.c:2064)
        __x64_sys_connect (net/socket.c:2073 net/socket.c:2070 net/socket.c:2070)
        do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
        RIP: 0033:0x7fa10ff05a3d
        Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89
        c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
        RSP: 002b:00007fa110183de8 EFLAGS: 00000202 ORIG_RAX: 000000000000002a
        RAX: ffffffffffffffda RBX: 0000000020000054 RCX: 00007fa10ff05a3d
        RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003
        RBP: 00007fa110183e20 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000202 R12: 00007fa110184640
        R13: 0000000000000000 R14: 00007fa10fe8b060 R15: 00007fff73e23b20
      </TASK>
      
      The issue triggering process is analyzed as follows:
      Thread A                                       Thread B
      tcp_v4_rcv	//receive ack TCP packet       inet_shutdown
        tcp_check_req                                  tcp_disconnect //disconnect sock
        ...                                              tcp_set_state(sk, TCP_CLOSE)
          inet_csk_complete_hashdance                ...
            inet_csk_reqsk_queue_add                 inet_listen  //start listen
              spin_lock(&queue->rskq_lock)             inet_csk_listen_start
              ...                                        reqsk_queue_alloc
              ...                                          spin_lock_init
              spin_unlock(&queue->rskq_lock)	//warning
      
      When the socket receives the ACK packet during the three-way handshake,
      it will hold spinlock. And then the user actively shutdowns the socket
      and listens to the socket immediately, the spinlock will be initialized.
      When the socket is going to release the spinlock, a warning is generated.
      Also the same issue to fastopenq.lock.
      
      Move init spinlock to inet_create and inet_accept to make sure init the
      accept_queue's spinlocks once.
      
      Fixes: fff1f300 ("tcp: add a spinlock to protect struct request_sock_queue")
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Reported-by: default avatarMing Shu <sming56@aliyun.com>
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240118012019.1751966-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      198bc90e
    • Benjamin Poirier's avatar
      selftests: bonding: Increase timeout to 1200s · b01f15a7
      Benjamin Poirier authored
      When tests are run by runner.sh, bond_options.sh gets killed before
      it can complete:
      
      make -C tools/testing/selftests run_tests TARGETS="drivers/net/bonding"
      	[...]
      	# timeout set to 120
      	# selftests: drivers/net/bonding: bond_options.sh
      	# TEST: prio (active-backup miimon primary_reselect 0)                [ OK ]
      	# TEST: prio (active-backup miimon primary_reselect 1)                [ OK ]
      	# TEST: prio (active-backup miimon primary_reselect 2)                [ OK ]
      	# TEST: prio (active-backup arp_ip_target primary_reselect 0)         [ OK ]
      	# TEST: prio (active-backup arp_ip_target primary_reselect 1)         [ OK ]
      	# TEST: prio (active-backup arp_ip_target primary_reselect 2)         [ OK ]
      	#
      	not ok 7 selftests: drivers/net/bonding: bond_options.sh # TIMEOUT 120 seconds
      
      This test includes many sleep statements, at least some of which are
      related to timers in the operation of the bonding driver itself. Increase
      the test timeout to allow the test to complete.
      
      I ran the test in slightly different VMs (including one without HW
      virtualization support) and got runtimes of 13m39.760s, 13m31.238s, and
      13m2.956s. Use a ~1.5x "safety factor" and set the timeout to 1200s.
      
      Fixes: 42a8d4aa ("selftests: bonding: add bonding prio option test")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Closes: https://lore.kernel.org/netdev/20240116104402.1203850a@kernel.org/#tSuggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@nvidia.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Link: https://lore.kernel.org/r/20240118001233.304759-1-bpoirier@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b01f15a7