1. 20 Mar, 2024 5 commits
    • Subbaraya Sundeep's avatar
      octeontx2: Detect the mbox up or down message via register · a88e0f93
      Subbaraya Sundeep authored
      A single line of interrupt is used to receive up notifications
      and down reply messages from AF to PF (similarly from PF to its VF).
      PF acts as bridge and forwards VF messages to AF and sends respsones
      back from AF to VF. When an async event like link event is received
      by up message when PF is in middle of forwarding VF message then
      mailbox errors occur because PF state machine is corrupted.
      Since VF is a separate driver or VF driver can be in a VM it is
      not possible to serialize from the start of communication at VF.
      Hence to differentiate between type of messages at PF this patch makes
      sender to set mbox data register with distinct values for up and down
      messages. Sender also checks whether previous interrupt is received
      before triggering current interrupt by waiting for mailbox data register
      to become zero.
      
      Fixes: 5a6d7c9d ("octeontx2-pf: Mailbox communication with AF")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a88e0f93
    • Jakub Kicinski's avatar
      Merge tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec · 94e3ca2f
      Jakub Kicinski authored
      Steffen Klassert says:
      
      ====================
      pull request (net): ipsec 2024-03-19
      
      1) Fix possible page_pool leak triggered by esp_output.
         From Dragos Tatulea.
      
      2) Fix UDP encapsulation in software GSO path.
         From Leon Romanovsky.
      
      * tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
        xfrm: Allow UDP encapsulation only in offload modes
        net: esp: fix bad handling of pages from page_pool
      ====================
      
      Link: https://lore.kernel.org/r/20240319110151.409825-1-steffen.klassert@secunet.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      94e3ca2f
    • Jiri Pirko's avatar
      devlink: fix port new reply cmd type · 78a2f5e6
      Jiri Pirko authored
      Due to a c&p error, port new reply fills-up cmd with wrong value,
      any other existing port command replies and notifications.
      
      Fix it by filling cmd with value DEVLINK_CMD_PORT_NEW.
      
      Skimmed through devlink userspace implementations, none of them cares
      about this cmd value.
      Reported-by: default avatarChenyuan Yang <chenyuan0y@gmail.com>
      Closes: https://lore.kernel.org/all/ZfZcDxGV3tSy4qsV@cy-server/
      Fixes: cd76dcd6 ("devlink: Support add and delete devlink port")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarKalesh AP <kalesh-anakkur.purayil@broadcom.com>
      Link: https://lore.kernel.org/r/20240318091908.2736542-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      78a2f5e6
    • Kuniyuki Iwashima's avatar
      tcp: Clear req->syncookie in reqsk_alloc(). · 956c0d61
      Kuniyuki Iwashima authored
      syzkaller reported a read of uninit req->syncookie. [0]
      
      Originally, req->syncookie was used only in tcp_conn_request()
      to indicate if we need to encode SYN cookie in SYN+ACK, so the
      field remains uninitialised in other places.
      
      The commit 695751e3 ("bpf: tcp: Handle BPF SYN Cookie in
      cookie_v[46]_check().") added another meaning in ACK path;
      req->syncookie is set true if SYN cookie is validated by BPF
      kfunc.
      
      After the change, cookie_v[46]_check() always read req->syncookie,
      but it is not initialised in the normal SYN cookie case as reported
      by KMSAN.
      
      Let's make sure we always initialise req->syncookie in reqsk_alloc().
      
      [0]:
      BUG: KMSAN: uninit-value in cookie_v4_check+0x22b7/0x29e0
       net/ipv4/syncookies.c:477
       cookie_v4_check+0x22b7/0x29e0 net/ipv4/syncookies.c:477
       tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
       tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
       __netif_receive_skb_one_core net/core/dev.c:5538 [inline]
       __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
       process_backlog+0x480/0x8b0 net/core/dev.c:5981
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
       do_softirq+0x9a/0x100 kernel/softirq.c:455
       __local_bh_enable_ip+0x9f/0xb0 kernel/softirq.c:382
       local_bh_enable include/linux/bottom_half.h:33 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:820 [inline]
       __dev_queue_xmit+0x2776/0x52c0 net/core/dev.c:4362
       dev_queue_xmit include/linux/netdevice.h:3091 [inline]
       neigh_hh_output include/net/neighbour.h:526 [inline]
       neigh_output include/net/neighbour.h:540 [inline]
       ip_finish_output2+0x187a/0x1b70 net/ipv4/ip_output.c:235
       __ip_finish_output+0x287/0x810
       ip_finish_output+0x4b/0x550 net/ipv4/ip_output.c:323
       NF_HOOK_COND include/linux/netfilter.h:303 [inline]
       ip_output+0x15f/0x3f0 net/ipv4/ip_output.c:433
       dst_output include/net/dst.h:450 [inline]
       ip_local_out net/ipv4/ip_output.c:129 [inline]
       __ip_queue_xmit+0x1e93/0x2030 net/ipv4/ip_output.c:535
       ip_queue_xmit+0x60/0x80 net/ipv4/ip_output.c:549
       __tcp_transmit_skb+0x3c70/0x4890 net/ipv4/tcp_output.c:1462
       tcp_transmit_skb net/ipv4/tcp_output.c:1480 [inline]
       tcp_write_xmit+0x3ee1/0x8900 net/ipv4/tcp_output.c:2792
       __tcp_push_pending_frames net/ipv4/tcp_output.c:2977 [inline]
       tcp_send_fin+0xa90/0x12e0 net/ipv4/tcp_output.c:3578
       tcp_shutdown+0x198/0x1f0 net/ipv4/tcp.c:2716
       inet_shutdown+0x33f/0x5b0 net/ipv4/af_inet.c:923
       __sys_shutdown_sock net/socket.c:2425 [inline]
       __sys_shutdown net/socket.c:2437 [inline]
       __do_sys_shutdown net/socket.c:2445 [inline]
       __se_sys_shutdown+0x2a4/0x440 net/socket.c:2443
       __x64_sys_shutdown+0x6c/0xa0 net/socket.c:2443
       do_syscall_64+0xd5/0x1f0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      Uninit was stored to memory at:
       reqsk_alloc include/net/request_sock.h:148 [inline]
       inet_reqsk_alloc+0x651/0x7a0 net/ipv4/tcp_input.c:6978
       cookie_tcp_reqsk_alloc+0xd4/0x900 net/ipv4/syncookies.c:328
       cookie_tcp_check net/ipv4/syncookies.c:388 [inline]
       cookie_v4_check+0x289f/0x29e0 net/ipv4/syncookies.c:420
       tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1855 [inline]
       tcp_v4_do_rcv+0xb17/0x10b0 net/ipv4/tcp_ipv4.c:1914
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_rcv_finish+0x4a2/0x520 net/ipv4/ip_input.c:449
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_rcv+0xcd/0x380 net/ipv4/ip_input.c:569
       __netif_receive_skb_one_core net/core/dev.c:5538 [inline]
       __netif_receive_skb+0x319/0x9e0 net/core/dev.c:5652
       process_backlog+0x480/0x8b0 net/core/dev.c:5981
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
      
      Uninit was created at:
       __alloc_pages+0x9a7/0xe00 mm/page_alloc.c:4592
       __alloc_pages_node include/linux/gfp.h:238 [inline]
       alloc_pages_node include/linux/gfp.h:261 [inline]
       alloc_slab_page mm/slub.c:2175 [inline]
       allocate_slab mm/slub.c:2338 [inline]
       new_slab+0x2de/0x1400 mm/slub.c:2391
       ___slab_alloc+0x1184/0x33d0 mm/slub.c:3525
       __slab_alloc mm/slub.c:3610 [inline]
       __slab_alloc_node mm/slub.c:3663 [inline]
       slab_alloc_node mm/slub.c:3835 [inline]
       kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852
       reqsk_alloc include/net/request_sock.h:131 [inline]
       inet_reqsk_alloc+0x66/0x7a0 net/ipv4/tcp_input.c:6978
       tcp_conn_request+0x484/0x44e0 net/ipv4/tcp_input.c:7135
       tcp_v4_conn_request+0x16f/0x1d0 net/ipv4/tcp_ipv4.c:1716
       tcp_rcv_state_process+0x2e5/0x4bb0 net/ipv4/tcp_input.c:6655
       tcp_v4_do_rcv+0xbfd/0x10b0 net/ipv4/tcp_ipv4.c:1929
       tcp_v4_rcv+0x4ce4/0x5420 net/ipv4/tcp_ipv4.c:2322
       ip_protocol_deliver_rcu+0x2a3/0x13d0 net/ipv4/ip_input.c:205
       ip_local_deliver_finish+0x332/0x500 net/ipv4/ip_input.c:233
       NF_HOOK include/linux/netfilter.h:314 [inline]
       ip_local_deliver+0x21f/0x490 net/ipv4/ip_input.c:254
       dst_input include/net/dst.h:460 [inline]
       ip_sublist_rcv_finish net/ipv4/ip_input.c:580 [inline]
       ip_list_rcv_finish net/ipv4/ip_input.c:631 [inline]
       ip_sublist_rcv+0x15f3/0x17f0 net/ipv4/ip_input.c:639
       ip_list_rcv+0x9ef/0xa40 net/ipv4/ip_input.c:674
       __netif_receive_skb_list_ptype net/core/dev.c:5581 [inline]
       __netif_receive_skb_list_core+0x15c5/0x1670 net/core/dev.c:5629
       __netif_receive_skb_list net/core/dev.c:5681 [inline]
       netif_receive_skb_list_internal+0x106c/0x16f0 net/core/dev.c:5773
       gro_normal_list include/net/gro.h:438 [inline]
       napi_complete_done+0x425/0x880 net/core/dev.c:6113
       virtqueue_napi_complete drivers/net/virtio_net.c:465 [inline]
       virtnet_poll+0x149d/0x2240 drivers/net/virtio_net.c:2211
       __napi_poll+0xe7/0x980 net/core/dev.c:6632
       napi_poll net/core/dev.c:6701 [inline]
       net_rx_action+0x89d/0x1820 net/core/dev.c:6813
       __do_softirq+0x1c0/0x7d7 kernel/softirq.c:554
      
      CPU: 0 PID: 16792 Comm: syz-executor.2 Not tainted 6.8.0-syzkaller-05562-g61387b8d #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      
      Fixes: 695751e3 ("bpf: tcp: Handle BPF SYN Cookie in cookie_v[46]_check().")
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Closes: https://lore.kernel.org/bpf/CANn89iKdN9c+C_2JAUbc+VY3DDQjAQukMtiBbormAmAk9CdvQA@mail.gmail.com/Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240315224710.55209-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      956c0d61
    • Thinh Tran's avatar
      net/bnx2x: Prevent access to a freed page in page_pool · d27e2da9
      Thinh Tran authored
      Fix race condition leading to system crash during EEH error handling
      
      During EEH error recovery, the bnx2x driver's transmit timeout logic
      could cause a race condition when handling reset tasks. The
      bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(),
      which ultimately leads to bnx2x_nic_unload(). In bnx2x_nic_unload()
      SGEs are freed using bnx2x_free_rx_sge_range(). However, this could
      overlap with the EEH driver's attempt to reset the device using
      bnx2x_io_slot_reset(), which also tries to free SGEs. This race
      condition can result in system crashes due to accessing freed memory
      locations in bnx2x_free_rx_sge()
      
      799  static inline void bnx2x_free_rx_sge(struct bnx2x *bp,
      800				struct bnx2x_fastpath *fp, u16 index)
      801  {
      802	struct sw_rx_page *sw_buf = &fp->rx_page_ring[index];
      803     struct page *page = sw_buf->page;
      ....
      where sw_buf was set to NULL after the call to dma_unmap_page()
      by the preceding thread.
      
          EEH: Beginning: 'slot_reset'
          PCI 0011:01:00.0#10000: EEH: Invoking bnx2x->slot_reset()
          bnx2x: [bnx2x_io_slot_reset:14228(eth1)]IO slot reset initializing...
          bnx2x 0011:01:00.0: enabling device (0140 -> 0142)
          bnx2x: [bnx2x_io_slot_reset:14244(eth1)]IO slot reset --> driver unload
          Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
          BUG: Kernel NULL pointer dereference on read at 0x00000000
          Faulting instruction address: 0xc0080000025065fc
          Oops: Kernel access of bad area, sig: 11 [#1]
          .....
          Call Trace:
          [c000000003c67a20] [c00800000250658c] bnx2x_io_slot_reset+0x204/0x610 [bnx2x] (unreliable)
          [c000000003c67af0] [c0000000000518a8] eeh_report_reset+0xb8/0xf0
          [c000000003c67b60] [c000000000052130] eeh_pe_report+0x180/0x550
          [c000000003c67c70] [c00000000005318c] eeh_handle_normal_event+0x84c/0xa60
          [c000000003c67d50] [c000000000053a84] eeh_event_handler+0xf4/0x170
          [c000000003c67da0] [c000000000194c58] kthread+0x1c8/0x1d0
          [c000000003c67e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
      
      To solve this issue, we need to verify page pool allocations before
      freeing.
      
      Fixes: 4cace675 ("bnx2x: Alloc 4k fragment for each rx ring buffer element")
      Signed-off-by: default avatarThinh Tran <thinhtr@linux.ibm.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20240315205535.1321-1-thinhtr@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d27e2da9
  2. 19 Mar, 2024 14 commits
  3. 18 Mar, 2024 9 commits
    • Abhishek Chauhan's avatar
      Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets" · 35c3e279
      Abhishek Chauhan authored
      This reverts commit 885c36e5.
      
      The patch currently broke the bpf selftest test_tc_dtime because
      uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
      does not necessarily mean mono with the original fix as the bit was re-used
      for userspace timestamp as well to avoid tstamp reset in the forwarding
      path. To solve this we need to keep mono_delivery_time as is and
      introduce another bit called user_delivery_time and fall back to the
      initial proposal of setting the user_delivery_time bit based on
      sk_clockid set from userspace.
      
      Fixes: 885c36e5 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets")
      Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/Signed-off-by: default avatarAbhishek Chauhan <quic_abchauha@quicinc.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c3e279
    • Arınç ÜNAL's avatar
      net: dsa: mt7530: prevent possible incorrect XTAL frequency selection · f490c492
      Arınç ÜNAL authored
      On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
      value that represents the frequency of the crystal oscillator connected to
      the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
      ESW_P4_LED_0 pins, which is done right after reset is deasserted.
      
        ESW_P4_LED_0    ESW_P3_LED_0    Frequency
        -----------------------------------------
        0               0               Reserved
        0               1               20MHz
        1               0               40MHz
        1               1               25MHz
      
      On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
      pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
      it's low.
      
      These pins are also used with LEDs, therefore, their state can be set to
      something other than the bootstrapping configuration. For example, a link
      may be established on port 3 before the DSA subdriver takes control of the
      switch which would set ESW_P3_LED_0 to high.
      
      Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
      described between reset assertion and deassertion. Some switch ICs in real
      life conditions cannot always have these pins set back to the bootstrapping
      configuration before reset deassertion in this amount of delay. This causes
      wrong crystal frequency to be selected which puts the switch in a
      nonfunctional state after reset deassertion.
      
      The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
      by Justin Swartz.
      
      With a cable from an active peer connected to port 3 before reset, an
      incorrect crystal frequency (0b11 = 25MHz) is selected:
      
                            [1]                  [3]     [5]
                            :                    :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    _____________________________
      ESW_P3_LED_0                               |__________________________
      
                             :                  : :     :
                             :                  : [4]...:
                             :                  :
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 1000 - 1100 usec.
      [3] Reset is deasserted.
      [4] Period of 315 usec. HWTRAP register is populated with incorrect
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      Increase the delay between reset_control_assert() and
      reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
      gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
      ensures a higher possibility that the switch IC will have these pins back
      to the bootstrapping configuration before reset deassertion.
      
      With a cable from an active peer connected to port 3 before reset, the
      correct crystal frequency (0b10 = 40MHz) is selected:
      
                            [1]        [2-1]     [3]     [5]
                            :          :         :       :
                    _____________________________         __________________
      ESW_P4_LED_0                               |_______|
                    ___________________           _______
      ESW_P3_LED_0                     |_________|       |__________________
      
                             :          :       : :     :
                             :          [2-2]...: [4]...:
                             [2]................:
      
      [1] Reset is asserted.
      [2] Period of 5000 - 5100 usec.
      [2-1] ESW_P3_LED_0 goes low.
      [2-2] Remaining period of 5000 - 5100 usec.
      [3] Reset is deasserted.
      [4] Period of 310 usec. HWTRAP register is populated with bootstrapped
          XTAL frequency.
      [5] Signals reflect the bootstrapped configuration.
      
      ESW_P3_LED_0 low period before reset deassertion:
      
                    5000 usec
                  - 5100 usec
          TEST     RESET HOLD
             #         (usec)
        ---------------------
             1           5410
             2           5440
             3           4375
             4           5490
             5           5475
             6           4335
             7           4370
             8           5435
             9           4205
            10           4335
            11           3750
            12           3170
            13           4395
            14           4375
            15           3515
            16           4335
            17           4220
            18           4175
            19           4175
            20           4350
      
           Min           3170
           Max           5490
      
        Median       4342.500
           Avg       4466.500
      
      Revert commit 2920dd92 ("net: dsa: mt7530: disable LEDs before reset").
      Changing the state of pins via reset assertion is simpler and more
      efficient than doing so by setting the LED controller off.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Fixes: c288575f ("net: dsa: mt7530: Add the support of MT7531 switch")
      Co-developed-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarJustin Swartz <justin.swartz@risingedge.co.za>
      Signed-off-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f490c492
    • David S. Miller's avatar
      Merge branch 'veth-xdp-gro' · ba77f6e2
      David S. Miller authored
      Ignat Korchagin says:
      
      ====================
      net: veth: ability to toggle GRO and XDP independently
      
      It is rather confusing that GRO is automatically enabled, when an XDP program
      is attached to a veth interface. Moreover, it is not possible to disable GRO
      on a veth, if an XDP program is attached (which might be desirable in some use
      cases).
      
      Make GRO and XDP independent for a veth interface.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba77f6e2
    • Ignat Korchagin's avatar
      selftests: net: veth: test the ability to independently manipulate GRO and XDP · ba5a6476
      Ignat Korchagin authored
      We should be able to independently flip either XDP or GRO states and toggling
      one should not affect the other.
      
      Adjust other tests as well that had implicit expectation that GRO would be
      automatically enabled.
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba5a6476
    • Ignat Korchagin's avatar
      net: veth: do not manipulate GRO when using XDP · d7db7775
      Ignat Korchagin authored
      Commit d3256efd ("veth: allow enabling NAPI even without XDP") tried to fix
      the fact that GRO was not possible without XDP, because veth did not use NAPI
      without XDP. However, it also introduced the behaviour that GRO is always
      enabled, when XDP is enabled.
      
      While it might be desired for most cases, it is confusing for the user at best
      as the GRO flag suddenly changes, when an XDP program is attached. It also
      introduces some complexities in state management as was partially addressed in
      commit fe9f8013 ("net: veth: clear GRO when clearing XDP even when down").
      
      But the biggest problem is that it is not possible to disable GRO at all, when
      an XDP program is attached, which might be needed for some use cases.
      
      Fix this by not touching the GRO flag on XDP enable/disable as the code already
      supports switching to NAPI if either GRO or XDP is requested.
      
      Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/
      Fixes: d3256efd ("veth: allow enabling NAPI even without XDP")
      Fixes: fe9f8013 ("net: veth: clear GRO when clearing XDP even when down")
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7db7775
    • Leon Romanovsky's avatar
      xfrm: Allow UDP encapsulation only in offload modes · 773bb766
      Leon Romanovsky authored
      The missing check of x->encap caused to the situation where GSO packets
      were created with UDP encapsulation.
      
      As a solution return the encap check for non-offloaded SA.
      
      Fixes: 983a73da ("xfrm: Pass UDP encapsulation in TX packet offload")
      Closes: https://lore.kernel.org/all/a650221ae500f0c7cf496c61c96c1b103dcb6f67.camel@redhat.comReported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      773bb766
    • Dragos Tatulea's avatar
      net: esp: fix bad handling of pages from page_pool · c3198822
      Dragos Tatulea authored
      When the skb is reorganized during esp_output (!esp->inline), the pages
      coming from the original skb fragments are supposed to be released back
      to the system through put_page. But if the skb fragment pages are
      originating from a page_pool, calling put_page on them will trigger a
      page_pool leak which will eventually result in a crash.
      
      This leak can be easily observed when using CONFIG_DEBUG_VM and doing
      ipsec + gre (non offloaded) forwarding:
      
        BUG: Bad page state in process ksoftirqd/16  pfn:1451b6
        page:00000000de2b8d32 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1451b6000 pfn:0x1451b6
        flags: 0x200000000000000(node=0|zone=2)
        page_type: 0xffffffff()
        raw: 0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000
        raw: 00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000
        page dumped because: page_pool leak
        Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core]
        CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x36/0x50
         bad_page+0x70/0xf0
         free_unref_page_prepare+0x27a/0x460
         free_unref_page+0x38/0x120
         esp_ssg_unref.isra.0+0x15f/0x200
         esp_output_tail+0x66d/0x780
         esp_xmit+0x2c5/0x360
         validate_xmit_xfrm+0x313/0x370
         ? validate_xmit_skb+0x1d/0x330
         validate_xmit_skb_list+0x4c/0x70
         sch_direct_xmit+0x23e/0x350
         __dev_queue_xmit+0x337/0xba0
         ? nf_hook_slow+0x3f/0xd0
         ip_finish_output2+0x25e/0x580
         iptunnel_xmit+0x19b/0x240
         ip_tunnel_xmit+0x5fb/0xb60
         ipgre_xmit+0x14d/0x280 [ip_gre]
         dev_hard_start_xmit+0xc3/0x1c0
         __dev_queue_xmit+0x208/0xba0
         ? nf_hook_slow+0x3f/0xd0
         ip_finish_output2+0x1ca/0x580
         ip_sublist_rcv_finish+0x32/0x40
         ip_sublist_rcv+0x1b2/0x1f0
         ? ip_rcv_finish_core.constprop.0+0x460/0x460
         ip_list_rcv+0x103/0x130
         __netif_receive_skb_list_core+0x181/0x1e0
         netif_receive_skb_list_internal+0x1b3/0x2c0
         napi_gro_receive+0xc8/0x200
         gro_cell_poll+0x52/0x90
         __napi_poll+0x25/0x1a0
         net_rx_action+0x28e/0x300
         __do_softirq+0xc3/0x276
         ? sort_range+0x20/0x20
         run_ksoftirqd+0x1e/0x30
         smpboot_thread_fn+0xa6/0x130
         kthread+0xcd/0x100
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x31/0x50
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork_asm+0x11/0x20
         </TASK>
      
      The suggested fix is to introduce a new wrapper (skb_page_unref) that
      covers page refcounting for page_pool pages as well.
      
      Cc: stable@vger.kernel.org
      Fixes: 6a5bcd84 ("page_pool: Allow drivers to hint on SKB recycling")
      Reported-and-tested-by: default avatarAnatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz>
      Reported-by: default avatarIan Kumlien <ian.kumlien@gmail.com>
      Link: https://lore.kernel.org/netdev/CAA85sZvvHtrpTQRqdaOx6gd55zPAVsqMYk_Lwh4Md5knTq7AyA@mail.gmail.comSigned-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      c3198822
    • Eric Dumazet's avatar
      packet: annotate data-races around ignore_outgoing · 6ebfad33
      Eric Dumazet authored
      ignore_outgoing is read locklessly from dev_queue_xmit_nit()
      and packet_getsockopt()
      
      Add appropriate READ_ONCE()/WRITE_ONCE() annotations.
      
      syzbot reported:
      
      BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
      
      write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
       packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
       do_sock_setsockopt net/socket.c:2311 [inline]
       __sys_setsockopt+0x1d8/0x250 net/socket.c:2334
       __do_sys_setsockopt net/socket.c:2343 [inline]
       __se_sys_setsockopt net/socket.c:2340 [inline]
       __x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
       do_syscall_64+0xd3/0x1d0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75
      
      read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
       dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
       xmit_one net/core/dev.c:3527 [inline]
       dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
       __dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
       dev_queue_xmit include/linux/netdevice.h:3091 [inline]
       batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
       batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
       batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
       batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
       batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
       process_one_work kernel/workqueue.c:3254 [inline]
       process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
       worker_thread+0x526/0x730 kernel/workqueue.c:3416
       kthread+0x1d1/0x210 kernel/kthread.c:388
       ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
      
      value changed: 0x00 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G        W          6.8.0-syzkaller-08073-g480e035f #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
      Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
      
      Fixes: fa788d98 ("packet: add sockopt to ignore outgoing packets")
      Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#tSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarJason Xing <kerneljasonxing@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ebfad33
    • Herve Codina's avatar
      net: wan: fsl_qmc_hdlc: Fix module compilation · badc9e33
      Herve Codina authored
      The fsl_qmc_driver does not compile as module:
        error: ‘qmc_hdlc_driver’ undeclared here (not in a function);
          405 | MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
              |                         ^~~~~~~~~~~~~~~
      
      Fix the typo.
      
      Fixes: b40f00ecd463 ("net: wan: Add support for QMC HDLC")
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Closes: https://lore.kernel.org/linux-kernel/87ttl93f7i.fsf@mail.lhotse/Signed-off-by: default avatarHerve Codina <herve.codina@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      badc9e33
  4. 15 Mar, 2024 2 commits
  5. 14 Mar, 2024 9 commits
    • Jens Axboe's avatar
      net: remove {revc,send}msg_copy_msghdr() from exports · e54e09c0
      Jens Axboe authored
      The only user of these was io_uring, and it's not using them anymore.
      Make them static and remove them from the socket header file.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/1b6089d3-c1cf-464a-abd3-b0f0b6bb2523@kernel.dkSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e54e09c0
    • Duanqiang Wen's avatar
      net: txgbe: fix clk_name exceed MAX_DEV_ID limits · e30cef00
      Duanqiang Wen authored
      txgbe register clk which name is i2c_designware.pci_dev_id(),
      clk_name will be stored in clk_lookup_alloc. If PCIe bus number
      is larger than 0x39, clk_name size will be larger than 20 bytes.
      It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
      shortened clk_name.
      
      Fixes: b63f2048 ("net: txgbe: Register fixed rate clock")
      Signed-off-by: default avatarDuanqiang Wen <duanqiangwen@net-swift.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Link: https://lore.kernel.org/r/20240313080634.459523-1-duanqiangwen@net-swift.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e30cef00
    • Jakub Kicinski's avatar
      docs: networking: fix indentation errors in multi-pf-netdev · 1c636867
      Jakub Kicinski authored
      Stephen reports new warnings in the docs:
      
      Documentation/networking/multi-pf-netdev.rst:94: ERROR: Unexpected indentation.
      Documentation/networking/multi-pf-netdev.rst:106: ERROR: Unexpected indentation.
      
      Fixes: 77d9ec3f ("Documentation: networking: Add description for multi-pf netdev")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Link: https://lore.kernel.org/all/20240312153304.0ef1b78e@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20240313032329.3919036-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      1c636867
    • Paolo Abeni's avatar
      Merge branch 'rxrpc-fixes-for-af_rxrpc' · 7278c70a
      Paolo Abeni authored
      David Howells says:
      
      ====================
      rxrpc: Fixes for AF_RXRPC
      
      Here are a couple of fixes for the AF_RXRPC changes[1] in net-next.
      
       (1) Fix a runtime warning introduced by a patch that changed how
           page_frag_alloc_align() works.
      
       (2) Fix an is-NULL vs IS_ERR error handling bug.
      
      The patches are tagged here:
      
      	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git tags/rxrpc-iothread-20240312
      
      And can be found on this branch:
      
      	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-iothread
      
      Link: https://lore.kernel.org/r/20240306000655.1100294-1-dhowells@redhat.com/ [1]
      ====================
      
      Link: https://lore.kernel.org/r/20240312233723.2984928-1-dhowells@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7278c70a
    • David Howells's avatar
      rxrpc: Fix error check on ->alloc_txbuf() · 89e43541
      David Howells authored
      rxrpc_alloc_*_txbuf() and ->alloc_txbuf() return NULL to indicate no
      memory, but rxrpc_send_data() uses IS_ERR().
      
      Fix rxrpc_send_data() to check for NULL only and set -ENOMEM if it sees
      that.
      
      Fixes: 49489bb0 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      89e43541
    • David Howells's avatar
      rxrpc: Fix use of changed alignment param to page_frag_alloc_align() · 6b253646
      David Howells authored
      Commit 411c5f36 ("mm/page_alloc: modify page_frag_alloc_align() to
      accept align as an argument") changed the way page_frag_alloc_align()
      worked, but it didn't fix AF_RXRPC as that use of that allocator function
      hadn't been merged yet at the time.  Now, when the AFS filesystem is used,
      this results in:
      
        WARNING: CPU: 4 PID: 379 at include/linux/gfp.h:323 rxrpc_alloc_data_txbuf+0x9d/0x2b0 [rxrpc]
      
      Fix this by using __page_frag_alloc_align() instead.
      
      Note that it might be better to use an order-based alignment rather than a
      mask-based alignment.
      
      Fixes: 49489bb0 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: Yunsheng Lin <linyunsheng@huawei.com>
      cc: Alexander Duyck <alexander.duyck@gmail.com>
      cc: Michael S. Tsirkin <mst@redhat.com>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: linux-afs@lists.infradead.org
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6b253646
    • Shigeru Yoshida's avatar
      hsr: Fix uninit-value access in hsr_get_node() · ddbec99f
      Shigeru Yoshida authored
      KMSAN reported the following uninit-value access issue [1]:
      
      =====================================================
      BUG: KMSAN: uninit-value in hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
       hsr_get_node+0xa2e/0xa40 net/hsr/hsr_framereg.c:246
       fill_frame_info net/hsr/hsr_forward.c:577 [inline]
       hsr_forward_skb+0xe12/0x30e0 net/hsr/hsr_forward.c:615
       hsr_dev_xmit+0x1a1/0x270 net/hsr/hsr_device.c:223
       __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
       netdev_start_xmit include/linux/netdevice.h:4954 [inline]
       xmit_one net/core/dev.c:3548 [inline]
       dev_hard_start_xmit+0x247/0xa10 net/core/dev.c:3564
       __dev_queue_xmit+0x33b8/0x5130 net/core/dev.c:4349
       dev_queue_xmit include/linux/netdevice.h:3134 [inline]
       packet_xmit+0x9c/0x6b0 net/packet/af_packet.c:276
       packet_snd net/packet/af_packet.c:3087 [inline]
       packet_sendmsg+0x8b1d/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      Uninit was created at:
       slab_post_alloc_hook+0x129/0xa70 mm/slab.h:768
       slab_alloc_node mm/slub.c:3478 [inline]
       kmem_cache_alloc_node+0x5e9/0xb10 mm/slub.c:3523
       kmalloc_reserve+0x13d/0x4a0 net/core/skbuff.c:560
       __alloc_skb+0x318/0x740 net/core/skbuff.c:651
       alloc_skb include/linux/skbuff.h:1286 [inline]
       alloc_skb_with_frags+0xc8/0xbd0 net/core/skbuff.c:6334
       sock_alloc_send_pskb+0xa80/0xbf0 net/core/sock.c:2787
       packet_alloc_skb net/packet/af_packet.c:2936 [inline]
       packet_snd net/packet/af_packet.c:3030 [inline]
       packet_sendmsg+0x70e8/0x9f30 net/packet/af_packet.c:3119
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       __sys_sendto+0x735/0xa10 net/socket.c:2191
       __do_sys_sendto net/socket.c:2203 [inline]
       __se_sys_sendto net/socket.c:2199 [inline]
       __x64_sys_sendto+0x125/0x1c0 net/socket.c:2199
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0x6d/0x140 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x63/0x6b
      
      CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted 6.7.0-syzkaller-00562-g9f8413c4 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
      =====================================================
      
      If the packet type ID field in the Ethernet header is either ETH_P_PRP or
      ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
      reads an invalid value as a sequence number. This causes the above issue.
      
      This patch fixes the issue by returning NULL if the Ethernet header is not
      followed by an HSR tag.
      
      Fixes: f266a683 ("net/hsr: Better frame dispatch")
      Reported-and-tested-by: syzbot+2ef3a8ce8e91b5a50098@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=2ef3a8ce8e91b5a50098 [1]
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Link: https://lore.kernel.org/r/20240312152719.724530-1-syoshida@redhat.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ddbec99f
    • William Tu's avatar
      vmxnet3: Fix missing reserved tailroom · e127ce76
      William Tu authored
      Use rbi->len instead of rcd->len for non-dataring packet.
      
      Found issue:
        XDP_WARN: xdp_update_frame_from_buff(line:278): Driver BUG: missing reserved tailroom
        WARNING: CPU: 0 PID: 0 at net/core/xdp.c:586 xdp_warn+0xf/0x20
        CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O       6.5.1 #1
        RIP: 0010:xdp_warn+0xf/0x20
        ...
        ? xdp_warn+0xf/0x20
        xdp_do_redirect+0x15f/0x1c0
        vmxnet3_run_xdp+0x17a/0x400 [vmxnet3]
        vmxnet3_process_xdp+0xe4/0x760 [vmxnet3]
        ? vmxnet3_tq_tx_complete.isra.0+0x21e/0x2c0 [vmxnet3]
        vmxnet3_rq_rx_complete+0x7ad/0x1120 [vmxnet3]
        vmxnet3_poll_rx_only+0x2d/0xa0 [vmxnet3]
        __napi_poll+0x20/0x180
        net_rx_action+0x177/0x390
      Reported-by: default avatarMartin Zaharinov <micron10@gmail.com>
      Tested-by: default avatarMartin Zaharinov <micron10@gmail.com>
      Link: https://lore.kernel.org/netdev/74BF3CC8-2A3A-44FF-98C2-1E20F110A92E@gmail.com/
      Fixes: 54f00cce ("vmxnet3: Add XDP support.")
      Signed-off-by: default avatarWilliam Tu <witu@nvidia.com>
      Link: https://lore.kernel.org/r/20240309183147.28222-1-witu@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e127ce76
    • Kuniyuki Iwashima's avatar
      tcp: Fix refcnt handling in __inet_hash_connect(). · 04d9d1fc
      Kuniyuki Iwashima authored
      syzbot reported a warning in sk_nulls_del_node_init_rcu().
      
      The commit 66b60b0c ("dccp/tcp: Unhash sk from ehash for tb2 alloc
      failure after check_estalblished().") tried to fix an issue that an
      unconnected socket occupies an ehash entry when bhash2 allocation fails.
      
      In such a case, we need to revert changes done by check_established(),
      which does not hold refcnt when inserting socket into ehash.
      
      So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
      of sk_nulls_add_node_rcu().
      
      Otherwise, sock_put() will cause refcnt underflow and leak the socket.
      
      [0]:
      WARNING: CPU: 0 PID: 23948 at include/net/sock.h:799 sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
      Modules linked in:
      CPU: 0 PID: 23948 Comm: syz-executor.2 Not tainted 6.8.0-rc6-syzkaller-00159-gc055fc00 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      RIP: 0010:sk_nulls_del_node_init_rcu+0x166/0x1a0 include/net/sock.h:799
      Code: e8 7f 71 c6 f7 83 fb 02 7c 25 e8 35 6d c6 f7 4d 85 f6 0f 95 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 1b 6d c6 f7 90 <0f> 0b 90 eb b2 e8 10 6d c6 f7 4c 89 e7 be 04 00 00 00 e8 63 e7 d2
      RSP: 0018:ffffc900032d7848 EFLAGS: 00010246
      RAX: ffffffff89cd0035 RBX: 0000000000000001 RCX: 0000000000040000
      RDX: ffffc90004de1000 RSI: 000000000003ffff RDI: 0000000000040000
      RBP: 1ffff1100439ac26 R08: ffffffff89ccffe3 R09: 1ffff1100439ac28
      R10: dffffc0000000000 R11: ffffed100439ac29 R12: ffff888021cd6140
      R13: dffffc0000000000 R14: ffff88802a9bf5c0 R15: ffff888021cd6130
      FS:  00007f3b823f16c0(0000) GS:ffff8880b9400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3b823f0ff8 CR3: 000000004674a000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       __inet_hash_connect+0x140f/0x20b0 net/ipv4/inet_hashtables.c:1139
       dccp_v6_connect+0xcb9/0x1480 net/dccp/ipv6.c:956
       __inet_stream_connect+0x262/0xf30 net/ipv4/af_inet.c:678
       inet_stream_connect+0x65/0xa0 net/ipv4/af_inet.c:749
       __sys_connect_file net/socket.c:2048 [inline]
       __sys_connect+0x2df/0x310 net/socket.c:2065
       __do_sys_connect net/socket.c:2075 [inline]
       __se_sys_connect net/socket.c:2072 [inline]
       __x64_sys_connect+0x7a/0x90 net/socket.c:2072
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f3b8167dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f3b823f10c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 00007f3b817abf80 RCX: 00007f3b8167dda9
      RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003
      RBP: 00007f3b823f1120 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      R13: 000000000000000b R14: 00007f3b817abf80 R15: 00007ffd3beb57b8
       </TASK>
      
      Reported-by: syzbot+12c506c1aae251e70449@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=12c506c1aae251e70449
      Fixes: 66b60b0c ("dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished().")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20240308201623.65448-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      04d9d1fc
  6. 13 Mar, 2024 1 commit
    • Shay Drory's avatar
      devlink: Fix devlink parallel commands processing · d7d75124
      Shay Drory authored
      Commit 870c7ad4 ("devlink: protect devlink->dev by the instance
      lock") added devlink instance locking inside a loop that iterates over
      all the registered devlink instances on the machine in the pre-doit
      phase. This can lead to serialization of devlink commands over
      different devlink instances.
      
      For example: While the first devlink instance is executing firmware
      flash, all commands to other devlink instances on the machine are
      forced to wait until the first devlink finishes.
      
      Therefore, in the pre-doit phase, take the devlink instance lock only
      for the devlink instance the command is targeting. Devlink layer is
      taking a reference on the devlink instance, ensuring the devlink->dev
      pointer is valid. This reference taking was introduced by commit
      a3806872 ("devlink: take device reference for devlink object").
      Without this commit, it would not be safe to access devlink->dev
      lockless.
      
      Fixes: 870c7ad4 ("devlink: protect devlink->dev by the instance lock")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d75124