1. 13 Dec, 2023 7 commits
    • Salvatore Dipietro's avatar
      tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set · f3f32a35
      Salvatore Dipietro authored
      Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
      and packets are sent as soon as possible. However in the `tcp_push` function
      where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
      considered which can trigger unexpected corking of packets and induce delays.
      
      For example, if two packets are generated as part of a server's reply, if the
      first one is not transmitted on the wire quickly enough, the second packet can
      trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
      possible. It will either wait for additional packets to be coalesced or an ACK
      from the client before transmitting the corked packet. This can interact badly
      if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
      completion times. It is not always possible to control who has delayed acks
      set, but it is possible to adjust when and how autocorking is triggered.
      Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
      
      Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
      Apache Tomcat 9.0.83 running the basic servlet below:
      
      import java.io.IOException;
      import java.io.OutputStreamWriter;
      import java.io.PrintWriter;
      import javax.servlet.ServletException;
      import javax.servlet.http.HttpServlet;
      import javax.servlet.http.HttpServletRequest;
      import javax.servlet.http.HttpServletResponse;
      
      public class HelloWorldServlet extends HttpServlet {
          @Override
          protected void doGet(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
              response.setContentType("text/html;charset=utf-8");
              OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
              String s = "a".repeat(3096);
              osw.write(s,0,s.length());
              osw.flush();
          }
      }
      
      Load was applied using  wrk2 (https://github.com/kinvolk/wrk2) from an AWS
      c6i.8xlarge instance.  With the current auto-corking behavior and TCP_NODELAY
      set an additional 40ms latency from P99.99+ values are observed.  With the
      patch applied we see no occurrences of 40ms latencies. The patch has also been
      tested with iperf and uperf benchmarks and no regression was observed.
      
      # No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
        ...
       50.000%    0.91ms
       75.000%    1.12ms
       90.000%    1.46ms
       99.000%    1.73ms
       99.900%    1.96ms
       99.990%   43.62ms   <<< 40+ ms extra latency
       99.999%   48.32ms
      100.000%   49.34ms
      
      # With patch
      ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
        ...
       50.000%    0.89ms
       75.000%    1.13ms
       90.000%    1.44ms
       99.000%    1.67ms
       99.900%    1.78ms
       99.990%    2.27ms   <<< no 40+ ms extra latency
       99.999%    3.71ms
      100.000%    4.57ms
      
      Fixes: f54b3111 ("tcp: auto corking")
      Signed-off-by: default avatarSalvatore Dipietro <dipiets@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3f32a35
    • Jiri Pirko's avatar
      dpll: sanitize possible null pointer dereference in dpll_pin_parent_pin_set() · 65c95f78
      Jiri Pirko authored
      User may not pass DPLL_A_PIN_STATE attribute in the pin set operation
      message. Sanitize that by checking if the attr pointer is not null
      and process the passed state attribute value only in that case.
      Reported-by: default avatarXingyuan Mo <hdthky0@gmail.com>
      Fixes: 9d71b54b ("dpll: netlink: Add DPLL framework base functions")
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Acked-by: default avatarVadim Fedorenko <vadim.fedorenko@linux.dev>
      Link: https://lore.kernel.org/r/20231211083758.1082853-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65c95f78
    • Jakub Kicinski's avatar
      Merge branch 'ena-driver-xdp-bug-fixes' · 154bb2fa
      Jakub Kicinski authored
      David Arinzon says:
      
      ====================
      ENA driver XDP bug fixes
      
      This patchset contains multiple XDP-related bug fixes
      in the ENA driver.
      ====================
      
      Link: https://lore.kernel.org/r/20231211062801.27891-1-darinzon@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      154bb2fa
    • David Arinzon's avatar
      net: ena: Fix XDP redirection error · 4ab138ca
      David Arinzon authored
      When sending TX packets, the meta descriptor can be all zeroes
      as no meta information is required (as in XDP).
      
      This patch removes the validity check, as when
      `disable_meta_caching` is enabled, such TX packets will be
      dropped otherwise.
      
      Fixes: 0e3a3f6d ("net: ena: support new LLQ acceleration mode")
      Signed-off-by: default avatarShay Agroskin <shayagr@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Link: https://lore.kernel.org/r/20231211062801.27891-5-darinzon@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ab138ca
    • David Arinzon's avatar
      net: ena: Fix DMA syncing in XDP path when SWIOTLB is on · d7601170
      David Arinzon authored
      This patch fixes two issues:
      
      Issue 1
      -------
      Description
      ```````````
      Current code does not call dma_sync_single_for_cpu() to sync data from
      the device side memory to the CPU side memory before the XDP code path
      uses the CPU side data.
      This causes the XDP code path to read the unset garbage data in the CPU
      side memory, resulting in incorrect handling of the packet by XDP.
      
      Solution
      ````````
      1. Add a call to dma_sync_single_for_cpu() before the XDP code starts to
         use the data in the CPU side memory.
      2. The XDP code verdict can be XDP_PASS, in which case there is a
         fallback to the non-XDP code, which also calls
         dma_sync_single_for_cpu().
         To avoid calling dma_sync_single_for_cpu() twice:
      2.1. Put the dma_sync_single_for_cpu() in the code in such a place where
           it happens before XDP and non-XDP code.
      2.2. Remove the calls to dma_sync_single_for_cpu() in the non-XDP code
           for the first buffer only (rx_copybreak and non-rx_copybreak
           cases), since the new call that was added covers these cases.
           The call to dma_sync_single_for_cpu() for the second buffer and on
           stays because only the first buffer is handled by the newly added
           dma_sync_single_for_cpu(). And there is no need for special
           handling of the second buffer and on for the XDP path since
           currently the driver supports only single buffer packets.
      
      Issue 2
      -------
      Description
      ```````````
      In case the XDP code forwarded the packet (ENA_XDP_FORWARDED),
      ena_unmap_rx_buff_attrs() is called with attrs set to 0.
      This means that before unmapping the buffer, the internal function
      dma_unmap_page_attrs() will also call dma_sync_single_for_cpu() on
      the whole buffer (not only on the data part of it).
      This sync is both wasteful (since a sync was already explicitly
      called before) and also causes a bug, which will be explained
      using the below diagram.
      
      The following diagram shows the flow of events causing the bug.
      The order of events is (1)-(4) as shown in the diagram.
      
      CPU side memory area
      
           (3)convert_to_xdp_frame() initializes the
              headroom with xdpf metadata
                            ||
                            \/
                ___________________________________
               |                                   |
       0       |                                   V                       4K
       ---------------------------------------------------------------------
       | xdpf->data      | other xdpf       |   < data >   | tailroom ||...|
       |                 | fields           |              | GARBAGE  ||   |
       ---------------------------------------------------------------------
      
                         /\                        /\
                         ||                        ||
         (4)ena_unmap_rx_buff_attrs() calls     (2)dma_sync_single_for_cpu()
            dma_sync_single_for_cpu() on the       copies data from device
            whole buffer page, overwriting         side to CPU side memory
            the xdpf->data with GARBAGE.           ||
       0                                                                   4K
       ---------------------------------------------------------------------
       | headroom                           |   < data >   | tailroom ||...|
       | GARBAGE                            |              | GARBAGE  ||   |
       ---------------------------------------------------------------------
      
      Device side memory area                      /\
                                                   ||
                                     (1) device writes RX packet data
      
      After the call to ena_unmap_rx_buff_attrs() in (4), the xdpf->data
      becomes corrupted, and so when it is later accessed in
      ena_clean_xdp_irq()->xdp_return_frame(), it causes a page fault,
      crashing the kernel.
      
      Solution
      ````````
      Explicitly tell ena_unmap_rx_buff_attrs() not to call
      dma_sync_single_for_cpu() by passing it the ENA_DMA_ATTR_SKIP_CPU_SYNC
      flag.
      
      Fixes: f7d625ad ("net: ena: Add dynamic recycling mechanism for rx buffers")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Link: https://lore.kernel.org/r/20231211062801.27891-4-darinzon@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d7601170
    • David Arinzon's avatar
      net: ena: Fix xdp drops handling due to multibuf packets · 505b1a88
      David Arinzon authored
      Current xdp code drops packets larger than ENA_XDP_MAX_MTU.
      This is an incorrect condition since the problem is not the
      size of the packet, rather the number of buffers it contains.
      
      This commit:
      
      1. Identifies and drops XDP multi-buffer packets at the
         beginning of the function.
      2. Increases the xdp drop statistic when this drop occurs.
      3. Adds a one-time print that such drops are happening to
         give better indication to the user.
      
      Fixes: 838c93dc ("net: ena: implement XDP drop support")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Link: https://lore.kernel.org/r/20231211062801.27891-3-darinzon@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      505b1a88
    • David Arinzon's avatar
      net: ena: Destroy correct number of xdp queues upon failure · 41db6f99
      David Arinzon authored
      The ena_setup_and_create_all_xdp_queues() function freed all the
      resources upon failure, after creating only xdp_num_queues queues,
      instead of freeing just the created ones.
      
      In this patch, the only resources that are freed, are the ones
      allocated right before the failure occurs.
      
      Fixes: 548c4940 ("net: ena: Implement XDP_TX action")
      Signed-off-by: default avatarShahar Itzko <itzko@amazon.com>
      Signed-off-by: default avatarDavid Arinzon <darinzon@amazon.com>
      Link: https://lore.kernel.org/r/20231211062801.27891-2-darinzon@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      41db6f99
  2. 12 Dec, 2023 4 commits
    • Dong Chenchen's avatar
      net: Remove acked SYN flag from packet in the transmit queue correctly · f99cd562
      Dong Chenchen authored
      syzkaller report:
      
       kernel BUG at net/core/skbuff.c:3452!
       invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
       CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc4-00009-gbee0e776-dirty #135
       RIP: 0010:skb_copy_and_csum_bits (net/core/skbuff.c:3452)
       Call Trace:
       icmp_glue_bits (net/ipv4/icmp.c:357)
       __ip_append_data.isra.0 (net/ipv4/ip_output.c:1165)
       ip_append_data (net/ipv4/ip_output.c:1362 net/ipv4/ip_output.c:1341)
       icmp_push_reply (net/ipv4/icmp.c:370)
       __icmp_send (./include/net/route.h:252 net/ipv4/icmp.c:772)
       ip_fragment.constprop.0 (./include/linux/skbuff.h:1234 net/ipv4/ip_output.c:592 net/ipv4/ip_output.c:577)
       __ip_finish_output (net/ipv4/ip_output.c:311 net/ipv4/ip_output.c:295)
       ip_output (net/ipv4/ip_output.c:427)
       __ip_queue_xmit (net/ipv4/ip_output.c:535)
       __tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
       __tcp_retransmit_skb (net/ipv4/tcp_output.c:3387)
       tcp_retransmit_skb (net/ipv4/tcp_output.c:3404)
       tcp_retransmit_timer (net/ipv4/tcp_timer.c:604)
       tcp_write_timer (./include/linux/spinlock.h:391 net/ipv4/tcp_timer.c:716)
      
      The panic issue was trigered by tcp simultaneous initiation.
      The initiation process is as follows:
      
            TCP A                                            TCP B
      
        1.  CLOSED                                           CLOSED
      
        2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
      
        3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
      
        4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
      
        5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
      
        // TCP B: not send challenge ack for ack limit or packet loss
        // TCP A: close
      	tcp_close
      	   tcp_send_fin
                    if (!tskb && tcp_under_memory_pressure(sk))
                        tskb = skb_rb_last(&sk->tcp_rtx_queue); //pick SYN_ACK packet
                 TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;  // set FIN flag
      
        6.  FIN_WAIT_1  --> <SEQ=100><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ...
      
        // TCP B: send challenge ack to SYN_FIN_ACK
      
        7.               ... <SEQ=301><ACK=101><CTL=ACK>   <-- SYN-RECEIVED //challenge ack
      
        // TCP A:  <SND.UNA=101>
      
        8.  FIN_WAIT_1 --> <SEQ=101><ACK=301><END_SEQ=102><CTL=SYN,FIN,ACK> ... // retransmit panic
      
      	__tcp_retransmit_skb  //skb->len=0
      	    tcp_trim_head
      		len = tp->snd_una - TCP_SKB_CB(skb)->seq // len=101-100
      		    __pskb_trim_head
      			skb->data_len -= len // skb->len=-1, wrap around
      	    ... ...
      	    ip_fragment
      		icmp_glue_bits //BUG_ON
      
      If we use tcp_trim_head() to remove acked SYN from packet that contains data
      or other flags, skb->len will be incorrectly decremented. We can remove SYN
      flag that has been acked from rtx_queue earlier than tcp_trim_head(), which
      can fix the problem mentioned above.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Co-developed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDong Chenchen <dongchenchen2@huawei.com>
      Link: https://lore.kernel.org/r/20231210020200.1539875-1-dongchenchen2@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f99cd562
    • Dinghao Liu's avatar
      qed: Fix a potential use-after-free in qed_cxt_tables_alloc · b65d52ac
      Dinghao Liu authored
      qed_ilt_shadow_alloc() will call qed_ilt_shadow_free() to
      free p_hwfn->p_cxt_mngr->ilt_shadow on error. However,
      qed_cxt_tables_alloc() accesses the freed pointer on failure
      of qed_ilt_shadow_alloc() through calling qed_cxt_mngr_free(),
      which may lead to use-after-free. Fix this issue by setting
      p_mngr->ilt_shadow to NULL in qed_ilt_shadow_free().
      
      Fixes: fe56b9e6 ("qed: Add module with basic common support")
      Reviewed-by: default avatarPrzemek Kitszel <przemyslaw.kitszel@intel.com>
      Signed-off-by: default avatarDinghao Liu <dinghao.liu@zju.edu.cn>
      Link: https://lore.kernel.org/r/20231210045255.21383-1-dinghao.liu@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b65d52ac
    • Hyunwoo Kim's avatar
      net/rose: Fix Use-After-Free in rose_ioctl · 810c38a3
      Hyunwoo Kim authored
      Because rose_ioctl() accesses sk->sk_receive_queue
      without holding a sk->sk_receive_queue.lock, it can
      cause a race with rose_accept().
      A use-after-free for skb occurs with the following flow.
      ```
      rose_ioctl() -> skb_peek()
      rose_accept() -> skb_dequeue() -> kfree_skb()
      ```
      Add sk->sk_receive_queue.lock to rose_ioctl() to fix this issue.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarHyunwoo Kim <v4bel@theori.io>
      Link: https://lore.kernel.org/r/20231209100538.GA407321@v4bel-B760M-AORUS-ELITE-AXSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      810c38a3
    • Hyunwoo Kim's avatar
      atm: Fix Use-After-Free in do_vcc_ioctl · 24e90b9e
      Hyunwoo Kim authored
      Because do_vcc_ioctl() accesses sk->sk_receive_queue
      without holding a sk->sk_receive_queue.lock, it can
      cause a race with vcc_recvmsg().
      A use-after-free for skb occurs with the following flow.
      ```
      do_vcc_ioctl() -> skb_peek()
      vcc_recvmsg() -> skb_recv_datagram() -> skb_free_datagram()
      ```
      Add sk->sk_receive_queue.lock to do_vcc_ioctl() to fix this issue.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarHyunwoo Kim <v4bel@theori.io>
      Link: https://lore.kernel.org/r/20231209094210.GA403126@v4bel-B760M-AORUS-ELITE-AXSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      24e90b9e
  3. 11 Dec, 2023 6 commits
    • Hariprasad Kelam's avatar
      octeontx2-af: Fix pause frame configuration · e307b5a8
      Hariprasad Kelam authored
      The current implementation's default Pause Forward setting is causing
      unnecessary network traffic. This patch disables Pause Forward to
      address this issue.
      
      Fixes: 1121f6b0 ("octeontx2-af: Priority flow control configuration support")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e307b5a8
    • David S. Miller's avatar
      Merge branch 'octeontx2-fixes' · c3e04142
      David S. Miller authored
      Hariprasad Kelam says:
      
      ====================
      octeontx2: Fix issues with promisc/allmulti mode
      
      When interface is configured in promisc/all multi mode, low network
      performance observed. This series patches address the same.
      
      Patch1: Change the promisc/all multi mcam entry action to unicast if
      there are no trusted vfs associated with PF.
      
      Patch2: Configures RSS flow algorithm in promisc/all multi mcam entries
      to address flow distribution issues.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3e04142
    • Hariprasad Kelam's avatar
      octeontx2-af: Update RSS algorithm index · 570ba378
      Hariprasad Kelam authored
      The RSS flow algorithm is not set up correctly for promiscuous or all
      multi MCAM entries. This has an impact on flow distribution.
      
      This patch fixes the issue by updating flow algorithm index in above
      mentioned MCAM entries.
      
      Fixes: 967db352 ("octeontx2-af: add support for multicast/promisc packet replication feature")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      570ba378
    • Hariprasad Kelam's avatar
      octeontx2-pf: Fix promisc mcam entry action · dbda4368
      Hariprasad Kelam authored
      Current implementation is such that, promisc mcam entry action
      is set as multicast even when there are no trusted VFs. multicast
      action causes the hardware to copy packet data, which reduces
      the performance.
      
      This patch fixes this issue by setting the promisc mcam entry action to
      unicast instead of multicast when there are no trusted VFs. The same
      change is made for the 'allmulti' mcam entry action.
      
      Fixes: ffd2f89a ("octeontx2-pf: Enable promisc/allmulti match MCAM entries.")
      Signed-off-by: default avatarHariprasad Kelam <hkelam@marvell.com>
      Signed-off-by: default avatarSunil Kovvuri Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbda4368
    • Shinas Rasheed's avatar
      octeon_ep: explicitly test for firmware ready value · 284f7176
      Shinas Rasheed authored
      The firmware ready value is 1, and get firmware ready status
      function should explicitly test for that value. The firmware
      ready value read will be 2 after driver load, and on unbind
      till firmware rewrites the firmware ready back to 0, the value
      seen by driver will be 2, which should be regarded as not ready.
      
      Fixes: 10c073e4 ("octeon_ep: defer probe if firmware not ready")
      Signed-off-by: default avatarShinas Rasheed <srasheed@marvell.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      284f7176
    • Vlad Buslov's avatar
      net/sched: act_ct: Take per-cb reference to tcf_ct_flow_table · 125f1c7f
      Vlad Buslov authored
      The referenced change added custom cleanup code to act_ct to delete any
      callbacks registered on the parent block when deleting the
      tcf_ct_flow_table instance. However, the underlying issue is that the
      drivers don't obtain the reference to the tcf_ct_flow_table instance when
      registering callbacks which means that not only driver callbacks may still
      be on the table when deleting it but also that the driver can still have
      pointers to its internal nf_flowtable and can use it concurrently which
      results either warning in netfilter[0] or use-after-free.
      
      Fix the issue by taking a reference to the underlying struct
      tcf_ct_flow_table instance when registering the callback and release the
      reference when unregistering. Expose new API required for such reference
      counting by adding two new callbacks to nf_flowtable_type and implementing
      them for act_ct flowtable_ct type. This fixes the issue by extending the
      lifetime of nf_flowtable until all users have unregistered.
      
      [0]:
      [106170.938634] ------------[ cut here ]------------
      [106170.939111] WARNING: CPU: 21 PID: 3688 at include/net/netfilter/nf_flow_table.h:262 mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
      [106170.940108] Modules linked in: act_ct nf_flow_table act_mirred act_skbedit act_tunnel_key vxlan cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa bonding openvswitch nsh rpcrdma rdma_ucm
      ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_regis
      try overlay mlx5_core
      [106170.943496] CPU: 21 PID: 3688 Comm: kworker/u48:0 Not tainted 6.6.0-rc7_for_upstream_min_debug_2023_11_01_13_02 #1
      [106170.944361] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [106170.945292] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
      [106170.945846] RIP: 0010:mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
      [106170.946413] Code: 89 ef 48 83 05 71 a4 14 00 01 e8 f4 06 04 e1 48 83 05 6c a4 14 00 01 48 83 c4 28 5b 5d 41 5c 41 5d c3 48 83 05 d1 8b 14 00 01 <0f> 0b 48 83 05 d7 8b 14 00 01 e9 96 fe ff ff 48 83 05 a2 90 14 00
      [106170.947924] RSP: 0018:ffff88813ff0fcb8 EFLAGS: 00010202
      [106170.948397] RAX: 0000000000000000 RBX: ffff88811eabac40 RCX: ffff88811eabad48
      [106170.949040] RDX: ffff88811eab8000 RSI: ffffffffa02cd560 RDI: 0000000000000000
      [106170.949679] RBP: ffff88811eab8000 R08: 0000000000000001 R09: ffffffffa0229700
      [106170.950317] R10: ffff888103538fc0 R11: 0000000000000001 R12: ffff88811eabad58
      [106170.950969] R13: ffff888110c01c00 R14: ffff888106b40000 R15: 0000000000000000
      [106170.951616] FS:  0000000000000000(0000) GS:ffff88885fd40000(0000) knlGS:0000000000000000
      [106170.952329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [106170.952834] CR2: 00007f1cefd28cb0 CR3: 000000012181b006 CR4: 0000000000370ea0
      [106170.953482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [106170.954121] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [106170.954766] Call Trace:
      [106170.955057]  <TASK>
      [106170.955315]  ? __warn+0x79/0x120
      [106170.955648]  ? mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
      [106170.956172]  ? report_bug+0x17c/0x190
      [106170.956537]  ? handle_bug+0x3c/0x60
      [106170.956891]  ? exc_invalid_op+0x14/0x70
      [106170.957264]  ? asm_exc_invalid_op+0x16/0x20
      [106170.957666]  ? mlx5_del_flow_rules+0x10/0x310 [mlx5_core]
      [106170.958172]  ? mlx5_tc_ct_block_flow_offload_add+0x1240/0x1240 [mlx5_core]
      [106170.958788]  ? mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
      [106170.959339]  ? mlx5_tc_ct_del_ft_cb+0xc6/0x2b0 [mlx5_core]
      [106170.959854]  ? mapping_remove+0x154/0x1d0 [mlx5_core]
      [106170.960342]  ? mlx5e_tc_action_miss_mapping_put+0x4f/0x80 [mlx5_core]
      [106170.960927]  mlx5_tc_ct_delete_flow+0x76/0xc0 [mlx5_core]
      [106170.961441]  mlx5_free_flow_attr_actions+0x13b/0x220 [mlx5_core]
      [106170.962001]  mlx5e_tc_del_fdb_flow+0x22c/0x3b0 [mlx5_core]
      [106170.962524]  mlx5e_tc_del_flow+0x95/0x3c0 [mlx5_core]
      [106170.963034]  mlx5e_flow_put+0x73/0xe0 [mlx5_core]
      [106170.963506]  mlx5e_put_flow_list+0x38/0x70 [mlx5_core]
      [106170.964002]  mlx5e_rep_update_flows+0xec/0x290 [mlx5_core]
      [106170.964525]  mlx5e_rep_neigh_update+0x1da/0x310 [mlx5_core]
      [106170.965056]  process_one_work+0x13a/0x2c0
      [106170.965443]  worker_thread+0x2e5/0x3f0
      [106170.965808]  ? rescuer_thread+0x410/0x410
      [106170.966192]  kthread+0xc6/0xf0
      [106170.966515]  ? kthread_complete_and_exit+0x20/0x20
      [106170.966970]  ret_from_fork+0x2d/0x50
      [106170.967332]  ? kthread_complete_and_exit+0x20/0x20
      [106170.967774]  ret_from_fork_asm+0x11/0x20
      [106170.970466]  </TASK>
      [106170.970726] ---[ end trace 0000000000000000 ]---
      
      Fixes: 77ac5e40 ("net/sched: act_ct: remove and free nf_table callbacks")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarPaul Blakey <paulb@nvidia.com>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      125f1c7f
  4. 10 Dec, 2023 2 commits
    • Zhipeng Lu's avatar
      octeontx2-af: fix a use-after-free in rvu_nix_register_reporters · 28a7cb04
      Zhipeng Lu authored
      The rvu_dl will be freed in rvu_nix_health_reporters_destroy(rvu_dl)
      after the create_workqueue fails, and after that free, the rvu_dl will
      be translate back through the following call chain:
      
      rvu_nix_health_reporters_destroy
        |-> rvu_nix_health_reporters_create
             |-> rvu_health_reporters_create
                   |-> rvu_register_dl (label err_dl_health)
      
      Finally. in the err_dl_health label, rvu_dl being freed again in
      rvu_health_reporters_destroy(rvu) by rvu_nix_health_reporters_destroy.
      In the second calls of rvu_nix_health_reporters_destroy, however,
      it uses rvu_dl->rvu_nix_health_reporter, which is already freed at
      the end of rvu_nix_health_reporters_destroy in the first call.
      
      So this patch prevents the first destroy by instantly returning -ENONMEN
      when create_workqueue fails. In addition, since the failure of
      create_workqueue is the only entrence of label err, it has been
      integrated into the error-handling path of create_workqueue.
      
      Fixes: 5ed66306 ("octeontx2-af: Add devlink health reporters for NIX")
      Signed-off-by: default avatarZhipeng Lu <alexious@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28a7cb04
    • Radu Bulie's avatar
      net: fec: correct queue selection · 9fc95fe9
      Radu Bulie authored
      The old implementation extracted VLAN TCI info from the payload
      before the VLAN tag has been pushed in the payload.
      
      Another problem was that the VLAN TCI was extracted even if the
      packet did not have VLAN protocol header.
      
      This resulted in invalid VLAN TCI and as a consequence a random
      queue was computed.
      
      This patch fixes the above issues and use the VLAN TCI from the
      skb if it is present or VLAN TCI from payload if present. If no
      VLAN header is present queue 0 is selected.
      
      Fixes: 52c4a1a8 ("net: fec: add ndo_select_queue to fix TX bandwidth fluctuations")
      Signed-off-by: default avatarRadu Bulie <radu-andrei.bulie@nxp.com>
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fc95fe9
  5. 09 Dec, 2023 15 commits
  6. 08 Dec, 2023 5 commits
    • Florent Revest's avatar
      team: Fix use-after-free when an option instance allocation fails · c12296bb
      Florent Revest authored
      In __team_options_register, team_options are allocated and appended to
      the team's option_list.
      If one option instance allocation fails, the "inst_rollback" cleanup
      path frees the previously allocated options but doesn't remove them from
      the team's option_list.
      This leaves dangling pointers that can be dereferenced later by other
      parts of the team driver that iterate over options.
      
      This patch fixes the cleanup path to remove the dangling pointers from
      the list.
      
      As far as I can tell, this uaf doesn't have much security implications
      since it would be fairly hard to exploit (an attacker would need to make
      the allocation of that specific small object fail) but it's still nice
      to fix.
      
      Cc: stable@vger.kernel.org
      Fixes: 80f7c668 ("team: add support for per-port options")
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Link: https://lore.kernel.org/r/20231206123719.1963153-1-revest@chromium.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c12296bb
    • Maciej Żenczykowski's avatar
      net: ipv6: support reporting otherwise unknown prefix flags in RTM_NEWPREFIX · bd4a8167
      Maciej Żenczykowski authored
      Lorenzo points out that we effectively clear all unknown
      flags from PIO when copying them to userspace in the netlink
      RTM_NEWPREFIX notification.
      
      We could fix this one at a time as new flags are defined,
      or in one fell swoop - I choose the latter.
      
      We could either define 6 new reserved flags (reserved1..6) and handle
      them individually (and rename them as new flags are defined), or we
      could simply copy the entire unmodified byte over - I choose the latter.
      
      This unfortunately requires some anonymous union/struct magic,
      so we add a static assert on the struct size for a little extra safety.
      
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd4a8167
    • Judy Hsiao's avatar
      neighbour: Don't let neigh_forced_gc() disable preemption for long · e5dc5aff
      Judy Hsiao authored
      We are seeing cases where neigh_cleanup_and_release() is called by
      neigh_forced_gc() many times in a row with preemption turned off.
      When running on a low powered CPU at a low CPU frequency, this has
      been measured to keep preemption off for ~10 ms. That's not great on a
      system with HZ=1000 which expects tasks to be able to schedule in
      with ~1ms latency.
      Suggested-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarJudy Hsiao <judyhsiao@chromium.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5dc5aff
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2023-12-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 179a8b51
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes 2023-12-04
      
      This series provides bug fixes to mlx5 driver.
      
      V1->V2:
        - Drop commit #9 ("net/mlx5e: Forbid devlink reload if IPSec rules are
          offloaded"), we are working on a better fix
      
      Please pull and let me know if there is any problem.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      179a8b51
    • Linus Torvalds's avatar
      Merge tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 5e3f5b81
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf and netfilter.
      
        Current release - regressions:
      
         - veth: fix packet segmentation in veth_convert_skb_to_xdp_buff
      
        Current release - new code bugs:
      
         - tcp: assorted fixes to the new Auth Option support
      
        Older releases - regressions:
      
         - tcp: fix mid stream window clamp
      
         - tls: fix incorrect splice handling
      
         - ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()
      
         - dsa: mv88e6xxx: restore USXGMII support for 6393X
      
         - arcnet: restore support for multiple Sohard Arcnet cards
      
        Older releases - always broken:
      
         - tcp: do not accept ACK of bytes we never sent
      
         - require admin privileges to receive packet traces via netlink
      
         - packet: move reference count in packet_sock to atomic_long_t
      
         - bpf:
            - fix incorrect branch offset comparison with cpu=v4
            - fix prog_array_map_poke_run map poke update
      
         - netfilter:
            - three fixes for crashes on bad admin commands
            - xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
            - nf_tables: fix 'exist' matching on bigendian arches
      
         - leds: netdev: fix RTNL handling to prevent potential deadlock
      
         - eth: tg3: prevent races in error/reset handling
      
         - eth: r8169: fix rtl8125b PAUSE storm when suspended
      
         - eth: r8152: improve reset and surprise removal handling
      
         - eth: hns: fix race between changing features and sending
      
         - eth: nfp: fix sleep in atomic for bonding offload"
      
      * tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
        vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning
        net/smc: fix missing byte order conversion in CLC handshake
        net: dsa: microchip: provide a list of valid protocols for xmit handler
        drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
        psample: Require 'CAP_NET_ADMIN' when joining "packets" group
        bpf: sockmap, updating the sg structure should also update curr
        net: tls, update curr on splice as well
        nfp: flower: fix for take a mutex lock in soft irq context and rcu lock
        net: dsa: mv88e6xxx: Restore USXGMII support for 6393X
        tcp: do not accept ACK of bytes we never sent
        selftests/bpf: Add test for early update in prog_array_map_poke_run
        bpf: Fix prog_array_map_poke_run map poke update
        netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
        netfilter: nf_tables: validate family when identifying table via handle
        netfilter: nf_tables: bail out on mismatching dynset and set expressions
        netfilter: nf_tables: fix 'exist' matching on bigendian arches
        netfilter: nft_set_pipapo: skip inactive elements during set walk
        netfilter: bpf: fix bad registration on nf_defrag
        leds: trigger: netdev: fix RTNL handling to prevent potential deadlock
        octeontx2-af: Update Tx link register range
        ...
      5e3f5b81
  7. 07 Dec, 2023 1 commit