1. 08 Mar, 2024 40 commits
    • Peiyang Wang's avatar
      net: hns3: fix reset timeout under full functions and queues · 216bc415
      Peiyang Wang authored
      The cmdq reset command times out when all VFs are enabled and the queue is
      full. The hardware processing time exceeds the timeout set by the driver.
      In order to avoid the above extreme situations, the driver extends the
      reset timeout to 1 second.
      Signed-off-by: default avatarPeiyang Wang <wangpeiyang1@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Reviewed-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      216bc415
    • Jijie Shao's avatar
      net: hns3: fix delete tc fail issue · 03f92287
      Jijie Shao authored
      When the tc is removed during reset, hns3 driver will return a errcode.
      But kernel ignores this errcode, As a result,
      the driver status is inconsistent with the kernel status.
      
      This patch retains the deletion status when the deletion fails
      and continues to delete after the reset to ensure that
      the status of the driver is consistent with that of kernel.
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03f92287
    • Yonglong Liu's avatar
      net: hns3: fix kernel crash when 1588 is received on HIP08 devices · 0fbcf236
      Yonglong Liu authored
      The HIP08 devices does not register the ptp devices, so the
      hdev->ptp is NULL, but the hardware can receive 1588 messages,
      and set the HNS3_RXD_TS_VLD_B bit, so, if match this case, the
      access of hdev->ptp->flags will cause a kernel crash:
      
      [ 5888.946472] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
      [ 5888.946475] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
      ...
      [ 5889.266118] pc : hclge_ptp_get_rx_hwts+0x40/0x170 [hclge]
      [ 5889.272612] lr : hclge_ptp_get_rx_hwts+0x34/0x170 [hclge]
      [ 5889.279101] sp : ffff800012c3bc50
      [ 5889.283516] x29: ffff800012c3bc50 x28: ffff2040002be040
      [ 5889.289927] x27: ffff800009116484 x26: 0000000080007500
      [ 5889.296333] x25: 0000000000000000 x24: ffff204001c6f000
      [ 5889.302738] x23: ffff204144f53c00 x22: 0000000000000000
      [ 5889.309134] x21: 0000000000000000 x20: ffff204004220080
      [ 5889.315520] x19: ffff204144f53c00 x18: 0000000000000000
      [ 5889.321897] x17: 0000000000000000 x16: 0000000000000000
      [ 5889.328263] x15: 0000004000140ec8 x14: 0000000000000000
      [ 5889.334617] x13: 0000000000000000 x12: 00000000010011df
      [ 5889.340965] x11: bbfeff4d22000000 x10: 0000000000000000
      [ 5889.347303] x9 : ffff800009402124 x8 : 0200f78811dfbb4d
      [ 5889.353637] x7 : 2200000000191b01 x6 : ffff208002a7d480
      [ 5889.359959] x5 : 0000000000000000 x4 : 0000000000000000
      [ 5889.366271] x3 : 0000000000000000 x2 : 0000000000000000
      [ 5889.372567] x1 : 0000000000000000 x0 : ffff20400095c080
      [ 5889.378857] Call trace:
      [ 5889.382285] hclge_ptp_get_rx_hwts+0x40/0x170 [hclge]
      [ 5889.388304] hns3_handle_bdinfo+0x324/0x410 [hns3]
      [ 5889.394055] hns3_handle_rx_bd+0x60/0x150 [hns3]
      [ 5889.399624] hns3_clean_rx_ring+0x84/0x170 [hns3]
      [ 5889.405270] hns3_nic_common_poll+0xa8/0x220 [hns3]
      [ 5889.411084] napi_poll+0xcc/0x264
      [ 5889.415329] net_rx_action+0xd4/0x21c
      [ 5889.419911] __do_softirq+0x130/0x358
      [ 5889.424484] irq_exit+0x134/0x154
      [ 5889.428700] __handle_domain_irq+0x88/0xf0
      [ 5889.433684] gic_handle_irq+0x78/0x2c0
      [ 5889.438319] el1_irq+0xb8/0x140
      [ 5889.442354] arch_cpu_idle+0x18/0x40
      [ 5889.446816] default_idle_call+0x5c/0x1c0
      [ 5889.451714] cpuidle_idle_call+0x174/0x1b0
      [ 5889.456692] do_idle+0xc8/0x160
      [ 5889.460717] cpu_startup_entry+0x30/0xfc
      [ 5889.465523] secondary_start_kernel+0x158/0x1ec
      [ 5889.470936] Code: 97ffab78 f9411c14 91408294 f9457284 (f9400c80)
      [ 5889.477950] SMP: stopping secondary CPUs
      [ 5890.514626] SMP: failed to stop secondary CPUs 0-69,71-95
      [ 5890.522951] Starting crashdump kernel...
      
      Fixes: 0bf5eb78 ("net: hns3: add support for PTP")
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fbcf236
    • Hao Lan's avatar
      net: hns3: Disable SerDes serial loopback for HiLink H60 · 0448825b
      Hao Lan authored
      When the hilink version is H60, the serdes serial loopback test is not
      supported. This patch add hilink version detection. When the version
      is H60, the serdes serial loopback test will be disable.
      Signed-off-by: default avatarHao Lan <lanhao@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0448825b
    • Hao Lan's avatar
      net: hns3: add new 200G link modes for hisilicon device · dd1f65f0
      Hao Lan authored
      The hisilicon device now supports a new 200G link interface,
      which query from firmware in a new bit. Therefore,
      the HCLGE_SUPPORT_200G_R4_BIT capability bit has been added.
      The HCLGE_SUPPORT_200G_BIT has been renamed as
      HCLGE_SUPPORT_200G_R4_EXT_BIT, and the firmware has
      extended support for this mode.
      
      Fixes: ae6f010c ("net: hns3: add support for 200G device")
      Signed-off-by: default avatarHao Lan <lanhao@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd1f65f0
    • Jijie Shao's avatar
      net: hns3: fix wrong judgment condition issue · 07a1d6dc
      Jijie Shao authored
      In hns3_dcbnl_ieee_delapp, should check ieee_delapp not ieee_setapp.
      This path fix the wrong judgment.
      
      Fixes: 0ba22bcb ("net: hns3: add support config dscp map to tc")
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07a1d6dc
    • David S. Miller's avatar
      Merge branch 'ionic-diet' · 147a1c06
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      ionic: putting ionic on a diet
      
      Building on the performance work done in the previous patchset
          [Link] https://lore.kernel.org/netdev/20240229193935.14197-1-shannon.nelson@amd.com/
      this patchset puts the ionic driver on a diet, decreasing the memory
      requirements per queue, and simplifies a few more bits of logic.
      
      We trimmed the queue management structs and gained some ground, but
      the most savings came from trimming the individual buffer descriptors.
      The original design used a single generic buffer descriptor for Tx, Rx and
      Adminq needs, but the Rx and Adminq descriptors really don't need all the
      info that the Tx descriptors track.  By splitting up the descriptor types
      we can significantly reduce the descriptor sizes for Rx and Adminq use.
      
      There is a small reduction in the queue management structs, saving about
      3 cachelines per queuepair:
      
          ionic_qcq:
      	Before:	/* size: 2176, cachelines: 34, members: 23 */
      	After:	/* size: 2048, cachelines: 32, members: 23 */
      
      We also remove an array of completion descriptor pointers, or about
      8 Kbytes per queue.
      
      But the biggest savings came from splitting the desc_info struct into
      queue specific structs and trimming out what was unnecessary.
      
          Before:
      	ionic_desc_info:
      		/* size: 496, cachelines: 8, members: 10 */
          After:
      	ionic_tx_desc_info:
      		/* size: 496, cachelines: 8, members: 6 */
      	ionic_rx_desc_info:
      		/* size: 224, cachelines: 4, members: 2 */
      	ionic_admin_desc_info:
      		/* size: 8, cachelines: 1, members: 1 */
      
      In a 64 core host the ionic driver will default to 64 queuepairs of
      1024 descriptors for Rx, 1024 for Tx, and 80 for Adminq and Notifyq.
      
      The total memory usage for 64 queues:
          Before:
      	  65 * sizeof(ionic_qcq)			   141,440
      	+ 64 * 1024 * sizeof(ionic_desc_info)		32,505,856
      	+ 64 * 1024 * sizeof(ionic_desc_info)		32,505,856
      	+ 64 * 1024 * 2 * sizeof(ionic_qc_info)		    16,384
      	+  1 *   80 * sizeof(ionic_desc_info)		    39,690
      							----------
      							65,201,038
      
          After:
      	  65 * sizeof(ionic_qcq)			   133,120
      	+ 64 * 1024 * sizeof(ionic_tx_desc_info)	32,505,856
      	+ 64 * 1024 * sizeof(ionic_rx_desc_info)	14,680,064
      	+                           (removed)		         0
      	+  1 *   80 * sizeof(ionic_admin desc_info)	       640
      							----------
      							47,319,680
      
      This saves us approximately 18 Mbytes per port in a 64 core machine,
      a 28% savings in our memory needs.
      
      In addition, this improves our simple single thread / single queue
      iperf case on a 9100 MTU connection from 86.7 to 95 Gbits/sec.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      147a1c06
    • Shannon Nelson's avatar
      ionic: keep stats struct local to error handling · 2854242d
      Shannon Nelson authored
      When possible, keep the stats struct references strictly
      in the error handling blocks and out of the fastpath.
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2854242d
    • Shannon Nelson's avatar
      ionic: better dma-map error handling · 56e41ee1
      Shannon Nelson authored
      Fix up a couple of small dma_addr handling issues
        - don't double-count dma-map-err stat in ionic_tx_map_skb()
          or ionic_xdp_post_frame()
        - return 0 on error from both ionic_tx_map_single() and
          ionic_tx_map_frag() and check for !dma_addr in ionic_tx_map_skb()
          and ionic_xdp_post_frame()
        - be sure to unmap buf_info[0] in ionic_tx_map_skb() error path
        - don't assign rx buf->dma_addr until error checked in ionic_rx_page_alloc()
        - remove unnecessary dma_addr_t casts
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56e41ee1
    • Shannon Nelson's avatar
      ionic: remove unnecessary NULL test · a12c1e7a
      Shannon Nelson authored
      We call ionic_rx_page_alloc() only on existing buf_info structs from
      ionic_rx_fill().  There's no need for the additional NULL test.
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a12c1e7a
    • Shannon Nelson's avatar
      ionic: rearrange ionic_queue for better layout · 4554341d
      Shannon Nelson authored
      A simple change to the struct ionic_queue layout removes some
      unnecessary padding and saves us a cacheline in the struct
      ionic_qcq layout.
      
          struct ionic_queue {
      	Before: /* size: 256, cachelines: 4, members: 29 */
      	After:  /* size: 192, cachelines: 3, members: 29 */
      
          struct ionic_qcq {
      	Before: /* size: 2112, cachelines: 33, members: 23 */
      	After:  /* size: 2048, cachelines: 32, members: 23 */
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4554341d
    • Shannon Nelson's avatar
      ionic: rearrange ionic_qcq · 453538c5
      Shannon Nelson authored
      Rearange a few fields for better cache use and to put the
      flags field up into the first cacheline rather than the last.
      
          struct ionic_qcq
      	Before: /* size: 2176, cachelines: 34, members: 23 */
      	After:  /* size: 2112, cachelines: 33, members: 23 */
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      453538c5
    • Shannon Nelson's avatar
      ionic: carry idev in ionic_cq struct · 01658924
      Shannon Nelson authored
      Remove the idev field from ionic_queue, which saves us a
      bit of space, and add it into ionic_cq where there's room
      within some cacheline padding.  Use this pointer rather
      than doing a multi level reference from lif->ionic.
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01658924
    • Shannon Nelson's avatar
      ionic: refactor skb building · 36a47c90
      Shannon Nelson authored
      The existing ionic_rx_frags() code is a bit of a mess and can
      be cleaned up by unrolling the first frag/header setup from
      the loop, then reworking the do-while-loop into a for-loop.  We
      rename the function to a more descriptive ionic_rx_build_skb().
      We also change a couple of related variable names for readability.
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36a47c90
    • Shannon Nelson's avatar
      ionic: fold adminq clean into service routine · 8599bd4c
      Shannon Nelson authored
      Since the AdminQ clean is a simple action called from only
      one place, fold it back into the service routine.
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8599bd4c
    • Shannon Nelson's avatar
      ionic: use specialized desc info structs · 4dcd4575
      Shannon Nelson authored
      Make desc_info structure specific to the queue type, which
      allows us to cut down the Rx and AdminQ descriptor sizes by
      not including all the fields needed for the Tx desriptors.
      
      Before:
          struct ionic_desc_info {
      	/* size: 464, cachelines: 8, members: 6 */
      
      After:
          struct ionic_tx_desc_info {
      	/* size: 464, cachelines: 8, members: 6 */
          struct ionic_rx_desc_info {
      	/* size: 224, cachelines: 4, members: 2 */
          struct ionic_admin_desc_info {
      	/* size: 8, cachelines: 1, members: 1 */
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4dcd4575
    • Shannon Nelson's avatar
      ionic: remove the cq_info to save more memory · 65e548f6
      Shannon Nelson authored
      With a little simple math we don't need another struct array to
      find the completion structs, so we can remove the ionic_cq_info
      altogether.  This doesn't really save anything in the ionic_cq
      since it gets padded out to the cacheline, but it does remove
      the parallel array allocation of 8 * num_descriptors, or about
      8 Kbytes per queue in a default configuration.
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65e548f6
    • Shannon Nelson's avatar
      ionic: remove callback pointer from desc_info · ae24a8f8
      Shannon Nelson authored
      By reworking the queue service routines to have their own
      servicing loops we can remove the cb pointer from desc_info
      to save another 8 bytes per descriptor,
      
      This simplifies some of the queue handling indirection and makes
      the code a little easier to follow, and keeps service code in
      one place rather than jumping between code files.
      
         struct ionic_desc_info
      	Before:  /* size: 472, cachelines: 8, members: 7 */
      	After:   /* size: 464, cachelines: 8, members: 6 */
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae24a8f8
    • Shannon Nelson's avatar
      ionic: move adminq-notifyq handling to main file · 05c94473
      Shannon Nelson authored
      Move the AdminQ and NotifyQ queue handling to ionic_main.c with
      the rest of the adminq code.
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05c94473
    • Shannon Nelson's avatar
      ionic: drop q mapping · 90c01ede
      Shannon Nelson authored
      Now that we're not using desc_info pointers mapped in every q
      we can simplify and drop the unnecessary utility functions.
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90c01ede
    • Shannon Nelson's avatar
      ionic: remove desc, sg_desc and cmb_desc from desc_info · d60984d3
      Shannon Nelson authored
      Remove the struct pointers from desc_info to use less space.
      Instead of pointers in every desc_info to its descriptor,
      we can use the queue descriptor index to find the individual
      desc, desc_info, and sgl structs in their parallel arrays.
      
         struct ionic_desc_info
      	Before:  /* size: 496, cachelines: 8, members: 10 */
      	After:   /* size: 472, cachelines: 8, members: 7 */
      Suggested-by: default avatarNeel Patel <npatel2@amd.com>
      Reviewed-by: default avatarBrett Creeley <brett.creeley@amd.com>
      Signed-off-by: default avatarShannon Nelson <shannon.nelson@amd.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d60984d3
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · e3eec349
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2024-03-06 (iavf, i40e, ixgbe)
      
      This series contains updates to iavf, i40e, and ixgbe drivers.
      
      Alexey Kodanev removes duplicate calls related to cloud filters on iavf
      and unnecessary null checks on i40e.
      
      Maciej adds helper functions for common code relating to updating
      statistics for ixgbe.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3eec349
    • Jakub Kicinski's avatar
      Add Jeff Kirsher to .get_maintainer.ignore · 7221fbe8
      Jakub Kicinski authored
      Jeff was retired as the Intel driver maintainer in
      commit 6667df91 ("MAINTAINERS: Update MAINTAINERS for
      Intel ethernet drivers"), and his address bounces.
      But he has signed-off a lot of patches over the years
      so get_maintainer insists on CCing him.
      
      We haven't heard from him since he left Intel, so remapping
      the address via mailmap is also pointless. Add to ignored
      addresses.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7221fbe8
    • David S. Miller's avatar
      Merge branch 'ipv6-lockless-dump-addrs' · 570c86ed
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      ipv6: lockless inet6_dump_addr()
      
      This series removes RTNL locking to dump ipv6 addresses.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      570c86ed
    • Eric Dumazet's avatar
      ipv6: remove RTNL protection from inet6_dump_addr() · 155549a6
      Eric Dumazet authored
      We can now remove RTNL acquisition while running
      inet6_dump_addr(), inet6_dump_ifmcaddr()
      and inet6_dump_ifacaddr().
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      155549a6
    • Eric Dumazet's avatar
      ipv6: use xa_array iterator to implement inet6_dump_addr() · 9cc4cc32
      Eric Dumazet authored
      inet6_dump_addr() can use the new xa_array iterator
      for better scalability.
      
      Make it ready for RCU-only protection.
      RTNL use is removed in the following patch.
      
      Also properly return 0 at the end of a dump to avoid
      and extra recvmsg() to get NLMSG_DONE.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cc4cc32
    • Eric Dumazet's avatar
      ipv6: make in6_dump_addrs() lockless · 46f5182d
      Eric Dumazet authored
      in6_dump_addrs() is called with RCU protection.
      
      There is no need holding idev->lock to iterate through unicast addresses.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46f5182d
    • Eric Dumazet's avatar
      ipv6: make inet6_fill_ifaddr() lockless · f0a7da70
      Eric Dumazet authored
      Make inet6_fill_ifaddr() lockless, and add approriate annotations
      on ifa->tstamp, ifa->valid_lft, ifa->preferred_lft, ifa->ifa_proto
      and ifa->rt_priority.
      
      Also constify 2nd argument of inet6_fill_ifaddr(), inet6_fill_ifmcaddr()
      and inet6_fill_ifacaddr().
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0a7da70
    • David S. Miller's avatar
      Merge tag 'ipsec-next-2024-03-06' of... · 3dbf6d67
      David S. Miller authored
      Merge tag 'ipsec-next-2024-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
      
      Steffen Klassert says:
      
      ====================
      1) Introduce forwarding of ICMP Error messages. That is specified
         in RFC 4301 but was never implemented. From Antony Antony.
      
      2) Use KMEM_CACHE instead of kmem_cache_create in xfrm6_tunnel_init()
         and xfrm_policy_init(). From Kunwu Chan.
      
      3) Do not allocate stats in the xfrm interface driver, this can be done
         on net core now. From Breno Leitao.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3dbf6d67
    • David S. Miller's avatar
      Merge branch 'nexthop-group-stats' · 7cf497e5
      David S. Miller authored
      Petr Machata says:
      
      ====================
      Support for nexthop group statistics
      
      ECMP is a fundamental component in L3 designs. However, it's fragile. Many
      factors influence whether an ECMP group will operate as intended: hash
      policy (i.e. the set of fields that contribute to ECMP hash calculation),
      neighbor validity, hash seed (which might lead to polarization) or the type
      of ECMP group used (hash-threshold or resilient).
      
      At the same time, collection of statistics that would help an operator
      determine that the group performs as desired, is difficult.
      
      A solution that we present in this patchset is to add counters to next hop
      group entries. For SW-datapath deployments, this will on its own allow
      collection and evaluation of relevant statistics. For HW-datapath
      deployments, we further add a way to request that HW counters be installed
      for a given group, in-kernel interfaces to collect the HW statistics, and
      netlink interfaces to query them.
      
      For example:
      
          # ip nexthop replace id 4000 group 4001/4002 hw_stats on
      
          # ip -s -d nexthop show id 4000
          id 4000 group 4001/4002 scope global proto unspec offload hw_stats on used on
            stats:
              id 4001 packets 5002 packets_hw 5000
              id 4002 packets 4999 packets_hw 4999
      
      The point of the patchset is visibility of ECMP balance, and that is
      influenced by packet headers, not their payload. Correspondingly, we only
      include packet counters in the statistics, not byte counters.
      
      We also decided to model HW statistics as a nexthop group attribute, not an
      arbitrary nexthop one. The latter would count any traffic going through a
      given nexthop, regardless of which ECMP group it is in, or any at all. The
      reason is again hat the point of the patchset is ECMP balance visibility,
      not arbitrary inspection of how busy a particular nexthop is.
      Implementation of individual-nexthop statistics is certainly possible, and
      could well follow the general approach we are taking in this patchset.
      For resilient groups, per-bucket statistics could be done in a similar
      manner as well.
      
      This patchset contains the core code. mlxsw support will be sent in a
      follow-up patch set.
      
      This patchset progresses as follows:
      
      - Patches #1 and #2 add support for a new next-hop object attribute,
        NHA_OP_FLAGS. That is meant to carry various op-specific signaling, in
        particular whether SW- and HW-collected nexthop stats should be part of
        the get or dump response. The idea is to avoid wasting message space, and
        time for collection of HW statistics, when the values are not needed.
      
      - Patches #3 and #4 add SW-datapath stats and corresponding UAPI.
      
      - Patches #5, #6 and #7 add support fro HW-datapath stats and UAPI.
        Individual drivers still need to contribute the appropriate HW-specific
        support code.
      
      v4:
      - Patch #2:
          - s/nla_get_bitfield32/nla_get_u32/ in __nh_valid_dump_req().
      
      v3:
      - Patch #3:
          - Convert to u64_stats_t
      - Patch #4:
          - Give a symbolic name to the set of all valid dump flags
            for the NHA_OP_FLAGS attribute.
          - Convert to u64_stats_t
      - Patch #6:
          - Use a named constant for the NHA_HW_STATS_ENABLE policy.
      
      v2:
      - Patch #2:
          - Change OP_FLAGS to u32, enforce through NLA_POLICY_MASK
      - Patch #3:
          - Set err on nexthop_create_group() error path
      - Patch #4:
          - Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS
          - Rename jump target in nla_put_nh_group_stats() to avoid
            having to rename further in the patchset.
      - Patch #7:
          - Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS_HW
          - Do not cancel outside of nesting in nla_put_nh_group_stats()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7cf497e5
    • Ido Schimmel's avatar
      net: nexthop: Expose nexthop group HW stats to user space · 5072ae00
      Ido Schimmel authored
      Add netlink support for reading NH group hardware stats.
      
      Stats collection is done through a new notifier,
      NEXTHOP_EVENT_HW_STATS_REPORT_DELTA. Drivers that implement HW counters for
      a given NH group are thereby asked to collect the stats and report back to
      core by calling nh_grp_hw_stats_report_delta(). This is similar to what
      netdevice L3 stats do.
      
      Besides exposing number of packets that passed in the HW datapath, also
      include information on whether any driver actually realizes the counters.
      The core can tell based on whether it got any _report_delta() reports from
      the drivers. This allows enabling the statistics at the group at any time,
      with drivers opting into supporting them. This is also in line with what
      netdevice L3 stats are doing.
      
      So as not to waste time and space, tie the collection and reporting of HW
      stats with a new op flag, NHA_OP_FLAG_DUMP_HW_STATS.
      Co-developed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: Kees Cook <keescook@chromium.org> # For the __counted_by bits
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5072ae00
    • Ido Schimmel's avatar
      net: nexthop: Add ability to enable / disable hardware statistics · 746c19a5
      Ido Schimmel authored
      Add netlink support for enabling collection of HW statistics on nexthop
      groups.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      746c19a5
    • Ido Schimmel's avatar
      net: nexthop: Add hardware statistics notifications · 5877786f
      Ido Schimmel authored
      Add hw_stats field to several notifier structures to communicate to the
      drivers that HW statistics should be configured for nexthops within a given
      group.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5877786f
    • Ido Schimmel's avatar
      net: nexthop: Expose nexthop group stats to user space · 95fedd76
      Ido Schimmel authored
      Add netlink support for reading NH group stats.
      
      This data is only for statistics of the traffic in the SW datapath. HW
      nexthop group statistics will be added in the following patches.
      
      Emission of the stats is keyed to a new op_stats flag to avoid cluttering
      the netlink message with stats if the user doesn't need them:
      NHA_OP_FLAG_DUMP_STATS.
      Co-developed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95fedd76
    • Ido Schimmel's avatar
      net: nexthop: Add nexthop group entry stats · f4676ea7
      Ido Schimmel authored
      Add nexthop group entry stats to count the number of packets forwarded
      via each nexthop in the group. The stats will be exposed to user space
      for better data path observability in the next patch.
      
      The per-CPU stats pointer is placed at the beginning of 'struct
      nh_grp_entry', so that all the fields accessed for the data path reside
      on the same cache line:
      
      struct nh_grp_entry {
              struct nexthop *           nh;                   /*     0     8 */
              struct nh_grp_entry_stats * stats;               /*     8     8 */
              u8                         weight;               /*    16     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              union {
                      struct {
                              atomic_t   upper_bound;          /*    24     4 */
                      } hthr;                                  /*    24     4 */
                      struct {
                              struct list_head uw_nh_entry;    /*    24    16 */
                              u16        count_buckets;        /*    40     2 */
                              u16        wants_buckets;        /*    42     2 */
                      } res;                                   /*    24    24 */
              };                                               /*    24    24 */
              struct list_head           nh_list;              /*    48    16 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct nexthop *           nh_parent;            /*    64     8 */
      
              /* size: 72, cachelines: 2, members: 6 */
              /* sum members: 65, holes: 1, sum holes: 7 */
              /* last cacheline: 8 bytes */
      };
      Co-developed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f4676ea7
    • Petr Machata's avatar
      net: nexthop: Add NHA_OP_FLAGS · a207eab1
      Petr Machata authored
      In order to add per-nexthop statistics, but still not increase netlink
      message size for consumers that do not care about them, there needs to be a
      toggle through which the user indicates their desire to get the statistics.
      To that end, add a new attribute, NHA_OP_FLAGS. The idea is to be able to
      use the attribute for carrying of arbitrary operation-specific flags, i.e.
      not make it specific for get / dump.
      
      Add the new attribute to get and dump policies, but do not actually allow
      any flags yet -- those will come later as the flags themselves are defined.
      Add the necessary parsing code.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a207eab1
    • Petr Machata's avatar
      net: nexthop: Adjust netlink policy parsing for a new attribute · 2118f939
      Petr Machata authored
      A following patch will introduce a new attribute, op-specific flags to
      adjust the behavior of an operation. Different operations will recognize
      different flags.
      
      - To make the differentiation possible, stop sharing the policies for get
        and del operations.
      
      - To allow querying for presence of the attribute, have all the attribute
        arrays sized to NHA_MAX, regardless of what is permitted by policy, and
        pass the corresponding value to nlmsg_parse() as well.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2118f939
    • Sai Krishna's avatar
      octeontx2-pf: Add TC flower offload support for TCP flags · 3b43f19d
      Sai Krishna authored
      This patch adds TC offload support for matching TCP flags
      from TCP header.
      
      Example usage:
      tc qdisc add dev eth0 ingress
      
      TC rule to drop the TCP SYN packets:
      tc filter add dev eth0 ingress protocol ip flower ip_proto tcp tcp_flags
      0x02/0x3f skip_sw action drop
      Signed-off-by: default avatarSai Krishna <saikrishnag@marvell.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b43f19d
    • fuyuanli's avatar
      tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe. · caabd859
      fuyuanli authored
      It is useful to expose skb addr and sock addr to user in tracepoint
      tcp_probe, so that we can get more information while monitoring
      receiving of tcp data, by ebpf or other ways.
      
      For example, we need to identify a packet by seq and end_seq when
      calculate transmit latency between layer 2 and layer 4 by ebpf, but which is
      not available in tcp_probe, so we can only use kprobe hooking
      tcp_rcv_established to get them. But we can use tcp_probe directly if skb
      addr and sock addr are available, which is more efficient.
      Signed-off-by: default avatarfuyuanli <fuyuanli@didiglobal.com>
      Reviewed-by: default avatarJason Xing <kerneljasonxing@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      caabd859
    • Jakub Kicinski's avatar
      net: dqs: add NIC stall detector based on BQL · 6025b913
      Jakub Kicinski authored
      softnet_data->time_squeeze is sometimes used as a proxy for
      host overload or indication of scheduling problems. In practice
      this statistic is very noisy and has hard to grasp units -
      e.g. is 10 squeezes a second to be expected, or high?
      
      Delaying network (NAPI) processing leads to drops on NIC queues
      but also RTT bloat, impacting pacing and CA decisions.
      Stalls are a little hard to detect on the Rx side, because
      there may simply have not been any packets received in given
      period of time. Packet timestamps help a little bit, but
      again we don't know if packets are stale because we're
      not keeping up or because someone (*cough* cgroups)
      disabled IRQs for a long time.
      
      We can, however, use Tx as a proxy for Rx stalls. Most drivers
      use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
      On the Tx side we know exactly when packets get queued,
      and completed, so there is no uncertainty.
      
      This patch adds stall checks to BQL. Why BQL? Because
      it's a convenient place to add such checks, already
      called by most drivers, and it has copious free space
      in its structures (this patch adds no extra cache
      references or dirtying to the fast path).
      
      The algorithm takes one parameter - max delay AKA stall
      threshold and increments a counter whenever NAPI got delayed
      for at least that amount of time. It also records the length
      of the longest stall.
      
      To be precise every time NAPI has not polled for at least
      stall thrs we check if there were any Tx packets queued
      between last NAPI run and now - stall_thrs/2.
      
      Unlike the classic Tx watchdog this mechanism does not
      ignore stalls caused by Tx being disabled, or loss of link.
      I don't think the check is worth the complexity, and
      stall is a stall, whether due to host overload, flow
      control, link down... doesn't matter much to the application.
      
      We have been running this detector in production at Meta
      for 2 years, with the threshold of 8ms. It's the lowest
      value where false positives become rare. There's still
      a constant stream of reported stalls (especially without
      the ksoftirqd deferral patches reverted), those who like
      their stall metrics to be 0 may prefer higher value.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6025b913