Commits · 92f74c1e05b044b51398d6d4a85e659e4384f2cb · Kirill Smelkov / linux

06 Jun, 2024 31 commits

mptcp: refer to 'MPTCP' socket in comments · 92f74c1e

Davide Caratti authored Jun 05, 2024

We used to call it 'master' socket at the early stages of MPTCP
development, but the correct wording is 'MPTCP' socket opposed to 'TCP
subflows': convert the last 3 comments to use a more appropriate term.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

92f74c1e

mptcp: add mptcp_space_from_win helper · 5cdedad6

Geliang Tang authored Jun 05, 2024

As a wrapper of __tcp_space_from_win(), this patch adds a MPTCP dedicated
space_from_win helper mptcp_space_from_win() in protocol.h to paired with
mptcp_win_from_space().

Use it instead of __tcp_space_from_win() in both mptcp_rcv_space_adjust()
and mptcp_set_rcvlowat().
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

5cdedad6

mptcp: use mptcp_win_from_space helper · 5f0d0649

Geliang Tang authored Jun 05, 2024

The MPTCP dedicated win_from_space helper mptcp_win_from_space() is defined
in protocol.h, use it in mptcp_rcv_space_adjust() instead of using the TCP
one. Here scaling_ratio is the same as msk->scaling_ratio.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

5f0d0649

net: allow rps/rfs related configs to be switched · 9b6a30fe

Jason Xing authored Jun 05, 2024

After John Sperbeck reported a compile error if the CONFIG_RFS_ACCEL
is off, I found that I cannot easily enable/disable the config
because of lack of the prompt when using 'make menuconfig'. Therefore,
I decided to change rps/rfc related configs altogether.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240605022932.33703-1-kerneljasonxing@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9b6a30fe

inet: remove (struct uncached_list)->quarantine · 98aa546a

Eric Dumazet authored Jun 04, 2024

This list is used to tranfert dst that are handled by
rt_flush_dev() and rt6_uncached_list_flush_dev() out
of the per-cpu lists.

But quarantine list is not used later.

If we simply use list_del_init(&rt->dst.rt_uncached),
this also removes the dst from per-cpu list.

This patch also makes the future calls to rt_del_uncached_list()
and rt6_uncached_list_del() faster, because no spinlock
acquisition is needed anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

98aa546a

net: use unrcu_pointer() helper · b4cb4a13

Eric Dumazet authored Jun 04, 2024

Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

b4cb4a13

Merge branch 'improve-gbeth-performance-on-renesas-rz-g2l-and-related-socs' · 59d0f481

Paolo Abeni authored Jun 06, 2024

Paul Barker says:

====================
Improve GbEth performance on Renesas RZ/G2L and related SoCs

This series aims to improve performance of the GbEth IP in the Renesas
RZ/G2L SoC family and the RZ/G3S SoC, which use the ravb driver. Along
the way, we do some refactoring and ensure that napi_complete_done() is
used in accordance with the NAPI documentation for both GbEth and R-Car
code paths.

Much of the performance improvement comes from enabling SW IRQ
Coalescing for all SoCs using the GbEth IP, and NAPI Threaded mode for
single core SoCs using the GbEth IP. These can be enabled/disabled at
runtime via sysfs, but our goal is to set sensible defaults which get
good performance on the affected SoCs.

The rest of the performance improvement comes from using a page pool to
allocate RX buffers, and reducing the allocation size from >8kB to 2kB.

The overall performance impact of this patch series seen in testing with
iperf3 is as follows (see patches 5-7 for more detailed results):
  * RZ/G2L:
    * TCP TX: +1.8% bandwidth
    * TCP RX: +1% bandwidth at 47% less CPU load
    * UDP RX: +1% bandwidth at 26% less CPU load

  * RZ/G2UL:
    * TCP TX: +37% bandwidth
    * TCP RX: +43% bandwidth
    * UDP TX: -8% bandwidth
    * UDP RX: +32500% bandwidth (!)

  * RZ/G3S:
    * TCP TX: +25% bandwidth
    * TCP RX: +76% bandwidth
    * UDP TX: -9% bandwidth
    * UDP RX: +37900% bandwidth (!)

  * RZ/Five:
    * TCP TX: +18% bandwidth
    * TCP RX: +212% bandwidth
    * UDP TX: +2% bandwidth
    * UDP RX: +inf bandwidth (test no longer crashes)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Fixing the crash in UDP RX testing for RZ/Five is a cumulative effect of
patches 1, 2, 5 & 6 so this is very difficult to break out as a bugfix
for backporting.

Changes v4->v5:
  * Added Sergey's Reviewed-by tags.
  * Improved the commit message for patch 2/7.
  * Re-wrapped to 80 cols, except where this would significantly impact
    readability.
  * Use lower case `skb` consistently in comments.
  * Included <net/page_pool/types.h> in ravb.h.
  * Moved rx_buffer_size so it is in the same place in ravb_hw_info as
    rx_max_desc_use was previously.
  * Used reverse xmas tree ordering in variable declarations.
  * Split lines after binary operators, instead of before.
  * Factor subtraction of sizeof(__sum16) out of the if condition in
    ravb_rx_csum_gbeth().
  * Add blank lines after variable declarations where needed.
  * Used goto instead of break to handle napi_build_skb() failure in
    ravb_rx_gbeth(). Break was incorrectly scoped to the surrounding
    switch statement, when it's the outer loop we really want to break
    out of.
  * Used continue instead of break to handle NULL priv->rx_1st_skb in
    ravb_rx_gbeth() as we may still be able to process further
    descriptors.
  * Unconditionally set priv->rx_1st_skb = NULL after processing a
    packet in ravb_rx_gbeth(). We don't need to check die_dt as this
    will be a no-op for single descriptor packets.
  * Moved napi_build_skb() call after dma_sync_single_for_cpu() in
    ravb_rx_rcar() to align the order of operations with ravb_rx_gbeth()
    and ensure the data is sync'd before it is accessed.
  * Moved zeroing of rx_buff->page to the end of packet processing in
    ravb_rx_rcar() to align the order of operations with
    ravb_rx_gbeth().

Changes v3->v4:
  * Dependency patches have merged so this is no longer an RFC.
  * Fixed update of stats->rx_packets.
  * Simplified refactoring following feedback from Niklas and Sergey.
  * Renamed needs_irq_coalesce -> coalesce_irqs.
  * Used a separate page pool for each RX queue.
  * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
    simplify the calling function.
  * Explained the calculation of rx_desc->ds_cc.
  * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
  * Used Niklas' suggested commit message for patch 2/7.
  * Added Sergey's Reviewed-by tags to patches 5/7 and 6/7.

Changes v2->v3:
  * Incorporated feedback on RFC v2 from Sergey.
  * Split out bugfixes and rebased. This changed the order of what was
    the first 5 patches of v2 and things look a little different so I've
    not picked up Reviewed-by tags from v2.
  * Further refactoring and tidy up of RX ring refill and
    ravb_rx_gbeth().
  * Switched to using a page pool to allocate RX buffers.
  * Re-tested and provided updated performance figures.

Changes v1->v2:
  * Marked as RFC as the series depends on unmerged patches.
  * Refactored R-Car code paths as well as GbEth code paths.
  * Updated references to the patches this series depends on.
====================

Link: https://lore.kernel.org/r/20240604072825.7490-1-paul.barker.ct@bp.renesas.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

59d0f481

net: ravb: Allocate RX buffers via page pool · 96672632

Paul Barker authored Jun 04, 2024

This patch makes multiple changes that can't be separated:

  1) Allocate plain RX buffers via a page pool instead of allocating
     SKBs, then use build_skb() when a packet is received.
  2) For GbEth IP, reduce the RX buffer size to 2kB.
  3) For GbEth IP, merge packets which span more than one RX descriptor
     as SKB fragments instead of copying data.

Implementing (1) without (2) would require the use of an order-1 page
pool (instead of an order-0 page pool split into page fragments) for
GbEth.

Implementing (2) without (3) would leave us no space to re-assemble
packets which span more than one RX descriptor.

Implementing (3) without (1) would not be possible as the network stack
expects to use put_page() or page_pool_put_page() to free SKB fragments
after an SKB is consumed.

RX checksum offload support is adjusted to handle both linear and
nonlinear (fragmented) packets.

This patch gives the following improvements during testing with iperf3.

  * RZ/G2L:
    * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
    * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)

  * RZ/G2UL:
    * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
    * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)

  * RZ/G3S:
    * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
    * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)

  * RZ/Five:
    * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
    * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

96672632

net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP · 65c482bc

Paul Barker authored Jun 04, 2024

NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing)
is required to improve network stack performance for single core SoCs
using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC).

This patch gives the following improvements during testing with iperf3.

  * RZ/G2UL:
    * TCP TX: +32% bandwidth (638Mbps -> 841Mbps)
    * TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps)
    * UDP RX: +104% bandwidth (53Mbps -> 108Mbps)

  * RZ/G3S:
    * TCP TX: 29% bandwidth (529Mbps -> 681Mbps)
    * UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps)

  * RZ/Five:
    * UDP RX: Test no longer crashes (0 -> 20 Mbps)

This patch gives the following reductions in performance in the same
testing:

  * RZ/G2UL:
    * UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps)

  * RZ/G3S:
    * UDP TX: -5% bandwidth (625Mbps -> 594Mbps)

These losses are considered acceptable given the benefits shown above.
If UDP TX bandwidth must be maximised for a particular use case, NAPI
threaded mode can be disabled at runtime via sysfs writes.

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

65c482bc

net: ravb: Enable SW IRQ Coalescing for GbEth · 7b39c181

Paul Barker authored Jun 04, 2024

Software IRQ Coalescing is required to improve network stack performance
in the RZ/G2L SoC family and the RZ/G3S SoC, i.e. the SoCs which use the
GbEth IP.

This patch gives the following improvements during testing with iperf3:

  * RZ/G2L:
    * TCP RX: same bandwidth with -6% CPU load (76% -> 71%)
    * UDP RX: same bandwidth with -10% CPU load (99% -> 89%)

  * RZ/G2UL:
    * UDP RX: +4200% bandwidth (1.23Mbps -> 53Mbps)

  * RZ/G3S:
    * UDP RX: +425% bandwidth (1.23Mbps -> 6.46Mbps)

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7b39c181

net: ravb: Refactor GbEth RX code path · 3ee43f09

Paul Barker authored Jun 04, 2024

We can reduce code duplication in ravb_rx_gbeth().
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

3ee43f09

net: ravb: Refactor RX ring refill · 37a01c12

Paul Barker authored Jun 04, 2024

To reduce code duplication, we add a new RX ring refill function which
can handle both the initial RX ring population (which was split between
ravb_ring_init() and ravb_ring_format()) and the RX ring refill after
polling (in ravb_rx()).
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

37a01c12

net: ravb: Align poll function with NAPI docs · b0e0e20d

Paul Barker authored Jun 04, 2024

Align ravb_poll() with the documentation in
`Documentation/networking/kapi.rst` and
`Documentation/networking/napi.rst`.

The documentation says that we should prefer napi_complete_done() over
napi_complete(), and using the former allows us to properly support busy
polling. We should ensure that napi_complete_done() is only called if
the work budget has not been exhausted, and we should only re-arm
interrupts if it returns true.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

b0e0e20d

net: ravb: Simplify poll & receive functions · 118e640a

Paul Barker authored Jun 04, 2024

We don't need to pass the work budget to ravb_rx() by reference, it's
cleaner to pass this by value and return the amount of work done. This
allows us to simplify the ravb_poll() function and use the common
`work_done` variable name seen in other network drivers for consistency
and ease of understanding.

This is a pure refactor and should not affect behaviour.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

118e640a

Merge branch 'net-mlx5e-shampo-enable-hw-gro-once-more' · 7da375e2

Jakub Kicinski authored Jun 05, 2024

Tariq Toukan says:

====================
net/mlx5e: SHAMPO, Enable HW GRO once more

This series enables hardware GRO for ConnectX-7 and newer NICs.
SHAMPO stands for Split Header And Merge Payload Offload.

The first part of the series contains important fixes and improvements.

The second part reworks the HW GRO counters.

Lastly, HW GRO is perf optimized and enabled.

Here are the bandwidth numbers for a simple iperf3 test over a single rq
where the application and irq are pinned to the same CPU:

+---------+--------+--------+-----------+-------------+
| streams | SW GRO | HW GRO | Unit      | Improvement |
+---------+--------+--------+-----------+-------------+
| 1       | 36     | 57     | Gbits/sec |    1.6 x    |
| 4       | 34     | 50     | Gbits/sec |    1.5 x    |
| 8       | 31     | 43     | Gbits/sec |    1.4 x    |
+---------+--------+--------+-----------+-------------+

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
====================

Link: https://lore.kernel.org/r/20240603212219.1037656-1-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7da375e2

net/mlx5e: SHAMPO, Coalesce skb fragments to page size · 14ae2fd1

Dragos Tatulea authored Jun 04, 2024

When doing hardware GRO (SHAMPO), the driver puts each data payload of a
packet from the wire into one skb fragment. TCP Zero-Copy expects page
sized skb fragments to be able to do it's page-flipping magic. With the
current way of arranging fragments by the driver, only specific MTUs
(page sized multiple + header size) will yield such page sized fragments
in a high percentage.

This change improves payload arrangement in the skb for hardware GRO by
coalescing payloads into a single skb fragment when possible.

To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields:
- Before:  0 % bytes mmap'ed
- After : 81 % bytes mmap'ed

More importantly, coalescing considerably improves the HW GRO performance.
Here are the results for a iperf3 bandwidth benchmark:
+---------+--------+--------+------------------------+-----------+
| streams | SW GRO | HW GRO | HW GRO with coalescing | Unit      |
|---------+--------+--------+------------------------+-----------|
| 1       | 36     | 42     | 57                     | Gbits/sec |
| 4       | 34     | 39     | 50                     | Gbits/sec |
| 8       | 31     | 35     | 43                     | Gbits/sec |
+---------+--------+--------+------------------------+-----------+

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

14ae2fd1

net/mlx5e: SHAMPO, Re-enable HW-GRO · 99be5617

Yoray Zack authored Jun 04, 2024

Add back HW-GRO to the reported features.

As the current implementation of HW-GRO uses KSMs with a
specific fixed buffer size (256B) to map its headers buffer,
we reported the feature only if the NIC is supporting KSM and
the minimum value for buffer size is below the requested one.

iperf3 bandwidth comparison:
+---------+--------+--------+-----------+
| streams | SW GRO | HW GRO | Unit      |
|---------+--------+--------+-----------|
| 1       | 36     | 42     | Gbits/sec |
| 4       | 34     | 39     | Gbits/sec |
| 8       | 31     | 35     | Gbits/sec |
+---------+--------+--------+-----------+

A downstream patch will add skb fragment coalescing which will improve
performance considerably.

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

99be5617

net/mlx5e: SHAMPO, Use KSMs instead of KLMs · 758191c9

Yoray Zack authored Jun 04, 2024

KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact,
it is a faster mechanism than KLM.

SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer.
As it used KLMs with the same buffer size for each entry,
we can use KSMs instead.

This commit changes the Mkeys that map the SHAMPO headers buffer
from KLMs to KSMs.
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

758191c9

net/mlx5e: SHAMPO, Add header-only ethtool counters for header data split · e95c5b9e

Tariq Toukan authored Jun 04, 2024

Count the number of header-only packets and bytes from SHAMPO.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-12-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e95c5b9e

net/mlx5e: SHAMPO, Drop rx_gro_match_packets counter · 16f448d4

Dragos Tatulea authored Jun 04, 2024

After modifying rx_gro_packets to be more accurate, the
rx_gro_match_packets counter is redundant.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-11-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

16f448d4

net/mlx5e: SHAMPO, Make GRO counters more precise · 8f9eb8bb

Dragos Tatulea authored Jun 04, 2024

Don't count non GRO packets. A non GRO packet is a packet with
a GRO cb count of 1.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-10-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8f9eb8bb

net/mlx5e: SHAMPO, Skipping on duplicate flush of the same SHAMPO SKB · f5a699e0

Yoray Zack authored Jun 04, 2024

SHAMPO SKB can be flushed in mlx5e_shampo_complete_rx_cqe().
If the SKB was flushed, rq->hw_gro_data->skb was also set to NULL.

We can skip on flushing the SKB in mlx5e_shampo_flush_skb
if rq->hw_gro_data->skb == NULL.
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-9-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f5a699e0

net/mlx5e: SHAMPO, Specialize mlx5e_fill_skb_data() · d34d7d19

Dragos Tatulea authored Jun 04, 2024

mlx5e_fill_skb_data() used to have multiple callers. But after the XDP
multibuf refactoring from commit 2cb0e27d ("net/mlx5e: RX, Prepare
non-linear striding RQ for XDP multi-buffer support") the SHAMPO code
path is the only caller.

Take advantage of this and specialize the function:
- Drop the redundant check.
- Assume that data_bcnt is > 0. This is needed in a downstream patch.

Rename the function as well to make things clear.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Suggested-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-8-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d34d7d19

net/mlx5e: SHAMPO, Simplify header page release in teardown · e839ac9a

Dragos Tatulea authored Jun 04, 2024

The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd)
has some complicated logic that comes from the fact that it is called
twice during teardown:
1) To release the posted header pages that didn't get any completions.
2) To release all remaining header pages.

This flow is not necessary: all header pages can be released from the
driver side in one go. Furthermore, the above flow is buggy. Taking the
8 headers per page example:
1) Release fragments 5-7. Page will be released.
2) Release remaining fragments 0-4. The bits in the header will indicate
   that the page needs releasing. But this is incorrect: page was
   released in step 1.

This patch releases all header pages in one go. This simplifies the
header page cleanup function. For consistency, the datapath header
page release API (mlx5e_free_rx_shampo_hd_entry()) is used.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e839ac9a

net/mlx5e: SHAMPO, Disable gso_size for non GRO packets · 083dbb54

Dragos Tatulea authored Jun 04, 2024

When HW GRO is enabled, forwarding of packets is broken due to gso_size
being set incorrectly on non GRO packets.

Non GRO packets have a skb GRO count of 1. mlx5 always sets gso_size on
the skb, even for non GRO packets. It leans on the fact that gso_size is
normally reset in napi_gro_complete(). But this happens only for packets
from GRO'able protocols (TCP/UDP) that have a gro_receive() handler.

The problematic scenarios are:

1) Non GRO protocol packets are received, validate_xmit_skb() will drop
   them (see EPROTONOSUPPORT in skb_mac_gso_segment()). The fix for
   this case would be to not set gso_size at all for SHAMPO packets with
   header size 0.

2) Packets from a GRO'ed protocol (TCP) are received but immediately
   flushed because they are not GRO'able (TCP SYN for example).
   mlx5e_shampo_update_hdr(), which updates the remaining GRO state on
   the skb, is not called because skb GRO count is 1. The fix here would
   be to always call mlx5e_shampo_update_hdr(), regardless of skb GRO
   count. But this call is expensive

The unified fix for both cases is to reset gso_size before calling
napi_gro_receive(). It is a change that is more effective (no call to
mlx5e_shampo_update_hdr() necessary) and simple (smallest code
footprint).
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-6-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

083dbb54

net/mlx5e: SHAMPO, Fix FCS config when HW GRO on · a64bbd8c

Dragos Tatulea authored Jun 04, 2024

For the following scenario:

ethtool --features eth3 rx-gro-hw on
ethtool --features eth3 rx-fcs on
ethtool --features eth3 rx-fcs off

... there is a firmware error because the driver enables HW GRO first
while FCS is still enabled.

This patch fixes this by swapping the order of HW GRO and FCS for this
specific case. Take LRO into consideration as well for consistency.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-5-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a64bbd8c

net/mlx5e: SHAMPO, Fix invalid WQ linked list unlink · fba83347

Dragos Tatulea authored Jun 04, 2024

When all the strides in a WQE have been consumed, the WQE is unlinked
from the WQ linked list (mlx5_wq_ll_pop()). For SHAMPO, it is possible
to receive CQEs with 0 consumed strides for the same WQE even after the
WQE is fully consumed and unlinked. This triggers an additional unlink
for the same wqe which corrupts the linked list.

Fix this scenario by accepting 0 sized consumed strides without
unlinking the WQE again.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-4-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fba83347

net/mlx5e: SHAMPO, Fix incorrect page release · 70bd03b8

Dragos Tatulea authored Jun 04, 2024

Under the following conditions:
1) No skb created yet
2) header_size == 0 (no SHAMPO header)
3) header_index + 1 % MLX5E_SHAMPO_WQ_HEADER_PER_PAGE == 0 (this is the
   last page fragment of a SHAMPO header page)

a new skb is formed with a page that is NOT a SHAMPO header page (it
is a regular data page). Further down in the same function
(mlx5e_handle_rx_cqe_mpwrq_shampo()), a SHAMPO header page from
header_index is released. This is wrong and it leads to SHAMPO header
pages being released more than once.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-3-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

70bd03b8

net/mlx5e: SHAMPO, Use net_prefetch API · 4e92d247

Tariq Toukan authored Jun 04, 2024

Let the SHAMPO functions use the net-specific prefetch API,
similar to all other usages.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-2-tariqt@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4e92d247

selftests: hsr: Extend the hsr_ping.sh test to use fixed MAC addresses · ed20142e

Lukasz Majewski authored Jun 03, 2024

Fixed MAC addresses help with debugging as last four bytes identify the
network namespace.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Link: https://lore.kernel.org/r/20240603093322.3150030-1-lukma@denx.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ed20142e

selftests: hsr: Extend the hsr_redbox.sh test to use fixed MAC addresses · 955edd87

Lukasz Majewski authored Jun 03, 2024

Fixed MAC addresses help with debugging as last four bytes identify the
network namespace.

Moreover, it allows to mimic the real life setup with for example bridge
having the same MAC address on each port.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Link: https://lore.kernel.org/r/20240603093322.3150030-2-lukma@denx.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>

955edd87

05 Jun, 2024 9 commits

Merge branch 'vmxnet3-upgrade-to-version-9' · d223d194

Jakub Kicinski authored Jun 05, 2024

Ronak Doshi says:

====================
vmxnet3: upgrade to version 9

vmxnet3 emulation has recently added timestamping feature which allows the
hypervisor (ESXi) to calculate latency from guest virtual NIC driver to all
the way up to the physical NIC. This patch series extends vmxnet3 driver
to leverage these new feature.

Compatibility is maintained using existing vmxnet3 versioning mechanism as
follows:
 - new features added to vmxnet3 emulation are associated with new vmxnet3
   version viz. vmxnet3 version 9.
 - emulation advertises all the versions it supports to the driver.
 - during initialization, vmxnet3 driver picks the highest version number
   supported by both the emulation and the driver and configures emulation
   to run at that version.

In particular, following changes are introduced:

Patch 1:
  This patch introduces utility macros for vmxnet3 version 9 comparison
  and updates Copyright information.

Patch 2:
  This patch adds support to timestamp the packets so as to allow latency
  measurement in the ESXi.

Patch 3:
  This patch adds support to disable certain offloads on the device based
  on the request specified by the user in the VM configuration.

Patch 4:
  With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver,
  with this patch, the driver can configure emulation to run at vmxnet3
  version 9.
====================

Link: https://lore.kernel.org/r/20240531193050.4132-1-ronak.doshi@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d223d194

vmxnet3: update to version 9 · 63587234

Ronak Doshi authored May 31, 2024

With all vmxnet3 version 9 changes incorporated in the vmxnet3 driver,
the driver can configure emulation to run at vmxnet3 version 9, provided
the emulation advertises support for version 9.
Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com>
Acked-by: Guolin Yang <guolin.yang@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240531193050.4132-5-ronak.doshi@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

63587234

vmxnet3: add command to allow disabling of offloads · 2e5010fd

Ronak Doshi authored May 31, 2024

This patch adds a new command to disable certain offloads. This
allows user to specify, using VM configuration, if certain offloads
need to be disabled.
Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com>
Acked-by: Guolin Yang <guolin.yang@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240531193050.4132-4-ronak.doshi@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2e5010fd

vmxnet3: add latency measurement support in vmxnet3 · 4c22fad7

Ronak Doshi authored May 31, 2024

This patch enhances vmxnet3 to support latency measurement.
This support will help to track the latency in packet processing
between guest virtual nic driver and host. For this purpose, we
introduce a new timestamp ring in vmxnet3 which will be per Tx/Rx
queue. This ring will be used to carry timestamp of the packets
which will be used to calculate the latency.

User can enable latency measurement using realtime knob in vnic
setting in VCenter.
Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com>
Acked-by: Guolin Yang <guolin.yang@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240531193050.4132-3-ronak.doshi@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4c22fad7

vmxnet3: prepare for version 9 changes · 4978478a

Ronak Doshi authored May 31, 2024

vmxnet3 is currently at version 7 and this patch initiates the
preparation to accommodate changes for up to version 9. Introduced
utility macros for vmxnet3 version 9 comparison and update Copyright
information.
Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com>
Acked-by: Guolin Yang <guolin.yang@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240531193050.4132-2-ronak.doshi@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4978478a

ionic: advertise 52-bit addressing limitation for MSI-X · 1467713e

David Christensen authored Jun 03, 2024

Current ionic devices only support 52 internal physical address
lines. This is sufficient for x86_64 systems which have similar
limitations but does not apply to all other architectures,
notably IBM POWER (ppc64). To ensure that MSI/MSI-X vectors are
not set outside the physical address limits of the NIC, set the
no_64bit_msi value of the pci_dev structure during device probe.
Signed-off-by: David Christensen <drc@linux.ibm.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240603212747.1079134-1-drc@linux.ibm.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1467713e

bnxt_en: fix atomic counter for ptp packets · c790275b

Vadim Fedorenko authored Jun 04, 2024

atomic_dec_if_positive returns new value regardless if it is updated or
not. The commit in fixes changed the behavior of the condition to one
that differs from original code. Restore original condition to properly
maintain atomic counter.

Fixes: 165f8769 ("bnxt_en: add timestamping statistics support")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604091939.785535-1-vadfed@meta.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c790275b

Merge branch 'tcp-rto-min-us' · 54751f4d

David S. Miller authored Jun 05, 2024

Kevin Yang says:

====================
tcp: add sysctl_tcp_rto_min_us

Adding a sysctl knob to allow user to specify a default
rto_min at socket init time.

After this patch series, the rto_min will has multiple sources:
route option has the highest precedence, followed by the
TCP_BPF_RTO_MIN socket option, followed by this new
tcp_rto_min_us sysctl.

v3:
    fix typo, simplify min/max_t to min/max

v2:
    fit line width to 80 column.

v2: https://lore.kernel.org/netdev/20240530153436.2202800-1-yyd@google.com/
v1: https://lore.kernel.org/netdev/20240528171320.1332292-1-yyd@google.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

54751f4d

tcp: add sysctl_tcp_rto_min_us · f086edef

Kevin Yang authored Jun 03, 2024

Adding a sysctl knob to allow user to specify a default
rto_min at socket init time, other than using the hard
coded 200ms default rto_min.

Note that the rto_min route option has the highest precedence
for configuring this setting, followed by the TCP_BPF_RTO_MIN
socket option, followed by the tcp_rto_min_us sysctl.
Signed-off-by: Kevin Yang <yyd@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f086edef