Commits · d7df09f5e7b46af0eb927c065113faa57411d100 · Kirill Smelkov / linux

17 Nov, 2021 24 commits

net/mlx5: E-switch, Enable vport QoS on demand · d7df09f5

Dmytro Linkin authored Sep 21, 2021

Vports' QoS is not commonly used but consume SW/HW resources, which
becomes an issue on BlueField SoC systems.
Don't enable QoS on vports by default on eswitch mode change and enable
when it's going to be used by one of the top level users:
- configuring TC matchall filter with police action;
- setting rate with legacy NDO API;
- calling devlink ops->rate_leaf_*() callbacks.

Disable vport QoS on vport cleanup.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

d7df09f5

net/mlx5: E-switch, move offloads mode callbacks to offloads file · e9d491a6

Parav Pandit authored Oct 21, 2021

eswitch.c is mainly for common code between legacy and offloads mode.
MAC address get and set via devlink is applicable only in offloads mode.

Hence, move it to eswitch_offloads.c file.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

e9d491a6

net/mlx5: E-switch, Reuse mlx5_eswitch_set_vport_mac · b22fd438

Parav Pandit authored Oct 21, 2021

mlx5_eswitch_set_vport_mac() routine already does necessary checks which
are duplicated in implementation of
mlx5_devlink_port_function_hw_addr_set().

Hence, reuse mlx5_eswitch_set_vport_mac() and cut down the code.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

b22fd438

net/mlx5: E-switch, Remove vport enabled check · fcf8ec54

Parav Pandit authored Oct 20, 2021

An eswitch vport of the devlink port is always enabled before a
devlink port is registered. And a eswitch vport is always disabled
after a devlink port is unregistered.
Hence avoid the vport enabled check in the devlink callback routine.
Such check is only applicable in the legacy SR-IOV callbacks.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Sunil Sudhakar Rani <sunrani@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

fcf8ec54

net/mlx5e: Specify out ifindex when looking up decap route · 819c319c

Chris Mi authored Oct 26, 2021

There is a use case that the local and remote VTEPs are in the same
host. Currently, the out ifindex is not specified when looking up the
decap route for offloads. So in this case, a local route is returned
and the route dev is lo.

Actual tunnel interface can be created with a parameter "dev" [1],
which specifies the physical device to use for tunnel endpoint
communication. Pass this parameter to driver when looking up decap
route for offloads. So that a unicast route will be returned.

[1] ip link add name vxlan1 type vxlan id 100 dev enp4s0f0 remote 1.1.1.1 dstport 4789
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

819c319c

net/mlx5e: TC, Move comment about mod header flag to correct place · fc3a879a

Roi Dayan authored Nov 10, 2021

Move the comment to the correct place where the driver actually
removes the flag and not in the check that maybe pedit actions exists.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

fc3a879a

net/mlx5e: TC, Move kfree() calls after destroying all resources · 88d97486

Roi Dayan authored Nov 01, 2021

When deleting fdb/nic flow rules first release all resources
and then call the kfree() calls instead of sparse them around
the function.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

88d97486

net/mlx5e: TC, Destroy nic flow counter if exists · 972fe492

Roi Dayan authored Nov 01, 2021

Counter is only added if counter flag exists.
So check the counter fag exists for deleting the counter.
This is the same as in add/del fdb flow.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

972fe492

net/mlx5: TC, using swap() instead of tmp variable · 0164a9bd

Yihao Han authored Nov 02, 2021

swap() was used instead of the tmp variable to swap values
Signed-off-by: Yihao Han <hanyihao@vivo.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

0164a9bd

net/mlx5: CT: Allow static allocation of mod headers · 1cfd3490

Paul Blakey authored Aug 25, 2021

As each CT rule uses at least 4 modify header actions, each rule
causes at least 3 reallocations by the mod header actions api.

Allow initial static allocation of the mod acts array, and use it for
CT rules. If the static allocation is exceeded go back to dynamic
allocation.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>

1cfd3490

net/mlx5e: Refactor mod header management API · 2c0e5cf5

Paul Blakey authored Jul 05, 2021

For all mod hdr related functions to reside in a single self contained
component (mod_hdr.c), refactor alloc() and add get_id() so that user
won't rely on internal implementation, and move both to mod_hdr
component.

Rename the prefix to mlx5e_mod_hdr_* as other mod hdr functions.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

2c0e5cf5

net/mlx5: Avoid printing health buffer when firmware is unavailable · f28a14c1

Aya Levin authored Nov 09, 2021

Use firmware version field as an indication to health buffer's sanity.
When firmware version is 0xFFFFFFFF, deduce that firmware is unavailable
and avoid printing the health buffer to dmesg as it doesn't provide
debug info.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

f28a14c1

net/mlx5: Fix format-security build warnings · aef0f8c6

Saeed Mahameed authored Nov 03, 2021

Treat the string as an argument to avoid this.

drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c:482:5:
error: format string is not a string literal (potentially insecure)
                         name);
                         ^~~~
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:2079:4:
error: format string is not a string literal (potentially insecure)
                        ptp_ch_stats_desc[i].format);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>

aef0f8c6

net/mlx5e: Support ethtool cq mode · bc541621

Saeed Mahameed authored Sep 14, 2021

Add support for ethtool coalesce cq mode set and get.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

bc541621

net: document SMII and correct phylink's new validation mechanism · b9241f54

Russell King (Oracle) authored Nov 15, 2021

SMII has not been documented in the kernel, but information on this PHY
interface mode has been recently found. Document it, and correct the
recently introduced phylink handling for this interface mode.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/E1mmfVl-0075nP-14@rmk-PC.armlinux.org.ukSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b9241f54

Merge branch... · be0f6c41

Jakub Kicinski authored Nov 16, 2021

Merge branch 'r8169-disable-detection-of-further-chip-versions-that-didn-t-make-it-to-the-mass-market'

Heiner Kallweit says:

====================
r8169: disable detection of further chip versions that didn't make it to the mass market

There's no sign of life from further chip versions. Seems they didn't
make it to the mass market. Let's disable detection and if nobody
complains remove support a few kernel versions later.
====================

Link: https://lore.kernel.org/r/7708d13a-4a2b-090d-fadf-ecdd0fff5d2e@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

be0f6c41

r8169: disable detection of chip version 41 · 364ef1f3

Heiner Kallweit authored Nov 15, 2021

It seems this chip version never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

364ef1f3

r8169: disable detection of chip version 45 · 6c8a5cf9

Heiner Kallweit authored Nov 15, 2021

It seems this chip version never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6c8a5cf9

r8169: disable detection of chip versions 49 and 50 · 2d6600c7

Heiner Kallweit authored Nov 15, 2021

It seems these chip versions never made it to the wild. Therefore
disable detection and if nobody complains remove support completely
later.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2d6600c7

r8169: enable ASPM L1/L1.1 from RTL8168h · 4b5f82f6

Heiner Kallweit authored Nov 15, 2021

With newer chip versions ASPM-related issues seem to occur only if
L1.2 is enabled. I have a test system with RTL8168h that gives a
number of rx_missed errors when running iperf and L1.2 is enabled.
With L1.2 disabled (and L1 + L1.1 active) everything is fine.
See also [0]. Can't test this, but L1 + L1.1 being active should be
sufficient to reach higher package power saving states.

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1942830Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/36feb8c4-a0b6-422a-899c-e61f2e869dfe@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4b5f82f6

Merge branch 'net-better-packing-of-global-vars' · c60c34a9

Jakub Kicinski authored Nov 16, 2021

Eric Dumazet says:

====================
net: better packing of global vars

First two patches avoid holes in data section,
and last patch makes sure some siphash keys are contained
in a single cache line.
====================

Link: https://lore.kernel.org/r/20211115172303.3732746-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c60c34a9

net: align static siphash keys · 49ecc2e9

Eric Dumazet authored Nov 15, 2021

siphash keys use 16 bytes.

Define siphash_aligned_key_t macro so that we can make sure they
are not crossing a cache line boundary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

49ecc2e9

net: use .data.once section in netdev_level_once() · 7071732c

Eric Dumazet authored Nov 15, 2021

Same rationale than prior patch : using the dedicated
section avoid holes and pack all these bool values.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7071732c

once: use __section(".data.once") · c2c60ea3

Eric Dumazet authored Nov 15, 2021

.data.once contains nicely packed bool variables.
It is used already by DO_ONCE_LITE().

Using it also in DO_ONCE() removes holes in .data section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c2c60ea3

16 Nov, 2021 16 commits

Merge branch 'inuse-cleanups' · 62803fec

David S. Miller authored Nov 16, 2021

Eric Dumazet says:

====================
net: prot_inuse and sock_inuse cleanups

Small series cleaning and optimizing sock_prot_inuse_add()
and sock_inuse_add().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

62803fec

net: drop nopreempt requirement on sock_prot_inuse_add() · b3cb764a

Eric Dumazet authored Nov 15, 2021

This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3cb764a

net: merge net->core.prot_inuse and net->core.sock_inuse · 4199bae1

Eric Dumazet authored Nov 15, 2021

net->core.sock_inuse is a per cpu variable (int),
while net->core.prot_inuse is another per cpu variable
of 64 integers.

per cpu allocator tend to place them in very different places.

Grouping them together makes sense, since it makes
updates potentially faster, if hitting the same
cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4199bae1

net: make sock_inuse_add() available · d477eb90

Eric Dumazet authored Nov 15, 2021

MPTCP hard codes it, let us instead provide this helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d477eb90

net: inline sock_prot_inuse_add() · 2a12ae5d

Eric Dumazet authored Nov 15, 2021

sock_prot_inuse_add() is very small, we can inline it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a12ae5d

Merge branch 'gro-out-of-core-files' · abc3342a

David S. Miller authored Nov 16, 2021

Eric Dumazet says:

====================
gro: get out of core files

Move GRO related content into net/core/gro.c
and include/net/gro.h.

This reduces GRO scope to where it is really needed,
and shrinks too big files (include/linux/netdevice.h
and net/core/dev.c)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

abc3342a

net: gro: populate net/core/gro.c · 587652bb

Eric Dumazet authored Nov 15, 2021

Move gro code and data from net/core/dev.c to net/core/gro.c
to ease maintenance.

gro_normal_list() and gro_normal_one() are inlined
because they are called from both files.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

587652bb

net: gro: move skb_gro_receive into net/core/gro.c · e456a18a

Eric Dumazet authored Nov 15, 2021

net/core/gro.c will contain all core gro functions,
to shrink net/core/skbuff.c and net/core/dev.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e456a18a

net: gro: move skb_gro_receive_list to udp_offload.c · 0b935d7f

Eric Dumazet authored Nov 15, 2021

This helper is used once, no need to keep it in fat net/core/skbuff.c
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b935d7f

net: move gro definitions to include/net/gro.h · 4721031c

Eric Dumazet authored Nov 15, 2021

include/linux/netdevice.h became too big, move gro stuff
into include/net/gro.h
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4721031c

Merge branch 'tcp-optimizations' · 6fcc0620

David S. Miller authored Nov 16, 2021

Eric Dumazet says:

====================
tcp: optimizations for linux-5.17

Mostly small improvements in this series.

The notable change is in "defer skb freeing after
socket lock is released" in recvmsg() (and RX zerocopy)

The idea is to try to let skb freeing to BH handler,
whenever possible, or at least perform the freeing
outside of the socket lock section, for much improved
performance. This idea can probably be extended
to other protocols.

 Tests on a 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs.

 MTU : 1500  (1428 bytes of TCP payload per MSS)
 Before: 55 Gbit
 After:  66 Gbit

 MTU : 4096+ (4096 bytes of TCP payload, plus TCP/IPv6 headers)
 Before: 82 Gbit
 After:  95 Gbit
====================
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6fcc0620

net: move early demux fields close to sk_refcnt · 43f51df4

Eric Dumazet authored Nov 15, 2021

sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux,
and currently spans two cache lines.

Moving them close to sk_refcnt makes more sense, as only one cache
line is needed.

New layout for this hot cache line is :

struct sock {
	struct sock_common         __sk_common;          /*     0  0x88 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	struct dst_entry *         sk_rx_dst;            /*  0x88   0x8 */
	int                        sk_rx_dst_ifindex;    /*  0x90   0x4 */
	u32                        sk_rx_dst_cookie;     /*  0x94   0x4 */
	socket_lock_t              sk_lock;              /*  0x98  0x20 */
	atomic_t                   sk_drops;             /*  0xb8   0x4 */
	int                        sk_rcvlowat;          /*  0xbc   0x4 */
	/* --- cacheline 3 boundary (192 bytes) --- */
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

43f51df4

tcp: do not call tcp_cleanup_rbuf() if we have a backlog · 29fbc26e

Eric Dumazet authored Nov 15, 2021

Under pressure, tcp recvmsg() has logic to process the socket backlog,
but calls tcp_cleanup_rbuf() right before.

Avoiding sending ACK right before processing new segments makes
a lot of sense, as this decrease the number of ACK packets,
with no impact on effective ACK clocking.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29fbc26e

tcp: check local var (timeo) before socket fields in one test · 8bd172b7

Eric Dumazet authored Nov 15, 2021

Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.

Modern applications use non-blocking IO, while a socket is terminated
only once during its life time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8bd172b7

tcp: defer skb freeing after socket lock is released · f35f8219

Eric Dumazet authored Nov 15, 2021

tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f35f8219

tcp: avoid indirect calls to sock_rfree · 3df684c1

Eric Dumazet authored Nov 15, 2021

TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.

Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3df684c1