Commits · 01488a0ccd9abe15565bed50a45afcddbb0fe199 · Kirill Smelkov / linux

13 Mar, 2021 8 commits

net: dsa: bcm_sf2: store PHY interface/mode in port structure · 01488a0c

Rafał Miłecki authored Mar 12, 2021

It's needed later for proper switch / crossbar setup.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

01488a0c

net: ipv4: route.c: Fix indentation of multi line comment. · 6ad08600

Shubhankar Kuranagatti authored Mar 12, 2021

All comment lines inside the comment block have been aligned.
Every line of comment starts with a * (uniformity in code).
Signed-off-by: Shubhankar Kuranagatti <shubhankarvk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6ad08600

net: broadcom: bcm4908_enet: support TX interrupt · 12bb508b

Rafał Miłecki authored Mar 11, 2021

It appears that each DMA channel has its own interrupt and both rings
can be configured (the same way) to handle interrupts.

1. Make ring interrupts code generic (make it operate on given ring)
2. Move napi to ring (so each has its own)
3. Make IRQ handler generic (match ring against received IRQ number)
4. Add (optional) support for TX interrupt
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

12bb508b

dt-bindings: net: bcm4908-enet: add optional TX interrupt · ab4dda7a

Rafał Miłecki authored Mar 11, 2021

I discovered that hardware actually supports two interrupts, one per DMA
channel (RX and TX).
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab4dda7a

Merge branch 'macb-fixed-link-fixes' · 26d2e042

David S. Miller authored Mar 12, 2021

Robert Hancock says:

====================
macb SGMII fixed-link fixes

Some fixes to the macb driver for use in SGMII mode with a fixed-link (such as
for chip-to-chip connectivity).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

26d2e042

net: macb: Disable PCS auto-negotiation for SGMII fixed-link mode · e276e5e4

Robert Hancock authored Mar 11, 2021

When using a fixed-link configuration in SGMII mode, it's not really
sensible to have auto-negotiation enabled since the link settings are
fixed by definition. In other configurations, such as an SGMII
connection to a PHY, it should generally be enabled.
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e276e5e4

net: macb: poll for fixed link state in SGMII mode · 8fab174b

Robert Hancock authored Mar 11, 2021

When using a fixed-link configuration with GEM in SGMII mode, such as
for a chip-to-chip interconnect, the link state was always showing as
established regardless of the actual connectivity state. We can monitor
the pcs_link_state bit in the Network Status register to determine
whether the PCS link state is actually up.
Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fab174b

Merge tag 'mlx5-updates-2021-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · c232f81b

David S. Miller authored Mar 12, 2021

Saeed Mahameed says:

====================
mlx5-updates-2021-03-12

1) TC support for ICMP parameters
2) TC connection tracking with mirroring
3) A round of trivial fixups and cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c232f81b

12 Mar, 2021 32 commits

net/mlx5e: Allow to match on ICMP parameters · a3222a2d

Maor Dickman authored Jan 24, 2021

Support matching on ICMPv4/6 type and code parameters using misc3
section of match parameters.
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

a3222a2d

net/mlx5: CT: Add support for mirroring · 69e2916e

Paul Blakey authored Sep 21, 2020

Add support for mirroring before the CT action by spliting the pre ct rule.
Mirror outputs are done first on the tc chain,prio table rule (the fwd
rule), which will then forward to a per port fwd table.
On this fwd table, we insert the original pre ct rule that forwards to
ct/ct nat table.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

69e2916e

net/mlx5: Display the command index in command mailbox dump · 287e0df0

Alaa Hleihel authored Dec 31, 2020

Multiple commands can be printed at the same time which can
lead to wrong order of their lines in dmesg output.
As a result, it's hard to match data dumps to the correct command
or which command was fully dumped at some point.

Fix this by displaying the corresponding command index, and also
indicate when a command was fully dumped.
Signed-off-by: Alaa Hleihel <alaa@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

287e0df0

net/mlx5e: allocate 'indirection_rqt' buffer dynamically · 2119bda6

Arnd Bergmann authored Mar 08, 2021

Increasing the size of the indirection_rqt array from 128 to 256 bytes
pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function
over the warning limit when building with clang and CONFIG_KASAN:

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=]

Using dynamic allocation here is safe because the caller does the
same, and it reduces the stack usage of the function to just a few
bytes.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

2119bda6

net/mlx5e: Dump ICOSQ WQE descriptor on CQE with error events · e16cf9d7

Tariq Toukan authored Feb 17, 2021

Dump the ICOSQ's WQE descriptor when a completion with error is received.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

e16cf9d7

net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath · 991b2654

Maxim Mikityanskiy authored Jan 29, 2021

Commit e20f0dbf ("net/mlx5e: RX, Add a prefetch command for small
L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e.
In the same time frame, commit 5af75c74 ("net/mlx5e: Enhanced TX
MPWQE for SKBs") added one more usage of prefetchw. When these two
changes were merged, this new occurrence of prefetchw wasn't replaced
with net_prefetchw.

This commit fixes this last occurrence of prefetchw in
mlx5e_tx_mpwqe_session_start, making the same change that was done in
mlx5e_xdp_mpwqe_session_start.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

991b2654

net/mlx5e: Remove redundant newline in NL_SET_ERR_MSG_MOD · bca08a91

Roi Dayan authored Mar 09, 2021

Fix the following coccicheck warnings:

drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

bca08a91

net/mlx5: Read congestion counters from all ports when lag is active · 093bd764

Mark Zhang authored Feb 02, 2021

Read congestion counters from all ports in any lag mode rather than
only in RoCE lag mode (e.g., VF lag).
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

093bd764

net/mlx5: remove unneeded semicolon · 79760922

Jiapeng Chong authored Feb 22, 2021

Fix the following coccicheck warnings:

./drivers/net/ethernet/mellanox/mlx5/core/sf/devlink.c:495:2-3: Unneeded
semicolon.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

79760922

net/mlx5: use kvfree() for memory allocated with kvzalloc() · ad2c99ca

Junlin Yang authored Mar 03, 2021

It is allocated with kvzalloc(), the corresponding release function
should not be kfree(), use kvfree() instead.

Generated by: scripts/coccinelle/api/kfree_mismatch.cocci
Signed-off-by: Junlin Yang <yangjunlin@yulong.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

ad2c99ca

net/mlx5: DR, Add missing vhca_id consume from STEv1 · cc82a2e6

Yevgeny Kliteynik authored Feb 06, 2021

The field source_eswitch_owner_vhca_id was not consumed
in the same way as in STEv0. Added the missing set.

Fixes: 10b69418 ("net/mlx5: DR, Add HW STEv1 match logic")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

cc82a2e6

net/mlx5: DR, Remove unneeded rx_decap_l3 function for STEv1 · 14124778

Yevgeny Kliteynik authored Feb 06, 2021

Remove the dr_ste_v1_set_rx_decap_l3 function that was
replaced by another function - fixing a rebase error.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

14124778

net/mlx5: DR, Fixed typo in STE v0 · 0142f097

Yevgeny Kliteynik authored Feb 06, 2021

"reforamt" -> "reformat"
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

0142f097

docs: networking: phy: Improve placement of parenthesis · bfdfe7fc

Jonathan Neuschäfer authored Mar 11, 2021

"either" is outside the parentheses, so the matching "or" should be too.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

bfdfe7fc

Merge branch 'tcp-delayed-completions' · 5215206d

David S. Miller authored Mar 11, 2021

Eric Dumazet says:

====================
tcp: better deal with delayed TX completions

Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)

While problems have been there forever, second patch might
introduce some regressions so I prefer not backport
them to stable releases before things settle.

Many thanks to FB team for their help and tests.

Few packetdrill tests need to be changed to reflect
the improvements brought by this series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5215206d

tcp: remove obsolete check in __tcp_retransmit_skb() · ac3959fd

Eric Dumazet authored Mar 11, 2021

TSQ provides a nice way to avoid bufferbloat on individual socket,
including retransmit packets. We can get rid of the old
heuristic:

	/* Do not sent more than we queued. 1/4 is reserved for possible
	 * copying overhead: fragmentation, tunneling, mangling etc.
	 */
	if (refcount_read(&sk->sk_wmem_alloc) >
	    min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
		  sk->sk_sndbuf))
		return -EAGAIN;

This heuristic was giving false positives according to Jakub,
whenever TX completions are delayed above RTT. (Ack packets
are processed by TCP stack before clones are orphaned/freed)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac3959fd

tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack() · a7abf3cd

Eric Dumazet authored Mar 11, 2021

Jakub reported Data included in a Fastopen SYN that had to be
retransmit would have to wait for an RTO if TX completions are slow,
even with prior fix.

This is because tcp_rcv_fastopen_synack() does not use standard
rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
because tp->lost_out == tp->retrans_out

Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
by using tcp_mark_skb_lost() on the skb thats needs to be
sent again.

Not this raised a warning in tcp_fastretrans_alert() during my tests
since we consider the data not being aknowledged
by the receiver does not mean packet was lost on the network.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a7abf3cd

tcp: plug skb_still_in_host_queue() to TSQ · f4dae54e

Eric Dumazet authored Mar 11, 2021

Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)

Main issue is that TCP stack has a logic preventing a packet
being retransmit if the prior clone has not yet been
orphaned or freed.

This logic came with commit 1f3279ae ("tcp: avoid
retransmits of TCP packets hanging in host queues")

Thankfully, in the case skb_still_in_host_queue() detects
the initial clone is still in flight, it can use TSQ logic
that will eventually retry later, at the moment the clone
is freed or orphaned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neil Spring <ntspring@fb.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f4dae54e

isdn: remove extra spaces in the header file · 8176f8c0

Tong Zhang authored Mar 10, 2021

fix some coding style issues in the isdn header
Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8176f8c0

tipc: clean up warnings detected by sparse · 97bc84bb

Hoang Huu Le authored Mar 11, 2021

This patch fixes the following warning from sparse:

net/tipc/monitor.c:263:35: warning: incorrect type in assignment (different base types)
net/tipc/monitor.c:263:35:    expected unsigned int
net/tipc/monitor.c:263:35:    got restricted __be32 [usertype]
[...]
net/tipc/node.c:374:13: warning: context imbalance in 'tipc_node_read_lock' - wrong count at exit
net/tipc/node.c:379:13: warning: context imbalance in 'tipc_node_read_unlock' - unexpected unlock
net/tipc/node.c:384:13: warning: context imbalance in 'tipc_node_write_lock' - wrong count at exit
net/tipc/node.c:389:13: warning: context imbalance in 'tipc_node_write_unlock_fast' - unexpected unlock
net/tipc/node.c:404:17: warning: context imbalance in 'tipc_node_write_unlock' - unexpected unlock
[...]
net/tipc/crypto.c:1201:9: warning: incorrect type in initializer (different address spaces)
net/tipc/crypto.c:1201:9:    expected struct tipc_aead [noderef] __rcu *__tmp
net/tipc/crypto.c:1201:9:    got struct tipc_aead *
[...]
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

97bc84bb

tipc: convert dest node's address to network order · 1980d375

Hoang Le authored Mar 11, 2021

(struct tipc_link_info)->dest is in network order (__be32), so we must
convert the value to network order before assigning. The problem detected
by sparse:

net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types)
net/tipc/netlink_compat.c:699:24:    expected restricted __be32 [usertype] dest
net/tipc/netlink_compat.c:699:24:    got int
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

1980d375

Merge branch 'mlxsw-Implement-sampling-using-mirroring' · 1520929e

David S. Miller authored Mar 11, 2021

Ido Schimmel says:

====================
mlxsw: Implement sampling using mirroring

So far, sampling was implemented using a dedicated sampling mechanism
that is available on all Spectrum ASICs. Spectrum-2 and later ASICs
support sampling by mirroring packets to the CPU port with probability.
This method has a couple of advantages compared to the legacy method:

* Extra metadata per-packet: Egress port, egress traffic class, traffic
  class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow as opposed to only
  ingress

This series should not result in any user-visible changes and its aim is
to convert Spectrum-2 and later ASICs to perform sampling by mirroring
to the CPU port with probability. Future submissions will expose the
additional metadata and enable sampling using more triggers (e.g.,
egress).

Series overview:

Patches #1-#3 extend the SPAN (mirroring) module to accept new
parameters required for sampling. See individual commit messages for
detailed explanation.

Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while
still using the legacy method for all ASIC generations.

Patch #6 converts Spectrum-2 and later ASICs to perform sampling by
mirroring to the CPU port with probability.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1520929e

mlxsw: spectrum_matchall: Implement sampling using mirroring · cf31190a

Ido Schimmel authored Mar 11, 2021

Spectrum-2 and later ASICs support sampling of packets by mirroring to
the CPU with probability. There are several advantages compared to the
legacy dedicated sampling mechanism:

* Extra metadata per-packet: Egress port, egress traffic class, traffic
  class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow

Convert Spectrum-2 and later ASICs to perform sampling by mirroring to
the CPU with probability.

Subsequent patches will add support for egress / per-flow sampling and
expose the extra metadata.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf31190a

mlxsw: spectrum_trap: Split sampling traps between ASICs · 34a27721

Ido Schimmel authored Mar 11, 2021

Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.

As a preparation for more advanced sampling configurations, split the trap
configuration used for sampled packets between Spectrum-1 and later ASICs.

This is needed since packets that are mirrored to the CPU are trapped
via a different trap identifier compared to packets that are sampled
using the dedicated sampling mechanism.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

34a27721

mlxsw: spectrum_matchall: Split sampling support between ASICs · 20afb9bc

Ido Schimmel authored Mar 11, 2021

Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.

As a preparation for more advanced sampling configurations, split the
sampling operations between Spectrum-1 and later ASICs.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20afb9bc

mlxsw: spectrum_span: Add SPAN probability rate support · 2dcbd920

Ido Schimmel authored Mar 11, 2021

Currently, every packet that matches a mirroring trigger (e.g., received
packets, buffer dropped packets) is mirrored. Spectrum-2 and later ASICs
support mirroring with probability, where every 1 in N matched packets
is mirrored.

Extend the API that creates the binding between the trigger and the SPAN
agent with a probability rate parameter, which is an attribute of the
trigger. Set it to '1' to maintain existing behavior.

Subsequent patches will use it to perform more sophisticated sampling,
by mirroring packets to the CPU with probability.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2dcbd920

mlxsw: reg: Extend mirroring registers with probability rate field · fa3faeb7

Ido Schimmel authored Mar 11, 2021

The MPAR and MPAGR registers are used to configure the binding between
the mirroring trigger (e.g., received packet) and the SPAN agent. Add
probability rate field, which will allow us to support sampling by
mirroring to the CPU.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa3faeb7

mlxsw: spectrum_span: Add SPAN session identifier support · 5c7659eb

Ido Schimmel authored Mar 11, 2021

When packets are mirrored to the CPU, the trap identifier with which the
packets are trapped is determined according to the session identifier of
the SPAN agent performing the mirroring. Packets that are trapped for
the same logical reason (e.g., buffer drops) should use the same session
identifier.

Currently, a single session is implicitly supported (identifier 0) and
is used for packets that are mirrored to the CPU due to buffer drops
(e.g., early drop).

Subsequent patches are going to mirror packets to the CPU due to
sampling, which will require a different session identifier.

Prepare for that by making the session identifier an attribute of the
SPAN agent.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c7659eb

Merge tag 'mlx5-updates-2021-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 1bc61c9d

David S. Miller authored Mar 11, 2021

Saeed Mahameed says:

====================
This series provides some cleanups to mlx5 driver
For more information please see tag log below.

Please pull and let me know if there is any problem.

mlx5-updates-2021-03-11

Cleanups for mlx5 driver

1) Fix build warnings form Arnd and Vlad
2) Leon improves locking for driver load/unload flows
3) From Roi, Lockdep false dependency warning
4) Other trivial cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1bc61c9d

Merge branch 'nexthop-Resilient-next-hop-groups' · 2a0186a3

David S. Miller authored Mar 11, 2021

Petr Machata says:

====================
nexthop: Resilient next-hop groups

At this moment, there is only one type of next-hop group: an mpath group.
Mpath groups implement the hash-threshold algorithm, described in RFC
2992[1].

To select a next hop, hash-threshold algorithm first assigns a range of
hashes to each next hop in the group, and then selects the next hop by
comparing the SKB hash with the individual ranges. When a next hop is
removed from the group, the ranges are recomputed, which leads to
reassignment of parts of hash space from one next hop to another. RFC 2992
illustrates it thus:

+-------+-------+-------+-------+-------+
| 1 | 2 | 3 | 4 | 5 |
+-------+-+-----+---+---+-----+-+-------+
| 1 | 2 | 4 | 5 |
+---------+---------+---------+---------+

Before and after deletion of next hop 3
under the hash-threshold algorithm.

Note how next hop 2 gave up part of the hash space in favor of next hop 1,
and 4 in favor of 5. While there will usually be some overlap between the
previous and the new distribution, some traffic flows change the next hop
that they resolve to.

If a multipath group is used for load-balancing between multiple servers,
this hash space reassignment causes an issue that packets from a single
flow suddenly end up arriving at a server that does not expect them, which
may lead to TCP reset.

If a multipath group is used for load-balancing among available paths to
the same server, the issue is that different latencies and reordering along
the way causes the packets to arrive in the wrong order.

Resilient hashing is a technique to address the above problem. Resilient
next-hop group has another layer of indirection between the group itself
and its constituent next hops: a hash table. The selection algorithm uses a
straightforward modulo operation on the SKB hash to choose a hash table
bucket, then reads the next hop that this bucket contains, and forwards
traffic there.

This indirection brings an important feature. In the hash-threshold
algorithm, the range of hashes associated with a next hop must be
continuous. With a hash table, mapping between the hash table buckets and
the individual next hops is arbitrary. Therefore when a next hop is deleted
the buckets that held it are simply reassigned to other next hops:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
v v v v
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Before and after deletion of next hop 3
under the resilient hashing algorithm.

When weights of next hops in a group are altered, it may be possible to
choose a subset of buckets that are currently not used for forwarding
traffic, and use those to satisfy the new next-hop distribution demands,
keeping the "busy" buckets intact. This way, established flows are ideally
kept being forwarded to the same endpoints through the same paths as before
the next-hop group change.

This patch set adds the implementation of resilient next-hop groups.

In a nutshell, the algorithm works as follows. Each next hop has a number
of buckets that it wants to have, according to its weight and the number of
buckets in the hash table. In case of an event that might cause bucket
allocation change, the numbers for individual next hops are updated,
similarly to how ranges are updated for mpath group next hops. Following
that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
next hop that is currently occupying more buckets than it wants (it is
"overweight"), it migrates the buckets to one of the next hops that has
fewer buckets than it wants (it is "underweight"). If, after this, there
are still underweight next hops, another upkeep run is scheduled to a
future time.

Chances are there are not enough "idle" buckets to satisfy the new demands.
The algorithm has knobs to select both what it means for a bucket to be
idle, and for whether and when to forcefully migrate buckets if there keeps
being an insufficient number of idle ones.

To illustrate the usage, consider the following commands:

# ip nexthop add id 1 via 192.0.2.2 dev dummy1
# ip nexthop add id 2 via 192.0.2.3 dev dummy1
# ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

If not present in netlink message, the idle timer defaults to 120 seconds,
and there is no unbalanced timer, meaning the group may remain unbalanced
indefinitely. The value of 120 is the default in Cumulus implementation of
resilient next-hop groups. To a degree the default is arbitrary, the only
value that certainly does not make sense is 0. Therefore going with an
existing deployed implementation is reasonable.

Unbalanced time, i.e. how long since the last time that all nexthops had as
many buckets as they should according to their weights, is reported when
the group is dumped:

# ip nexthop show id 10
id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

When replacing next hops or changing weights, if one does not specify some
parameters, their value is left as it was:

# ip nexthop replace id 10 group 1,2/2 type resilient
# ip nexthop show id 10
id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0

It is also possible to do a dump of individual buckets (and now you know
why there were only 8 of them in the example above):

# ip nexthop bucket show id 10
id 10 index 0 idle_time 5.59 nhid 1
id 10 index 1 idle_time 5.59 nhid 1
id 10 index 2 idle_time 8.74 nhid 2
id 10 index 3 idle_time 8.74 nhid 2
id 10 index 4 idle_time 8.74 nhid 1
id 10 index 5 idle_time 8.74 nhid 1
id 10 index 6 idle_time 8.74 nhid 1
id 10 index 7 idle_time 8.74 nhid 1

Note the two buckets that have a shorter idle time. Those are the ones that
were migrated after the nexthop replace command to satisfy the new demand
that nexthop 1 be given 6 buckets instead of 4.

The patchset proceeds as follows:

- Patches #1 and #2 are small refactoring patches.

- Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
meant to be set for all nexthop groups that in general have several
nexthops from which they choose, and avoids a more expensive dispatch
based on reading several flags, one for each nexthop group type.

- Patch #4 contains defines of new UAPI attributes and the new next-hop
group type. At this point, the nexthop code is made to bounce the new
type. As the resilient hashing code is gradually added in the following
patch sets, it will remain dead. The last patch will make it accessible.

This patch also adds a suite of new messages related to next hop buckets.
This approach was taken instead of overloading the information on the
existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.

First, a next-hop group can contain a large number of next-hop buckets
(4k is not unheard of). This imposes limits on the amount of information
that can be encoded for each next-hop bucket given a netlink message is
limited to 64k bytes.

Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
point, in the future it can be extended to provide user space with
control over next-hop buckets configuration.

- Patch #5 contains the meat of the resilient next-hop group support.

- Patches #6 and #7 implement support for notifications towards the
drivers.

- Patch #8 adds an interface for the drivers to report resilient hash
table bucket activity. Drivers will be able to report through this
interface whether traffic is hitting a given bucket.

- Patch #9 adds an interface for the drivers to report whether a given
hash table bucket is offloaded or trapping traffic.

- In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
the code necessary for creation of resilient groups, bucket dumping and
getting, and bucket migration notifications.

- In patch #14 the next-hop groups are finally made available.

The overall plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next-hop groups (already pushed)
3) Implementation of resilient next-hop groups (this patchset)
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

2a0186a3

nexthop: Enable resilient next-hop groups · 15e1dd57

Petr Machata authored Mar 11, 2021

Now that all the code is in place, stop rejecting requests to create
resilient next-hop groups.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

15e1dd57

nexthop: Notify userspace about bucket migrations · 0b4818aa

Petr Machata authored Mar 11, 2021

Nexthop replacements et.al. are notified through netlink, but if a delayed
work migrates buckets on the background, userspace will stay oblivious.
Notify these as RTM_NEWNEXTHOPBUCKET events.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b4818aa