Commits · 28b24f90020fed8e8e3e8e20575f08c1cd06e54f · Kirill Smelkov / linux

01 Oct, 2023 23 commits

net: implement lockless SO_MAX_PACING_RATE · 28b24f90

Eric Dumazet authored Sep 21, 2023

SO_MAX_PACING_RATE setsockopt() does not need to hold
the socket lock, because sk->sk_pacing_rate readers
can run fine if the value is changed by other threads,
after adding READ_ONCE() accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28b24f90

net: lockless implementation of SO_BUSY_POLL, SO_PREFER_BUSY_POLL, SO_BUSY_POLL_BUDGET · 2a4319cf

Eric Dumazet authored Sep 21, 2023

Setting sk->sk_ll_usec, sk_prefer_busy_poll and sk_busy_poll_budget
do not require the socket lock, readers are lockless anyway.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a4319cf

net: lockless SO_{TYPE|PROTOCOL|DOMAIN|ERROR } setsockopt() · b1202515

Eric Dumazet authored Sep 21, 2023

This options can not be set and return -ENOPROTOOPT,
no need to acqure socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b1202515

net: lockless SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC · 8ebfb6db

Eric Dumazet authored Sep 21, 2023

sock->flags are atomic, no need to hold the socket lock
in sk_setsockopt() for SO_PASSCRED, SO_PASSPIDFD and SO_PASSSEC.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ebfb6db

net: implement lockless SO_PRIORITY · 10bbf165

Eric Dumazet authored Sep 21, 2023

This is a followup of 8bf43be7 ("net: annotate data-races
around sk->sk_priority").

sk->sk_priority can be read and written without holding the socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10bbf165

openvswitch: reduce stack usage in do_execute_actions · 06bc3668

Ilya Maximets authored Sep 21, 2023

do_execute_actions() function can be called recursively multiple
times while executing actions that require pipeline forking or
recirculations.  It may also be re-entered multiple times if the packet
leaves openvswitch module and re-enters it through a different port.

Currently, there is a 256-byte array allocated on stack in this
function that is supposed to hold NSH header.  Compilers tend to
pre-allocate that space right at the beginning of the function:

     a88:       48 81 ec b0 01 00 00    sub    $0x1b0,%rsp

NSH is not a very common protocol, but the space is allocated on every
recursive call or re-entry multiplying the wasted stack space.

Move the stack allocation to push_nsh() function that is only used
if NSH actions are actually present.  push_nsh() is also a simple
function without a possibility for re-entry, so the stack is returned
right away.

With this change the preallocated space is reduced by 256 B per call:

     b18:       48 81 ec b0 00 00 00    sub    $0xb0,%rsp
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eelco Chaudron echaudro@redhat.com
Reviewed-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06bc3668

Merge branch 'dev-stats-virtio-l2tp_eth' · c1157c11

David S. Miller authored Oct 01, 2023

Eric Dumazet says:

====================
net: use DEV_STATS_xxx() helpers in virtio_net and l2tp_eth

Inspired by another (minor) KCSAN syzbot report.
Both virtio_net and l2tp_eth can use DEV_STATS_xxx() helpers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c1157c11

net: l2tp_eth: use generic dev->stats fields · a56d9390

Eric Dumazet authored Sep 21, 2023

Core networking has opt-in atomic variant of dev->stats,
simply use DEV_STATS_INC(), DEV_STATS_ADD() and DEV_STATS_READ().

v2: removed @priv local var in l2tp_eth_dev_recv() (Simon)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a56d9390

virtio_net: avoid data-races on dev->stats fields · d12a26b7

Eric Dumazet authored Sep 21, 2023

Use DEV_STATS_INC() and DEV_STATS_READ() which provide
atomicity on paths that can be used concurrently.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d12a26b7

net: add DEV_STATS_READ() helper · 0b068c71

Eric Dumazet authored Sep 21, 2023

Companion of DEV_STATS_INC() & DEV_STATS_ADD().

This is going to be used in the series.

Use it in macsec_get_stats64().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b068c71

octeontx2-pf: Tc flower offload support for MPLS · a63df366

Hariprasad Kelam authored Sep 21, 2023

This patch extends flower offload support for MPLS protocol.
Due to hardware limitation, currently driver supports lse
depth up to 4.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a63df366

net: ethernet: xilinx: Drop kernel doc comment about return value · f77e9f13

Uwe Kleine-König authored Sep 21, 2023

During review of the patch that became 2e0ec0af ("net: ethernet:
xilinx: Convert to platform remove callback returning void") in
net-next, Radhey Shyam Pandey pointed out that the change makes the
documentation about the return value obsolete. The patch was applied
without addressing this feedback, so here comes a fix in a separate
patch.

Fixes: 2e0ec0af ("net: ethernet: xilinx: Convert to platform remove callback returning void")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f77e9f13

net: atl1c: switch to napi_consume_skb() · d87c59f2

Sieng-Piaw Liew authored Sep 21, 2023

Switch to napi_consume_skb() to take advantage of bulk free, and skb
reuse through skb cache in conjunction with napi_build_skb().

When parameter 'budget' = 0, indicating non-NAPI context,
dev_consume_skb_any() is called internally.
Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d87c59f2

Merge branch 'sch_fq-improvements' · b49a9485

David S. Miller authored Oct 01, 2023

Eric Dumazet says:

====================
net_sched: sch_fq: round of improvements

For FQ tenth anniversary, it was time for making it faster.

The FQ part (as in Fair Queue) is rather expensive, because
we have to classify packets and store them in a per-flow structure,
and add this per-flow structure in a hash table. Then the RR lists
also add cache line misses.

Most fq qdisc are almost idle. Trying to share NIC bandwidth has
no benefits, thus the qdisc could behave like a FIFO.

This series brings a 5 % throughput increase in intensive
tcp_rr workload, and 13 % increase for (unpaced) UDP packets.

v2: removed an extra label (build bot).
    Fix an accidental increase of stat_internal_packets counter
    in fast path.
    Added "constify qdisc_priv()" patch to allow fq_fastpath_check()
    first parameter to be const.
    typo on 'eligible' (Willem)
====================
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b49a9485

net_sched: sch_fq: always garbage collect · 8f6c4ff9

Eric Dumazet authored Sep 20, 2023

FQ performs garbage collection at enqueue time, and only
if number of flows is above a given threshold, which
is hit after the qdisc has been used a bit.

Since an RB-tree traversal is needed to locate a flow,
it makes sense to perform gc all the time, to keep
rb-trees smaller.

This reduces by 50 % average storage costs in FQ,
and avoids 1 cache line miss at enqueue time when
fast path added in prior patch can not be used.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f6c4ff9

net_sched: sch_fq: add fast path for mostly idle qdisc · 076433bd

Eric Dumazet authored Sep 20, 2023

TCQ_F_CAN_BYPASS can be used by few qdiscs.

Idea is that if we queue a packet to an empty qdisc,
following dequeue() would pick it immediately.

FQ can not use the generic TCQ_F_CAN_BYPASS code,
because some additional checks need to be performed.

This patch adds a similar fast path to FQ.

Most of the time, qdisc is not throttled,
and many packets can avoid bringing/touching
at least four cache lines, and consuming 128bytes
of memory to store the state of a flow.

After this patch, netperf can send UDP packets about 13 % faster,
and pktgen goes 30 % faster (when FQ is in the way), on a fast NIC.

TCP traffic is also improved, thanks to a reduction of cache line misses.
I have measured a 5 % increase of throughput on a tcp_rr intensive workload.

tc -s -d qd sh dev eth1
...
qdisc fq 8004: parent 1:2 limit 10000p flow_limit 100p buckets 1024
   orphan_mask 1023 quantum 3028b initial_quantum 15140b low_rate_threshold 550Kbit
   refill_delay 40ms timer_slack 10us horizon 10s horizon_drop
 Sent 5646784384 bytes 1985161 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  flows 122 (inactive 122 throttled 0)
  gc 0 highprio 0 fastpath 659990 throttled 27762 latency 8.57us
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

076433bd

net_sched: sch_fq: change how @inactive is tracked · ee9af4e1

Eric Dumazet authored Sep 20, 2023

Currently, when one fq qdisc has no more packets to send, it can still
have some flows stored in its RR lists (q->new_flows & q->old_flows)

This was a design choice, but what is a bit disturbing is that
the inactive_flows counter does not include the count of empty flows
in RR lists.

As next patch needs to know better if there are active flows,
this change makes inactive_flows exact.

Before the patch, following command on an empty qdisc could have returned:

lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
  flows 1322 (inactive 1316 throttled 0)
  flows 1330 (inactive 1325 throttled 0)
  flows 1193 (inactive 1190 throttled 0)
  flows 1208 (inactive 1202 throttled 0)

After the patch, we now have:

lpaa17:~# tc -s -d qd sh dev eth1 | grep inactive
  flows 1322 (inactive 1322 throttled 0)
  flows 1330 (inactive 1330 throttled 0)
  flows 1193 (inactive 1193 throttled 0)
  flows 1208 (inactive 1208 throttled 0)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee9af4e1

net_sched: sch_fq: struct sched_data reorg · 54ff8ad6

Eric Dumazet authored Sep 20, 2023

q->flows can be often modified, and q->timer_slack is read mostly.

Exchange the two fields, so that cache line countaining
quantum, initial_quantum, and other critical parameters
stay clean (read-mostly).

Move q->watchdog next to q->stat_throttled

Add comments explaining how the structure is split in
three different parts.

pahole output before the patch:

struct fq_sched_data {
	struct fq_flow_head        new_flows;            /*     0  0x10 */
	struct fq_flow_head        old_flows;            /*  0x10  0x10 */
	struct rb_root             delayed;              /*  0x20   0x8 */
	u64                        time_next_delayed_flow; /*  0x28   0x8 */
	u64                        ktime_cache;          /*  0x30   0x8 */
	unsigned long              unthrottle_latency_ns; /*  0x38   0x8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct fq_flow             internal __attribute__((__aligned__(64))); /*  0x40  0x80 */

	/* XXX last struct has 16 bytes of padding */

	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                        quantum;              /*  0xc0   0x4 */
	u32                        initial_quantum;      /*  0xc4   0x4 */
	u32                        flow_refill_delay;    /*  0xc8   0x4 */
	u32                        flow_plimit;          /*  0xcc   0x4 */
	unsigned long              flow_max_rate;        /*  0xd0   0x8 */
	u64                        ce_threshold;         /*  0xd8   0x8 */
	u64                        horizon;              /*  0xe0   0x8 */
	u32                        orphan_mask;          /*  0xe8   0x4 */
	u32                        low_rate_threshold;   /*  0xec   0x4 */
	struct rb_root *           fq_root;              /*  0xf0   0x8 */
	u8                         rate_enable;          /*  0xf8   0x1 */
	u8                         fq_trees_log;         /*  0xf9   0x1 */
	u8                         horizon_drop;         /*  0xfa   0x1 */

	/* XXX 1 byte hole, try to pack */

<bad>	u32                        flows;                /*  0xfc   0x4 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	u32                        inactive_flows;       /* 0x100   0x4 */
	u32                        throttled_flows;      /* 0x104   0x4 */
	u64                        stat_gc_flows;        /* 0x108   0x8 */
	u64                        stat_internal_packets; /* 0x110   0x8 */
	u64                        stat_throttled;       /* 0x118   0x8 */
	u64                        stat_ce_mark;         /* 0x120   0x8 */
	u64                        stat_horizon_drops;   /* 0x128   0x8 */
	u64                        stat_horizon_caps;    /* 0x130   0x8 */
	u64                        stat_flows_plimit;    /* 0x138   0x8 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	u64                        stat_pkts_too_long;   /* 0x140   0x8 */
	u64                        stat_allocation_errors; /* 0x148   0x8 */
<bad>	u32                        timer_slack;          /* 0x150   0x4 */

	/* XXX 4 bytes hole, try to pack */

	struct qdisc_watchdog      watchdog;             /* 0x158  0x48 */

	/* size: 448, cachelines: 7, members: 34 */
	/* sum members: 411, holes: 2, sum holes: 5 */
	/* padding: 32 */
	/* paddings: 1, sum paddings: 16 */
	/* forced alignments: 1 */
};

pahole output after the patch:

struct fq_sched_data {
	struct fq_flow_head        new_flows;            /*     0  0x10 */
	struct fq_flow_head        old_flows;            /*  0x10  0x10 */
	struct rb_root             delayed;              /*  0x20   0x8 */
	u64                        time_next_delayed_flow; /*  0x28   0x8 */
	u64                        ktime_cache;          /*  0x30   0x8 */
	unsigned long              unthrottle_latency_ns; /*  0x38   0x8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct fq_flow             internal __attribute__((__aligned__(64))); /*  0x40  0x80 */

	/* XXX last struct has 16 bytes of padding */

	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                        quantum;              /*  0xc0   0x4 */
	u32                        initial_quantum;      /*  0xc4   0x4 */
	u32                        flow_refill_delay;    /*  0xc8   0x4 */
	u32                        flow_plimit;          /*  0xcc   0x4 */
	unsigned long              flow_max_rate;        /*  0xd0   0x8 */
	u64                        ce_threshold;         /*  0xd8   0x8 */
	u64                        horizon;              /*  0xe0   0x8 */
	u32                        orphan_mask;          /*  0xe8   0x4 */
	u32                        low_rate_threshold;   /*  0xec   0x4 */
	struct rb_root *           fq_root;              /*  0xf0   0x8 */
	u8                         rate_enable;          /*  0xf8   0x1 */
	u8                         fq_trees_log;         /*  0xf9   0x1 */
	u8                         horizon_drop;         /*  0xfa   0x1 */

	/* XXX 1 byte hole, try to pack */

<good>	u32                        timer_slack;          /*  0xfc   0x4 */
	/* --- cacheline 4 boundary (256 bytes) --- */
<good>	u32                        flows;                /* 0x100   0x4 */
	u32                        inactive_flows;       /* 0x104   0x4 */
	u32                        throttled_flows;      /* 0x108   0x4 */

	/* XXX 4 bytes hole, try to pack */

	u64                        stat_throttled;       /* 0x110   0x8 */
<better> struct qdisc_watchdog     watchdog;             /* 0x118  0x48 */
	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
	u64                        stat_gc_flows;        /* 0x160   0x8 */
	u64                        stat_internal_packets; /* 0x168   0x8 */
	u64                        stat_ce_mark;         /* 0x170   0x8 */
	u64                        stat_horizon_drops;   /* 0x178   0x8 */
	/* --- cacheline 6 boundary (384 bytes) --- */
	u64                        stat_horizon_caps;    /* 0x180   0x8 */
	u64                        stat_flows_plimit;    /* 0x188   0x8 */
	u64                        stat_pkts_too_long;   /* 0x190   0x8 */
	u64                        stat_allocation_errors; /* 0x198   0x8 */

	/* Force padding: */
	u64                        :64;
	u64                        :64;
	u64                        :64;
	u64                        :64;

	/* size: 448, cachelines: 7, members: 34 */
	/* sum members: 411, holes: 2, sum holes: 5 */
	/* padding: 32 */
	/* paddings: 1, sum paddings: 16 */
	/* forced alignments: 1 */
};
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54ff8ad6

net_sched: constify qdisc_priv() · 1add9073

Eric Dumazet authored Sep 20, 2023

In order to propagate const qualifiers, we change qdisc_priv()
to accept a possibly const argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1add9073

Merge branch 'tcp_delack_max' · 66ac08a7

David S. Miller authored Oct 01, 2023

Eric Dumazet says:

====================
tcp: add tcp_delack_max()

First patches are adding const qualifiers to four existing helpers.

Third patch adds a much needed companion feature to RTAX_RTO_MIN.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

66ac08a7

tcp: derive delack_max from rto_min · bbf80d71

Eric Dumazet authored Sep 20, 2023

While BPF allows to set icsk->->icsk_delack_max
and/or icsk->icsk_rto_min, we have an ip route
attribute (RTAX_RTO_MIN) to be able to tune rto_min,
but nothing to consequently adjust max delayed ack,
which vary from 40ms to 200 ms (TCP_DELACK_{MIN|MAX}).

This makes RTAX_RTO_MIN of almost no practical use,
unless customers are in big trouble.

Modern days datacenter communications want to set
rto_min to ~5 ms, and the max delayed ack one jiffie
smaller to avoid spurious retransmits.

After this patch, an "rto_min 5" route attribute will
effectively lower max delayed ack timers to 4 ms.

Note in the following ss output, "rto:6 ... ato:4"

$ ss -temoi dst XXXXXX
State Recv-Q Send-Q           Local Address:Port       Peer Address:Port  Process
ESTAB 0      0        [2002:a05:6608:295::]:52950   [2002:a05:6608:297::]:41597
     ino:255134 sk:1001 <->
         skmem:(r0,rb1707063,t872,tb262144,f0,w0,o0,bl0,d0) ts sack
 cubic wscale:8,8 rto:6 rtt:0.02/0.002 ato:4 mss:4096 pmtu:4500
 rcvmss:536 advmss:4096 cwnd:10 bytes_sent:54823160 bytes_acked:54823121
 bytes_received:54823120 segs_out:1370582 segs_in:1370580
 data_segs_out:1370579 data_segs_in:1370578 send 16.4Gbps
 pacing_rate 32.6Gbps delivery_rate 1.72Gbps delivered:1370579
 busy:26920ms unacked:1 rcv_rtt:34.615 rcv_space:65920
 rcv_ssthresh:65535 minrtt:0.015 snd_wnd:65536

While we could argue this patch fixes a bug with RTAX_RTO_MIN,
I do not add a Fixes: tag, so that we can soak it a bit before
asking backports to stable branches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bbf80d71

tcp: constify tcp_rto_min() and tcp_rto_min_us() argument · f68a181f

Eric Dumazet authored Sep 20, 2023

Make clear these functions do not change any field from TCP socket.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f68a181f

net: constify sk_dst_get() and __sk_dst_get() argument · 5033f58d

Eric Dumazet authored Sep 20, 2023

Both helpers only read fields from their socket argument.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5033f58d

30 Sep, 2023 1 commit

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 236f3873

David S. Miller authored Sep 30, 2023

Tony Nguyen says:

====================
ice: add PTP auxiliary bus support

Michal Michalik says:

Auxiliary bus allows exchanging information between PFs, which allows
both fixing problems and simplifying new features implementation.
The auxiliary bus is enabled for all devices supported by ice driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

236f3873

28 Sep, 2023 11 commits

pktgen: Introducing 'SHARED' flag for testing with non-shared skb · 7c7dd1d6

Liang Chen authored Sep 20, 2023

Currently, skbs generated by pktgen always have their reference count
incremented before transmission, causing their reference count to be
always greater than 1, leading to two issues:
1. Only the code paths for shared skbs can be tested.
2. In certain situations, skbs can only be released by pktgen.
To enhance testing comprehensiveness, we are introducing the "SHARED"
flag to indicate whether an SKB is shared. This flag is enabled by
default, aligning with the current behavior. However, disabling this
flag allows skbs with a reference count of 1 to be transmitted.
So we can test non-shared skbs and code paths where skbs are released
within the stack.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-2-liangchen.linux@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

7c7dd1d6

pktgen: Automate flag enumeration for unknown flag handling · 057708a9

Liang Chen authored Sep 20, 2023

When specifying an unknown flag, it will print all available flags.
Currently, these flags are provided as fixed strings, which requires
manual updates when flags change. Replacing it with automated flag
enumeration.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Link: https://lore.kernel.org/r/20230920125658.46978-1-liangchen.linux@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

057708a9

MAINTAINERS: Add an obsolete entry for LL TEMAC driver · 19f5eef8

Harini Katakam authored Sep 20, 2023

LL TEMAC IP is no longer supported. Hence add an entry marking the
driver as obsolete.
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Link: https://lore.kernel.org/r/20230920115047.31345-1-harini.katakam@amd.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

19f5eef8

Merge tag 'mlx5-updates-2023-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 416a01a4

Paolo Abeni authored Sep 28, 2023

Saeed Mahameed says:

====================
mlx5-updates-2023-09-19

Misc updates for mlx5 driver

1) From Erez, Add support for multicast forwarding to multi destination
   in bridge offloads with software steering mode (SMFS).

2) From Jianbo, Utilize the maximum aggregated link speed for police
   action rate.

3) From Moshe, Add a health error syndrome for pci data poisoned

4) From Shay, Enable 4 ports multiport E-switch

5) From Jiri, Trivial SF code cleanup

====================

Link: https://lore.kernel.org/r/20230920063552.296978-1-saeed@kernel.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

416a01a4

ethernet/intel: Use list_for_each_entry() helper · c1fec890

Jinjie Ruan authored Sep 19, 2023

Convert list_for_each() to list_for_each_entry() where applicable.

No functional changed.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230919170409.1581074-1-anthony.l.nguyen@intel.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c1fec890

Merge branch 'selftests-tc-testing-parallel-tdc' · f940d704

Paolo Abeni authored Sep 28, 2023

Pedro Tammela says:

====================
selftests/tc-testing: parallel tdc

As the number of tdc tests is growing, so is our completion wall time.
One of the ideas to improve this is to run tests in parallel, as they
are self contained.

This series allows for tests to run in parallel, in batches of 32 tests.
Not all tests can run in parallel as they might conflict with each other.
The code will still honor this requirement even when trying to run the
tests over the worker pool.

In order to make this happen we had to localize the test resources
(patches 1 and 2), where instead of having all tests sharing one single
namespace and veths devices each test now gets it's own local namespace and devices.

Even though the tests serialize over rtnl_lock in the kernel, we
measured a speedup of about 3x in a test VM.
====================

Link: https://lore.kernel.org/r/20230919135404.1778595-1-pctammela@mojatatu.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

f940d704

selftests/tc-testing: update tdc documentation · d3fc4eea

Pedro Tammela authored Sep 19, 2023

Update the documentation to reflect the changes made to tdc with regards
to minimal requirements and test definitions expectations.
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

d3fc4eea

selftests/tc-testing: implement tdc parallel test run · ac9b8293

Pedro Tammela authored Sep 19, 2023

Use a Python process pool to run the tests in parallel.
Not all tests can run in parallel, for instance tests that are not
namespaced and tests that use netdevsim, as they can conflict with one
another.

The code logic will split the tests into serial and parallel.
For the parallel tests, we build batches of 32 tests and queue each
batch on the process pool. For the serial tests, they are queued as a
whole into the process pool, which in turn executes them concurrently
with the parallel tests.

Even though the tests serialize on rtnl_lock in the kernel, this feature
showed results with a ~3x speedup on the wall time for the entire test suite
running in a VM:
   Before - 4m32.502s
   After - 1m19.202s

Examples:
   In order to run tdc using 4 processes:
      ./tdc.py -J4 <...>
   In order to run tdc using 1 process:
      ./tdc.py -J1 <...> || ./tdc.py <...>

Note that the kernel configuration will affect the speed of the tests,
especially if such configuration slows down process creation and/or
fork().
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ac9b8293

selftests/tc-testing: update test definitions for local resources · d227cc0b

Pedro Tammela authored Sep 19, 2023

With resources localized on a per test basis, some tests definitions
either contain redundant commands, were wrong or could be simplified.
Update all of them to match the new requirements.
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

d227cc0b

selftests/tc-testing: localize test resources · 98cfbe42

Pedro Tammela authored Sep 19, 2023

As of today, the current tdc architecture creates one netns and uses it
to run all tests. This assumption was embedded into the nsPlugin which
carried over as how the tests were written.

The tdc tests are by definition self contained and can,
theoretically, run in parallel. Even though in the kernel they will
serialize over the rtnl lock, we should expect a significant speedup of the
total wall time for the entire test suite, which is hitting close to
1100 tests at this point.

A first step to achieve this goal is to remove sharing of global resources like
veth/dummy interfaces and the netns. In this patch we 'localize' these
resources on a per test basis. Each test gets it's own netns, VETH/dummy interfaces.
The resources are spawned in the pre_suite phase, where tdc will prepare
all netns and interfaces for all tests. This is done in order to avoid
concurrency issues with netns / interfaces spawning and commands using
them. As tdc progresses, the resources are deleted after each test finishes
executing.

Tests that don't use the nsPlugin still run under the root namespace,
but are now required to manage any external resources like interfaces.
These cannot be parallelized as their definition doesn't allow it.
On the other hand, when using the nsPlugin, tests don't need to create
dummy/veth interfaces as these are handled already.
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

98cfbe42

net: sfp: add quirk for Fiberstone GPON-ONU-34-20BI · d387e34f

Christian Marangi authored Sep 19, 2023

Fiberstone GPON-ONU-34-20B can operate at 2500base-X, but report 1.2GBd
NRZ in their EEPROM.

The module also require the ignore tx fault fixup similar to Huawei MA5671A
as it gets disabled on error messages with serial redirection enabled.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20230919124720.8210-1-ansuelsmth@gmail.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

d387e34f

22 Sep, 2023 5 commits

Merge branch 'mlxsw-multicast' · 5a1b322c

David S. Miller authored Sep 22, 2023

Petr Machata says:

====================
mlxsw: Improve blocks selection for IPv6 multicast forwarding

Amit Cohen writes:

The driver configures two ACL regions during initialization, these regions
are used for IPv4 and IPv6 multicast forwarding. Entries residing in these
two regions match on the {SIP, DIP, VRID} key elements.

Currently for IPv6 region, 9 key blocks are used. This can be improved by
reducing the amount key blocks needed for the IPv6 region to 8. It is
possible to use key blocks that mix subsets of the VRID element with
subsets of the DIP element.

To make this happen, we have to take in account the algorithm that chooses
which key blocks will be used. It is lazy and not the optimal one as it is
a complex task. It searches the block that contains the most elements that
are required, chooses it, removes the elements that appear in the chosen
block and starts again searching the block that contains the most elements.

To optimize the nubmber of the blocks for IPv6 multicast forwarding, handle
the following:

1. Add support for key blocks that mix subsets of the VRID element with
subsets of the DIP element.

2. Prevent the algorithm from chosing another blocks for VRID.
Currently, we have the block 'ipv4_4' which contains 2 sub-elements of
VRID. With the existing algorithm, this block might be chosen, then 8
blocks must be chosen for SIP and DIP and we will get 9 blocks to match on
{SIP, DIP, VRID}. Therefore, replace this block with a new block 'ipv4_5'
that contains 1 element for VRID, this will not be chosen for IPv6 as VRID
element will be broken to several sub-elements. In this way we can get 8
blocks for IPv6 multicast forwarding.

This improvement was tested and indeed 8 blocks are used instead of 9.

v2:
- Resending without changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5a1b322c

mlxsw: Edit IPv6 key blocks to use one less block for multicast forwarding · 92953e7a

Amit Cohen authored Sep 19, 2023

Two ACL regions that are configured by the driver during initialization are
the ones used for IPv4 and IPv6 multicast forwarding. Entries residing
in these two regions match on the {SIP, DIP, VRID} key elements.

Currently for IPv6 region, 9 key blocks are used:
* 4 for SIP - 'ipv4_1', 'ipv6_{3,4,5}'
* 4 for DIP - 'ipv4_0', 'ipv6_{0,1,2/2b}'
* 1 for VRID - 'ipv4_4b'

This can be improved by reducing the amount key blocks needed for
the IPv6 region to 8. It is possible to use key blocks that mix subsets of
the VRID element with subsets of the DIP element.
The following key blocks can be used:
* 4 for SIP - 'ipv4_1', 'ipv6_{3,4,5}'
* 1 for subset of DIP - 'ipv4_0'
* 3 for the rest of DIP and subsets of VRID - 'ipv6_{0,1,2/2b}'

To make this happen, add VRID sub-elements as part of existing keys -
'ipv6_{0,1,2/2b}'. Note that one of the sub-elements is called
VRID_ROUTER_MSB and does not contain bit numbers like the rest, as for
Spectrum < 4 this element represents bits 8-10 and for Spectrum-4 it
represents bits 8-11.

Breaking VRID into 3 sub-elements makes the driver use one less block in
IPv6 region for multicast forwarding. The sub-elements can be filled in
blocks that are used for destination IP.

The algorithm in the driver that chooses which key blocks will be used is
lazy and not the optimal one. It searches the block that contains the most
elements that are required, chooses it, removes the elements that appear
in the chosen block and starts again searching the block that contains the
most elements.

When key block 'ipv4_4' is defined, the algorithm might choose it, as it
contains 2 sub-elements of VRID, then 8 blocks must be chosen for SIP and
DIP and we get 9 blocks to match on {SIP, DIP, VRID}. That is why we had to
remove key block 'ipv4_4' in a previous patch and use key block that
contains one field for VRID.

This improvement was tested and indeed 8 blocks are used instead of 9.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92953e7a

mlxsw: spectrum_acl_flex_keys: Add 'ipv4_5b' flex key · c6caabdf

Amit Cohen authored Sep 19, 2023

The previous patch replaced the key block 'ipv4_4' with 'ipv4_5'. The
corresponding block for Spectrum-4 is 'ipv4_4b'. To be consistent, replace
key block 'ipv4_4b' with 'ipv4_5b'.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6caabdf

mlxsw: Add 'ipv4_5' flex key · c2f3e10a

Amit Cohen authored Sep 19, 2023

Currently virtual router ID element is broken to two sub-elements -
'VIRT_ROUTER_LSB' and 'VIRT_ROUTER_MSB'. It was broken as this field is
broken in 'ipv4_4' flex key which is used for IPv4 in Spectrum < 4.
For Spectrum-4, we use 'ipv4_4b' flex key which contains one field for
virtual router, this key is not supported in older ASICs.

Add 'ipv4_5' flex key which is supported in all ASICs and contains one
field for virtual router. Then there is no reason to use 'VIRT_ROUTER_LSB'
and 'VIRT_ROUTER_MSB', remove them and add one element 'VIRT_ROUTER' for
this field.

The motivation is to get rid of 'ipv4_4' flex key, as it might be chosen
for IPv6 multicast forwarding region. This will not allow the improvement
in a following patch. See more details in the cover letter and in a
following patch.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2f3e10a

hamradio: baycom: remove useless link in Kconfig · 84c19e65

Peter Lafreniere authored Sep 19, 2023

The Kconfig help text for baycom drivers suggests that more information
on the hardware can be found at <https://www.baycom.de>. The website now
includes no information on their ham radio products other than a mention
that they were once produced by the company, saying:
"The amateur radio equipment is now no longer part and business of BayCom GmbH"

As there is no information relavent to the baycom driver on the site,
remove the link.
Signed-off-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>

84c19e65