Commits · e61caf04b9f8a2626714ffd12e734c555c758af4 · Kirill Smelkov / linux

15 Apr, 2023 4 commits

Merge branch 'page_pool-allow-caching-from-safely-localized-napi' · e61caf04

Jakub Kicinski authored Apr 14, 2023

Jakub Kicinski says:

====================
page_pool: allow caching from safely localized NAPI

I went back to the explicit "are we in NAPI method", mostly
because I don't like having both around :( (even tho I maintain
that in_softirq() && !in_hardirq() is as safe, as softirqs do
not nest).

Still returning the skbs to a CPU, tho, not to the NAPI instance.
I reckon we could create a small refcounted struct per NAPI instance
which would allow sockets and other users so hold a persisent
and safe reference. But that's a bigger change, and I get 90+%
recycling thru the cache with just these patches (for RR and
streaming tests with 100% CPU use it's almost 100%).

Some numbers for streaming test with 100% CPU use (from previous version,
but really they perform the same):

		HW-GRO				page=page
		before		after		before		after
recycle:
cached:			0	138669686		0	150197505
cache_full:		0	   223391		0	    74582
ring:		138551933         9997191	149299454		0
ring_full: 		0             488	     3154	   127590
released_refcnt:	0		0		0		0

alloc:
fast:		136491361	148615710	146969587	150322859
slow:		     1772	     1799	      144	      105
slow_high_order:	0		0		0		0
empty:		     1772	     1799	      144	      105
refill:		  2165245	   156302	  2332880	     2128
waive:			0		0		0		0

v1: https://lore.kernel.org/all/20230411201800.596103-1-kuba@kernel.org/
rfcv2: https://lore.kernel.org/all/20230405232100.103392-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230413042605.895677-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e61caf04

bnxt: hook NAPIs to page pools · 294e39e0

Jakub Kicinski authored Apr 12, 2023

bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
to hoook them up together.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

294e39e0

page_pool: allow caching from safely localized NAPI · 8c48eea3

Jakub Kicinski authored Apr 12, 2023

Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8c48eea3

net: skb: plumb napi state thru skb freeing paths · b07a2d97

Jakub Kicinski authored Apr 12, 2023

We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
Going forward we will also try to cache head and data pages.
Plumb the "are we in a normal NAPI context" information thru
deeper into the freeing path, up to skb_release_data() and
skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
comes from netpoll which passes budget of 0 to try to reap
the Tx completions but not perform any Rx.

Use "bool napi_safe" rather than bare "int budget",
the further we get from NAPI the more confusing the budget
argument may seem (particularly whether 0 or MAX is the
correct value to pass in when not in NAPI).
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b07a2d97

14 Apr, 2023 36 commits

Merge branch 'msg_control-split' · c11d2e71

David S. Miller authored Apr 14, 2023

Kevin Brodsky says:

====================
net: Finish up ->msg_control{,_user} split

Commit 1f466e1f ("net: cleanly handle kernel vs user buffers for
->msg_control") introduced the msg_control_user and
msg_control_is_user fields in struct msghdr, to ensure that user
pointers are represented as such. It also took care of converting most
users of struct msghdr::msg_control where user pointers are involved. It
did however miss a number of cases, and some code using msg_control
inappropriately has also appeared in the meantime.

This series is attempting to complete the split, by eliminating the
remaining cases where msg_control is used when in fact a user
pointer is stored in the union (patch 1).

It also addresses a couple of issues with msg_control_is_user: one where
it is not updated as it should (patch 2), and one where it is not
initialised (patch 3).

v1..v2:
* Split out the msg_control_is_user fixes into separate patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c11d2e71

net/ipv6: Initialise msg_control_is_user · b6d85cf5

Kevin Brodsky authored Apr 13, 2023

do_ipv6_setsockopt() makes use of struct msghdr::msg_control in the
IPV6_2292PKTOPTIONS case. Make sure to initialise
msg_control_is_user accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6d85cf5

net/compat: Update msg_control_is_user when setting a kernel pointer · 60daf8d4

Kevin Brodsky authored Apr 13, 2023

cmsghdr_from_user_compat_to_kern() is an unusual case w.r.t. how
the kmsg->msg_control* fields are used. The input struct msghdr
holds a pointer to a user buffer, i.e. ksmg->msg_control_user is
active. However, upon success, a kernel pointer is stored in
kmsg->msg_control. kmsg->msg_control_is_user should therefore be
updated accordingly.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

60daf8d4

net: Ensure ->msg_control_user is used for user buffers · c39ef213

Kevin Brodsky authored Apr 13, 2023

Since commit 1f466e1f ("net: cleanly handle kernel vs user
buffers for ->msg_control"), pointers to user buffers should be
stored in struct msghdr::msg_control_user, instead of the
msg_control field.  Most users of msg_control have already been
converted (where user buffers are involved), but not all of them.

This patch attempts to address the remaining cases. An exception is
made for null checks, as it should be safe to use msg_control
unconditionally for that purpose.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

c39ef213

vsock/loopback: don't disable irqs for queue access · eaaa4e92

Arseniy Krasnov authored Apr 13, 2023

This replaces 'skb_queue_tail()' with 'virtio_vsock_skb_queue_tail()'.
The first one uses 'spin_lock_irqsave()', second uses 'spin_lock_bh()'.
There is no need to disable interrupts in the loopback transport as
there is no access to the queue with skbs from interrupt context. Both
virtio and vhost transports work in the same way.
Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eaaa4e92

Merge branch 'mana-jumbo-frames' · c61fcc09

David S. Miller authored Apr 14, 2023

Haiyang Zhang says:

====================
net: mana: Add support for jumbo frame

The set adds support for jumbo frame,
with some optimization for the RX path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c61fcc09

net: mana: Add support for jumbo frame · 80f6215b

Haiyang Zhang authored Apr 12, 2023

During probe, get the hardware-allowed max MTU by querying the device
configuration. Users can select MTU up to the device limit.
When XDP is in use, limit MTU settings so the buffer size is within
one page. And, when MTU is set to a too large value, XDP is not allowed
to run.
Also, to prevent changing MTU fails, and leaves the NIC in a bad state,
pre-allocate all buffers before starting the change. So in low memory
condition, it will return error, without affecting the NIC.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80f6215b

net: mana: Enable RX path to handle various MTU sizes · 2fbbd712

Haiyang Zhang authored Apr 12, 2023

Update RX data path to allocate and use RX queue DMA buffers with
proper size based on potentially various MTU sizes.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2fbbd712

net: mana: Refactor RX buffer allocation code to prepare for various MTU · a2917b23

Haiyang Zhang authored Apr 12, 2023

Move out common buffer allocation code from mana_process_rx_cqe() and
mana_alloc_rx_wqe() to helper functions.
Refactor related variables so they can be changed in one place, and buffer
sizes are in sync.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2917b23

net: mana: Use napi_build_skb in RX path · ce518bc3

Haiyang Zhang authored Apr 12, 2023

Use napi_build_skb() instead of build_skb() to take advantage of the
NAPI percpu caches to obtain skbuff_head.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce518bc3

Merge tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · e473ea81

Jakub Kicinski authored Apr 13, 2023

Saeed Mahameed says:

====================
mlx5-updates-2023-04-11

1) Vlad adds the support for linux bridge multicast offload support
   Patches #1 through #9
   Synopsis

Vlad Says:
==============
Implement support of bridge multicast offload in mlx5. Handle port object
attribute SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED notification to toggle multicast
offload and bridge snooping support on bridge. Handle port object
SWITCHDEV_OBJ_ID_PORT_MDB notification to attach a bridge port to MDB.

Steering architecture

Existing offload infrastructure relies on two levels of flow tables - bridge
ingress and egress. For multicast offload the architecture is extended with
additional layer of per-port multicast replication tables. Such tables filter
loopback traffic (so packets are not replicated to their source port) and pop
VLAN headers for "untagged" VLANs. The tables are referenced by the MDB rules in
egress table. MDB egress rule can point to multiple per-port multicast tables,
which causes matching multicast traffic to be replicated to all of them, and,
consecutively, to several bridge ports:

                                                                                                                            +--------+--+
                                                                                    +---------------------------------------> Port 1 |  |
                                                                                    |                                       +-^------+--+
                                                                                    |                                         |
                                                                                    |                                         |
                                       +-----------------------------------------+  |     +---------------------------+       |
                                       | EGRESS table                            |  |  +--> PORT 1 multicast table    |       |
+----------------------------------+   +-----------------------------------------+  |  |  +---------------------------+       |
| INGRESS table                    |   |                                         |  |  |  |                           |       |
+----------------------------------+   | dst_mac=P1,vlan=X -> pop vlan, goto P1  +--+  |  | FG0:                      |       |
|                                  |   | dst_mac=P1,vlan=Y -> pop vlan, goto P1  |     |  | src_port=dst_port -> drop |       |
| src_mac=M1,vlan=X -> goto egress +---> dst_mac=P2,vlan=X -> pop vlan, goto P2  +--+  |  | FG1:                      |       |
| ...                              |   | dst_mac=P2,vlan=Y -> goto P2            |  |  |  | VLAN X -> pop, goto port  |       |
|                                  |   | dst_mac=MDB1,vlan=Y -> goto mcast P1,P2 +-----+  | ...                       |       |
+----------------------------------+   |                                         |  |  |  | VLAN Y -> pop, goto port  +-------+
                                       +-----------------------------------------+  |  |  | FG3:                      |
                                                                                    |  |  | matchall -> goto port     |
                                                                                    |  |  |                           |
                                                                                    |  |  +---------------------------+
                                                                                    |  |
                                                                                    |  |
                                                                                    |  |                                    +--------+--+
                                                                                    +---------------------------------------> Port 2 |  |
                                                                                       |                                    +-^------+--+
                                                                                       |                                      |
                                                                                       |                                      |
                                                                                       |  +---------------------------+       |
                                                                                       +--> PORT 2 multicast table    |       |
                                                                                          +---------------------------+       |
                                                                                          |                           |       |
                                                                                          | FG0:                      |       |
                                                                                          | src_port=dst_port -> drop |       |
                                                                                          | FG1:                      |       |
                                                                                          | VLAN X -> pop, goto port  |       |
                                                                                          | ...                       |       |
                                                                                          |                           |       |
                                                                                          | FG3:                      |       |
                                                                                          | matchall -> goto port     +-------+
                                                                                          |                           |
                                                                                          +---------------------------+

Patches overview:

- Patch 1 adds hardware definition bits for capabilities required to replicate
  multicast packets to multiple per-port tables. These bits are used by
  following patches to only attempt multicast offload if firmware and hardware
  provide necessary support.

- Pathces 2-4 patches are preparations and refactoring.

- Patch 5 implements necessary infrastructure to toggle multicast offload
  via SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED port object attribute notification.
  This also enabled IGMP and MLD snooping.

- Patch 6 implements per-port multicast replication tables. It only supports
  filtering of loopback packets.

- Patch 7 extends per-port multicast tables with VLAN pop support for 'untagged'
  VLANs.

- Patch 8 handles SWITCHDEV_OBJ_ID_PORT_MDB port object notifications. It
  creates MDB replication rules in egress table that can replicate packets to
  multiple per-port multicast tables.

- Patch 9 adds tracepoints for MDB events.

==============

2) Parav Create a new allocation profile for SFs, to save on memory

3) Yevgeny provides some initial patches for upcoming software steering
   support new pattern/arguments type of modify_header actions.

Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
The current modify_header object allows for having only limited number of
these FW objects, which means that we are limited in the number of offloaded
flows that require modify_header action.

As a preparation Yevgeny provides the following 4 patches:
 - Patch 1: Add required mlx5_ifc HW bits
 - Patch 2, 3: Add new WQE type and opcode that is required for pattern/arg
   support and adds appropriate support in dr_send.c
 - Patch 4: Add ICM pool for modify-header-pattern objects and implement
   patterns cache, allowing patterns reuse for different flows

* tag 'mlx5-updates-2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: DR, Add modify-header-pattern ICM pool
  net/mlx5: DR, Prepare sending new WQE type
  net/mlx5: Add new WQE for updating flow table
  net/mlx5: Add mlx5_ifc bits for modify header argument
  net/mlx5: DR, Set counter ID on the last STE for STEv1 TX
  net/mlx5: Create a new profile for SFs
  net/mlx5: Bridge, add tracepoints for multicast
  net/mlx5: Bridge, implement mdb offload
  net/mlx5: Bridge, support multicast VLAN pop
  net/mlx5: Bridge, add per-port multicast replication tables
  net/mlx5: Bridge, snoop igmp/mld packets
  net/mlx5: Bridge, extract code to lookup parent bridge of port
  net/mlx5: Bridge, move additional data structures to priv header
  net/mlx5: Bridge, increase bridge tables sizes
  net/mlx5: Add mlx5_ifc definitions for bridge multicast support
====================

Link: https://lore.kernel.org/r/20230412040752.14220-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e473ea81

Merge branch 'add-kernel-tc-mqprio-and-tc-taprio-support-for-preemptible-traffic-classes' · f7d29571

Jakub Kicinski authored Apr 13, 2023

Vladimir Oltean says:

====================
Add kernel tc-mqprio and tc-taprio support for preemptible traffic classes

The last RFC in August 2022 contained a proposal for the UAPI of both
TSN standards which together form Frame Preemption (802.1Q and 802.3):
https://lore.kernel.org/netdev/20220816222920.1952936-1-vladimir.oltean@nxp.com/

It wasn't clear at the time whether the 802.1Q portion of Frame Preemption
should be exposed via the tc qdisc (mqprio, taprio) or via some other
layer (perhaps also ethtool like the 802.3 portion, or dcbnl), even
though the options were discussed extensively, with pros and cons:
https://lore.kernel.org/netdev/20220816222920.1952936-3-vladimir.oltean@nxp.com/

So the 802.3 portion got submitted separately and finally was accepted:
https://lore.kernel.org/netdev/20230119122705.73054-1-vladimir.oltean@nxp.com/

leaving the only remaining question: how do we expose the 802.1Q bits?

This series proposes that we use the Qdisc layer, through separate
(albeit very similar) UAPI in mqprio and taprio, and that both these
Qdiscs pass the information down to the offloading device driver through
the common mqprio offload structure (which taprio also passes).

An implementation is provided for the NXP LS1028A on-board Ethernet
endpoint (enetc). Previous versions also contained support for its
embedded switch (felix), but this needs more work and will be submitted
separately.

v4: https://lore.kernel.org/netdev/20230403103440.2895683-1-vladimir.oltean@nxp.com/
v2: https://lore.kernel.org/netdev/20230219135309.594188-1-vladimir.oltean@nxp.com/
v1: https://lore.kernel.org/netdev/20230216232126.3402975-1-vladimir.oltean@nxp.com/
====================

Link: https://lore.kernel.org/r/20230411180157.1850527-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f7d29571

net: enetc: add support for preemptible traffic classes · 01e23b2b

Vladimir Oltean authored Apr 11, 2023

PFs which support the MAC Merge layer also have a set of 8 registers
called "Port traffic class N frame preemption register (PTC0FPR - PTC7FPR)".
Through these, a traffic class (group of TX rings of same dequeue
priority) can be mapped to the eMAC or to the pMAC.

There's nothing particularly spectacular here. We should probably only
commit the preemptible TCs to hardware once the MAC Merge layer became
active, but unlike Felix, we don't have an IRQ that notifies us of that.
We'd have to sleep for up to verifyTime (127 ms) to wait for a
resolution coming from the verification state machine; not only from the
ndo_setup_tc() code path, but also from enetc_mm_link_state_update().
Since it's relatively complicated and has a relatively small benefit,
I'm not doing it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

01e23b2b

net: enetc: rename "mqprio" to "qopt" · 50764da3

Vladimir Oltean authored Apr 11, 2023

To gain access to the larger encapsulating structure which has the type
tc_mqprio_qopt_offload, rename just the "qopt" field as "qopt".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

50764da3

net/sched: taprio: allow per-TC user input of FP adminStatus · a721c3e5

Vladimir Oltean authored Apr 11, 2023

This is a duplication of the FP adminStatus logic introduced for
tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload
structure embedded within tc_taprio_qopt_offload. So practically, if a
device driver is written to treat the mqprio portion of taprio just like
standalone mqprio, it gets unified handling of frame preemption.

I would have reused more code with taprio, but this is mostly netlink
attribute parsing, which is hard to transform into generic code without
having something that stinks as a result. We have the same variables
with the same semantics, just different nlattr type values
(TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12;
TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and
consequently, different policies for the nest.

Every time nla_parse_nested() is called, an on-stack table "tb" of
nlattr pointers is allocated statically, up to the maximum understood
nlattr type. That array size is hardcoded as a constant, but when
transforming this into a common parsing function, it would become either
a VLA (which the Linux kernel rightfully doesn't like) or a call to the
allocator.

Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q
Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE"
and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE"
use cases. HOLD and RELEASE events are emitted towards the underlying
MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a
Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space,
one can distinguish between the 2 use cases by choosing whether to use
the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate
operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.

A small part of the change is dedicated to refactoring the max_sdu
nlattr parsing to put all logic under the "if" that tests for presence
of that nlattr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a721c3e5

net/sched: mqprio: allow per-TC user input of FP adminStatus · f62af20b

Vladimir Oltean authored Apr 11, 2023

IEEE 802.1Q-2018 clause 6.7.2 Frame preemption specifies that each
packet priority can be assigned to a "frame preemption status" value of
either "express" or "preemptible". Express priorities are transmitted by
the local device through the eMAC, and preemptible priorities through
the pMAC (the concepts of eMAC and pMAC come from the 802.3 MAC Merge
layer).

The FP adminStatus is defined per packet priority, but 802.1Q clause
12.30.1.1.1 framePreemptionAdminStatus also says that:

| Priorities that all map to the same traffic class should be
| constrained to use the same value of preemption status.

It is impossible to ignore the cognitive dissonance in the standard
here, because it practically means that the FP adminStatus only takes
distinct values per traffic class, even though it is defined per
priority.

I can see no valid use case which is prevented by having the kernel take
the FP adminStatus as input per traffic class (what we do here).
In addition, this also enforces the above constraint by construction.
User space network managers which wish to expose FP adminStatus per
priority are free to do so; they must only observe the prio_tc_map of
the netdev (which presumably is also under their control, when
constructing the mqprio netlink attributes).

The reason for configuring frame preemption as a property of the Qdisc
layer is that the information about "preemptible TCs" is closest to the
place which handles the num_tc and prio_tc_map of the netdev. If the
UAPI would have been any other layer, it would be unclear what to do
with the FP information when num_tc collapses to 0. A key assumption is
that only mqprio/taprio change the num_tc and prio_tc_map of the netdev.
Not sure if that's a great assumption to make.

Having FP in tc-mqprio can be seen as an implementation of the use case
defined in 802.1Q Annex S.2 "Preemption used in isolation". There will
be a separate implementation of FP in tc-taprio, for the other use
cases.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f62af20b

net/sched: pass netlink extack to mqprio and taprio offload · c54876cd

Vladimir Oltean authored Apr 11, 2023

With the multiplexed ndo_setup_tc() model which lacks a first-class
struct netlink_ext_ack * argument, the only way to pass the netlink
extended ACK message down to the device driver is to embed it within the
offload structure.

Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.

Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload
structure, and since device drivers might effectively reuse their mqprio
implementation for the mqprio portion of taprio, we make taprio set the
extack in both offload structures to point at the same netlink extack
message.

In fact, the taprio handling is a bit more tricky, for 2 reasons.

First is because the offload structure has a longer lifetime than the
extack structure. The driver is supposed to populate the extack
synchronously from ndo_setup_tc() and leave it alone afterwards.
To not have any use-after-free surprises, we zero out the extack pointer
when we leave taprio_enable_offload().

The second reason is because taprio does overwrite the extack message on
ndo_setup_tc() error. We need to switch to the weak form of setting an
extack message, which preserves a potential message set by the driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c54876cd

net/sched: mqprio: add an extack message to mqprio_parse_opt() · ab277d20

Vladimir Oltean authored Apr 11, 2023

Ferenc reports that a combination of poor iproute2 defaults and obscure
cases where the kernel returns -EINVAL make it difficult to understand
what is wrong with this command:

$ ip link add veth0 numtxqueues 8 numrxqueues 8 type veth peer name veth1
$ tc qdisc add dev veth0 root mqprio num_tc 8 map 0 1 2 3 4 5 6 7 \
        queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7
RTNETLINK answers: Invalid argument

Hopefully with this patch, the cause is clearer:

Error: Device does not support hardware offload.

The kernel was (and still is) rejecting this because iproute2 defaults
to "hw 1" if this command line option is not specified.

Link: https://lore.kernel.org/netdev/ede5e9a2f27bf83bfb86d3e8c4ca7b34093b99e2.camel@inf.elte.hu/Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ab277d20

net/sched: mqprio: add extack to mqprio_parse_nlattr() · 57f21bf8

Vladimir Oltean authored Apr 11, 2023

Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.

Try to make errors less sour by giving user space some information
regarding what went wrong.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

57f21bf8

net/sched: mqprio: simplify handling of nlattr portion of TCA_OPTIONS · 3dd0c16e

Vladimir Oltean authored Apr 11, 2023

In commit 4e8b86c0 ("mqprio: Introduce new hardware offload mode and
shaper in mqprio"), the TCA_OPTIONS format of mqprio was extended to
contain a fixed portion (of size NLA_ALIGN(sizeof struct tc_mqprio_qopt))
and a variable portion of other nlattrs (in the TCA_MQPRIO_* type space)
following immediately afterwards.

In commit feb2cf3d ("net/sched: mqprio: refactor nlattr parsing to a
separate function"), we've moved the nlattr handling to a smaller
function, but yet, a small parse_attr() still remains, and the larger
mqprio_parse_nlattr() still does not have access to the beginning, and
the length, of the TCA_OPTIONS region containing these other nlattrs.

In a future change, the mqprio qdisc will need to iterate through this
nlattr region to discover other attributes, so eliminate parse_attr()
and add 2 variables in mqprio_parse_nlattr() which hold the beginning
and the length of the nlattr range.

We avoid the need to memset when nlattr_opt_len has insufficient length
by pre-initializing the table "tb".
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3dd0c16e

net: ethtool: create and export ethtool_dev_mm_supported() · d54151aa

Vladimir Oltean authored Apr 11, 2023

Create a wrapper over __ethtool_dev_mm_supported() which also calls
ethnl_ops_begin() and ethnl_ops_complete(). It can be used by other code
layers, such as tc, to make sure that preemptible TCs are supported
(this is true if an underlying MAC Merge layer exists).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

d54151aa

tools: ynl: Rename ethtool to ethtool.py · 85a4abed

Rahul Rameshbabu authored Apr 12, 2023

Make it explicit that this tool is not a drop-in replacement for ethtool.
This tool is intended for testing ethtool functionality implemented in the
kernel and should use a name that differentiates it from the ethtool
utility.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20230413012252.184434-2-rrameshbabu@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

85a4abed

tools: ynl: Remove absolute paths to yaml files from ethtool testing tool · 3ea31e66

Rahul Rameshbabu authored Apr 12, 2023

Absolute paths for the spec and schema files make the ethtool testing tool
unusable with freshly checked-out source trees. Replace absolute paths with
relative paths for both files in the Documentation/ directory.

Issue seen before the change

Traceback (most recent call last):
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 424, in <module>
main()
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/./ethtool", line 158, in main
ynl = YnlFamily(spec, schema)
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/ynl.py", line 342, in __init__
super().__init__(def_path, schema)
File "/home/binary-eater/Documents/mlx/linux/tools/net/ynl/lib/nlspec.py", line 333, in __init__
with open(spec_path, "r") as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/google/home/sdf/src/linux/Documentation/netlink/specs/ethtool.yaml'

Fixes: f3d07b02 ("tools: ynl: ethtool testing tool")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20230413012252.184434-1-rrameshbabu@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3ea31e66

Merge branch 'macb-ptp-minor-updates' · 916b15fb

Jakub Kicinski authored Apr 13, 2023

Harini Katakam says:

====================
Macb PTP minor updates

- Enable PTP unicast
- Optimize HW timestamp reading
====================

Link: https://lore.kernel.org/r/20230411123712.11459-1-harini.katakam@amd.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

916b15fb

net: macb: Optimize reading HW timestamp · 8c0d0fe0

Harini Katakam authored Apr 11, 2023

The seconds input from BD (6 bits) just needs to be ORed with the
upper bits from timer in this function. Avoid addition operation
every single time. Seconds rollover handling is left untouched.
Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8c0d0fe0

net: macb: Enable PTP unicast · ee4e92c2

Harini Katakam authored Apr 11, 2023

Enable transmission and reception of PTP unicast packets by
updating PTP unicast config bit and setting current HW mac
address as allowed address in PTP unicast filter registers.
Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ee4e92c2

net: macb: Update gem PTP support check · adee474a

Harini Katakam authored Apr 11, 2023

There are currently two checks for PTP functionality - one on GEM
capability and another on the kernel config option. Combine them
into a single function as there's no use case where gem_has_ptp is
TRUE and MACB_USE_HWSTAMP is false.
Signed-off-by: Harini Katakam <harini.katakam@amd.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

adee474a

Merge branch 'ocelot-felix-driver-cleanup' · fb4be9a4

Jakub Kicinski authored Apr 13, 2023

Vladimir Oltean says:

====================
Ocelot/Felix driver cleanup

The cleanup mostly handles the statistics code path - some issues
regarding understandability became apparent after the series
"Fix trainwreck with Ocelot switch statistics counters":
https://lore.kernel.org/netdev/20230321010325.897817-1-vladimir.oltean@nxp.com/

There is also one patch which cleans up a misleading comment
in the DSA felix_setup().
====================

Link: https://lore.kernel.org/r/20230412124737.2243527-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fb4be9a4

net: mscc: ocelot: fix ineffective WARN_ON() in ocelot_stats.c · a291399e

Vladimir Oltean authored Apr 12, 2023

Since it is hopefully now clear that, since "last" and "layout[i].reg"
are enum types and not addresses, the existing WARN_ON() is ineffective
in checking that the _addresses_ are sorted in the proper order.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a291399e

net: mscc: ocelot: strengthen type of "int i" in ocelot_stats.c · 6663c01e

Vladimir Oltean authored Apr 12, 2023

The "int i" used to index the struct ocelot_stat_layout array actually
has a specific type: enum ocelot_stat. Use it, so that the WARN()
comment from ocelot_prepare_stats_regions() makes more sense.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6663c01e

net: mscc: ocelot: strengthen type of "u32 reg" and "u32 base" in ocelot_stats.c · eae0b9d1

Vladimir Oltean authored Apr 12, 2023

Use the specific enum ocelot_reg to make it clear that the region
registers are encoded and not plain addresses.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eae0b9d1

net: dsa: felix: remove confusing/incorrect comment from felix_setup() · a9afc3e4

Vladimir Oltean authored Apr 12, 2023

That comment was written prior to knowing that what I was actually
seeing was a manifestation of the bug fixed in commit b4024c9e
("felix: Fix initialization of ioremap resources").

There isn't any particular reason now why the hardware initialization is
done in felix_setup(), so just delete that comment to avoid spreading
misinformation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a9afc3e4

net: mscc: ocelot: remove blank line at the end of ocelot_stats.c · 93f0f93b

Vladimir Oltean authored Apr 12, 2023

Commit a3bb8f52 ("net: mscc: ocelot: remove unnecessary exposure of
stats structures") made an unnecessary change which was to add a new
line at the end of ocelot_stats.c. Remove it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

93f0f93b

net: mscc: ocelot: debugging print for statistics regions · 07de3265

Vladimir Oltean authored Apr 12, 2023

To make it easier to debug future issues with statistics counters not
getting aggregated properly into regions, like what happened in commit
6acc72a4 ("net: mscc: ocelot: fix stats region batching"), add some
dev_dbg() prints which show the regions that were dynamically
determined.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

07de3265

net: mscc: ocelot: refactor enum ocelot_reg decoding to helper · 40cd07cb

Vladimir Oltean authored Apr 12, 2023

ocelot_io.c duplicates the decoding of an enum ocelot_reg (which holds
an enum ocelot_target in the upper bits and an index into a regmap array
in the lower bits) 4 times.

We'd like to reuse that logic once more, from ocelot.c. In order to do
that, let's consolidate the existing 4 instances into a header
accessible both by ocelot.c as well as by ocelot_io.c.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

40cd07cb

net: mscc: ocelot: strengthen type of "u32 reg" in I/O accessors · 9ecd0579

Vladimir Oltean authored Apr 12, 2023

The "u32 reg" argument that is passed to these functions is not a plain
address, but rather a driver-specific encoding of another enum
ocelot_target target in the upper bits, and an index into the
u32 ocelot->map[target][] array in the lower bits. That encoded value
takes the type "enum ocelot_reg" and is what is passed to these I/O
functions, so let's actually use that to prevent type confusion.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9ecd0579