Commits · 81ef35e7619a04b902ca457d60594f956154b9f2 · Kirill Smelkov / linux

23 Mar, 2021 13 commits

net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG · 81ef35e7

Vladimir Oltean authored Mar 23, 2021

Similar to the DSA situation, ocelot supports LAG offload but treats
this scenario improperly:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

We do the same thing as we do there, which is to simulate a 'bridge join'
on 'lag join', if we detect that the bonding upper has a bridge upper.

Again, same as DSA, ocelot supports software fallback for LAG, and in
that case, we should avoid calling ocelot_netdevice_changeupper.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

81ef35e7

net: dsa: sync up switchdev objects and port attributes when joining the bridge · 010e269f

Vladimir Oltean authored Mar 23, 2021

If we join an already-created bridge port, such as a bond master
interface, then we can miss the initial switchdev notifications emitted
by the bridge for this port, while it wasn't offloaded by anybody.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

010e269f

net: dsa: inherit the actual bridge port flags at join time · 5961d6a1

Vladimir Oltean authored Mar 23, 2021

DSA currently assumes that the bridge port starts off with this
constellation of bridge port flags:

- learning on
- unicast flooding on
- multicast flooding on
- broadcast flooding on

just by virtue of code copy-pasta from the bridge layer (new_nbp).
This was a simple enough strategy thus far, because the 'bridge join'
moment always coincided with the 'bridge port creation' moment.

But with sandwiched interfaces, such as:

 br0
  |
bond0
  |
 swp0

it may happen that the user has had time to change the bridge port flags
of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
assume that the bridge port flags are those determined by new_nbp, when
in fact this can happen:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set bond0 type bridge_slave learning off
ip link set swp0 master br0

Now swp0 has learning enabled, bond0 has learning disabled. Not nice.

Fix this by "dumpster diving" through the actual bridge port flags with
br_port_flag_is_set, at bridge join time.

We use this opportunity to split dsa_port_change_brport_flags into two
distinct functions called dsa_port_inherit_brport_flags and
dsa_port_clear_brport_flags, now that the implementation for the two
cases is no longer similar. This patch also creates two functions called
dsa_port_switchdev_sync and dsa_port_switchdev_unsync which collect what
we have so far, even if that's asymmetrical. More is going to be added
in the next patch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5961d6a1

net: dsa: pass extack to dsa_port_{bridge,lag}_join · 2afc526a

Vladimir Oltean authored Mar 23, 2021

This is a pretty noisy change that was broken out of the larger change
for replaying switchdev attributes and objects at bridge join time,
which is when these extack objects are actually used.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2afc526a

net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge · 185c9a76

Vladimir Oltean authored Mar 23, 2021

DSA can properly detect and offload this sequence of operations:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set swp0 master bond0
ip link set bond0 master br0

But not this one:

ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

Actually the second one is more complicated, due to the elapsed time
between the enslavement of bond0 and the offloading of it via swp0, a
lot of things could have happened to the bond0 bridge port in terms of
switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
a bit of a can of worms, and making sure that the DSA port's state is in
sync with this already existing bridge port is handled in the next
patches.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

185c9a76

net: bridge: add helper to replay VLANs installed on port · 22f67cdf

Vladimir Oltean authored Mar 23, 2021

Currently this simple setup with DSA:

ip link add br0 type bridge vlan_filtering 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0

will not work because the bridge has created the PVID in br_add_if ->
nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
but that was too early, since swp0 was not yet a lower of bond0, so it
had no reason to act upon that notification.

We need a helper in the bridge to replay the switchdev VLAN objects that
were notified since the bridge port creation, because some of them may
have been missed.

As opposed to the br_mdb_replay function, the vg->vlan_list write side
protection is offered by the rtnl_mutex which is sleepable, so we don't
need to queue up the objects in atomic context, we can replay them right
away.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22f67cdf

net: bridge: add helper to replay port and local fdb entries · 04846f90

Vladimir Oltean authored Mar 23, 2021

When a switchdev port starts offloading a LAG that is already in a
bridge and has an FDB entry pointing to it:

ip link set bond0 master br0
bridge fdb add dev bond0 00:01:02:03:04:05 master static
ip link set swp0 master bond0

the switchdev driver will have no idea that this FDB entry is there,
because it missed the switchdev event emitted at its creation.

Ido Schimmel pointed this out during a discussion about challenges with
switchdev offloading of stacked interfaces between the physical port and
the bridge, and recommended to just catch that condition and deny the
CHANGEUPPER event:
https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/

But in fact, we might need to deal with the hard thing anyway, which is
to replay all FDB addresses relevant to this port, because it isn't just
static FDB entries, but also local addresses (ones that are not
forwarded but terminated by the bridge). There, we can't just say 'oh
yeah, there was an upper already so I'm not joining that'.

So, similar to the logic for replaying MDB entries, add a function that
must be called by individual switchdev drivers and replays local FDB
entries as well as ones pointing towards a bridge port. This time, we
use the atomic switchdev notifier block, since that's what FDB entries
expect for some reason.
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04846f90

net: bridge: add helper to replay port and host-joined mdb entries · 4f2673b3

Vladimir Oltean authored Mar 23, 2021

I have a system with DSA ports, and udhcpcd is configured to bring
interfaces up as soon as they are created.

I create a bridge as follows:

ip link add br0 type bridge

As soon as I create the bridge and udhcpcd brings it up, I also have
avahi which automatically starts sending IPv6 packets to advertise some
local services, and because of that, the br0 bridge joins the following
IPv6 groups due to the code path detailed below:

33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0

br_dev_xmit
-> br_multicast_rcv
   -> br_ip6_multicast_add_group
      -> __br_multicast_add_group
         -> br_multicast_host_join
            -> br_mdb_notify

This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.

Then when we add a port to br0:

ip link set swp0 master br0

the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.

The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.

Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.

To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.

There was some opportunity for reuse between br_mdb_switchdev_host_port,
br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev
mdb object is created, so a helper was created.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4f2673b3

net: bridge: add helper to retrieve the current ageing time · f1d42ea1

Vladimir Oltean authored Mar 23, 2021

The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:

sysfs/ioctl/netlink
-> br_set_ageing_time
   -> __set_ageing_time

therefore not at bridge port creation time, so:
(a) switchdev drivers have to hardcode the initial value for the address
    ageing time, because they didn't get any notification
(b) that hardcoded value can be out of sync, if the user changes the
    ageing time before enslaving the port to the bridge

We need a helper in the bridge, such that switchdev drivers can query
the current value of the bridge ageing time when they start offloading
it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1d42ea1

net: bridge: add helper for retrieving the current bridge port STP state · c0e715bb

Vladimir Oltean authored Mar 23, 2021

It may happen that we have the following topology with DSA or any other
switchdev driver with LAG offload:

ip link add br0 type bridge stp_state 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
ip link set swp1 master bond0

STP decides that it should put bond0 into the BLOCKING state, and
that's that. The ports that are actively listening for the switchdev
port attributes emitted for the bond0 bridge port (because they are
offloading it) and have the honor of seeing that switchdev port
attribute can react to it, so we can program swp0 and swp1 into the
BLOCKING state.

But if then we do:

ip link set swp2 master bond0

then as far as the bridge is concerned, nothing has changed: it still
has one bridge port. But this new bridge port will not see any STP state
change notification and will remain FORWARDING, which is how the
standalone code leaves it in.

We need a function in the bridge driver which retrieves the current STP
state, such that drivers can synchronize to it when they may have missed
switchdev events.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0e715bb

net: lapb: Make "lapb_t1timer_running" able to detect an already running timer · 65d2dbb3

Xie He authored Mar 21, 2021

Problem:

The "lapb_t1timer_running" function in "lapb_timer.c" is used in only
one place: in the "lapb_kick" function in "lapb_out.c". "lapb_kick" calls
"lapb_t1timer_running" to check if the timer is already pending, and if
it is not, schedule it to run.

However, if the timer has already fired and is running, and is waiting to
get the "lapb->lock" lock, "lapb_t1timer_running" will not detect this,
and "lapb_kick" will then schedule a new timer. The old timer will then
abort when it sees a new timer pending.

I think this is not right. The purpose of "lapb_kick" should be ensuring
that the actual work of the timer function is scheduled to be done.
If the timer function is already running but waiting for the lock,
"lapb_kick" should not abort and reschedule it.

Changes made:

I added a new field "t1timer_running" in "struct lapb_cb" for
"lapb_t1timer_running" to use. "t1timer_running" will accurately reflect
whether the actual work of the timer is pending. If the timer has fired
but is still waiting for the lock, "t1timer_running" will still correctly
reflect whether the actual work is waiting to be done.

The old "t1timer_stop" field, whose only responsibility is to ask a timer
(that is already running but waiting for the lock) to abort, is no longer
needed, because the new "t1timer_running" field can fully take over its
responsibility. Therefore "t1timer_stop" is deleted.

"t1timer_running" is not simply a negation of the old "t1timer_stop".
At the end of the timer function, if it does not reschedule itself,
"t1timer_running" is set to false to indicate that the timer is stopped.

For consistency of the code, I also added "t2timer_running" and deleted
"t2timer_stop".
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

65d2dbb3

net: dsa: hellcreek: Report switch name and ID · 1ab568e9

Kurt Kanzenbach authored Mar 22, 2021

Report the driver name, ASIC ID and the switch name via devlink. This is a
useful information for user space tooling.
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1ab568e9

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 9a255a06

David S. Miller authored Mar 22, 2021

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next:

1) Split flowtable workqueues per events, from Oz Shlomo.

2) fall-through warnings for clang, from Gustavo A. R. Silva

3) Remove unused declaration in conntrack, from YueHaibing.

4) Consolidate skb_try_make_writable() in flowtable datapath,
   simplify some of the existing codebase.

5) Call dst_check() to fall back to static classic forwarding path.

6) Update table flags from commit phase.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9a255a06

22 Mar, 2021 27 commits

net: set initial device refcount to 1 · add2d736

Eric Dumazet authored Mar 22, 2021

When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.

When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4

Fix is to change initial (and final) refcount to be 1.

Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.

Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

add2d736

Merge branch 'dpaa2-switch-offload-port-flags' · 0ca99c84

David S. Miller authored Mar 22, 2021

Ioana Ciornei says:

====================
dpaa2-switch: offload bridge port flags to device

Add support for offloading bridge port flags to the switch. With this
patch set, the learning, broadcast flooding and unknown ucast/mcast
flooding states will be user configurable.

Apart from that, the last patch is a small fix that configures the
offload_fwd_mark if the switch port is under a bridge or not.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

0ca99c84

dpaa2-switch: mark skbs with offload_fwd_mark · b175dfd7

Ioana Ciornei authored Mar 22, 2021

If a switch port is under a bridge, the offload_fwd_mark should be setup
before sending the skb towards the stack so that the bridge does not try
to flood the packet on the other switch ports.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b175dfd7

dpaa2-switch: add support for configuring per port unknown flooding · 6253d5e3

Ioana Ciornei authored Mar 22, 2021

Add support for configuring per port unknown flooding by accepting both
BR_FLOOD and BR_MCAST_FLOOD as offloadable bridge port flags.

The DPAA2 switch does not support at the moment configuration of unknown
multicast flooding independently of unknown unicast flooding, therefore
check that both BR_FLOOD and BR_MCAST_FLOOD have the same state.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6253d5e3

dpaa2-switch: add support for configuring per port broadcast flooding · b54eb093

Ioana Ciornei authored Mar 22, 2021

The BR_BCAST_FLOOD bridge port flag is now accepted by the driver and a
change in its state will determine a reconfiguration of the broadcast
egress flooding list on the FDB associated with the port.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b54eb093

dpaa2-switch: add support for configuring learning state per port · 1e7cbabf

Ioana Ciornei authored Mar 22, 2021

Add support for configuring the learning state of a switch port.
When the user requests the HW learning to be disabled, a fast-age
procedure on that specific port is run so that previously learnt
addresses do not linger.

At device probe as well as on a bridge leave action, the ports are
configured with HW learning disabled since they are basically a
standalone port.

At the same time, at bridge join we inherit the bridge port BR_LEARNING
flag state and configure it on the switch port.

There were already some MC firmware ABI functions for changing the
learning state, but those were per FDB (bridging domain) and not per
port so we need to adjust those to use the new MC fw command which is
per port.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e7cbabf

dpaa2-switch: refactor the egress flooding domain setup · f054e3e2

Ioana Ciornei authored Mar 22, 2021

Extract the code that determines the list of egress flood interfaces for
a specific flood type into a new function -
dpaa2_switch_fdb_get_flood_cfg().

This will help us to not duplicate code when the broadcast and
unknown ucast/mcast flooding domains will be individually configurable.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f054e3e2

dpaa2-switch: move the dpaa2_switch_fdb_set_egress_flood function · c7e856c8

Ioana Ciornei authored Mar 22, 2021

In order to avoid a forward declaration in the next patches, move the
dpaa2_switch_fdb_set_egress_flood() function to the top of the file.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7e856c8

Merge branch 'lantiq-xrx300-xrx330' · 3adffc76

David S. Miller authored Mar 22, 2021

Aleksander Jan Bajkowski says:

====================
net: dsa: lantiq: add support for xRX300 and xRX330

Changed since v3:
	* fixed last compilation warning

Changed since v2:
	* fixed compilation warnings
	* removed example bindings for xrx330
	* patches has been refactored due to upstream changes

Changed since v1:
	* gswip_mii_mask_cfg() can now change port 3 on xRX330
	* changed alowed modes on port 0 and 5 for xRX300 and xRX330
	* moved common part of phylink validation into gswip_phylink_set_capab()
	* verify the compatible string against the hardware
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3adffc76

dt-bindings: net: dsa: lantiq: add xRx300 and xRX330 switch bindings · ee83d824

Aleksander Jan Bajkowski authored Mar 22, 2021

Add compatible string for xRX300 and xRX330 SoCs.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee83d824

net: dsa: lantiq: verify compatible strings against hardware · 204c7614

Aleksander Jan Bajkowski authored Mar 22, 2021

Verify compatible string against hardware.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

204c7614

net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330 · a09d042b

Aleksander Jan Bajkowski authored Mar 22, 2021

This patch allows to use all PHYs on GRX300 and GRX330. The ARX300
has 3 and the GRX330 has 4 integrated PHYs connected to different
ports compared to VRX200. Each integrated PHY can work as single
Gigabit Ethernet PHY (GMII) or as double Fast Ethernet PHY (MII).

Allowed port configurations:

xRX200:
GMAC0: RGMII, MII, REVMII or RMII port
GMAC1: RGMII, MII, REVMII or RMII port
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII) or RGMII port

xRX300:
GMAC0: RGMII port
GMAC1: GPHY2 (GMII)
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII) or RGMII port

xRX330:
GMAC0: RGMII, GMII or RMII port
GMAC1: GPHY2 (GMII)
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII) or GPHY3 (GMII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII), RGMII or RMII port

Tested on D-Link DWR966 (xRX330) with OpenWRT.
Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

a09d042b

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 853b0df9

David S. Miller authored Mar 22, 2021

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-03-22

This series contains updates to ice and iavf drivers.

Haiyue Wang says:

The Intel E810 Series supports a programmable pipeline for a domain
specific protocols classification, for example GTP by Dynamic Device
Personalization (DDP) profile.

The E810 PF has introduced flex-bytes support by ethtool user-def option
allowing for packet deeper matching based on an offset and value for DDP
usage.

For making VF also benefit from this flexible protocol classification,
some new virtchnl messages are defined and handled by PF, so VF can
query this new flow director capability, and use ethtool with extending
the user-def option to configure Rx flow classification.

The new user-def 0xAAAABBBBCCCCDDDD: BBBB is the 2 byte pattern while
AAAA corresponds to its offset in the packet. Similarly DDDD is the 2
byte pattern with CCCC being the corresponding offset. The offset ranges
from 0x0 to 0x1F7 (up to 504 bytes into the packet). The offset starts
from the beginning of the packet.

This feature can be used to allow customers to set flow director rules
for protocols headers that are beyond standard ones supported by
ethtool (e.g. PFCP or GTP-U).

Like for matching GTP-U's TEID value 0x10203040:
ethtool -N ens787f0v0 flow-type udp4 dst-port 2152 \
    user-def 0x002e102000303040 action 13
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

853b0df9

Merge branch 'mlxsw-resil-nexthop-groups-prep' · ec8136cd

David S. Miller authored Mar 22, 2021

Ido Schimmel says:

====================
mlxsw: Preparations for resilient nexthop groups

This patchset contains preparations for resilient nexthop groups support in
mlxsw. A follow-up patchset will add support and selftests. Most of the
patches are trivial and small to make review easier.

Patchset overview:

Patch #1 removes RTNL assertion in nexthop notifier block since it is
not needed. The assertion will trigger when mlxsw starts processing
notifications related to resilient groups as not all are emitted with
RTNL held.

Patches #2-#9 gradually add support for nexthops with trap action. Up
until now mlxsw did not program nexthops whose neighbour entry was not
resolved. This will not work with resilient groups as their size is
fixed and the nexthop mapped to each bucket is determined by the nexthop
code. Therefore, nexthops whose neighbour entry is not resolved will be
programmed to trap packets to the CPU in order to trigger neighbour
resolution.

Patch #10 is a non-functional change to allow for code reuse between
regular nexthop groups and resilient ones.

Patch #11 avoids unnecessary neighbour updates in hardware. See the
commit message for a detailed explanation.

Patches #12-#14 add support for additional nexthop group sizes that are
supported by Spectrum-{2,3} ASICs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ec8136cd

mlxsw: spectrum_router: Add Spectrum-{2, 3} adjacency group size ranges · ea037b23

Ido Schimmel authored Mar 22, 2021

Spectrum-{2,3} support different adjacency group size ranges compared to
Spectrum-1. Add an array describing these ranges and change the common
code to use the array which was set during the per-ASIC initialization.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea037b23

mlxsw: spectrum_router: Encode adjacency group size ranges in an array · 164fa130

Ido Schimmel authored Mar 22, 2021

The device supports a fixed set of adjacency group sizes. Encode these
sizes in an array, so that the next patch will be able to split it
between Spectrum-1 and Spectrum-{2,3}, which support different size
ranges.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

164fa130

mlxsw: spectrum_router: Create per-ASIC router operations · d354fdd9

Ido Schimmel authored Mar 22, 2021

There are several differences in the router module between Spectrum-1
and Spectrum-{2,3}. Currently, this is only apparent in the router
interface (RIF) operations that are split between these ASICs.

A subsequent patch is going to introduce another difference between
these ASICs.

Create per-ASIC router operations that will encapsulate all these
differences. For now, these operations are only used to set the per-ASIC
RIF operations in 'mlxsw_sp->router->rif_ops_arr'. Note that this fields
was unused since commit 1f5b2303 ("mlxsw: spectrum: Set RIF ops per
ASIC type").
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d354fdd9

mlxsw: spectrum_router: Avoid unnecessary neighbour updates · c1efd500

Ido Schimmel authored Mar 22, 2021

Avoid updating neighbour and adjacency entries in hardware when the
neighbour is already connected and its MAC address did not change. This
can happen, for example, when neighbour transitions between valid states
such as 'NUD_REACHABLE' and 'NUD_DELAY'.

This is especially important for resilient hashing as these updates will
result in adjacency entries being marked as active.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1efd500

mlxsw: spectrum_router: Break nexthop group entry validation to a separate function · 40f5429f

Ido Schimmel authored Mar 22, 2021

The validation of a nexthop group entry is also necessary for resilient
nexthop groups, so break the validation to a separate function to allow
for code reuse in subsequent patches.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

40f5429f

mlxsw: spectrum_router: Encapsulate nexthop update in a function · 29017c64

Ido Schimmel authored Mar 22, 2021

Encapsulate this functionality in a separate function, so that it could
be invoked by follow-up patches, when replacing a nexthop bucket that is
part of a resilient nexthop group.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29017c64

mlxsw: spectrum_router: Rename nexthop update function to reflect its type · 424603cc

Ido Schimmel authored Mar 22, 2021

mlxsw_sp_nexthop_update() is used to update the configuration of
Ethernet-type nexthops, as opposed to mlxsw_sp_nexthop_ipip_update(),
which is used to update IPinIP-type nexthops.

Rename the function to mlxsw_sp_nexthop_eth_update(), so that it is
consistent with mlxsw_sp_nexthop_ipip_update().

It will allow us to introduce mlxsw_sp_nexthop_update() in a follow-up
patch, which calls either of above mentioned function based on the
nexthop's type.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

424603cc

mlxsw: spectrum_router: Add nexthop trap action support · fc199d7c

Ido Schimmel authored Mar 22, 2021

Currently, nexthops are programmed with either forward or discard action
(for blackhole nexthops). Nexthops that do not have a valid MAC address
(neighbour) or router interface (RIF) are simply not written to the
adjacency table.

In resilient nexthop groups, the size of the group must remain fixed and
the kernel is in complete control of the layout of the adjacency table.
A nexthop without a valid MAC or RIF will therefore be written with a
trap action, to trigger neighbour resolution.

Allow such nexthops to be programmed to the adjacency table to enable
above mentioned use case.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc199d7c

mlxsw: spectrum_router: Prepare for nexthops with trap action · 1be2361e

Ido Schimmel authored Mar 22, 2021

Nexthops that need to be programmed with a trap action might not have a
valid router interface (RIF) associated with them. Therefore, use the
loopback RIF created during initialization to program them to the
device.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1be2361e

mlxsw: spectrum_router: Introduce nexthop action field · 031d5c16

Ido Schimmel authored Mar 22, 2021

Currently, the action associated with the nexthop is assumed to be
'forward' unless the 'discard' bit is set.

Instead, simplify this by introducing a dedicated field to represent the
action of the nexthop. This will allow us to more easily introduce more
actions, such as trap.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

031d5c16

mlxsw: spectrum_router: Adjust comments on nexthop fields · 248136fa

Ido Schimmel authored Mar 22, 2021

The comments assume that nexthops are simple Ethernet nexthops
that are programmed to forward packets to the associated neighbour. This
is no longer the case, as both IPinIP and blackhole nexthops are now
supported.

Adjust the comments to reflect these changes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

248136fa

mlxsw: spectrum_router: Only provide MAC address for valid nexthops · c6a5011b

Ido Schimmel authored Mar 22, 2021

The helper returns the MAC address associated with the nexthop. It is
only valid when the nexthop forwards packets and when it is an Ethernet
nexthop. Reflect this in the checks the helper is performing.

This is not an issue because the sole caller of the function only
invokes it for such nexthops.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6a5011b

mlxsw: spectrum_router: Consolidate nexthop helpers · 26df5acc

Ido Schimmel authored Mar 22, 2021

The helper mlxsw_sp_nexthop_offload() is actually interested in finding
out if the nexthop is both written to the adjacency table and forwarding
packets (as opposed to discarding them).

Rename it to mlxsw_sp_nexthop_is_forward() and remove
mlxsw_sp_nexthop_is_discard().
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26df5acc