Commits · ec8cb4f617a23700d37018d249e3b05149d44a38 · Kirill Smelkov / linux

12 May, 2022 40 commits

net: selftests: Stress reuseport listen · ec8cb4f6

Martin KaFai Lau authored May 11, 2022

This patch adds a test that has 300 VIPs listening on port 443.
Each VIP:443 will have 80 listening socks by using SO_REUSEPORT.
Thus, it will have 24000 listening socks.

Before removing the port only listening_hash, all socks will be in the
same port 443 bucket and inet_reuseport_add_sock() spends much time to
walk through the bucket.  After removing the port only listening_hash
and move all usage to the port+addr lhash2, each bucket in the
ideal case has 80 sk which is much smaller than before.

Here is the test result from a qemu:
Before: listen 24000 socks took 210.210485362 (~210s)
 After: listen 24000 socks took 0.207173      (~210ms)
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ec8cb4f6

net: inet: Retire port only listening_hash · cae3873c

Martin KaFai Lau authored May 11, 2022

The listen sk is currently stored in two hash tables,
listening_hash (hashed by port) and lhash2 (hashed by port and address).

After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
the TCP-SYN lookup fast path does not use listening_hash.

The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
also moved the seq_file (/proc/net/tcp) iteration usage from
listening_hash to lhash2.

There are still a few listening_hash usages left.
One of them is inet_reuseport_add_sock() which uses the listening_hash
to search a listen sk during the listen() system call.  This turns
out to be very slow on use cases that listen on many different
VIPs at a popular port (e.g. 443).  [ On top of the slowness in
adding to the tail in the IPv6 case ].  The latter patch has a
selftest to demonstrate this case.

This patch takes this chance to move all remaining listening_hash
usages to lhash2 and then retire listening_hash.

Since most changes need to be done together, it is hard to cut
the listening_hash to lhash2 switch into small patches.  The
changes in this patch is highlighted here for the review
purpose.

1. Because of the listening_hash removal, lhash2 can use the
   sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
   This will also keep the sk_unhashed() check to work as is
   after stop adding sk to listening_hash.

   The union is removed from inet_listen_hashbucket because
   only nulls_head is needed.

2. icsk->icsk_listen_portaddr_node and its helpers are removed.

3. The current lhash2 users needs to iterate with sk_nulls_node
   instead of icsk_listen_portaddr_node.

   One case is in the inet[6]_lhash2_lookup().

   Another case is the seq_file iterator in tcp_ipv4.c.
   One thing to note is sk_nulls_next() is needed
   because the old inet_lhash2_for_each_icsk_continue()
   does a "next" first before iterating.

4. Move the remaining listening_hash usage to lhash2

   inet_reuseport_add_sock() which this series is
   trying to improve.

   inet_diag.c and mptcp_diag.c are the final two
   remaining use cases and is moved to lhash2 now also.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cae3873c

net: inet: Open code inet_hash2 and inet_unhash2 · e8d00590

Martin KaFai Lau authored May 11, 2022

This patch folds lhash2 related functions into __inet_hash and
inet_unhash.  This will make the removal of the listening_hash
in a latter patch easier to review.

First, this patch folds inet_hash2 into __inet_hash.

For unhash, the current call sequence is like
inet_unhash() => __inet_unhash() => inet_unhash2().
The specific testing cases in __inet_unhash() are mostly related
to TCP_LISTEN sk and its caller inet_unhash() already has
the TCP_LISTEN test, so this patch folds both __inet_unhash() and
inet_unhash2() into inet_unhash().

Note that all listening_hash users also have lhash2 initialized,
so the !h->lhash2 check is no longer needed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e8d00590

net: inet: Remove count from inet_listen_hashbucket · 8ea1eebb

Martin KaFai Lau authored May 11, 2022

After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
the count is no longer used.  This patch removes it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8ea1eebb

Merge branch 'make-sfc-siena-ko-specific-to-siena' · 0c1822d9

Jakub Kicinski authored May 12, 2022

Martin Habets says:

====================
Make sfc-siena.ko specific to Siena

This series is a follow-up to the one titled "Move Siena into
a separate subdirectory".
It enhances the new sfc-siena.ko module to differentiate it from sfc.ko.

	Patches

Patches 1-5 create separate Kconfig options for Siena, and adjusts the
various names used for work items and directories.
Patch 6 reinstates SRIOV functionality in sfc-siena.ko.

	Testing

Various build tests were done such as allyesconfig, W=1 and sparse.
The new sfc-siena.ko and sfc.ko modules were tested on a machine with NICs
for both modules in them.
Inserting the updated sfc.ko and the new sfc-siena.ko modules at the same
time works, and no work items and directories exist with the same name.
====================

Link: https://lore.kernel.org/r/165228589518.696.7119477411428288875.stgit@palantir17.mph.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0c1822d9

sfc/siena: Reinstate SRIOV init/fini function calls · c3743039

Martin Habets authored May 11, 2022

They were removed in the first series since they were not used for EF10.
Put that code back for Siena, with the prototypes in siena_sriov.h
since that file is a more applicable place for it.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c3743039

sfc/siena: Make PTP and reset support specific for Siena · ef9b5770

Martin Habets authored May 11, 2022

Change the clock name and work queue names to differentiate them from
the names used in sfc.ko.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ef9b5770

sfc/siena: Make MCDI logging support specific for Siena · 58b6b3d5

Martin Habets authored May 11, 2022

Add a Siena Kconfig option and use it in stead of the sfc one.
Rename the internal variable for the 'mcdi_logging_default' module
parameter to avoid a naming conflict with the one in sfc.ko.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

58b6b3d5

siena: Make HWMON support specific for Siena · f62a0745

Martin Habets authored May 11, 2022

Add a Siena Kconfig option and use it in stead of the sfc one.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f62a0745

siena: Make SRIOV support specific for Siena · dfb1cfbd

Martin Habets authored May 11, 2022

Add a Siena Kconfig option and use it in stead of the sfc one.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dfb1cfbd

siena: Make MTD support specific for Siena · 65d4b471

Martin Habets authored May 11, 2022

Add a Siena Kconfig option and use it in stead of the sfc one.
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

65d4b471

Merge branch 'restructure-struct-ocelot_port' · 75db72de

Jakub Kicinski authored May 12, 2022

Vladimir Oltean says:

====================
Restructure struct ocelot_port

This patch set represents preparation for further work. It adds an
"index" field to struct ocelot_port, and populates it from the Felix DSA
driver and Ocelot switchdev driver.

The users of struct ocelot_port :: index are the same users as those of
struct ocelot_port_private :: chip_port.
====================

Link: https://lore.kernel.org/r/20220511100637.568950-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

75db72de

net: mscc: ocelot: move ocelot_port_private :: chip_port to ocelot_port :: index · 7e708760

Vladimir Oltean authored May 11, 2022

Currently the ocelot switch lib is unaware of the index of a struct
ocelot_port, since that is kept in the encapsulating structures of outer
drivers (struct dsa_port :: index, struct ocelot_port_private :: chip_port).

With the upcoming increase in complexity associated with assigning DSA
tag_8021q CPU ports to certain user ports, it becomes necessary for the
switch lib to be able to retrieve the index of a certain ocelot_port.

Therefore, introduce a new u8 to ocelot_port (same size as the chip_port
used by the ocelot switchdev driver) and rework the existing code to
populate and use it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7e708760

net: mscc: ocelot: minimize holes in struct ocelot_port · 6d0be600

Vladimir Oltean authored May 11, 2022

Reorder members of struct ocelot_port to eliminate holes and reduce
structure size. Pahole says:

Before:

struct ocelot_port {
        struct ocelot *            ocelot;               /*     0     8 */
        struct regmap *            target;               /*     8     8 */
        bool                       vlan_aware;           /*    16     1 */

        /* XXX 7 bytes hole, try to pack */

        const struct ocelot_bridge_vlan  * pvid_vlan;    /*    24     8 */
        unsigned int               ptp_skbs_in_flight;   /*    32     4 */
        u8                         ptp_cmd;              /*    36     1 */

        /* XXX 3 bytes hole, try to pack */

        struct sk_buff_head        tx_skbs;              /*    40    96 */
        /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
        u8                         ts_id;                /*   136     1 */

        /* XXX 3 bytes hole, try to pack */

        phy_interface_t            phy_mode;             /*   140     4 */
        bool                       is_dsa_8021q_cpu;     /*   144     1 */
        bool                       learn_ena;            /*   145     1 */

        /* XXX 6 bytes hole, try to pack */

        struct net_device *        bond;                 /*   152     8 */
        bool                       lag_tx_active;        /*   160     1 */

        /* XXX 1 byte hole, try to pack */

        u16                        mrp_ring_id;          /*   162     2 */

        /* XXX 4 bytes hole, try to pack */

        struct net_device *        bridge;               /*   168     8 */
        int                        bridge_num;           /*   176     4 */
        u8                         stp_state;            /*   180     1 */

        /* XXX 3 bytes hole, try to pack */

        int                        speed;                /*   184     4 */

        /* size: 192, cachelines: 3, members: 18 */
        /* sum members: 161, holes: 7, sum holes: 27 */
        /* padding: 4 */
};

After:

struct ocelot_port {
        struct ocelot *            ocelot;               /*     0     8 */
        struct regmap *            target;               /*     8     8 */
        struct net_device *        bond;                 /*    16     8 */
        struct net_device *        bridge;               /*    24     8 */
        const struct ocelot_bridge_vlan  * pvid_vlan;    /*    32     8 */
        phy_interface_t            phy_mode;             /*    40     4 */
        unsigned int               ptp_skbs_in_flight;   /*    44     4 */
        struct sk_buff_head        tx_skbs;              /*    48    96 */
        /* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
        u16                        mrp_ring_id;          /*   144     2 */
        u8                         ptp_cmd;              /*   146     1 */
        u8                         ts_id;                /*   147     1 */
        u8                         stp_state;            /*   148     1 */
        bool                       vlan_aware;           /*   149     1 */
        bool                       is_dsa_8021q_cpu;     /*   150     1 */
        bool                       learn_ena;            /*   151     1 */
        bool                       lag_tx_active;        /*   152     1 */

        /* XXX 3 bytes hole, try to pack */

        int                        bridge_num;           /*   156     4 */
        int                        speed;                /*   160     4 */

        /* size: 168, cachelines: 3, members: 18 */
        /* sum members: 161, holes: 1, sum holes: 3 */
        /* padding: 4 */
        /* last cacheline: 40 bytes */
};
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

6d0be600

net: mscc: ocelot: delete ocelot_port :: xmit_template · 15f6d01e

Vladimir Oltean authored May 11, 2022

This is no longer used since commit 7c4bb540 ("net: dsa: tag_ocelot:
create separate tagger for Seville").
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

15f6d01e

Merge branch 'dsa-changes-for-multiple-cpu-ports-part-1' · 879c610c

Jakub Kicinski authored May 12, 2022

Vladimir Oltean says:

====================
DSA changes for multiple CPU ports (part 1)

I am trying to enable the second internal port pair from the NXP LS1028A
Felix switch for DSA-tagged traffic via "ocelot-8021q". This series
represents part 1 (of an unknown number) of that effort.

It does some preparation work, like managing host flooding in DSA via a
dedicated method, and removing the CPU port as argument from the tagging
protocol change procedure.

In terms of driver-specific changes, it reworks the 2 tag protocol
implementations in the Felix driver to have a structured data format.
It enables host flooding towards all tag_8021q CPU ports. It dynamically
updates the tag_8021q CPU port used for traps. It also fixes a bug
introduced by a previous refactoring/oversimplification commit in
net-next.
====================

Link: https://lore.kernel.org/r/20220511095020.562461-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

879c610c

net: dsa: felix: reimplement tagging protocol change with function pointers · 7a29d220

Vladimir Oltean authored May 11, 2022

The error handling for the current tagging protocol change procedure is
a bit brittle (we dismantle the previous tagging protocol entirely
before setting up the new one). By identifying which parts of a tagging
protocol are unique to itself and which parts are shared with the other,
we can implement a protocol change procedure where error handling is a
bit more robust, because we start setting up the new protocol first, and
tear down the old one only after the setup of the specific and shared
parts succeeded.

The protocol change is a bit too open-coded too, in the area of
migrating host flood settings and MDBs. By identifying what differs
between tagging protocols (the forwarding masks for host flooding) we
can implement a more straightforward migration procedure which is
handled in the shared portion of the protocol change, rather than
individually by each protocol.

Therefore, a more structured approach calls for the introduction of a
structure of function pointers per tagging protocol. This covers setup,
teardown and the host forwarding mask. In the future it will also cover
how to prepare for a new DSA master.

The initial tagging protocol setup (at driver probe time) and the final
teardown (at driver removal time) are also adapted to call into the
structured methods of the specific protocol in current use. This is
especially relevant for teardown, where we previously called
felix_del_tag_protocol() only for the first CPU port. But by not
specifying which CPU port this is for, we gain more flexibility to
support multiple CPU ports in the future.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

7a29d220

net: dsa: felix: dynamically determine tag_8021q CPU port for traps · c352e5e8

Vladimir Oltean authored May 11, 2022

Ocelot switches support a single active CPU port at a time (at least as
a trapping destination, i.e. for control traffic). This is true
regardless of whether we are using the native copy-to-CPU-port-module
functionality, or a redirect action towards the software-defined
tag_8021q CPU port.

Currently we assume that the trapping destination in tag_8021q mode is
the first CPU port, yet in the future we may want to migrate the user
ports to the second CPU port.

For that to work, we need to make sure that the tag_8021q trapping
destination is a CPU port that is active, i.e. is used by at least some
user port on which the trap was added. Otherwise, we may end up
redirecting the traffic to a CPU port which isn't even up.

Note that due to the current design where we simply choose the CPU port
of the first port from the trap's ingress port mask, it may be that a
CPU port absorbes control traffic from user ports which aren't affine to
it as per user space's request. This isn't ideal, but is the lesser of
two evils. Following the user-configured affinity for traps would mean
that we can no longer reuse a single TCAM entry for multiple traps,
which is what we actually do for e.g. PTP. Either we duplicate and
deduplicate TCAM entries on the fly when user-to-CPU-port mappings
change (which is unnecessarily complicated), or we redirect trapped
traffic to all tag_8021q CPU ports if multiple such ports are in use.
The latter would have actually been nice, if it actually worked, but it
doesn't, since a OCELOT_MASK_MODE_REDIRECT action towards multiple ports
would not take PGID_SRC into consideration, and it would just duplicate
the packet towards each (CPU) port, leading to duplicates in software.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c352e5e8

net: dsa: remove port argument from ->change_tag_protocol() · bacf93b0

Vladimir Oltean authored May 11, 2022

DSA has not supported (and probably will not support in the future
either) independent tagging protocols per CPU port.

Different switch drivers have different requirements, some may need to
replicate some settings for each CPU port, some may need to apply some
settings on a single CPU port, while some may have to configure some
global settings and then some per-CPU-port settings.

In any case, the current model where DSA calls ->change_tag_protocol for
each CPU port turns out to be impractical for drivers where there are
global things to be done. For example, felix calls dsa_tag_8021q_register(),
which makes no sense per CPU port, so it suppresses the second call.

Let drivers deal with replication towards all CPU ports, and remove the
CPU port argument from the function prototype.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bacf93b0

net: dsa: felix: manage host flooding using a specific driver callback · 72c3b0c7

Vladimir Oltean authored May 11, 2022

At the time - commit 7569459a ("net: dsa: manage flooding on the CPU
ports") - not introducing a dedicated switch callback for host flooding
made sense, because for the only user, the felix driver, there was
nothing different to do for the CPU port than set the flood flags on the
CPU port just like on any other bridge port.

There are 2 reasons why this approach is not good enough, however.

(1) Other drivers, like sja1105, support configuring flooding as a
    function of {ingress port, egress port}, whereas the DSA
    ->port_bridge_flags() function only operates on an egress port.
    So with that driver we'd have useless host flooding from user ports
    which don't need it.

(2) Even with the felix driver, support for multiple CPU ports makes it
    difficult to piggyback on ->port_bridge_flags(). The way in which
    the felix driver is going to support host-filtered addresses with
    multiple CPU ports is that it will direct these addresses towards
    both CPU ports (in a sort of multicast fashion), then restrict the
    forwarding to only one of the two using the forwarding masks.
    Consequently, flooding will also be enabled towards both CPU ports.
    However, ->port_bridge_flags() gets passed the index of a single CPU
    port, and that leaves the flood settings out of sync between the 2
    CPU ports.

This is to say, it's better to have a specific driver method for host
flooding, which takes the user port as argument. This solves problem (1)
by allowing the driver to do different things for different user ports,
and problem (2) by abstracting the operation and letting the driver do
whatever, rather than explicitly making the DSA core point to the CPU
port it thinks needs to be touched.

This new method also creates a problem, which is that cross-chip setups
are not handled. However I don't have hardware right now where I can
test what is the proper thing to do, and there isn't hardware compatible
with multi-switch trees that supports host flooding. So it remains a
problem to be tackled in the future.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

72c3b0c7

net: dsa: introduce the dsa_cpu_ports() helper · 465c3de4

Vladimir Oltean authored May 11, 2022

Similar to dsa_user_ports() which retrieves a port mask of all user
ports, introduce dsa_cpu_ports() which retrieves the mask of all CPU
ports of a switch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

465c3de4

net: dsa: felix: bring the NPI port indirection for host flooding to surface · 910ee6cc

Vladimir Oltean authored May 11, 2022

For symmetry with host FDBs and MDBs where the indirection is now
handled outside the ocelot switch lib, do the same for bridge port
flags (unicast/multicast/broadcast flooding).

The only caller of the ocelot switch lib which uses the NPI port is the
Felix DSA driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

910ee6cc

net: dsa: felix: bring the NPI port indirection for host MDBs to surface · 0ddf83cd

Vladimir Oltean authored May 11, 2022

For symmetry with host FDBs where the indirection is now handled outside
the ocelot switch lib, do the same for host MDB entries. The only caller
of the ocelot switch lib which uses the NPI port is the Felix DSA driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0ddf83cd

net: dsa: felix: program host FDB entries towards PGID_CPU for tag_8021q too · e9b3ba43

Vladimir Oltean authored May 11, 2022

I remembered why we had the host FDB migration procedure in place.

It is true that host FDB entry migration can be done by changing the
value of PGID_CPU, but the problem is that only host FDB entries learned
while operating in NPI mode go to PGID_CPU. When the CPU port operates
in tag_8021q mode, the FDB entries are learned towards the unicast PGID
equal to the physical port number of this CPU port, bypassing the
PGID_CPU indirection.

So host FDB entries learned in tag_8021q mode are not migrated any
longer towards the NPI port.

Fix this by extracting the NPI port -> PGID_CPU redirection from the
ocelot switch lib, moving it to the Felix DSA driver, and applying it
for any CPU port regardless of its kind (NPI or tag_8021q).

Fixes: a51c1c3f ("net: dsa: felix: stop migrating FDBs back and forth on tag proto change")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e9b3ba43

net: lan966x: Fix use of pointer after being freed · f0a65f81

Horatiu Vultur authored May 11, 2022

The smatch found the following warning:

drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:736 lan966x_fdma_reload()
warn: 'rx_dcbs' was already freed.

This issue can happen when changing the MTU on one of the ports and once
the RX buffers are allocated and then the TX buffer allocation fails.
In that case the RX buffers should not be restore. This fix this issue
such that the RX buffers will not be restored if the TX buffers failed
to be allocated.

Fixes: 2ea1cbac ("net: lan966x: Update FDMA to change MTU.")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20220511204059.2689199-1-horatiu.vultur@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f0a65f81

net: update the register_netdevice() kdoc · fa926bb3

Jakub Kicinski authored May 11, 2022

The BUGS section looks quite dated, the registration
is under rtnl lock. Remove some obvious information
while at it.

Link: https://lore.kernel.org/r/20220511190720.1401356-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fa926bb3

skbuff: replace a BUG_ON() with the new DEBUG_NET_WARN_ON_ONCE() · 0df65743

Jakub Kicinski authored May 11, 2022

Very few drivers actually have Kconfig knobs for adding
-DDEBUG. 8 according to a quick grep, while there are
93 users of skb_checksum_none_assert(). Switch to the
new DEBUG_NET_WARN_ON_ONCE() to catch bad skbs.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220511172305.1382810-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

0df65743

mlxbf_gige: remove driver-managed interrupt counts · f4826443

David Thompson authored May 11, 2022

The driver currently has three interrupt counters,
which are incremented every time each interrupt handler
executes.  These driver-managed counters are not
necessary as the kernel already has logic that manages
interrupt counts and exposes them via /proc/interrupts.
This patch removes the driver-managed counters.
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Link: https://lore.kernel.org/r/20220511135251.2989-1-davthompson@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f4826443

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 9b19e57a

Jakub Kicinski authored May 12, 2022

No conflicts.

Build issue in drivers/net/ethernet/sfc/ptp.c
  54fccfdd ("sfc: efx_default_channel_type APIs can be static")
  49e6123c ("net: sfc: fix memory leak due to ptp channel")
https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9b19e57a

Merge tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f3f19f93

Linus Torvalds authored May 12, 2022

Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireless, and bluetooth.

  No outstanding fires.

  Current release - regressions:

   - eth: atlantic: always deep reset on pm op, fix null-deref

  Current release - new code bugs:

   - rds: use maybe_get_net() when acquiring refcount on TCP sockets
     [refinement of a previous fix]

   - eth: ocelot: mark traps with a bool instead of guessing type based
     on list membership

  Previous releases - regressions:

   - net: fix skipping features in for_each_netdev_feature()

   - phy: micrel: fix null-derefs on suspend/resume and probe

   - bcmgenet: check for Wake-on-LAN interrupt probe deferral

  Previous releases - always broken:

   - ipv4: drop dst in multicast routing path, prevent leaks

   - ping: fix address binding wrt vrf

   - net: fix wrong network header length when BPF protocol translation
     is used on skbs with a fraglist

   - bluetooth: fix the creation of hdev->name

   - rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition

   - wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing

   - wifi: ath11k: reduce the wait time of 11d scan and hw scan while
     adding an interface

   - mac80211: fix rx reordering with non explicit / psmp ack policy

   - mac80211: reset MBSSID parameters upon connection

   - nl80211: fix races in nl80211_set_tx_bitrate_mask()

   - tls: fix context leak on tls_device_down

   - sched: act_pedit: really ensure the skb is writable

   - batman-adv: don't skb_split skbuffs with frag_list

   - eth: ocelot: fix various issues with TC actions (null-deref; bad
     stats; ineffective drops; ineffective filter removal)"

* tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
  tls: Fix context leak on tls_device_down
  net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
  net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
  net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
  mlxsw: Avoid warning during ip6gre device removal
  net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
  net: ethernet: mediatek: ppe: fix wrong size passed to memset()
  Bluetooth: Fix the creation of hdev->name
  i40e: i40e_main: fix a missing check on list iterator
  net/sched: act_pedit: really ensure the skb is writable
  s390/lcs: fix variable dereferenced before check
  s390/ctcm: fix potential memory leak
  s390/ctcm: fix variable dereferenced before check
  net: atlantic: verify hw_head_ lies within TX buffer ring
  net: atlantic: add check for MAX_SKB_FRAGS
  net: atlantic: reduce scope of is_rsc_complete
  net: atlantic: fix "frag[0] not initialized"
  net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
  net: phy: micrel: Fix incorrect variable type in micrel
  decnet: Use container_of() for struct dn_neigh casts
  ...

f3f19f93

Merge branch 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 0ac824f3

Linus Torvalds authored May 12, 2022

Pull cgroup fix from Tejun Heo:
 "Waiman's fix for a cgroup2 cpuset bug where it could miss nodes which
  were hot-added"

* 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()

0ac824f3

Merge tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · c37dba6a

Linus Torvalds authored May 12, 2022

Pull fs fixes from Jan Kara:
 "Three fixes that I'd still like to get to 5.18:

   - add a missing sanity check in the fanotify FAN_RENAME feature
     (added in 5.17, let's fix it before it gets wider usage in
     userspace)

   - udf fix for recently introduced filesystem corruption issue

   - writeback fix for a race in inode list handling that can lead to
     delayed writeback and possible dirty throttling stalls"

* tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Avoid using stale lengthOfImpUse
  writeback: Avoid skipping inode writeback
  fanotify: do not allow setting dirent events in mask of non-dir

c37dba6a

tls: Fix context leak on tls_device_down · 3740651b

Maxim Mikityanskiy authored May 12, 2022

The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.

Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.

This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.

Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3740651b

net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe() · 1fa89ffb

Taehee Yoo authored May 12, 2022

In the NIC ->probe() callback, ->mtd_probe() callback is called.
If NIC has 2 ports, ->probe() is called twice and ->mtd_probe() too.
In the ->mtd_probe(), which is efx_ef10_mtd_probe() it allocates and
initializes mtd partiion.
But mtd partition for sfc is shared data.
So that allocated mtd partition data from last called
efx_ef10_mtd_probe() will not be used.
Therefore it must be freed.
But it doesn't free a not used mtd partition data in efx_ef10_mtd_probe().

kmemleak reports:
unreferenced object 0xffff88811ddb0000 (size 63168):
  comm "systemd-udevd", pid 265, jiffies 4294681048 (age 348.586s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffa3767749>] kmalloc_order_trace+0x19/0x120
    [<ffffffffa3873f0e>] __kmalloc+0x20e/0x250
    [<ffffffffc041389f>] efx_ef10_mtd_probe+0x11f/0x270 [sfc]
    [<ffffffffc0484c8a>] efx_pci_probe.cold.17+0x3df/0x53d [sfc]
    [<ffffffffa414192c>] local_pci_probe+0xdc/0x170
    [<ffffffffa4145df5>] pci_device_probe+0x235/0x680
    [<ffffffffa443dd52>] really_probe+0x1c2/0x8f0
    [<ffffffffa443e72b>] __driver_probe_device+0x2ab/0x460
    [<ffffffffa443e92a>] driver_probe_device+0x4a/0x120
    [<ffffffffa443f2ae>] __driver_attach+0x16e/0x320
    [<ffffffffa4437a90>] bus_for_each_dev+0x110/0x190
    [<ffffffffa443b75e>] bus_add_driver+0x39e/0x560
    [<ffffffffa4440b1e>] driver_register+0x18e/0x310
    [<ffffffffc02e2055>] 0xffffffffc02e2055
    [<ffffffffa3001af3>] do_one_initcall+0xc3/0x450
    [<ffffffffa33ca574>] do_init_module+0x1b4/0x700
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Fixes: 8127d661 ("sfc: Add support for Solarflare SFC9100 family")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20220512054709.12513-1-ap420073@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

1fa89ffb

net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending · f3c46e41

Guangguan Wang authored May 12, 2022

Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.

Fixes: 846e344e ("net/smc: add receive timeout check")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

f3c46e41

net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down() · b7be130c

Florian Fainelli authored May 11, 2022

After commit 2d1f90f9 ("net: dsa/bcm_sf2: fix incorrect usage of
state->link") the interface suspend path would call our mac_link_down()
call back which would forcibly set the link down, thus preventing
Wake-on-LAN packets from reaching our management port.

Fix this by looking at whether the port is enabled for Wake-on-LAN and
not clearing the link status in that case to let packets go through.

Fixes: 2d1f90f9 ("net: dsa/bcm_sf2: fix incorrect usage of state->link")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220512021731.2494261-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b7be130c

mlxsw: Avoid warning during ip6gre device removal · 810c2f0a

Amit Cohen authored May 11, 2022

IPv6 addresses which are used for tunnels are stored in a hash table
with reference counting. When a new GRE tunnel is configured, the driver
is notified and configures it in hardware.

Currently, any change in the tunnel is not applied in the driver. It
means that if the remote address is changed, the driver is not aware of
this change and the first address will be used.

This behavior results in a warning [1] in scenarios such as the
following:

 # ip link add name gre1 type ip6gre local 2000::3 remote 2000::fffe tos inherit ttl inherit
 # ip link set name gre1 type ip6gre local 2000::3 remote 2000::ffff ttl inherit
 # ip link delete gre1

The change of the address is not applied in the driver. Currently, the
driver uses the remote address which is stored in the 'parms' of the
overlay device. When the tunnel is removed, the new IPv6 address is
used, the driver tries to release it, but as it is not aware of the
change, this address is not configured and it warns about releasing non
existing IPv6 address.

Fix it by using the IPv6 address which is cached in the IPIP entry, this
address is the last one that the driver used, so even in cases such the
above, the first address will be released, without any warning.

[1]:

WARNING: CPU: 1 PID: 2197 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2920 mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
...
CPU: 1 PID: 2197 Comm: ip Not tainted 5.17.0-rc8-custom-95062-gc1e5ded51a9a #84
Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 07/12/2021
RIP: 0010:mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
...
Call Trace:
 <TASK>
 mlxsw_sp2_ipip_rem_addr_unset_gre6+0xf1/0x120 [mlxsw_spectrum]
 mlxsw_sp_netdevice_ipip_ol_event+0xdb/0x640 [mlxsw_spectrum]
 mlxsw_sp_netdevice_event+0xc4/0x850 [mlxsw_spectrum]
 raw_notifier_call_chain+0x3c/0x50
 call_netdevice_notifiers_info+0x2f/0x80
 unregister_netdevice_many+0x311/0x6d0
 rtnl_dellink+0x136/0x360
 rtnetlink_rcv_msg+0x12f/0x380
 netlink_rcv_skb+0x49/0xf0
 netlink_unicast+0x233/0x340
 netlink_sendmsg+0x202/0x440
 ____sys_sendmsg+0x1f3/0x220
 ___sys_sendmsg+0x70/0xb0
 __sys_sendmsg+0x54/0xa0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e846efe2 ("mlxsw: spectrum: Add hash table for IPv6 address mapping")
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220511115747.238602-1-idosch@nvidia.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

810c2f0a

Merge branch 'nfp-vf-rate-limit-support' · b33177f1

Paolo Abeni authored May 12, 2022

Simon Horman says:

====================
*nfp: VF rate limit support

this short series adds VF rate limiting to the NFP driver.

The first patch, as suggested by Jakub Kicinski, adds a helper
to check that ndo_set_vf_rate() rate parameters are sane.
It also provides a place for further parameter checking to live,
if needed in future.

The second patch adds VF rate limit support to the NFP driver.
It addresses several comments made on v1, including removing
the parameter check that is now provided by the helper added
in the first patch.
====================

Link: https://lore.kernel.org/r/20220511113932.92114-1-simon.horman@corigine.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

b33177f1

nfp: VF rate limit support · e0d0e1fd

Bin Chen authored May 11, 2022

Add VF rate limit feature

This patch enhances the NFP driver to supports assignment of
both max_tx_rate and min_tx_rate to VFs

The template of configurations below is all supported.
e.g.
 # ip link set $DEV vf $VF_NUM max_tx_rate $RATE_VALUE
 # ip link set $DEV vf $VF_NUM min_tx_rate $RATE_VALUE
 # ip link set $DEV vf $VF_NUM max_tx_rate $RATE_VALUE \
			       min_tx_rate $RATE_VALUE
 # ip link set $DEV vf $VF_NUM min_tx_rate $RATE_VALUE \
			       max_tx_rate $RATE_VALUE

The max RATE_VALUE is limited to 0xFFFF which is about
63Gbps (using 1024 for 1G)
Signed-off-by: Bin Chen <bin.chen@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e0d0e1fd

rtnetlink: verify rate parameters for calls to ndo_set_vf_rate · a14857c2

Bin Chen authored May 11, 2022

When calling ndo_set_vf_rate() the max_tx_rate parameter may be zero,
in which case the setting is cleared, or it must be greater or equal to
min_tx_rate.

Enforce this requirement on all calls to ndo_set_vf_rate via a wrapper
which also only calls ndo_set_vf_rate() if defined by the driver.

Based on work by Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Bin Chen <bin.chen@corigine.com>
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

a14857c2