Commits · 887b1d1adb2e9318b9233bef648bd2b1cc1e66ff · Kirill Smelkov / linux

01 Aug, 2024 20 commits

net: ethernet: mtk_eth_soc: drop clocks unused by Ethernet driver · 887b1d1a

Daniel Golle authored Jul 30, 2024

Clocks for SerDes and PHY are going to be handled by standalone drivers
for each of those hardware components. Drop them from the Ethernet driver.

The clocks which are being removed for this patch are responsible for
the for the SerDes PCS and PHYs used for the 2nd and 3rd MAC which are
anyway not yet supported. Hence backwards compatibility is not an issue.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://patch.msgid.link/b5faaf69b5c6e3e155c64af03706c3c423c6a1c9.1722335682.git.daniel@makrotopia.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

887b1d1a

net/mlx5: Reclaim max 50K pages at once · 501c3005

Anand Khoje authored Jul 30, 2024

In non FLR context, at times CX-5 requests release of ~8 million FW pages.
This needs humongous number of cmd mailboxes, which to be released once
the pages are reclaimed. Release of humongous number of cmd mailboxes is
consuming cpu time running into many seconds. Which with non preemptible
kernels is leading to critical process starving on that cpu’s RQ.
On top of it, the FW does not use all the mailbox messages as it has a
limit of releasing 50K pages at once per MLX5_CMD_OP_MANAGE_PAGES +
MLX5_PAGES_TAKE device command. Hence, the allocation of these many
mailboxes is extra and adds unnecessary overhead.
To alleviate this, this change restricts the total number of pages
a worker will try to reclaim to maximum 50K pages in one go.

Our tests have shown significant benefit of this change in terms of
time consumed by dma_pool_free().
During a test where an event was raised by HCA
to release 1.3 Million pages, following observations were made:

- Without this change:
Number of mailbox messages allocated was around 20K, to accommodate
the DMA addresses of 1.3 million pages.
The average time spent by dma_pool_free() to free the DMA pool is between
16 usec to 32 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        287
            1024 |@@@                                      1332
            2048 |@                                        656
            4096 |@@@@@                                    2599
            8192 |@@@@@@@@@@                               4755
           16384 |@@@@@@@@@@@@@@@                          7545
           32768 |@@@@@                                    2501
           65536 |                                         0

- With this change:
Number of mailbox messages allocated was around 800; this was to
accommodate DMA addresses of only 50K pages.
The average time spent by dma_pool_free() to free the DMA pool in this case
lies between 1 usec to 2 usec.
           value  ------------- Distribution ------------- count
             256 |                                         0
             512 |@@@@@@@@@@@@@@@@@@                       346
            1024 |@@@@@@@@@@@@@@@@@@@@@@                   435
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         0
Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20240730073634.114407-1-anand.a.khoje@oracle.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

501c3005

net: skbuff: Skip early return in skb_unref when debugging · c9c0ee5f

Breno Leitao authored Jul 29, 2024

This patch modifies the skb_unref function to skip the early return
optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
the reference count decrement always occurs in debug builds, allowing
for more thorough checking of SKB reference counting.

Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
was not set, the function would return early after a memory barrier
(smp_rmb()) without decrementing the reference count. This optimization
assumes it's safe to proceed with freeing the SKB without the overhead
of an atomic decrement from 1 to 0.

With this change:
- In non-debug builds (CONFIG_DEBUG_NET not set), behavior remains
  unchanged, preserving the performance optimization.
- In debug builds (CONFIG_DEBUG_NET set), the reference count is always
  decremented, even when it's 1, allowing for consistent behavior and
  potentially catching subtle SKB management bugs.

This modification enhances debugging capabilities for networking code
without impacting performance in production kernels. It helps kernel
developers identify and diagnose issues related to SKB management and
reference counting in the network stack.

Cc: Chris Mason <clm@fb.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240729104741.370327-1-leitao@debian.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c9c0ee5f

Merge branch 'ethernet-convert-from-tasklet-to-bh-workqueue' · 8e0c0ec9

Jakub Kicinski authored Jul 31, 2024

Allen Pais says:

====================
ethernet: Convert from tasklet to BH workqueue [part]

The only generic interface to execute asynchronously in the BH context is
tasklet; however, it's marked deprecated and has some design flaws. To
replace tasklets, BH workqueue support was recently added. A BH workqueue
behaves similarly to regular workqueues except that the queued work items
are executed in the BH context.

This patch converts a few drivers in drivers/ethernet/* from tasklet
to BH workqueue. The next set will be sent out after the next -rc is
out.

v2: https://lore.kernel.org/20240621183947.4105278-1-allen.lkml@gmail.com
v1: https://lore.kernel.org/20240507190111.16710-2-apais@linux.microsoft.com
====================

Link: https://patch.msgid.link/20240730183403.4176544-1-allen.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8e0c0ec9

net: macb: Convert tasklet API to new bottom half workqueue mechanism · c5092ba3

Allen Pais authored Jul 30, 2024

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the macb driver. This transition ensures compatibility
with the latest design and enhances performance.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-5-allen.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c5092ba3

net: cnic: Convert tasklet API to new bottom half workqueue mechanism · 8d3beb6b

Allen Pais authored Jul 30, 2024

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the cnic driver. This transition ensures compatibility
with the latest design and enhances performance.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-4-allen.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8d3beb6b

net: xgbe: Convert tasklet API to new bottom half workqueue mechanism · 2d671dc6

Allen Pais authored Jul 30, 2024

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the xgbe driver. This transition ensures compatibility
with the latest design and enhances performance.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-3-allen.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2d671dc6

net: alteon: Convert tasklet API to new bottom half workqueue mechanism · 20a3bcfe

Allen Pais authored Jul 30, 2024

Migrate tasklet APIs to the new bottom half workqueue mechanism. It
replaces all occurrences of tasklet usage with the appropriate workqueue
APIs throughout the alteon driver. This transition ensures compatibility
with the latest design and enhances performance.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Link: https://patch.msgid.link/20240730183403.4176544-2-allen.lkml@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

20a3bcfe

Merge branch 'mlxsw-core_thermal-small-cleanups' · 9bb3ec18

Jakub Kicinski authored Jul 31, 2024

Petr Machata says:

====================
mlxsw: core_thermal: Small cleanups

Ido Schimmel says:

Clean up various issues which I noticed while addressing feedback on a
different patchset.
====================

Link: https://patch.msgid.link/cover.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9bb3ec18

mlxsw: core_thermal: Fix -Wformat-truncation warning · b0d21321

Ido Schimmel authored Jul 30, 2024

The name of a thermal zone device cannot be longer than 19 characters
('THERMAL_NAME_LENGTH - 1'). The format string 'mlxsw-lc%d-module%d' can
exceed this limitation if the maximum number of line cards cannot be
represented using a single digit and the maximum number of transceiver
modules cannot be represented using two digits.

This is not the case with current systems nor future ones. Therefore,
increase the size of the result buffer beyond 'THERMAL_NAME_LENGTH' and
suppress the following build warning [1].

If this limitation is ever exceeded, we will know about it since the
thermal core validates the thermal device's name during registration.

[1]
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c: In function ‘mlxsw_thermal_modules_init.part.0’:
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:70: error: ‘%d’ directive output may be truncated writing between 1 and 3 bytes into a region of size between 2 and 4 [-Werror=format-truncation=]
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                                                                      ^~
In function ‘mlxsw_thermal_module_tz_init’,
    inlined from ‘mlxsw_thermal_module_init’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:465:9,
    inlined from ‘mlxsw_thermal_modules_init.part.0’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:500:9:
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:52: note: directive argument in the range [1, 256]
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                                                    ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:17: note: ‘snprintf’ output between 18 and 22 bytes into a destination of size 20
  418 |                 snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d",
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  419 |                          module_tz->slot_index, module_tz->module + 1);
      |                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/583a70c6dbe75e6bf0c2c58abbb3470a860d2dc3.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b0d21321

mlxsw: core_thermal: Remove unnecessary assignments · ec672931

Ido Schimmel authored Jul 30, 2024

Setting both pointers to NULL is unnecessary since the code never checks
whether these pointers are NULL or not. Remove the assignments.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/ea646f5d7642fffd5393fa23650660ab8f77a511.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ec672931

mlxsw: core_thermal: Remove unnecessary checks · e7e3a450

Ido Schimmel authored Jul 30, 2024

mlxsw_thermal_module_fini() cannot be invoked with a thermal module
which is NULL or which is not associated with a thermal zone, so remove
these checks.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/8db5fe0a3a28ba09a15d4102cc03f7e8ca7675be.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e7e3a450

mlxsw: core_thermal: Simplify rollback · e25f3040

Ido Schimmel authored Jul 30, 2024

During rollback, instead of calling mlxsw_thermal_module_fini() for all
the modules, only call it for modules that were successfully
initialized. This is not a bug fix since mlxsw_thermal_module_fini()
first checks that the module was initialized.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/905bebc45f6e246031f0c5c177bba8efe11e05f5.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e25f3040

mlxsw: core_thermal: Make mlxsw_thermal_module_{init, fini} symmetric · fb76ea1d

Ido Schimmel authored Jul 30, 2024

mlxsw_thermal_module_fini() de-initializes the module's thermal zone,
but mlxsw_thermal_module_init() does not initialize it. Make both
functions symmetric by moving the initialization of the module's thermal
zone to mlxsw_thermal_module_init().
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/a661ad468f8ad0d7d533d8334e4abf61dfe34342.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

fb76ea1d

mlxsw: core_thermal: Remove unused arguments · 73c18f99

Ido Schimmel authored Jul 30, 2024

'dev' and 'core' arguments are not used by mlxsw_thermal_module_init().
Remove them.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/563fc7383f61809a306b9954872219eaaf3c689b.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

73c18f99

mlxsw: core_thermal: Fold two loops into one · d81d7143

Ido Schimmel authored Jul 30, 2024

There is no need to traverse the same array twice. Do it once by folding
both loops into one.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/81756744ed532aaa9249a83fc08757accfe8b07c.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d81d7143

mlxsw: core_thermal: Remove another unnecessary check · 2a1c9dcb

Ido Schimmel authored Jul 30, 2024

mlxsw_thermal_modules_init() allocates an array of modules and then
initializes each entry by calling mlxsw_thermal_module_init() which
among other things initializes the 'parent' pointer of the entry.

mlxsw_thermal_modules_init() then traverses over the array again, but
skips over entries that do not have their 'parent' pointer set which is
impossible given the above.

Therefore, remove the unnecessary check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/fb3e8ded422a441436431d5785b900f11ffc9621.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2a1c9dcb

mlxsw: core_thermal: Remove unnecessary check · 4be011d7

Ido Schimmel authored Jul 30, 2024

mlxsw_thermal_modules_init() allocates an array of modules and then
calls mlxsw_thermal_module_init() to initialize each entry in the array.
It is therefore impossible for mlxsw_thermal_module_init() to encounter
an entry that is already initialized and has its 'parent' pointer set.

Remove the unnecessary check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4be011d7

mlxsw: core_thermal: Call thermal_zone_device_unregister() unconditionally · a1bb54b1

Ido Schimmel authored Jul 30, 2024

The function returns immediately if the thermal zone pointer is NULL so
there is no need to check it before calling the function.

Remove the check.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Link: https://patch.msgid.link/0bd251aa8ce03d3c951983aa6b4300d8205b88a7.1722345311.git.petrm@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a1bb54b1

net/mlx4: Add support for EEPROM high pages query for QSFP/QSFP+/QSFP28 · 9c26a1d0

Krzysztof Olędzki authored Jul 30, 2024

Enable reading additional EEPROM information from high pages such as
thresholds and alarms on QSFP/QSFP+/QSFP28 modules.

"This is similar to commit a708fb7b ("net/mlx5e: ethtool, Add
support for EEPROM high pages query") but given all the required logic
already exists in mlx4_qsfp_eeprom_params_set() only s/_LEN/MAX_LEN/ is
needed.
Tested-by: Dan Merillat <git@dan.merillat.org>
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/b17c5336-6dc3-41f2-afa6-f9e79231f224@ans.plSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9c26a1d0

31 Jul, 2024 20 commits

Add support for PIO p flag · 990c3049

Patrick Rohr authored Jul 29, 2024

draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information
Option to signal that the network can allocate a unique IPv6 prefix per
client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).

When ra_honor_pio_pflag is enabled, the presence of a P-flag causes
SLAAC autoconfiguration to be disabled for that particular PIO.

An automated test has been added in Android (r.android.com/3195335) to
go along with this change.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Lamparter <equinox@opensourcerouting.org>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

990c3049

Merge branch 'smc-cleanups' into main · 59f72657

David S. Miller authored Jul 31, 2024

Zhengchao Shao says:

====================
net/smc: do some cleanups in smc module

Do some cleanups in smc module.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

59f72657

net/smc: remove unused input parameters in smcr_new_buf_create · 0908503a

Zhengchao Shao authored Jul 30, 2024

The input parameter "is_rmb" of the smcr_new_buf_create function
has never been used, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0908503a

net/smc: remove redundant code in smc_connect_check_aclc · d37307ea

Zhengchao Shao authored Jul 30, 2024

When the SMC client perform CLC handshake, it will check whether
the clc header type is correct in receiving SMC_CLC_ACCEPT packet.
The specific invoking path is as follows:
__smc_connect
  smc_connect_clc
    smc_clc_wait_msg
      smc_clc_msg_hdr_valid
        smc_clc_msg_acc_conf_valid
Therefore, the smc_connect_check_aclc interface invoked by
__smc_connect does not need to check type again.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d37307ea

net/smc: remove the fallback in __smc_connect · 5a795757

Zhengchao Shao authored Jul 30, 2024

When the SMC client begins to connect to server, smcd_version is set
to SMC_V1 + SMC_V2. If fail to get VLAN ID, only SMC_V2 information
is left in smcd_version. And smcd_version will not be changed to 0.
Therefore, remove the fallback caused by the failure to get VLAN ID.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a795757

net/smc: remove unreferenced header in smc_loopback.h file · 1018825a

Zhengchao Shao authored Jul 30, 2024

Because linux/err.h is unreferenced in smc_loopback.h file, so
remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1018825a

dt-bindings: net: dsa: vsc73xx: add {rx,tx}-internal-delay-ps · b735154a

Pawel Dembicki authored Jul 29, 2024

Add a schema validator to vitesse,vsc73xx.yaml for MAC-level RGMII delays
in the CPU port. Additionally, valid values for VSC73XX were defined,
and a common definition for the RX and TX valid range was created.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

b735154a

net: dsa: vsc73xx: make RGMII delays configurable · 3b91b032

Pawel Dembicki authored Jul 29, 2024

This patch switches hardcoded RGMII transmit/receive delay to
a configurable value. Delay values are taken from the properties of
the CPU port: 'tx-internal-delay-ps' and 'rx-internal-delay-ps'.

The default value is configured to 2.0 ns to maintain backward
compatibility with existing code.
Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b91b032

Merge branch 'l2tp-session-cleanup' into main · 3fafd92e

David S. Miller authored Jul 31, 2024

James Chapman says:

====================
l2tp: simplify tunnel and session cleanup

This series simplifies and improves l2tp tunnel and session cleanup.

 * refactor l2tp management code to not use the tunnel socket's
   sk_user_data. This allows the tunnel and its socket to be closed
   and freed without sequencing the two using the socket's sk_destruct
   hook.

 * export ip_flush_pending_frames and use it when closing l2tp_ip
   sockets.

 * move the work of closing all sessions in the tunnel to the work
   queue so that sessions are deleted using the same codepath whether
   they are closed by user API request or their parent tunnel is
   closing.

 * refactor l2tp_ppp pppox socket / session relationship to have the
   session keep the socket alive, not the other way around. Previously
   the pppox socket held a ref on the session, which complicated
   session delete by having to go through the pppox socket destructor.

 * free sessions and pppox sockets by rcu.

 * fix a possible tunnel refcount underflow.

 * avoid using rcu_barrier in net exit handler.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3fafd92e

l2tp: use pre_exit pernet hook to avoid rcu_barrier · 5dfa598b

James Chapman authored Jul 29, 2024

Move the work of closing all tunnels from the pernet exit hook to
pre_exit since the core does rcu synchronisation between these steps
and we can therefore remove rcu_barrier from l2tp code.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5dfa598b

l2tp: cleanup eth/ppp pseudowire setup code · d93b8a63

James Chapman authored Jul 29, 2024

l2tp eth/ppp pseudowire setup/cleanup uses kfree() in some error
paths. Drop the refcount instead such that the session object is
always freed when the refcount reaches 0.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d93b8a63

l2tp: add idr consistency check in session_register · 0aa45570

James Chapman authored Jul 29, 2024

l2tp_session_register uses an idr_alloc then idr_replace pattern to
insert sessions into the session IDR. To catch invalid locking, add a
WARN_ON_ONCE if the IDR entry is modified by another thread between
alloc and replace steps.

Also add comments to make expectations clear.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0aa45570

l2tp: use rcu list add/del when updating lists · 89b768ec

James Chapman authored Jul 29, 2024

l2tp_v3_session_htable and tunnel->session_list are read by lockless
getters using RCU. Use rcu list variants when adding or removing list
items.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89b768ec

l2tp: prevent possible tunnel refcount underflow · 24256415

James Chapman authored Jul 29, 2024

When a session is created, it sets a backpointer to its tunnel. When
the session refcount drops to 0, l2tp_session_free drops the tunnel
refcount if session->tunnel is non-NULL. However, session->tunnel is
set in l2tp_session_create, before the tunnel refcount is incremented
by l2tp_session_register, which leaves a small window where
session->tunnel is non-NULL when the tunnel refcount hasn't been
bumped.

Moving the assignment to l2tp_session_register is trivial but
l2tp_session_create calls l2tp_session_set_header_len which uses
session->tunnel to get the tunnel's encap. Add an encap arg to
l2tp_session_set_header_len to avoid using session->tunnel.

If l2tpv3 sessions have colliding IDs, it is possible for
l2tp_v3_session_get to race with l2tp_session_register and fetch a
session which doesn't yet have session->tunnel set. Add a check for
this case.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

24256415

l2tp: refactor ppp socket/session relationship · c5cbaef9

James Chapman authored Jul 29, 2024

Each l2tp ppp session has an associated pppox socket. l2tp_ppp uses
the session's pppox socket refcount to manage session lifetimes; the
pppox socket holds a ref on the session which is dropped by the socket
destructor. This complicates session cleanup.

Given l2tp sessions are refcounted, it makes more sense to reverse
this relationship such that the session keeps the socket alive, not
the other way around. So refactor l2tp_ppp to have the session hold a
ref on its socket while it references it. When the session is closed,
it drops its socket ref when it detaches from its socket. If the
socket is closed first, it initiates the closing of its session, if
one is attached. The socket/session can then be freed asynchronously
when their refcounts drop to 0.

Use the session's session_close callback to detach the pppox socket
since this will be done on the work queue together with the rest of
the session cleanup via l2tp_session_delete.

Also, since l2tp_ppp uses the pppox socket's sk_user_data, use the rcu
sk_user_data access helpers when accessing it and set the socket's
SOCK_RCU_FREE flag to have pppox sockets freed by rcu.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5cbaef9

l2tp: free sessions using rcu · d17e8999

James Chapman authored Jul 29, 2024

l2tp sessions may be accessed under an rcu read lock. Have them freed
via rcu and remove the now unneeded synchronize_rcu when a session is
removed.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d17e8999

l2tp: delete sessions using work queue · fc7ec7f5

James Chapman authored Jul 29, 2024

When a tunnel is closed, l2tp_tunnel_closeall closes all sessions in
the tunnel. Move the work of deleting each session to the work queue
so that sessions are deleted using the same codepath whether they are
closed by user API request or their parent tunnel is closing. This
also avoids the locking dance in l2tp_tunnel_closeall where the
tunnel's session list lock was unlocked and relocked in the loop.

In l2tp_exit_net, use drain_workqueue instead of flush_workqueue
because the processing of tunnel_delete work may queue session_delete
work items which must also be processed.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc7ec7f5

l2tp: simplify tunnel and socket cleanup · 29717a4f

James Chapman authored Jul 29, 2024

When the l2tp tunnel socket used sk_user_data to point to its
associated l2tp tunnel, socket and tunnel cleanup had to make use of
the socket's destructor to free the tunnel only when the socket could
no longer be accessed.

Now that sk_user_data is no longer used, we can simplify socket and
tunnel cleanup:

  * If the tunnel closes first, it cleans up and drops its socket ref
    when the tunnel refcount drops to zero. If its socket was provided
    by userspace, the socket is closed and freed asynchronously, when
    userspace closes it. If its socket is a kernel socket, the tunnel
    closes the socket itself during cleanup and drops its socket ref
    when the tunnel's refcount drops to zero.

  * If the socket closes first, we initiate the closing of its
    associated tunnel. For UDP sockets, this is via the socket's
    encap_destroy hook. For L2TPIP sockets, this is via the socket's
    destroy callback. The tunnel holds a socket ref while it
    references the sock. When the tunnel is freed, it drops its socket
    ref and the socket will be cleaned up when its own refcount drops
    to zero, asynchronous to the tunnel free.

  * The tunnel socket destructor is no longer needed since the tunnel
    is no longer freed through the socket destructor.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29717a4f

l2tp: remove unused tunnel magic field · 0fa51a7c

James Chapman authored Jul 29, 2024

Since l2tp no longer derives tunnel pointers directly via
sk_user_data, it is no longer useful for l2tp to check tunnel pointers
using a magic feather. Drop the tunnel's magic field.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fa51a7c

l2tp: don't set sk_user_data in tunnel socket · 4a4cd703

James Chapman authored Jul 29, 2024

l2tp no longer uses the tunnel socket's sk_user_data so drop the code
which sets it.

In l2tp_validate_socket use l2tp_sk_to_tunnel to check whether a given
socket is already attached to an l2tp tunnel since we can no longer
use non-null sk_user_data to indicate this.
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a4cd703