Commits · f05dfdaf567aaa482e6e4474bbf5993c5ffffc49 · Kirill Smelkov / linux

04 Oct, 2022 17 commits

dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller · f05dfdaf

Oleksij Rempel authored Oct 03, 2022

Add bindings for the regulator based Ethernet PoDL PSE controller and
generic bindings for all PSE controllers.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f05dfdaf

ethtool: add interface to interact with Ethernet Power Equipment · 18ff0bcd

Oleksij Rempel authored Oct 03, 2022

Add interface to support Power Sourcing Equipment. At current step it
provides generic way to address all variants of PSE devices as defined
in IEEE 802.3-2018 but support only objects specified for IEEE 802.3-2018 104.4
PoDL Power Sourcing Equipment (PSE).

Currently supported and mandatory objects are:
IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus
IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState
IEEE 802.3-2018 30.15.1.2.1 acPoDLPSEAdminControl

This is minimal interface needed to control PSE on each separate
ethernet port but it provides not all mandatory objects specified in
IEEE 802.3-2018.

Since "PoDL PSE" and "PSE" have similar names, but some different values
I decide to not merge them and keep separate naming schema. This should
allow as to be as close to IEEE 802.3 spec as possible and avoid name
conflicts in the future.

This implementation is connected to PHYs instead of MACs because PSE
auto classification can potentially interfere with PHY auto negotiation.
So, may be some extra PHY related initialization will be needed.

With WIP version of ethtools interaction with PSE capable link looks
as following:

$ ip l
...
5: t1l1@eth0: <BROADCAST,MULTICAST> ..
...

$ ethtool --show-pse t1l1
PSE attributs for t1l1:
PoDL PSE Admin State: disabled
PoDL PSE Power Detection Status: disabled

$ ethtool --set-pse t1l1 podl-pse-admin-control enable
$ ethtool --show-pse t1l1
PSE attributs for t1l1:
PoDL PSE Admin State: enabled
PoDL PSE Power Detection Status: delivering power
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

18ff0bcd

net: mdiobus: search for PSE nodes by parsing PHY nodes. · 5e82147d

Oleksij Rempel authored Oct 03, 2022

Some PHYs can be linked with PSE (Power Sourcing Equipment), so search
for related nodes and attach it to the phydev.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5e82147d

net: mdiobus: fwnode_mdiobus_register_phy() rework error handling · cfaa202a

Oleksij Rempel authored Oct 03, 2022

Rework error handling as preparation for PSE patch. This patch should
make it easier to extend this function.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cfaa202a

net: add framework to support Ethernet PSE and PDs devices · 3114b075

Oleksij Rempel authored Oct 03, 2022

This framework was create with intention to provide support for Ethernet PSE
(Power Sourcing Equipment) and PDs (Powered Device).

At current step this patch implements generic PSE support for PoDL (Power over
Data Lines 802.3bu) specification with reserving name space for PD devices as
well.

This framework can be extended to support 802.3af and 802.3at "Power via the
Media Dependent Interface" (or PoE/Power over Ethernet)
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3114b075

dt-bindings: net: phy: add PoDL PSE property · e9554b31

Oleksij Rempel authored Oct 03, 2022

Add property to reference node representing a PoDL Power Sourcing Equipment.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e9554b31

Merge branch 'net-marvell-prestera-add-nexthop-routes-offloading' · 46a275a5

Jakub Kicinski authored Oct 03, 2022

Yevhen Orlov says:

====================
net: marvell: prestera: add nexthop routes offloading

Add support for nexthop routes for Marvell Prestera driver.
Subscribe on NEIGH_UPDATE events.

Add features:
 - Support connected route adding
   e.g.: "ip address add 1.1.1.1/24 dev sw1p1"
   e.g.: "ip route add 6.6.6/24 dev sw1p1"
 - Support nexthop route adding
   e.g.: "ip route add 5.5.5/24 via 1.1.1.2"
 - Support ECMP route adding
   e.g.: "ip route add 5.5.5/24 nexthop via 1.1.1.2 nexthop via 1.1.1.3"
 - Support "offload" and "trap" flags per each nexthop
 - Support "offload" flag for neighbours

Limitations:
 - Only "local" and "main" tables supported
 - Only generic interfaces supported for router (no bridges or vlans)

Flags meaning:
  ip route add 5.5.5/24 nexthop via 2.2.2.2 nexthop via 2.2.2.3
  ip route show
  ...
  5.5.5.0/24 rt_offload
        nexthop via 2.2.2.2 dev sw1p31 weight 1 trap
        nexthop via 2.2.2.3 dev sw1p31 weight 1 trap
  ...
  # When you just add route - lpm entry became occupied
  # in HW ("rt_offload" flag), but related to nexthops neighbours
  # still not resolved ("trap" flag).
  #
  # After some time...
  ip route show
  ...
  5.5.5.0/24 rt_offload
        nexthop via 2.2.2.2 dev sw1p31 weight 1 offload
        nexthop via 2.2.2.3 dev sw1p31 weight 1 offload
  ...
  # You will see, that appropriate neighbours was resolved and nexthop
  # entries occupied in HW too ("offload" flag)
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>

Changes for v2:
* Add more reviewers in CC
* Check if route nexthop or direct with fib_nh_gw_family instead of fib_nh_scope
  This is needed after,
  747c1430 ("ip: fix dflt addr selection for connected nexthop"),
  because direct route is now with the same scope as nexthop (RT_SCOPE_LINK)

Changes for v3:
* Resolve "unused functions" warnings, after
  patch ("net: marvell: prestera: Add heplers to interact ... "), and before
  patch ("net: marvell: prestera: Add neighbour cache accounting")

Changes for v4:
* Rebase to the latest master to resolve patch applying issues

Changes for v5:
* Repack structures to prevent holes
* Remove unused variables
* Fix misspeling issues

Changes for v6:
* Rebase on top of master
* Fix smatch warnings

Changes for v7:
* Rebase on top of master
* Refactor: use "fib_lookup" instead of "fib_new_table"+"fib_table_lookup",
  according to Paolo Abeni suggestion
* Refactor: use "rhashtable_free_and_destroy" instead of rhashtable
  walk, according to Paolo Abeni suggestion
====================

Link: https://lore.kernel.org/r/20221001093417.22388-1-yevhen.orlov@plvision.euSigned-off-by: Jakub Kicinski <kuba@kernel.org>

46a275a5

net: marvell: prestera: Propagate nh state from hw to kernel · ae15ed6e

Yevhen Orlov authored Oct 01, 2022

We poll nexthops in HW and call for each active nexthop appropriate
neighbour.

Also we provide implicity neighbour resolving.
For example, user have added nexthop route:
  # ip route add 5.5.5.5 via 1.1.1.2
But neighbour 1.1.1.2 doesn't exist. In this case we will try to call
neigh_event_send, even if there is no traffic.
This is useful, when you have add route, which will be used after some
time but with a lot of traffic (burst). So, we has prepared, offloaded
route in advance.
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ae15ed6e

net: marvell: prestera: Add neighbour cache accounting · 396b80cb

Yevhen Orlov authored Oct 01, 2022

Move forward and use new PRESTERA_FIB_TYPE_UC_NH to provide basic
nexthop routes support.
Provide deinitialization sequence for all created router objects.

Limitations:
- Only "local" and "main" tables supported
- Only generic interfaces supported for router (no bridges or vlans)
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

396b80cb

net: marvell: prestera: add stub handler neighbour events · 8b1ef491

Yevhen Orlov authored Oct 01, 2022

Actual handler will be added in next patches
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8b1ef491

net: marvell: prestera: Add heplers to interact with fib_notifier_info · 04f24a1e

Yevhen Orlov authored Oct 01, 2022

This will be used to implement nexthops related logic in next patches.
Also try to keep ipv4/6 abstraction to be able to reuse helpers for ipv6
in the future.
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

04f24a1e

net: marvell: prestera: Add length macros for prestera_ip_addr · 59b44ea8

Yevhen Orlov authored Oct 01, 2022

Add macros to determine IP address length (internal driver types).
This will be used in next patches for nexthops logic.
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

59b44ea8

net: marvell: prestera: add delayed wq and flush wq on deinit · 90b6f9c0

Yevhen Orlov authored Oct 01, 2022

Flushing workqueues ensures, that no more pending works, related to just
unregistered or deinitialized notifiers. After that we can free memory.

Delayed wq will be used for neighbours in next patches.
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

90b6f9c0

net: marvell: prestera: Add strict cleanup of fib arbiter · 333fe4d0

Yevhen Orlov authored Oct 01, 2022

This will, ensure, that there is no more, preciously allocated fib_cache
entries left after deinit.
Will be used to free allocated resources of nexthop routes, that points
to "not our" port (e.g. eth0).
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

333fe4d0

net: marvell: prestera: Add cleanup of allocated fib_nodes · 1e7313e8

Yevhen Orlov authored Oct 01, 2022

Do explicity cleanup on router_hw_fini, to ensure, that all allocated
objects cleaned. This will be used in cases,
when upper layer (cache) is not mapped to router_hw layer.
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1e7313e8

net: marvell: prestera: Add router nexthops ABI · 0a23ae23

Yevhen Orlov authored Oct 01, 2022

- Add functions to allocate/delete/set nexthop group
  - NOTE: non-ECMP nexthop is nexthop group with allocated size = 1
- Add function to read state of HW nh (if packets going through it)
Co-developed-by: Taras Chornyi <tchornyi@marvell.com>
Signed-off-by: Taras Chornyi <tchornyi@marvell.com>
Co-developed-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0a23ae23

eth: octeon: fix build after netif_napi_add() changes · 899b8cd0

Jakub Kicinski authored Oct 02, 2022

Guenter reports I missed a netif_napi_add() call
in one of the platform-specific drivers:

drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 'octeon_mgmt_probe':
drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1399:9: error: too many arguments to function 'netif_napi_add'
 1399 |         netif_napi_add(netdev, &p->napi, octeon_mgmt_napi_poll,
      |         ^~~~~~~~~~~~~~
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: b48b89f9 ("net: drop the weight argument from netif_napi_add")
Link: https://lore.kernel.org/r/20221002175650.1491124-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

899b8cd0

03 Oct, 2022 23 commits

Merge branch 'mlx5-xsk-updates-part4-and-more' · b89eced8

Jakub Kicinski authored Oct 03, 2022

Saeed Mahameed says:

====================
mlx5 xsk updates part4 and more

1) Final part of xsk improvements,
in this series Maxim continues to improve xsk implementation
 a) XSK Busy polling support
 b) Use KLM to avoid Frame overrun in unaligned mode
 c) Optimize unaligned more for certain frame sizes
 d) Other straight forward minor optimizations.

part 1: https://lore.kernel.org/netdev/20220927203611.244301-1-saeed@kernel.org/
part 2: https://lore.kernel.org/netdev/20220929072156.93299-1-saeed@kernel.org/
part 3: https://lore.kernel.org/netdev/20220930162903.62262-1-saeed@kernel.org/

2) Oversize packets firmware counter, from Gal.

3) Set default grace period for health reporters based on function type

4) Some minor E-Switch improvements
====================

Link: https://lore.kernel.org/r/20221002045632.291612-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b89eced8

net/mlx5: E-Switch, Return EBUSY if can't get mode lock · 794131c4

Jianbo Liu authored Oct 01, 2022

It is to avoid tc retrying during device mode change.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

794131c4

net/mlx5: E-switch, Don't update group if qos is not enabled · 909ffe46

Chris Mi authored Oct 01, 2022

Currently, qos group will be updated and qos will be enabled when
unregistering devlink port. Actually no need to update group if qos
is not enabled.

Add a check to prevent unnecessary enabling and disabling qos for
every port.
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Dmytro Linkin <dlinkin@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

909ffe46

net/mlx5: E-Switch, Allow offloading fwd dest flow table with vport · 8c9cc1eb

Roi Dayan authored Oct 01, 2022

Before this commit a fwd dest flow table resulted in ignoring vport dests
which is incorrect and is supported.
With this commit the dests can be a mix of flow table and vport dests.
There is still a limitation that there cannot be more than one flow table dest.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8c9cc1eb

net/mlx5: Set default grace period based on function type · 1330bd98

Maher Sanalla authored Oct 01, 2022

Currently, driver sets the same grace period for fw fatal health reporter
to any type of function.

Since the lower level functions are more vulnerable to fw fatal errors as a
result of parent function closure/reload, set a smaller grace period for
the lower level functions, as follows:

1. For ECPF: 180 seconds.
2. For PF: 60 seconds.
3. For VF/SF: 30 seconds.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1330bd98

net/mlx5: Start health poll at earlier stage of driver load · 9b98d395

Moshe Shemesh authored Oct 01, 2022

Start health poll at earlier stage, so if fw fatal issue occurred before
or during initialization commands such as init_hca or set_hca_cap the
poll health can detect and indicate that the driver is already in error
state.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9b98d395

net/mlx5e: Expose rx_oversize_pkts_buffer counter · 16ab85e7

Gal Pressman authored Oct 01, 2022

Add the rx_oversize_pkts_buffer counter to ethtool statistics.
This counter exposes the number of dropped received packets due to
length which arrived to RQ and exceed software buffer size allocated by
the device for incoming traffic. It might imply that the device MTU is
larger than the software buffers size.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

16ab85e7

net/mlx5e: xsk: Optimize for unaligned mode with 3072-byte frames · c2c9e31d

Maxim Mikityanskiy authored Oct 01, 2022

When XSK frame size is 3072 (or another power of two multiplied by 3),
KLM mechanism for NIC virtual memory page mapping can be optimized by
replacing it with KSM.

Before this change, two KLM entries were needed to map an XSK frame that
is not a power of two: one entry maps the UMEM memory up to the frame
length, the other maps the rest of the stride to the garbage page.

When the frame length divided by 3 is a power of two, it can be mapped
using 3 KSM entries, and the fourth will map the rest of the stride to
the garbage page. All 4 KSM entries are of the same size, which allows
for a much faster lookup.

Frame size 3072 is useful in certain use cases, because it allows
packing 4 frames into 3 pages. Generally speaking, other frame sizes
equal to PAGE_SIZE minus a power of two can be optimized in a similar
way, but it will require many more KSMs per frame, which slows down UMRs
a little bit, but more importantly may hit the limit for the maximum
number of KSM entries.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c2c9e31d

net/mlx5e: xsk: Print a warning in slow configurations · c6f04204

Maxim Mikityanskiy authored Oct 01, 2022

On striding RQ, when the XSK frame size doesn't match the MKey page
size, KLM is used for memory mappings, which is a slower mechanism than
MTT or KSM. It may happen in two cases:

1. Frame size is not a power of two (only possible in the unaligned mode
of XSK).

2. Frame size is 2048 bytes, and the firmware doesn't support MKey pages
smaller than 4096 bytes.

Depending on the case, print a warning and recommend to disable striding
RQ or upgrade the firmware.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c6f04204

net/mlx5e: xsk: Use KLM to protect frame overrun in unaligned mode · 13921345

Maxim Mikityanskiy authored Oct 01, 2022

XSK RQs support striding RQ linear mode, but the stride size may be
bigger than the XSK frame size, because:

1. The stride size must be a power of two.

2. The stride size must be equal to the UMR page size. Each XSK frame is
treated as a separate page, because they aren't necessarily adjacent in
physical memory, so the driver can't put more than one stride per page.

3. The minimal MTT page size is 4096 on older firmware.

That means that if XSK frame size is 2048 or not a power of two, the
strides may be bigger than XSK frames. Normally, it's not a problem if
the hardware enforces the MTU. However, traffic between vports skips the
hardware MTU check, and oversized packets may be received.

If an oversized packet is bigger than the XSK frame but not bigger than
the stride, it will cause overwriting of the adjacent UMEM region. If
the packet takes more than one stride, they can be recycled for reuse,
so it's not a problem when the XSK frame size matches the stride size.

Work around the above issue by leveraging KLM to make a more
fine-grained mapping. The beginning of each stride is mapped to the
frame memory, and the padding up to the closest power of two is mapped
to the overflow page that doesn't belong to UMEM. This way, application
data corruption won't happen upon receiving packets bigger than MTU.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

13921345

net/mlx5e: Improve MTT/KSM alignment · 9f123f74

Maxim Mikityanskiy authored Oct 01, 2022

Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires
smaller alignment than MTT.

Ensure that there is always an even amount of MTTs in a UMR WQE, so that
complete octwords are formed, and no garbage is mapped.

Drop extra alignment in MLX5_MTT_OCTW that may cause setting too big
ucseg->xlt_octowords, also leading to mapping garbage.

Generalize some calculations by introducing the MLX5_OCTWORD constant.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

9f123f74

net/mlx5e: xsk: Use umr_mode to calculate striding RQ parameters · 168723c1

Maxim Mikityanskiy authored Oct 01, 2022

Instead of passing the unaligned flag, pass an enum that indicates the
UMR mode. The next commit will add the third mode (KLM for certain
configurations of XSK), which will be added to this enum instead of
adding another bool flag everywhere.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

168723c1

net/mlx5e: xsk: Improve need_wakeup logic · cfb4d09c

Maxim Mikityanskiy authored Oct 01, 2022

XSK need_wakeup mechanism allows the driver to stop busy waiting for
buffers when the fill ring is empty, yield to the application and signal
it that the driver needs to be waken up after the application refills
the fill ring.

Add protection against the race condition on the RX (refill) side: if
the application refills buffers after xskrq->post_wqes is called, but
before mlx5e_xsk_update_rx_wakeup, NAPI will exit, skipping taking these
buffers to the hardware WQ, and the application won't wake it up again.

Optimize the whole need_wakeup logic, removing unneeded flows, to
compensate for this new check.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cfb4d09c

net/mlx5e: xsk: Include XSK skb_from_cqe callbacks in INDIRECT_CALL · 1ca6492e

Maxim Mikityanskiy authored Oct 01, 2022

XSK is a performance-critical data path. To avoid an indirect function
call with a retpoline, include XSK callbacks in the INDIRECT_CALL macro,
so that they are called directly in XSK flows.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1ca6492e

net/mlx5e: xsk: Set napi_id to support busy polling · a2740f52

Maxim Mikityanskiy authored Oct 01, 2022

xdp_rxq_info_reg should get the actual napi_id, not 0, in order to
support socket busy polling properly.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a2740f52

net/mlx5e: xsk: Flush RQ on XSK activation to save memory · 082a9edf

Maxim Mikityanskiy authored Oct 01, 2022

The regular RQ remains open after opening an XSK socket, in order to
guarantee that closing the XSK socket never fails due to an error when
reopening the regular RQ.

To save memory, the regular RQ can be deactivated and flushed, releasing
all pages, when an XSK socket is open.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

082a9edf

net: ipa: update copyrights · a4388da5

Alex Elder authored Sep 30, 2022

Some source files state copyright dates that are earlier than the
last modification of the file. Change the copyright year to 2022 in
all such cases.
Signed-off-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20220930224549.3503434-1-elder@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a4388da5

net: ipa: update comments · ace5dc61

Alex Elder authored Sep 30, 2022

This patch just updates comments throughout the IPA code.

Transaction state is now tracked using indexes into an array rather
than linked lists, and a few comments refer to the "old way" of
doing things.  The description of how transactions are used was
changed to refer to "operations" rather than "commands", to
(hopefully) remove a possible ambiguity.

IPA register offsets and fields are now handled differently as well,
and the register documentation is updated to better describe the
code.

A few minor updates to comments were made (e.g., adding a missing
word, fixing a typo or punctuation, etc.).

Finally, the local macro atomic_dec_not_zero() is no longer used, so
it is deleted.
Signed-off-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20220930224527.3503404-1-elder@linaro.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ace5dc61

net: lan966x: Fix return type of lan966x_port_xmit · 450a580f

Nathan Huckleberry authored Sep 29, 2022

The ndo_start_xmit field in net_device_ops is expected to be of type
netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb, struct net_device *dev).

The mismatched return type breaks forward edge kCFI since the underlying
function definition does not match the function hook definition.

The return type of lan966x_port_xmit should be changed from int to
netdev_tx_t.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1703
Cc: llvm@lists.linux.dev
Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20220929182704.64438-1-nhuck@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

450a580f

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · a08d97a1

Jakub Kicinski authored Oct 03, 2022

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-03

We've added 143 non-merge commits during the last 27 day(s) which contain
a total of 151 files changed, 8321 insertions(+), 1402 deletions(-).

The main changes are:

1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu.

2) Add support for struct-based arguments for trampoline based BPF programs,
   from Yonghong Song.

3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa.

4) Batch of improvements to veristat selftest tool in particular to add CSV output,
   a comparison mode for CSV outputs and filtering, from Andrii Nakryiko.

5) Add preparatory changes needed for the BPF core for upcoming BPF HID support,
   from Benjamin Tissoires.

6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program
   types, from Daniel Xu.

7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler.

8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer /
   single-kernel-consumer semantics for BPF ring buffer, from David Vernet.

9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF
   hashtab's bucket lock, from Hou Tao.

10) Allow creating an iterator that loops through only the resources of one
    task/thread instead of all, from Kui-Feng Lee.

11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi.

12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT
    entry which is not yet inserted, from Lorenzo Bianconi.

13) Remove invalid recursion check for struct_ops for TCP congestion control BPF
    programs, from Martin KaFai Lau.

14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu.

15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa.

16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen.

17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu.

18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta.

19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits)
  net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c
  Documentation: bpf: Add implementation notes documentations to table of contents
  bpf, docs: Delete misformatted table.
  selftests/xsk: Fix double free
  bpftool: Fix error message of strerror
  libbpf: Fix overrun in netlink attribute iteration
  selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged"
  samples/bpf: Fix typo in xdp_router_ipv4 sample
  bpftool: Remove unused struct event_ring_info
  bpftool: Remove unused struct btf_attach_point
  bpf, docs: Add TOC and fix formatting.
  bpf, docs: Add Clang note about BPF_ALU
  bpf, docs: Move Clang notes to a separate file
  bpf, docs: Linux byteswap note
  bpf, docs: Move legacy packet instructions to a separate file
  selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION)
  bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
  bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function
  bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()
  bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
  ...
====================

Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a08d97a1

net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c · 820dc052

Lorenzo Bianconi authored Sep 30, 2022

Remove circular dependency between nf_nat module and nf_conntrack one
moving bpf_ct_set_nat_info kfunc in nf_nat_bpf.c

Fixes: 0fabd2aa ("net: netfilter: add bpf_ct_set_nat_info kfunc helper")
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/51a65513d2cda3eeb0754842e8025ab3966068d8.1664490511.git.lorenzo@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

820dc052

Documentation: bpf: Add implementation notes documentations to table of contents · 736baae6

Bagas Sanjaya authored Oct 02, 2022

Sphinx reported warnings on missing implementation notes documentations in the
table of contents:

Documentation/bpf/clang-notes.rst: WARNING: document isn't included in any toctree
Documentation/bpf/linux-notes.rst: WARNING: document isn't included in any toctree

Add these documentations to the table of contents (index.rst) of BPF
documentation to fix the warnings.

Link: https://lore.kernel.org/linux-doc/202210020749.yfgDZbRL-lkp@intel.com/
Fixes: 6c7aaffb ("bpf, docs: Move Clang notes to a separate file")
Fixes: 6166da0a ("bpf, docs: Move legacy packet instructions to a separate file")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20221002032022.24693-1-bagasdotme@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

736baae6

once: add DO_ONCE_SLOW() for sleepable contexts · 62c07983

Eric Dumazet authored Oct 01, 2022

Christophe Leroy reported a ~80ms latency spike
happening at first TCP connect() time.

This is because __inet_hash_connect() uses get_random_once()
to populate a perturbation table which became quite big
after commit 4c2c8f03 ("tcp: increase source port perturb table to 2^16")

get_random_once() uses DO_ONCE(), which block hard irqs for the duration
of the operation.

This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock
for operations where we prefer to stay in process context.

Then __inet_hash_connect() can use get_random_slow_once()
to populate its perturbation table.

Fixes: 4c2c8f03 ("tcp: increase source port perturb table to 2^16")
Fixes: 190cc824 ("tcp: change source port randomizarion at connect() time")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#tSigned-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>

62c07983