Commits · 63e3564507ea0f2e8eac5ebadf44c430d56d3fbc · Kirill Smelkov / linux

26 Jan, 2023 11 commits

net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol types · 63e35645

Steen Hegelund authored Jan 24, 2023

This allows the IS0 VCAP to have its own list of supported ethernet
protocol types matching what is supported by the VCAPs port lookup
classification.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

63e35645

net: microchip: sparx5: Add automatic selection of VCAP rule actionset · 81e164c4

Steen Hegelund authored Jan 24, 2023

With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

81e164c4

net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs · 88bd9ea7

Steen Hegelund authored Jan 24, 2023

This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.

Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

88bd9ea7

net: microchip: sparx5: Add TC support for IS0 VCAP · 542e6e2c

Steen Hegelund authored Jan 24, 2023

This enables the TC command to use the Sparx5 IS0 VCAP
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

542e6e2c

net: microchip: sparx5: Add actionset type id information to rule · 7306fcd1

Steen Hegelund authored Jan 24, 2023

This adds the actionset type id to the rule information.  This is needed as
we now have more than one actionset in a VCAP instance (IS0).
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

7306fcd1

net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5 · 545609fd

Steen Hegelund authored Jan 24, 2023

This adds the IS0 VCAP port keyset configuration for Sparx5 and also
updates the debugFS support to show the keyset configuration.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

545609fd

net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model · f274a659

Steen Hegelund authored Jan 24, 2023

This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
This VCAP provides classification actions for Sparx5.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

f274a659

Merge branch 'add-ip_local_port_range-socket-option' · 3f17e16f

Jakub Kicinski authored Jan 25, 2023

Jakub Sitnicki says:

====================
Add IP_LOCAL_PORT_RANGE socket option

This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].

Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.

Documentation
-------------

Proposed update to the ip(7) man-page:

       IP_LOCAL_PORT_RANGE (since Linux X.Y)
              Set or get the per-socket default local  port  range.  This
              option  can  be  used  to  clamp down the global local port
              range, defined by the ip_local_port_range  /proc  interface
              described below, for a given socket.

              The  option  takes  an uint32_t value with the high 16 bits
              set to the upper range bound, and the low 16  bits  set  to
              the  lower  range  bound.  Range  bounds are inclusive. The
              16-bit values should be in host byte order.

              The lower bound has to be less than the  upper  bound  when
              both  bounds  are  not  zero. Otherwise, setting the option
              fails with EINVAL.

              If either bound is outside of the global local port  range,
              or is zero, then that bound has no effect.

              To  reset  the setting, pass zero as both the upper and the
              lower bound.

Interaction with SELinux bind() hook
------------------------------------

SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.

The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.

Hence, there is no interaction with SELinux, AFAICT.

RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/

 * Allow either the high bound or the low bound, or both, to be zero
 * Add getsockopt support
 * Add selftests

Links:
------

[1]: https://lpc.events/event/16/contributions/1349/
====================

Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3f17e16f

selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option · ae543965

Jakub Sitnicki authored Jan 24, 2023

Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:

1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ae543965

inet: Add IP_LOCAL_PORT_RANGE socket option · 91d0b78c

Jakub Sitnicki authored Jan 24, 2023

Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

91d0b78c

net: Kconfig: fix spellos · 6a7a2c18

Randy Dunlap authored Jan 24, 2023

Fix spelling in net/ Kconfig files.
(reported by codespell)
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6a7a2c18

25 Jan, 2023 17 commits

net: ethtool: fix NULL pointer dereference in pause_prepare_data() · f5be9caf

Vladimir Oltean authored Jan 24, 2023

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
   -> ctx->ops->prepare_data
      -> pause_prepare_data

struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c90 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>

f5be9caf

net: ethtool: fix NULL pointer dereference in stats_prepare_data() · c96de136

Vladimir Oltean authored Jan 24, 2023

In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
   -> ctx->ops->prepare_data
      -> stats_prepare_data

struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c90 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c96de136

Merge branch 's390-ism-generalized-interface' · 99db6fb0

David S. Miller authored Jan 25, 2023

Jan Karcher says:

====================
drivers/s390/net/ism: Add generalized interface

Previously, there was no clean separation between SMC-D code and the ISM
device driver.This patch series addresses the situation to make ISM available
for uses outside of SMC-D.
In detail: SMC-D offers an interface via struct smcd_ops, which only the
ISM module implements so far. However, there is no real separation between
the smcd and ism modules, which starts right with the ISM device
initialization, which calls directly into the SMC-D code.
This patch series introduces a new API in the ISM module, which allows
registration of arbitrary clients via include/linux/ism.h: struct ism_client.
Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of
dependencies on SMC-D in the device structure), and adds a number of API
calls for data transfers via ISM (see ism_register_dmb() & friends).
Still, the ISM module implements the SMC-D API, and therefore has a number
of internal helper functions for that matter.
Note that the ISM API is consciously kept thin for now (as compared to the
SMC-D API calls), as a number of API calls are only used with SMC-D and
hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls.

v1 -> v2:
  Removed s390x dependency which broke config for other archs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

99db6fb0

net/smc: De-tangle ism and smc device initialization · 8c81ba20

Stefan Raspl authored Jan 23, 2023

The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8c81ba20

s390/ism: Consolidate SMC-D-related code · 820f2100

Stefan Raspl authored Jan 23, 2023

The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

820f2100

net/smc: Separate SMC-D and ISM APIs · 9de4df7b

Stefan Raspl authored Jan 23, 2023

We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9de4df7b

net/smc: Register SMC-D as ISM client · 8747716f

Stefan Raspl authored Jan 23, 2023

Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8747716f

net/ism: Add new API for client registration · 89e7d2ba

Stefan Raspl authored Jan 23, 2023

Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

89e7d2ba

s390/ism: Introduce struct ism_dmb · 1baedb13

Stefan Raspl authored Jan 23, 2023

Conceptually, a DMB is a structure that belongs to ISM devices. However,
SMC currently 'owns' this structure. So future exploiters of ISM devices
would be forced to include SMC headers to work - which is just weird.
Therefore, we switch ISM to struct ism_dmb, introduce a new public header
with the definition (will be populated with further API calls later on),
and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb
are identical, we can simply convert between the two for now.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1baedb13

net/ism: Add missing calls to disable bus-mastering · 462502ff

Stefan Raspl authored Jan 23, 2023

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

462502ff

net/smc: Terminate connections prior to device removal · c40bff41

Stefan Raspl authored Jan 23, 2023

Removing an ISM device prior to terminating its associated connections
doesn't end well.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c40bff41

virtio-net: Reduce debug name field size to 16 bytes · d0671115

Parav Pandit authored Jan 23, 2023

virtio queue index can be maximum of 65535. 16 bytes are enough to store
the vq name with the existing string prefix.

With this change, send queue struct saves 24 bytes and receive
queue saves whole cache line worth 64 bytes per structure
due to saving in alignment bytes.

Pahole results before:

pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1112    0
receive_queue   1280    1

Pahole results after:
pahole -s drivers/net/virtio_net.o | \
    grep -e "send_queue" -e "receive_queue"
send_queue      1088    0
receive_queue   1216    1
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0671115

devlink: remove a dubious assumption in fmsg dumping · 4373a023

Jakub Kicinski authored Jan 23, 2023

Build bot detects that err may be returned uninitialized in
devlink_fmsg_prepare_skb(). This is not really true because
all fmsgs users should create at least one outer nest, and
therefore fmsg can't be completely empty.

That said the assumption is not trivial to confirm, so let's
follow the bots advice, anyway.

This code does not seem to have changed since its inception in
commit 1db64e87 ("devlink: Add devlink formatted message (fmsg) API")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4373a023

net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm() · 28113cfa

Vladimir Oltean authored Jan 23, 2023

We don't read the verify_enabled variable from hardware in the MAC Merge
layer state GET operation, instead we always leave it set to "false".
The user may think something is wrong if they set verify_enabled to
true, then read it back and see it's still false, even though the
configuration took place.

Fixes: 6505b680 ("net: mscc: ocelot: add MAC Merge layer support for VSC9959")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230123184538.3420098-1-vladimir.oltean@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

28113cfa

nfp: flower: change get/set_eeprom logic and enable for flower reps · 74b4f173

James Hershaw authored Jan 23, 2023

The changes in this patch are as follows:

- Alter the logic of get/set_eeprom functions to use the helper function
nfp_app_from_netdev() which handles differentiating between an nfp_net
and a nfp_repr. This allows us to get an agnostic backpointer to the
pdev.

- Enable the various eeprom commands by adding the 'get_eeprom_len',
'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct.
This allows the eeprom commands to work on representor interfaces,
similar to a previous patch which added it to the vnics.
Currently these are being used to configure persistent MAC addresses for
the physical ports on the nfp.
Signed-off-by: James Hershaw <james.hershaw@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230123134135.293278-1-simon.horman@corigine.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

74b4f173

ipv6: Make ip6_route_output_flags_noref() static. · 90317bcd

Guillaume Nault authored Jan 23, 2023

This function is only used in net/ipv6/route.c and has no reason to be
visible outside of it.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

90317bcd

netlink: fix spelling mistake in dump size assert · ec8f7d49

Jakub Kicinski authored Jan 23, 2023

Commit 2c7bc10d ("netlink: add macro for checking dump ctx size")
misspelled the name of the assert as asset, missing an R.
Reported-by: Ido Schimmel <idosch@idosch.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ec8f7d49

24 Jan, 2023 12 commits

Merge branch 'netlink-protocol-specs' · c554520f

Paolo Abeni authored Jan 24, 2023

Jakub Kicinski says:

====================
Netlink protocol specs

I think the Netlink proto specs are far along enough to merge.
Filling in all attribute types and quirks will be an ongoing
effort but we have enough to cover FOU so it's somewhat complete.

I fully intend to continue polishing the code but at the same
time I'd like to start helping others base their work on the
specs (e.g. DPLL) and need to start working on some new families
myself.

That's the progress / motivation for merging. The RFC [1] has more
of a high level blurb, plus I created a lot of documentation, I'm
not going to repeat it here. There was also the talk at LPC [2].

[1] https://lore.kernel.org/all/20220811022304.583300-1-kuba@kernel.org/
[2] https://youtu.be/9QkXIQXkaQk?t=2562
v2: https://lore.kernel.org/all/20220930023418.1346263-1-kuba@kernel.org/
v3: https://lore.kernel.org/all/20230119003613.111778-1-kuba@kernel.org/1

v4:
 - spec improvements (patch 2)
 - Python cleanup (patch 3)
 - rename auto-gen files and use the right comment style
====================

Link: https://lore.kernel.org/r/20230120175041.342573-1-kuba@kernel.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

c554520f

tools: ynl: add a completely generic client · e4b48ed4

Jakub Kicinski authored Jan 20, 2023

Add a CLI sample which can take in arbitrary request
in JSON format, convert it to Netlink and do the inverse
for output.

It's meant as a development tool primarily and perhaps
for selftests which need to tickle netlink in a special way.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e4b48ed4

net: fou: use policy and operation tables generated from the spec · 1d562c32

Jakub Kicinski authored Jan 20, 2023

Generate and plug in the spec-based tables.

A little bit of renaming is needed in the FOU code.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

1d562c32

net: fou: rename the source for linking · 08d32323

Jakub Kicinski authored Jan 20, 2023

We'll need to link two objects together to form the fou module.
This means the source can't be called fou, the build system expects
fou.o to be the combined object.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

08d32323

net: fou: regenerate the uAPI from the spec · 3a330496

Jakub Kicinski authored Jan 20, 2023

Regenerate the FOU uAPI header from the YAML spec.

The flags now come before attributes which use them,
and the comments for type disappear (coders should look
at the spec instead).
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

3a330496

netlink: add a proto specification for FOU · 4eb77b4e

Jakub Kicinski authored Jan 20, 2023

FOU has a reasonably modern Genetlink family. Add a spec.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

4eb77b4e

net: add basic C code generators for Netlink · be5bea1c

Jakub Kicinski authored Jan 20, 2023

Code generators to turn Netlink specs into C code.
I'm definitely not proud of it.

The main generator is in Python, there's a bash script
to regen all code-gen'ed files in tree after making
spec changes.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

be5bea1c

netlink: add schemas for YAML specs · e616c07c

Jakub Kicinski authored Jan 20, 2023

Add schemas for Netlink spec files. As described in the docs
we have 4 "protocols" or compatibility levels, and each one
comes with its own schema, but the more general / legacy
schemas are superset of more modern ones: genetlink is
the smallest followed by genetlink-c and genetlink-legacy.
There is no schema for raw netlink, yet, I haven't found the time..

I don't know enough jsonschema to do inheritance or something
but the repetition is not too bad. I hope.
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

e616c07c

docs: add more netlink docs (incl. spec docs) · 9d6a6507

Jakub Kicinski authored Jan 20, 2023

Add documentation about the upcoming Netlink protocol specs.
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

9d6a6507

Merge branch 'net-sched-use-the-backlog-for-nested-mirred-ingress' · d961bee4

Paolo Abeni authored Jan 24, 2023

Davide Caratti says:

====================
net/sched: use the backlog for nested mirred ingress

TC mirred has a protection against excessive stack growth, but that
protection doesn't really guarantee the absence of recursion, nor
it guards against loops. Patch 1/2 rewords "recursion" to "nesting" to
make this more clear.
We can leverage on this existing mechanism to prevent TCP / SCTP from doing
soft lock-up in some specific scenarios that uses mirred egress->ingress:
patch 2 changes mirred so that the networking backlog is used for nested
mirred ingress actions.
====================

Link: https://lore.kernel.org/r/cover.1674233458.git.dcaratti@redhat.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

d961bee4

act_mirred: use the backlog for nested calls to mirred ingress · ca22da2f

Davide Caratti authored Jan 20, 2023

William reports kernel soft-lockups on some OVS topologies when TC mirred
egress->ingress action is hit by local TCP traffic [1].
The same can also be reproduced with SCTP (thanks Xin for verifying), when
client and server reach themselves through mirred egress to ingress, and
one of the two peers sends a "heartbeat" packet (from within a timer).

Enqueueing to backlog proved to fix this soft lockup; however, as Cong
noticed [2], we should preserve - when possible - the current mirred
behavior that counts as "overlimits" any eventual packet drop subsequent to
the mirred forwarding action [3]. A compromise solution might use the
backlog only when tcf_mirred_act() has a nest level greater than one:
change tcf_mirred_forward() accordingly.

Also, add a kselftest that can reproduce the lockup and verifies TC mirred
ability to account for further packet drops after TC mirred egress->ingress
(when the nest level is 1).

 [1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
 [2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/
 [3] such behavior is not guaranteed: for example, if RPS or skb RX
     timestamping is enabled on the mirred target device, the kernel
     can defer receiving the skb and return NET_RX_SUCCESS inside
     tcf_mirred_forward().
Reported-by: William Zhao <wizhao@redhat.com>
CC: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ca22da2f

net/sched: act_mirred: better wording on protection against excessive stack growth · 78dcdffe

Davide Caratti authored Jan 20, 2023

with commit e2ca070f ("net: sched: protect against stack overflow in
TC act_mirred"), act_mirred protected itself against excessive stack growth
using per_cpu counter of nested calls to tcf_mirred_act(), and capping it
to MIRRED_RECURSION_LIMIT. However, such protection does not detect
recursion/loops in case the packet is enqueued to the backlog (for example,
when the mirred target device has RPS or skb timestamping enabled). Change
the wording from "recursion" to "nesting" to make it more clear to readers.

CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

78dcdffe