Commits · 77890db10ef04cae55fb858cbb861414f33039a3 · Kirill Smelkov / linux

31 Mar, 2021 40 commits

David S. Miller authored Mar 31, 2021

Vladimir Oltean says:

====================
XDP for NXP ENETC

This series adds support to the enetc driver for the basic XDP primitives.
The ENETC is a network controller found inside the NXP LS1028A SoC,
which is a dual-core Cortex A72 device for industrial networking,
with the CPUs clocked at up to 1.3 GHz. On this platform, there are 4
ENETC ports and a 6-port embedded DSA switch, in a topology that looks
like this:

  +-------------------------------------------------------------------------+
  |                    +--------+ 1 Gbps (typically disabled)               |
  | ENETC PCI          |  ENETC |--------------------------+                |
  | Root Complex       | port 3 |-----------------------+  |                |
  | Integrated         +--------+                       |  |                |
  | Endpoint                                            |  |                |
  |                    +--------+ 2.5 Gbps              |  |                |
  |                    |  ENETC |--------------+        |  |                |
  |                    | port 2 |-----------+  |        |  |                |
  |                    +--------+           |  |        |  |                |
  |                                         |  |        |  |                |
  |                        +------------------------------------------------+
  |                        |             |  Felix |  |  Felix |             |
  |                        | Switch      | port 4 |  | port 5 |             |
  |                        |             +--------+  +--------+             |
  |                        |                                                |
  | +--------+  +--------+ | +--------+  +--------+  +--------+  +--------+ |
  | |  ENETC |  |  ENETC | | |  Felix |  |  Felix |  |  Felix |  |  Felix | |
  | | port 0 |  | port 1 | | | port 0 |  | port 1 |  | port 2 |  | port 3 | |
  +-------------------------------------------------------------------------+
         |          |             |           |            |          |
         v          v             v           v            v          v
       Up to      Up to                      Up to 4x 2.5Gbps
      2.5Gbps     1Gbps

The ENETC ports 2 and 3 can act as DSA masters for the embedded switch.
Because 4 out of the 6 externally-facing ports of the SoC are switch
ports, the most interesting use case for XDP on this device is in fact
XDP_TX on the 2.5Gbps DSA master.

Nonetheless, the results presented below are for IPv4 forwarding between
ENETC port 0 (eno0) and port 1 (eno1) both configured for 1Gbps.
There are two streams of IPv4/UDP datagrams with a frame length of 64
octets delivered at 100% port load to eno0 and to eno1. eno0 has a flow
steering rule to process the traffic on RX ring 0 (CPU 0), and eno1 has
a flow steering rule towards RX ring 1 (CPU 1).

For the IPFWD test, standard IP routing was enabled in the netns.
For the XDP_DROP test, the samples/bpf/xdp1 program was attached to both
eno0 and to eno1.
For the XDP_TX test, the samples/bpf/xdp2 program was attached to both
eno0 and to eno1.
For the XDP_REDIRECT test, the samples/bpf/xdp_redirect program was
attached once to the input of eno0/output of eno1, and twice to the
input of eno1/output of eno0.

Finally, the preliminary results are as follows:

        | IPFWD | XDP_TX | XDP_REDIRECT | XDP_DROP
--------+-------+--------+-------------------------
fps     | 761   | 2535   | 1735         | 2783
Gbps    | 0.51  | 1.71   | 1.17         | n/a

There is a strange phenomenon in my testing sistem where it appears that
one CPU is processing more than the other. I have not investigated this
too much. Also, the code might not be very well optimized (for example
dma_sync_for_device is called with the full ENETC_RXB_DMA_SIZE_XDP).

Design wise, the ENETC is a PCI device with BD rings, so it uses the
MEM_TYPE_PAGE_SHARED memory model, as can typically be seen in Intel
devices. The strategy was to build upon the existing model that the
driver uses, and not change it too much. So you will see things like a
separate NAPI poll function for XDP.

I have only tested with PAGE_SIZE=4096, and since we split pages in
half, it means that MTU-sized frames are scatter/gather (the XDP
headroom + skb_shared_info only leaves us 1476 bytes of data per
buffer). This is sub-optimal, but I would rather keep it this way and
help speed up Lorenzo's series for S/G support through testing, rather
than change the enetc driver to use some other memory model like page_pool.
My code is already structured for S/G, and that works fine for XDP_DROP
and XDP_TX, just not for XDP_REDIRECT, even between two enetc ports.
So the S/G XDP_REDIRECT is stubbed out (the frames are dropped), but
obviously I would like to remove that limitation soon.

Please note that I am rather new to this kind of stuff, I am more of a
control path person, so I would appreciate feedback.

Enough talking, on to the patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

77890db1

net: enetc: add support for XDP_REDIRECT · 9d2b68cc

Vladimir Oltean authored Mar 31, 2021

The driver implementation of the XDP_REDIRECT action reuses parts from
XDP_TX, most notably the enetc_xdp_tx function which transmits an array
of TX software BDs. Only this time, the buffers don't have DMA mappings,
we need to create them.

When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can
employ the same buffer reuse strategy as for the normal processing path
and for XDP_PASS: we can flip to the other page half and seed that to
the RX ring.

Note that scatter/gather support is there, but disabled due to lack of
multi-buffer support in XDP (which is added by this series):
https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d2b68cc

net: enetc: increase RX ring default size · d6a2829e

Vladimir Oltean authored Mar 31, 2021

As explained in the XDP_TX patch, when receiving a burst of frames with
the XDP_TX verdict, there is a momentary dip in the number of available
RX buffers. The system will eventually recover as TX completions will
start kicking in and refilling our RX BD ring again. But until that
happens, we need to survive with as few out-of-buffer discards as
possible.

This increases the memory footprint of the driver in order to avoid
discards at 2.5Gbps line rate 64B packet sizes, the maximum speed
available for testing on 1 port on NXP LS1028A.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6a2829e

net: enetc: add support for XDP_TX · 7ed2bc80

Vladimir Oltean authored Mar 31, 2021

For reflecting packets back into the interface they came from, we create
an array of TX software BDs derived from the RX software BDs. Therefore,
we need to extend the TX software BD structure to contain most of the
stuff that's already present in the RX software BD structure, for
reasons that will become evident in a moment.

For a frame with the XDP_TX verdict, we don't reuse any buffer right
away as we do for XDP_DROP (the same page half) or XDP_PASS (the other
page half, same as the skb code path).

Because the buffer transfers ownership from the RX ring to the TX ring,
reusing any page half right away is very dangerous. So what we can do is
we can recycle the same page half as soon as TX is complete.

The code path is:
enetc_poll
-> enetc_clean_rx_ring_xdp
   -> enetc_xdp_tx
   -> enetc_refill_rx_ring
(time passes, another MSI interrupt is raised)
enetc_poll
-> enetc_clean_tx_ring
   -> enetc_recycle_xdp_tx_buff

But that creates a problem, because there is a potentially large time
window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in
which we'll have less and less RX buffers.

Basically, when the ship starts sinking, the knee-jerk reaction is to
let enetc_refill_rx_ring do what it does for the standard skb code path
(refill every 16 consumed buffers), but that turns out to be very
inefficient. The problem is that we have no rx_swbd->page at our
disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would
have to call enetc_new_page for every buffer that we refill (if we
choose to refill at this early stage). Very inefficient, it only makes
the problem worse, because page allocation is an expensive process, and
CPU time is exactly what we're lacking.

Additionally, there is an even bigger problem: if we let
enetc_refill_rx_ring top up the ring's buffers again from the RX path,
remember that the buffers sent to transmission haven't disappeared
anywhere. They will be eventually sent, and processed in
enetc_clean_tx_ring, and an attempt will be made to recycle them.
But surprise, the RX ring is already full of new buffers, because we
were premature in deciding that we should refill. So not only we took
the expensive decision of allocating new pages, but now we must throw
away perfectly good and reusable buffers.

So what we do is we implement an elastic refill mechanism, which keeps
track of the number of in-flight XDP_TX buffer descriptors. We top up
the RX ring only up to the total ring capacity minus the number of BDs
that are in flight (because we know that those BDs will return to us
eventually).

The enetc driver manages 1 RX ring per CPU, and the default TX ring
management is the same. So we do XDP_TX towards the TX ring of the same
index, because it is affined to the same CPU. This will probably not
produce great results when we have a tc-taprio/tc-mqprio qdisc on the
interface, because in that case, the number of TX rings might be
greater, but I didn't add any checks for that yet (mostly because I
didn't know what checks to add).

It should also be noted that we need to change the DMA mapping direction
for RX buffers, since they may now be reflected into the TX ring of the
same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and
remapping as DMA_TO_DEVICE, because performance is better this way.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ed2bc80

net: enetc: add support for XDP_DROP and XDP_PASS · d1b15102

Vladimir Oltean authored Mar 31, 2021

For the RX ring, enetc uses an allocation scheme based on pages split
into two buffers, which is already very efficient in terms of preventing
reallocations / maximizing reuse, so I see no reason why I would change
that.

 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half B | half B | half B | half B | half B | half B | half B |
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half A | half A | half A | half A | half A | half A | half A | RX ring
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
     ^                                                     ^
     |                                                     |
 next_to_clean                                       next_to_alloc
                                                      next_to_use

                   +--------+--------+--------+--------+--------+
                   |        |        |        |        |        |
                   | half B | half B | half B | half B | half B |
                   |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |        |        |        |        |        |
 | half B | half B | half A | half A | half A | half A | half A | RX ring
 |        |        |        |        |        |        |        |
 +--------+--------+--------+--------+--------+--------+--------+
 |        |        |   ^                                   ^
 | half A | half A |   |                                   |
 |        |        | next_to_clean                   next_to_use
 +--------+--------+
              ^
              |
         next_to_alloc

then when enetc_refill_rx_ring is called, whose purpose is to advance
next_to_use, it sees that it can take buffers up to next_to_alloc, and
it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate
one!".

The only problem is that for default PAGE_SIZE values of 4096, buffer
sizes are 2048 bytes. While this is enough for normal skb allocations at
an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256
bytes, and including skb_shared_info and alignment, we end up being able
to make use of only 1472 bytes, which is insufficient for the default
MTU.

To solve that problem, we implement scatter/gather processing in the
driver, because we would really like to keep the existing allocation
scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and
another one of 28 bytes.

Because the headroom required by XDP is different (and much larger) than
the one required by the network stack, whenever a BPF program is added
or deleted on the port, we drain the existing RX buffers and seed new
ones with the required headroom. We also keep the required headroom in
rx_ring->buffer_offset.

The simplest way to implement XDP_PASS, where an skb must be created, is
to create an xdp_buff based on the next_to_clean RX BDs, but not clear
those BDs from the RX ring yet, just keep the original index at which
the BDs for this frame started. Then, if the verdict is XDP_PASS,
instead of converting the xdb_buff to an skb, we replay a call to
enetc_build_skb (just as in the normal enetc_clean_rx_ring case),
starting from the original BD index.

We would also like to be minimally invasive to the regular RX data path,
and not check whether there is a BPF program attached to the ring on
every packet. So we create a separate RX ring processing function for
XDP.

Because we only install/remove the BPF program while the interface is
down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there
shouldn't be any circumstance in which we are processing packets and
there is a potentially freed BPF program attached to the RX ring.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1b15102

net: enetc: move up enetc_reuse_page and enetc_page_reusable · 65d0cbb4

Vladimir Oltean authored Mar 31, 2021

For XDP_TX, we need to call enetc_reuse_page from enetc_clean_tx_ring,
so we need to avoid a forward declaration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

65d0cbb4

net: enetc: clean the TX software BD on the TX confirmation path · 1ee8d6f3

Vladimir Oltean authored Mar 31, 2021

With the future introduction of some new fields into enetc_tx_swbd such
as is_xdp_tx, is_xdp_redirect etc, we need not only to set these bits
to true from the XDP_TX/XDP_REDIRECT code path, but also to false from
the old code paths.

This is because TX software buffer descriptors are kept in a ring that
is shadow of the hardware TX ring, so these structures keep getting
reused, and there is always the possibility that when a software BD is
reused (after we ran a full circle through the TX ring), the old user of
the tx_swbd had set is_xdp_tx = true, and now we are sending a regular
skb, which would need to set is_xdp_tx = false.

To be minimally invasive to the old code paths, let's just scrub the
software TX BD in the TX confirmation path (enetc_clean_tx_ring), once
we know that nobody uses this software TX BD (tx_ring->next_to_clean
hasn't yet been updated, and the TX paths check enetc_bd_unused which
tells them if there's any more space in the TX ring for a new enqueue).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1ee8d6f3

net: enetc: add a dedicated is_eof bit in the TX software BD · d504498d

Vladimir Oltean authored Mar 31, 2021

In the transmit path, if we have a scatter/gather frame, it is put into
multiple software buffer descriptors, the last of which has the skb
pointer populated (which is necessary for rearming the TX MSI vector and
for collecting the two-step TX timestamp from the TX confirmation path).

At the moment, this is sufficient, but with XDP_TX, we'll need to
service TX software buffer descriptors that don't have an skb pointer,
however they might be final nonetheless. So add a dedicated bit for
final software BDs that we populate and check explicitly. Also, we keep
looking just for an skb when doing TX timestamping, because we don't
want/need that for XDP.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d504498d

net: enetc: move skb creation into enetc_build_skb · a800abd3

Vladimir Oltean authored Mar 31, 2021

We need to build an skb from two code paths now: from the plain RX data
path and from the XDP data path when the verdict is XDP_PASS.

Create a new enetc_build_skb function which contains the essential steps
for building an skb based on the first and last positions of buffer
descriptors within the RX ring.

We also squash the enetc_process_skb function into enetc_build_skb,
because what that function did wasn't very meaningful on its own.

The "rx_frm_cnt++" instruction has been moved around napi_gro_receive
for cosmetic reasons, to be in the same spot as rx_byte_cnt++, which
itself must be before napi_gro_receive, because that's when we lose
ownership of the skb.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a800abd3

net: enetc: consume the error RX buffer descriptors in a dedicated function · 2fa423f5

Vladimir Oltean authored Mar 31, 2021

We can and should check the RX BD errors before starting to build the
skb. The only apparent reason why things are done in this backwards
order is to spare one call to enetc_rxbd_next.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2fa423f5

ipv6: remove extra dev_hold() for fallback tunnels · 0d7a7b20

Eric Dumazet authored Mar 31, 2021

My previous commits added a dev_hold() in tunnels ndo_init(),
but forgot to remove it from special functions setting up fallback tunnels.

Fallback tunnels do call their respective ndo_init()

This leads to various reports like :

unregister_netdevice: waiting for ip6gre0 to become free. Usage count = 2

Fixes: 48bb5697 ("ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 6289a98f ("sit: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 40cb881b ("ip6_vti: proper dev_{hold|put} in ndo_[un]init methods")
Fixes: 7f700334 ("ip6_gre: proper dev_{hold|put} in ndo_[un]init methods")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d7a7b20

net/tipc: fix missing destroy_workqueue() on error in tipc_crypto_start() · ac1db7ac

Yang Yingliang authored Mar 31, 2021

Add the missing destroy_workqueue() before return from
tipc_crypto_start() in the error handling case.

Fixes: 1ef6f7c9 ("tipc: add automatic session key exchange")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac1db7ac

Merge branch 'inet-shrink-netns' · ab1b4f0a

David S. Miller authored Mar 31, 2021

Eric Dumazet says:

====================
inet: shrink netns_ipv{4|6}

This patch series work on reducing footprint of netns_ipv4
and netns_ipv6. Some sysctls are converted to bytes,
and some fields are moves to reduce number of holes
and paddings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ab1b4f0a

ipv6: move ip6_dst_ops first in netns_ipv6 · 0dd39d95

Eric Dumazet authored Mar 31, 2021

ip6_dst_ops have cache line alignement.

Moving it at beginning of netns_ipv6
removes a 48 byte hole, and shrinks netns_ipv6
from 12 to 11 cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0dd39d95

ipv6: convert elligible sysctls to u8 · a6175633

Eric Dumazet authored Mar 31, 2021

Convert most sysctls that can fit in a byte.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6175633

tcp: convert tcp_comp_sack_nr sysctl to u8 · 1c3289c9

Eric Dumazet authored Mar 31, 2021

tcp_comp_sack_nr max value was already 255.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c3289c9

ipv4: convert igmp_link_local_mcast_reports sysctl to u8 · 7d4b37eb

Eric Dumazet authored Mar 31, 2021

This sysctl is a bool, can use less storage.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d4b37eb

ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8 · be205fe6

Eric Dumazet authored Mar 31, 2021

Make room for better packing of netns_ipv4
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be205fe6

ipv4: convert udp_l3mdev_accept sysctl to u8 · cd04bd02

Eric Dumazet authored Mar 31, 2021

Reduce footprint of sysctls.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd04bd02

ipv4: convert fib_notify_on_flag_change sysctl to u8 · b2908fac

Eric Dumazet authored Mar 31, 2021

Reduce footprint of sysctls.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b2908fac

inet: shrink netns_ipv4 by another cache line · 490f33c4

Eric Dumazet authored Mar 31, 2021

By shuffling around some fields to remove 8 bytes of hole,
we can save one cache line.

pahole result before/after the patch :

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */

->

/* size: 704, cachelines: 11, members: 139 */
/* sum members: 673, holes: 10, sum holes: 31 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

490f33c4

inet: shrink inet_timewait_death_row by 48 bytes · 1caf8d39

Eric Dumazet authored Mar 31, 2021

struct inet_timewait_death_row uses two cache lines, because we want
tw_count to use a full cache line to avoid false sharing.

Rework its definition and placement in netns_ipv4 so that:

1) We add 60 bytes of padding after tw_count to avoid
  false sharing, knowing that tcp_death_row will
  have ____cacheline_aligned_in_smp attribute.

2) We do not risk padding before tcp_death_row, because
  we move it at the beginning of netns_ipv4, even if new
 fields are added later.

3) We do not waste 48 bytes of padding after it.

Note that I have not changed dccp.

pahole result for struct netns_ipv4 before/after the patch :

/* size: 832, cachelines: 13, members: 139 */
/* sum members: 721, holes: 12, sum holes: 95 */
/* padding: 16 */
/* paddings: 2, sum paddings: 55 */

->

/* size: 768, cachelines: 12, members: 139 */
/* sum members: 673, holes: 11, sum holes: 39 */
/* padding: 56 */
/* paddings: 2, sum paddings: 7 */
/* forced alignments: 1 */
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1caf8d39

Merge branch 'net-coding-style' · 30b8817f

David S. Miller authored Mar 31, 2021

Weihang Li says:

====================
net: fix some coding style issues

Do some cleanups according to the coding style of kernel, including wrong
print type, redundant and missing spaces and so on.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

30b8817f

net: lpc_eth: fix format warnings of block comments · 44d043b5

Yangyang Li authored Mar 31, 2021

Fix the following format warning:
1. Block comments use * on subsequent lines
2. Block comments use a trailing */ on a separate line
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>