Commits · 695bc1f32c6bc806218b322096b23dfd601e59ca · Kirill Smelkov / linux

27 Apr, 2023 3 commits

SUNRPC: Clear rq_xid when receiving a new RPC Call · 695bc1f3

Chuck Lever authored Apr 17, 2023

This is an eye-catcher for tracepoints that record the XID: it means
svc_rqst() has not received a full RPC Call with an XID yet.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

695bc1f3

SUNRPC: Recognize control messages in server-side TCP socket code · 5e052dda

Chuck Lever authored Apr 17, 2023

To support kTLS, the server-side TCP socket receive path needs to
watch for CMSGs.
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

5e052dda

SUNRPC: Be even lazier about releasing pages · 6a0cdf56

Chuck Lever authored Apr 14, 2023

A single RPC transaction that touches only a couple of pages means
rq_pvec will not be even close to full in svc_xpt_release(). This is
a common case.

Instead, just leave the pages in rq_pvec until it is completely
full. This improves the efficiency of the batch release mechanism
on workloads that involve small RPC messages.

The rq_pvec is also fully emptied just before thread exit.
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6a0cdf56

26 Apr, 2023 26 commits

SUNRPC: Convert svc_xprt_release() to the release_pages() API · 647a2a64

Chuck Lever authored Apr 14, 2023

Instead of invoking put_page() one-at-a-time, pass the "response"
portion of rq_pages directly to release_pages() to reduce the number
of times each nfsd thread invokes a page allocator API.

Since svc_xprt_release() is not invoked while a client is waiting
for an RPC Reply, this is not expected to directly impact mean
request latencies on a lightly or moderately loaded server. However
as workload intensity increases, I expect somewhat better
scalability: the same number of server threads should be able to
handle more work.
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

647a2a64

SUNRPC: Relocate svc_free_res_pages() · b20cb39d

Chuck Lever authored Apr 14, 2023

Clean-up: There doesn't seem to be a reason why this function is
stuck in a header. One thing it prevents is the convenient addition
of tracing. Moving it to a source file also makes the rq_respages
clean-up logic easier to find.
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b20cb39d

nfsd: simplify the delayed disposal list code · 92e4a673

Jeff Layton authored Apr 14, 2023

When queueing a dispose list to the appropriate "freeme" lists, it
pointlessly queues the objects one at a time to an intermediate list.

Remove a few helpers and just open code a list_move to make it more
clear and efficient. Better document the resulting functions with
kerneldoc comments.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

92e4a673

SUNRPC: Ignore return value of ->xpo_sendto · 55fcc7d9

Chuck Lever authored Apr 03, 2023

Clean up: All callers of svc_process() ignore its return value, so
svc_process() can safely be converted to return void. Ditto for
svc_send().

The return value of ->xpo_sendto() is now used only as part of a
trace event.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

55fcc7d9

SUNRPC: Ensure server-side sockets have a sock->file · ae0d7770

Chuck Lever authored Mar 07, 2023

The TLS handshake upcall mechanism requires a non-NULL sock->file on
the socket it hands to user space. svc_sock_free() already releases
sock->file properly if one exists.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ae0d7770

NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor() · 0f516248

Chuck Lever authored Mar 17, 2023

There have been several bugs over the years where the NFSD splice
actor has attempted to write outside the rq_pages array.

This is a "should never happen" condition, but if for some reason
the pipe splice actor should attempt to walk past the end of
rq_pages, it needs to terminate the READ operation to prevent
corruption of the pointer addresses in the fields just beyond the
array.

A server crash is thus prevented. Since the code is not behaving,
the READ operation returns -EIO to the client. None of the READ
payload data can be trusted if the splice actor isn't operating as
expected.
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>

0f516248

sunrpc: simplify two-level sysctl registration for svcrdma_parm_table · 376bcd9b

Luis Chamberlain authored Mar 11, 2023

There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().

Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

376bcd9b

SUNRPC: return proper error from get_expiry() · cf64b9bc

NeilBrown authored Mar 08, 2023

The get_expiry() function currently returns a timestamp, and uses the
special return value of 0 to indicate an error.

Unfortunately this causes a problem when 0 is the correct return value.

On a system with no RTC it is possible that the boot time will be seen
to be "3".  When exportfs probes to see if a particular filesystem
supports NFS export it tries to cache information with an expiry time of
"3".  The intention is for this to be "long in the past".  Even with no
RTC it will not be far in the future (at most a second or two) so this
is harmless.
But if the boot time happens to have been calculated to be "3", then
get_expiry will fail incorrectly as it converts the number to "seconds
since bootime" - 0.

To avoid this problem we change get_expiry() to report the error quite
separately from the expiry time.  The error is now the return value.
The expiry time is reported through a by-reference parameter.
Reported-by: Jerry Zhang <jerry@skydio.com>
Tested-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

cf64b9bc

lockd: add some client-side tracepoints · 2f90e18f

Jeff Layton authored Mar 03, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2f90e18f

nfs: move nfs_fhandle_hash to common include file · e59fb674

Jeff Layton authored Mar 03, 2023

lockd needs to be able to hash filehandles for tracepoints. Move the
nfs_fhandle_hash() helper to a common nfs include file.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

e59fb674

lockd: server should unlock lock if client rejects the grant · 244cc191

Jeff Layton authored Mar 03, 2023

Currently lockd just dequeues the block and ignores it if the client
sends a GRANT_RES with a status of nlm_lck_denied. That status is an
indicator that the client has rejected the lock, so the right thing to
do is to unlock the lock we were trying to grant.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

244cc191

lockd: fix races in client GRANTED_MSG wait logic · 2005f5b9

Jeff Layton authored Mar 03, 2023

After the wait for a grant is done (for whatever reason), nlmclnt_block
updates the status of the nlm_rqst with the status of the block. At the
point it does this, however, the block is still queued its status could
change at any time.

This is particularly a problem when the waiting task is signaled during
the wait. We can end up giving up on the lock just before the GRANTED_MSG
callback comes in, and accept it even though the lock request gets back
an error, leaving a dangling lock on the server.

Since the nlm_wait never lives beyond the end of nlmclnt_lock, put it on
the stack and add functions to allow us to enqueue and dequeue the
block. Enqueue it just before the lock/wait loop, and dequeue it
just after we exit the loop instead of waiting until the end of
the function. Also, scrape the status at the time that we dequeue it to
ensure that it's final.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2005f5b9

lockd: move struct nlm_wait to lockd.h · f0aa4852

Jeff Layton authored Mar 03, 2023

The next patch needs struct nlm_wait in fs/lockd/clntproc.c, so move
the definition to a shared header file. As an added clean-up, drop
the unused b_reclaim field.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

f0aa4852

lockd: remove 2 unused helper functions · c88c680c

Jeff Layton authored Mar 03, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c88c680c

lockd: purge resources held on behalf of nlm clients when shutting down · bfca7a6f

Jeff Layton authored Mar 03, 2023

It's easily possible for the server to have an outstanding lock when we
go to shut down. When that happens, we often get a warning like this in
the kernel log:

    lockd: couldn't shutdown host module for net f0000000!

This is because the shutdown procedures skip removing any hosts that
still have outstanding resources (locks). Eventually, things seem to get
cleaned up anyway, but the log message is unsettling, and server
shutdown doesn't seem to be working the way it was intended.

Ensure that we tear down any resources held on behalf of a client when
tearing one down for server shutdown.
Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bfca7a6f

NFSD: Convert filecache to rhltable · c4c649ab

Chuck Lever authored Nov 24, 2022

While we were converting the nfs4_file hashtable to use the kernel's
resizable hashtable data structure, Neil Brown observed that the
list variant (rhltable) would be better for managing nfsd_file items
as well. The nfsd_file hash table will contain multiple entries for
the same inode -- these should be kept together on a list. And, it
could be possible for exotic or malicious client behavior to cause
the hash table to resize itself on every insertion.

A nice simplification is that rhltable_lookup() can return a list
that contains only nfsd_file items that match a given inode, which
enables us to eliminate specialized hash table helper functions and
use the default functions provided by the rhashtable implementation).

Since we are now storing nfsd_file items for the same inode on a
single list, that effectively reduces the number of hash entries
that have to be tracked in the hash table. The mininum bucket count
is therefore lowered.

Light testing with fstests generic/531 show no regressions.
Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c4c649ab

nfsd: allow reaping files still under writeback · dcb779fc

Jeff Layton authored Feb 15, 2023

On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.

The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.

Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

dcb779fc

nfsd: update comment over __nfsd_file_cache_purge · 972cc0e0

Jeff Layton authored Jan 26, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

972cc0e0

nfsd: don't take/put an extra reference when putting a file · b2ff1bd7

Jeff Layton authored Jan 18, 2023

The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b2ff1bd7

nfsd: add some comments to nfsd_file_do_acquire · b680cb9b

Jeff Layton authored Jan 05, 2023

David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b680cb9b

nfsd: don't kill nfsd_files because of lease break error · c6593366

Jeff Layton authored Jan 05, 2023

An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c6593366

nfsd: simplify test_bit return in NFSD_FILE_KEY_FULL comparator · d69b8dbf

Jeff Layton authored Jan 06, 2023

test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d69b8dbf

nfsd: NFSD_FILE_KEY_INODE only needs to find GC'ed entries · 6c31e4c9

Jeff Layton authored Jan 06, 2023

Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.

Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6c31e4c9

nfsd: don't open-code clear_and_wake_up_bit · b8bea9f6

Jeff Layton authored Jan 05, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b8bea9f6

net: phy: hide the PHYLIB_LEDS knob · 9b78d919

Paolo Abeni authored Apr 26, 2023

commit 4bb7aac7 ("net: phy: fix circular LEDS_CLASS dependencies")
solved a build failure, but introduces a new config knob with a default
'y' value: PHYLIB_LEDS.

The latter is against the current new config policy. The exception
was raised to allow the user to catch bad configurations without led
support.

Anyway the current definition of PHYLIB_LEDS does not fit the above
goal: if LEDS_CLASS is disabled, the new config will be available
only with PHYLIB disabled, too.

Hide the mentioned config, to preserve the randconfig testing done so
far, while respecting the mentioned policy.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/d82489be8ed911c383c3447e9abf469995ccf39a.1682496488.git.pabeni@redhat.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9b78d919

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c248b27c
Paolo Abeni authored Apr 26, 2023
```
No conflicts.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
```
c248b27c

25 Apr, 2023 11 commits

net: phy: marvell-88x2222: remove unnecessary (void*) conversions · 28b17f62

wuych authored Apr 25, 2023

Pointer variables of void * type do not require type cast.
Signed-off-by: wuych <yunchuan@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28b17f62

tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. · 50749f2d

Kuniyuki Iwashima authored Apr 24, 2023

syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs.  We can reproduce the problem with these sequences:

  sk = socket(AF_INET, SOCK_DGRAM, 0)
  sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
  sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
  sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
  sk.close()

sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.

When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.

The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
after the socket's refcnt reaches 0, there is a circular dependency.
If we close() the socket holding such skbs, we never call sock_put()
and leak the count, sk, and skb.

TCP has the same problem, and commit e0c8bccd ("net: stream:
purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
by calling skb_queue_purge() during close().  However, there is a
small chance that skb queued in a qdisc or device could be put
into the error queue after the skb_queue_purge() call.

In __skb_tstamp_tx(), the cloned skb should not have a reference
to the ubuf to remove the circular dependency, but skb_clone() does
not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().

[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
    [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
    [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
    [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
    [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
    [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
    [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
    [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
    [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
    [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
  backtrace:
    [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
    [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
    [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
    [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
    [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
    [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
    [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
    [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
    [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
    [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
    [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
    [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
    [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
    [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
    [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50749f2d

net: amd: Fix link leak when verifying config failed · d325c34d

Gencen Gan authored Apr 24, 2023

After failing to verify configuration, it returns directly without
releasing link, which may cause memory leak.

Paolo Abeni thinks that the whole code of this driver is quite
"suboptimal" and looks unmainatained since at least ~15y, so he
suggests that we could simply remove the whole driver, please
take it into consideration.

Simon Horman suggests that the fix label should be set to
"Linux-2.6.12-rc2" considering that the problem has existed
since the driver was introduced and the commit above doesn't
seem to exist in net/net-next.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Gan Gecen <gangecen@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

d325c34d

net: phy: marvell: Fix inconsistent indenting in led_blink_set · 4774ad84

Christian Marangi authored Apr 23, 2023

Fix inconsistent indeinting in m88e1318_led_blink_set reported by kernel
test robot, probably done by the presence of an if condition dropped in
later revision of the same code.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202304240007.0VEX8QYG-lkp@intel.com/
Fixes: ea9e8648 ("net: phy: marvell: Implement led_blink_set()")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230423172800.3470-1-ansuelsmth@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4774ad84

lan966x: Don't use xdp_frame when action is XDP_TX · 700f11eb

Horatiu Vultur authored Apr 22, 2023

When the action of an xdp program was XDP_TX, lan966x was creating
a xdp_frame and use this one to send the frame back. But it is also
possible to send back the frame without needing a xdp_frame, because
it is possible to send it back using the page.
And then once the frame is transmitted is possible to use directly
page_pool_recycle_direct as lan966x is using page pools.
This would save some CPU usage on this path, which results in higher
number of transmitted frames. Bellow are the statistics:
Frame size:    Improvement:
64                ~8%
256              ~11%
512               ~8%
1000              ~0%
1500              ~0%
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

700f11eb

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ee3392ed

Jakub Kicinski authored Apr 24, 2023

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-04-24

We've added 5 non-merge commits during the last 3 day(s) which contain
a total of 7 files changed, 87 insertions(+), 44 deletions(-).

The main changes are:

1) Workaround for bpf iter selftest due to lack of subprog support
   in precision tracking, from Andrii.

2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave.

3) One more test_verifier test converted from asm macro to asm in C,
   from Eduard.

4) Fix build with NETFILTER=y INET=n config, from Florian.

5) Add __rcu_read_{lock,unlock} into deny list, from Yafang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests
  bpf: Add __rcu_read_{lock,unlock} into btf id deny list
  bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
  selftests/bpf: verifier/prevent_map_lookup converted to inline assembly
  bpf: fix link failure with NETFILTER=y INET=n
====================

Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ee3392ed

Merge branch 'tsnep-xdp-socket-zero-copy-support' · 9610a8dc

Jakub Kicinski authored Apr 24, 2023

Gerhard Engleder says:

====================
tsnep: XDP socket zero-copy support

Implement XDP socket zero-copy support for tsnep driver. I tried to
follow existing drivers like igc as far as possible. But one main
difference is that tsnep does not need any reconfiguration for XDP BPF
program setup. So I decided to keep this behavior no matter if a XSK
pool is used or not. As a result, tsnep starts using the XSK pool even
if no XDP BPF program is available.

Another difference is that I tried to prevent potentially failing
allocations during XSK pool setup. E.g. both memory models for page pool
and XSK pool are registered all the time. Thus, XSK pool setup cannot
end up with not working queues.

Some prework is done to reduce the last two XSK commits to actual XSK
changes.
====================

Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9610a8dc

tsnep: Add XDP socket zero-copy TX support · cd275c23

Gerhard Engleder authored Apr 21, 2023

Send and complete XSK pool frames within TX NAPI context. NAPI context
is triggered by ndo_xsk_wakeup.

Test results with A53 1.2GHz:

xdpsock txonly copy mode, 64 byte frames:
                   pps            pkts           1.00
tx                 284,409        11,398,144
Two CPUs with 100% and 10% utilization.

xdpsock txonly zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
tx                 511,929        5,890,368
Two CPUs with 100% and 1% utilization.

xdpsock l2fwd copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 248,985        7,315,885
tx                 248,921        7,315,885
Two CPUs with 100% and 10% utilization.

xdpsock l2fwd zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 254,735        3,039,456
tx                 254,735        3,039,456
Two CPUs with 100% and 4% utilization.

Packet rate increases and CPU utilization is reduced in both cases.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cd275c23

tsnep: Add XDP socket zero-copy RX support · 3fc23339

Gerhard Engleder authored Apr 21, 2023

Add support for XSK zero-copy to RX path. The setup of the XSK pool can
be done at runtime. If the netdev is running, then the queue must be
disabled and enabled during reconfiguration. This can be done easily
with functions introduced in previous commits.

A more important property is that, if the netdev is running, then the
setup of the XSK pool shall not stop the netdev in case of errors. A
broken netdev after a failed XSK pool setup is bad behavior. Therefore,
the allocation and setup of resources during XSK pool setup is done only
before any queue is disabled. Additionally, freeing and later allocation
of resources is eliminated in some cases. Page pool entries are kept for
later use. Two memory models are registered in parallel. As a result,
the XSK pool setup cannot fail during queue reconfiguration.

In contrast to other drivers, XSK pool setup and XDP BPF program setup
are separate actions. XSK pool setup can be done without any XDP BPF
program. The XDP BPF program can be added, removed or changed without
any reconfiguration of the XSK pool.

Test results with A53 1.2GHz:

xdpsock rxdrop copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 856,054        10,625,775
Two CPUs with both 100% utilization.

xdpsock rxdrop zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 889,388        4,615,284
Two CPUs with 100% and 20% utilization.

Packet rate increases and CPU utilization is reduced.

100% CPU load seems to the base load. This load is consumed by ksoftirqd
just for dropping the generated packets without xdpsock running.

Using batch API reduced CPU utilization slightly, but measurements are
not stable enough to provide meaningful numbers.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3fc23339

tsnep: Move skb receive action to separate function · c2d64697

Gerhard Engleder authored Apr 21, 2023

The function tsnep_rx_poll() is already pretty long and the skb receive
action can be reused for XSK zero-copy support. Move page based skb
receive to separate function.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c2d64697

tsnep: Add functions for queue enable/disable · 2ea0a282

Gerhard Engleder authored Apr 21, 2023

Move queue enable and disable code to separate functions. This way the
activation and deactivation of the queues are defined actions, which can
be used in future execution paths.

This functions will be used for the queue reconfiguration at runtime,
which is necessary for XSK zero-copy support.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2ea0a282