Commits · 972cc0e0924598cb293b919d39c848dc038b2c28 · Kirill Smelkov / linux

26 Apr, 2023 9 commits

nfsd: update comment over __nfsd_file_cache_purge · 972cc0e0

Jeff Layton authored Jan 26, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

972cc0e0

nfsd: don't take/put an extra reference when putting a file · b2ff1bd7

Jeff Layton authored Jan 18, 2023

The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b2ff1bd7

nfsd: add some comments to nfsd_file_do_acquire · b680cb9b

Jeff Layton authored Jan 05, 2023

David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b680cb9b

nfsd: don't kill nfsd_files because of lease break error · c6593366

Jeff Layton authored Jan 05, 2023

An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c6593366

nfsd: simplify test_bit return in NFSD_FILE_KEY_FULL comparator · d69b8dbf

Jeff Layton authored Jan 06, 2023

test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d69b8dbf

nfsd: NFSD_FILE_KEY_INODE only needs to find GC'ed entries · 6c31e4c9

Jeff Layton authored Jan 06, 2023

Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.

Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6c31e4c9

nfsd: don't open-code clear_and_wake_up_bit · b8bea9f6

Jeff Layton authored Jan 05, 2023

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b8bea9f6

net: phy: hide the PHYLIB_LEDS knob · 9b78d919

Paolo Abeni authored Apr 26, 2023

commit 4bb7aac7 ("net: phy: fix circular LEDS_CLASS dependencies")
solved a build failure, but introduces a new config knob with a default
'y' value: PHYLIB_LEDS.

The latter is against the current new config policy. The exception
was raised to allow the user to catch bad configurations without led
support.

Anyway the current definition of PHYLIB_LEDS does not fit the above
goal: if LEDS_CLASS is disabled, the new config will be available
only with PHYLIB disabled, too.

Hide the mentioned config, to preserve the randconfig testing done so
far, while respecting the mentioned policy.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/d82489be8ed911c383c3447e9abf469995ccf39a.1682496488.git.pabeni@redhat.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

9b78d919

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c248b27c
Paolo Abeni authored Apr 26, 2023
```
No conflicts.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
```
c248b27c

25 Apr, 2023 26 commits

net: phy: marvell-88x2222: remove unnecessary (void*) conversions · 28b17f62

wuych authored Apr 25, 2023

Pointer variables of void * type do not require type cast.
Signed-off-by: wuych <yunchuan@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

28b17f62

tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. · 50749f2d

Kuniyuki Iwashima authored Apr 24, 2023

syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs.  We can reproduce the problem with these sequences:

  sk = socket(AF_INET, SOCK_DGRAM, 0)
  sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
  sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
  sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
  sk.close()

sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.

When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.

The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
after the socket's refcnt reaches 0, there is a circular dependency.
If we close() the socket holding such skbs, we never call sock_put()
and leak the count, sk, and skb.

TCP has the same problem, and commit e0c8bccd ("net: stream:
purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
by calling skb_queue_purge() during close().  However, there is a
small chance that skb queued in a qdisc or device could be put
into the error queue after the skb_queue_purge() call.

In __skb_tstamp_tx(), the cloned skb should not have a reference
to the ubuf to remove the circular dependency, but skb_clone() does
not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().

[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
    [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
    [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
    [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
    [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
    [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
    [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
    [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
    [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
    [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
    [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
  comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
  backtrace:
    [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
    [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
    [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
    [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
    [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
    [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
    [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
    [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
    [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
    [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
    [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
    [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
    [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
    [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
    [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
    [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
    [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50749f2d

net: amd: Fix link leak when verifying config failed · d325c34d

Gencen Gan authored Apr 24, 2023

After failing to verify configuration, it returns directly without
releasing link, which may cause memory leak.

Paolo Abeni thinks that the whole code of this driver is quite
"suboptimal" and looks unmainatained since at least ~15y, so he
suggests that we could simply remove the whole driver, please
take it into consideration.

Simon Horman suggests that the fix label should be set to
"Linux-2.6.12-rc2" considering that the problem has existed
since the driver was introduced and the commit above doesn't
seem to exist in net/net-next.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: Gan Gecen <gangecen@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

d325c34d

net: phy: marvell: Fix inconsistent indenting in led_blink_set · 4774ad84

Christian Marangi authored Apr 23, 2023

Fix inconsistent indeinting in m88e1318_led_blink_set reported by kernel
test robot, probably done by the presence of an if condition dropped in
later revision of the same code.
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202304240007.0VEX8QYG-lkp@intel.com/
Fixes: ea9e8648 ("net: phy: marvell: Implement led_blink_set()")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230423172800.3470-1-ansuelsmth@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4774ad84

lan966x: Don't use xdp_frame when action is XDP_TX · 700f11eb

Horatiu Vultur authored Apr 22, 2023

When the action of an xdp program was XDP_TX, lan966x was creating
a xdp_frame and use this one to send the frame back. But it is also
possible to send back the frame without needing a xdp_frame, because
it is possible to send it back using the page.
And then once the frame is transmitted is possible to use directly
page_pool_recycle_direct as lan966x is using page pools.
This would save some CPU usage on this path, which results in higher
number of transmitted frames. Bellow are the statistics:
Frame size:    Improvement:
64                ~8%
256              ~11%
512               ~8%
1000              ~0%
1500              ~0%
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

700f11eb

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ee3392ed

Jakub Kicinski authored Apr 24, 2023

Alexei Starovoitov says:

====================
pull-request: bpf-next 2023-04-24

We've added 5 non-merge commits during the last 3 day(s) which contain
a total of 7 files changed, 87 insertions(+), 44 deletions(-).

The main changes are:

1) Workaround for bpf iter selftest due to lack of subprog support
   in precision tracking, from Andrii.

2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave.

3) One more test_verifier test converted from asm macro to asm in C,
   from Eduard.

4) Fix build with NETFILTER=y INET=n config, from Florian.

5) Add __rcu_read_{lock,unlock} into deny list, from Yafang.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests
  bpf: Add __rcu_read_{lock,unlock} into btf id deny list
  bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
  selftests/bpf: verifier/prevent_map_lookup converted to inline assembly
  bpf: fix link failure with NETFILTER=y INET=n
====================

Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ee3392ed

Merge branch 'tsnep-xdp-socket-zero-copy-support' · 9610a8dc

Jakub Kicinski authored Apr 24, 2023

Gerhard Engleder says:

====================
tsnep: XDP socket zero-copy support

Implement XDP socket zero-copy support for tsnep driver. I tried to
follow existing drivers like igc as far as possible. But one main
difference is that tsnep does not need any reconfiguration for XDP BPF
program setup. So I decided to keep this behavior no matter if a XSK
pool is used or not. As a result, tsnep starts using the XSK pool even
if no XDP BPF program is available.

Another difference is that I tried to prevent potentially failing
allocations during XSK pool setup. E.g. both memory models for page pool
and XSK pool are registered all the time. Thus, XSK pool setup cannot
end up with not working queues.

Some prework is done to reduce the last two XSK commits to actual XSK
changes.
====================

Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

9610a8dc

tsnep: Add XDP socket zero-copy TX support · cd275c23

Gerhard Engleder authored Apr 21, 2023

Send and complete XSK pool frames within TX NAPI context. NAPI context
is triggered by ndo_xsk_wakeup.

Test results with A53 1.2GHz:

xdpsock txonly copy mode, 64 byte frames:
                   pps            pkts           1.00
tx                 284,409        11,398,144
Two CPUs with 100% and 10% utilization.

xdpsock txonly zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
tx                 511,929        5,890,368
Two CPUs with 100% and 1% utilization.

xdpsock l2fwd copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 248,985        7,315,885
tx                 248,921        7,315,885
Two CPUs with 100% and 10% utilization.

xdpsock l2fwd zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 254,735        3,039,456
tx                 254,735        3,039,456
Two CPUs with 100% and 4% utilization.

Packet rate increases and CPU utilization is reduced in both cases.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cd275c23

tsnep: Add XDP socket zero-copy RX support · 3fc23339

Gerhard Engleder authored Apr 21, 2023

Add support for XSK zero-copy to RX path. The setup of the XSK pool can
be done at runtime. If the netdev is running, then the queue must be
disabled and enabled during reconfiguration. This can be done easily
with functions introduced in previous commits.

A more important property is that, if the netdev is running, then the
setup of the XSK pool shall not stop the netdev in case of errors. A
broken netdev after a failed XSK pool setup is bad behavior. Therefore,
the allocation and setup of resources during XSK pool setup is done only
before any queue is disabled. Additionally, freeing and later allocation
of resources is eliminated in some cases. Page pool entries are kept for
later use. Two memory models are registered in parallel. As a result,
the XSK pool setup cannot fail during queue reconfiguration.

In contrast to other drivers, XSK pool setup and XDP BPF program setup
are separate actions. XSK pool setup can be done without any XDP BPF
program. The XDP BPF program can be added, removed or changed without
any reconfiguration of the XSK pool.

Test results with A53 1.2GHz:

xdpsock rxdrop copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 856,054        10,625,775
Two CPUs with both 100% utilization.

xdpsock rxdrop zero-copy mode, 64 byte frames:
                   pps            pkts           1.00
rx                 889,388        4,615,284
Two CPUs with 100% and 20% utilization.

Packet rate increases and CPU utilization is reduced.

100% CPU load seems to the base load. This load is consumed by ksoftirqd
just for dropping the generated packets without xdpsock running.

Using batch API reduced CPU utilization slightly, but measurements are
not stable enough to provide meaningful numbers.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3fc23339

tsnep: Move skb receive action to separate function · c2d64697

Gerhard Engleder authored Apr 21, 2023

The function tsnep_rx_poll() is already pretty long and the skb receive
action can be reused for XSK zero-copy support. Move page based skb
receive to separate function.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

c2d64697

tsnep: Add functions for queue enable/disable · 2ea0a282

Gerhard Engleder authored Apr 21, 2023

Move queue enable and disable code to separate functions. This way the
activation and deactivation of the queues are defined actions, which can
be used in future execution paths.

This functions will be used for the queue reconfiguration at runtime,
which is necessary for XSK zero-copy support.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

2ea0a282

tsnep: Rework TX/RX queue initialization · 33b0ee02

Gerhard Engleder authored Apr 21, 2023

Make initialization of TX and RX queues less dynamic by moving some
initialization from netdev open/close to device probing.

Additionally, move some initialization code to separate functions to
enable future use in other execution paths.

This is done as preparation for queue reconfigure at runtime, which is
necessary for XSK zero-copy support.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

33b0ee02

tsnep: Replace modulo operation with mask · 42fb2962

Gerhard Engleder authored Apr 21, 2023

TX/RX ring size is static and power of 2 to enable compiler to optimize
modulo operation to mask operation. Make this optimization already in
the code and don't rely on the compiler.

CPU utilisation during high packet rate has not changed. So no
performance improvement has been measured. But it is best practice to
prevent modulo operations.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

42fb2962

net: phy: dp83867: Add led_brightness_set support · 938f65ad

Alexander Stein authored Apr 24, 2023

Up to 4 LEDs can be attached to the PHY, add support for setting
brightness manually.
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230424134625.303957-1-alexander.stein@ew.tq-group.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

938f65ad

net: phy: Fix reading LED reg property · aed8fdad

Alexander Stein authored Apr 24, 2023

'reg' is always encoded in 32 bits, thus it has to be read using the
function with the corresponding bit width.

Fixes: 01e5b728 ("net: phy: Add a binding for PHY LEDs")
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230424141648.317944-1-alexander.stein@ew.tq-group.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

aed8fdad

drivers: nfc: nfcsim: remove return value check of `dev_dir` · e515c330

Jianuo Kuang authored Apr 24, 2023

Smatch complains that:
nfcsim_debugfs_init_dev() warn: 'dev_dir' is an error pointer or valid

According to the documentation of the debugfs_create_dir() function,
there is no need to check the return value of this function.
Just delete the dead code.
Signed-off-by: Jianuo Kuang <u202110722@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230424024140.34607-1-u202110722@hust.edu.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>

e515c330

net: phy: dp83867: Remove unnecessary (void*) conversions · 86c2b51f

wuych authored Apr 24, 2023

Pointer variables of void * type do not require type cast.
Signed-off-by: wuych <yunchuan@nfschina.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230424101550.664319-1-yunchuan@nfschina.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

86c2b51f

net: ethtool: coalesce: try to make user settings stick twice · 00d0f31a

Jakub Kicinski authored Apr 20, 2023

SET_COALESCE may change operation mode and parameters in one call.
Changing operation mode may cause the driver to reset the parameter
values to what is a reasonable default for new operation mode.

Since driver does not know which parameters come from user and which
are echoed back from ->get, driver may ignore the parameters when
switching operation modes.

This used to be inevitable for ioctl() but in netlink we know which
parameters are actually specified by the user.

We could inform which parameters were set by the user but this would
lead to a lot of code duplication in the drivers. Instead try to call
the drivers twice if both mode and params are changed. The set method
already checks if any params need updating so in case the driver did
the right thing the first time around - there will be no second call
to it's ->set method (only an extra call to ->get()).

For mlx5 for example before this patch we'd see:

# ethtool -C eth0 adaptive-rx on adaptive-tx on
# ethtool -C eth0 adaptive-rx off adaptive-tx off \
tx-usecs 123 rx-usecs 123
Adaptive RX: off TX: off
rx-usecs: 3
rx-frames: 32
tx-usecs: 16
tx-frames: 32
[...]

After the change:

# ethtool -C eth0 adaptive-rx on adaptive-tx on
# ethtool -C eth0 adaptive-rx off adaptive-tx off \
tx-usecs 123 rx-usecs 123
Adaptive RX: off TX: off
rx-usecs: 123
rx-frames: 32
tx-usecs: 123
tx-frames: 32
[...]

This only works for netlink, so it's a small discrepancy between
netlink and ioctl(). Since we anticipate most users to move to
netlink I believe it's worth making their lives easier.

Link: https://lore.kernel.org/r/20230420233302.944382-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

00d0f31a

Merge branch 'update-coding-style-and-check-alloc_frag' · 086c1616

Jakub Kicinski authored Apr 24, 2023

Haiyang Zhang says:

====================
Update coding style and check alloc_frag

Follow up patches for the jumbo frame support.

As suggested by Jakub Kicinski, update coding style, and check napi_alloc_frag
for possible fallback to single pages.
====================

Link: https://lore.kernel.org/r/1682096818-30056-1-git-send-email-haiyangz@microsoft.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

086c1616

net: mana: Check if netdev/napi_alloc_frag returns single page · df18f2da

Haiyang Zhang authored Apr 21, 2023

netdev/napi_alloc_frag() may fall back to single page which is smaller
than the requested size.
Add error checking to avoid memory overwritten.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

df18f2da

net: mana: Rename mana_refill_rxoob and remove some empty lines · 5c74064f

Haiyang Zhang authored Apr 21, 2023

Rename mana_refill_rxoob for naming consistency.
And remove some empty lines between function call and error
checking.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

5c74064f

Merge branch 'add-page_pool-support-for-page-recycling-in-veth-driver' · 8e8e47d9

Jakub Kicinski authored Apr 24, 2023

Lorenzo Bianconi says:

====================
add page_pool support for page recycling in veth driver

Introduce page_pool support in veth driver in order to recycle pages in
veth_convert_skb_to_xdp_buff routine and avoid reallocating the skb through
the page allocator when we run a xdp program on the device and we receive
skbs from the stack.
====================

Link: https://lore.kernel.org/r/cover.1682188837.git.lorenzo@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

8e8e47d9

net: veth: add page_pool stats · 4fc41805

Lorenzo Bianconi authored Apr 22, 2023

Introduce page_pool stats support to report info about local page_pool
through ethtool
Tested-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4fc41805

net: veth: add page_pool for page recycling · 0ebab78c

Lorenzo Bianconi authored Apr 22, 2023

Introduce page_pool support in veth driver in order to recycle pages
in veth_convert_skb_to_xdp_buff routine and avoid reallocating the skb
through the page allocator.
The patch has been tested sending tcp traffic to a veth pair where the
remote peer is running a simple xdp program just returning xdp_pass:

veth upstream codebase:
MTU 1500B: ~ 8Gbps
MTU 8000B: ~ 13.9Gbps

veth upstream codebase + pp support:
MTU 1500B: ~ 9.2Gbps
MTU 8000B: ~ 16.2Gbps
Tested-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0ebab78c

netlink: Use copy_to_user() for optval in netlink_getsockopt(). · d913d32c

Kuniyuki Iwashima authored Apr 21, 2023

Brad Spencer provided a detailed report [0] that when calling getsockopt()
for AF_NETLINK, some SOL_NETLINK options set only 1 byte even though such
options require at least sizeof(int) as length.

The options return a flag value that fits into 1 byte, but such behaviour
confuses users who do not initialise the variable before calling
getsockopt() and do not strictly check the returned value as char.

Currently, netlink_getsockopt() uses put_user() to copy data to optlen and
optval, but put_user() casts the data based on the pointer, char *optval.
As a result, only 1 byte is set to optval.

To avoid this behaviour, we need to use copy_to_user() or cast optval for
put_user().

Note that this changes the behaviour on big-endian systems, but we document
that the size of optval is int in the man page.

  $ man 7 netlink
  ...
  Socket options
       To set or get a netlink socket option, call getsockopt(2) to read
       or setsockopt(2) to write the option with the option level argument
       set to SOL_NETLINK.  Unless otherwise noted, optval is a pointer to
       an int.

Fixes: 9a4595bc ("[NETLINK]: Add set/getsockopt options to support more than 32 groups")
Fixes: be0c22a4 ("netlink: add NETLINK_BROADCAST_ERROR socket option")
Fixes: 38938bfe ("netlink: add NETLINK_NO_ENOBUFS socket flag")
Fixes: 0a6a3a23 ("netlink: add NETLINK_CAP_ACK socket option")
Fixes: 2d4bc933 ("netlink: extended ACK reporting")
Fixes: 89d35528 ("netlink: Add new socket option to enable strict checking on dumps")
Reported-by: Brad Spencer <bspencer@blackberry.com>
Link: https://lore.kernel.org/netdev/ZD7VkNWFfp22kTDt@datsun.rim.net/Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20230421185255.94606-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d913d32c

selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests · be7dbd27

Andrii Nakryiko authored Apr 24, 2023

iter_pass_iter_ptr_to_subprog subtest is relying on actual array size
being passed as subprog parameter. This combined with recent fixes to
precision tracking in conditional jumps ([0]) is now causing verifier to
backtrack all the way to the point where sum() and fill() subprogs are
called, at which point precision backtrack bails out and forces all the
states to have precise SCALAR registers. This in turn causes each
possible value of i within fill() and sum() subprogs to cause
a different non-equivalent state, preventing iterator code to converge.

For now, change the test to assume fixed size of passed in array. Once
BPF verifier supports precision tracking across subprogram calls, these
changes will be reverted as unnecessary.

  [0] 71b547f5 ("bpf: Fix incorrect verifier pruning due to missing register precision taints")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230424235128.1941726-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

be7dbd27

24 Apr, 2023 5 commits

Merge tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next · ffcddcae

Jakub Kicinski authored Apr 24, 2023

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Reduce jumpstack footprint: Stash chain in last rule marker in blob for
   tracing. Remove last rule and chain from jumpstack. From Florian Westphal.

2) nf_tables validates all tables before committing the new rules.
   Unfortunately, this has two drawbacks:

   - Since addition of the transaction mutex pernet state gets written to
     outside of the locked section from the cleanup callback, this is
     wrong so do this cleanup directly after table has passed all checks.

   - Revalidate tables that saw no changes. This can be avoided by
     keeping the validation state per table, not per netns.

   From Florian Westphal.

3) Get rid of a few redundant pointers in the traceinfo structure.
   The three removed pointers are used in the expression evaluation loop,
   so gcc keeps them in registers. Passing them to the (inlined) helpers
   thus doesn't increase nft_do_chain text size, while stack is reduced
   by another 24 bytes on 64bit arches. From Florian Westphal.

4) IPVS cleanups in several ways without implementing any functional
   changes, aside from removing some debugging output:

   - Update width of source for ip_vs_sync_conn_options
     The operation is safe, use an annotation to describe it properly.

   - Consistently use array_size() in ip_vs_conn_init()
     It seems better to use helpers consistently.

   - Remove {Enter,Leave}Function. These seem to be well past their
     use-by date.

   - Correct spelling in comments.

   From Simon Horman.

5) Extended netlink error report for netdevice in flowtables and
   netdev/chains. Allow for incrementally add/delete devices to netdev
   basechain. Allow to create netdev chain without device.

* tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: allow to create netdev chain without device
  netfilter: nf_tables: support for deleting devices in an existing netdev chain
  netfilter: nf_tables: support for adding new devices to an existing netdev chain
  netfilter: nf_tables: rename function to destroy hook list
  netfilter: nf_tables: do not send complete notification of deletions
  netfilter: nf_tables: extended netlink error reporting for netdevice
  ipvs: Correct spelling in comments
  ipvs: Remove {Enter,Leave}Function
  ipvs: Consistently use array_size() in ip_vs_conn_init()
  ipvs: Update width of source for ip_vs_sync_conn_options
  netfilter: nf_tables: do not store rule in traceinfo structure
  netfilter: nf_tables: do not store verdict in traceinfo structure
  netfilter: nf_tables: do not store pktinfo in traceinfo structure
  netfilter: nf_tables: remove unneeded conditional
  netfilter: nf_tables: make validation state per table
  netfilter: nf_tables: don't write table validation state without mutex
  netfilter: nf_tables: don't store chain address on jump
  netfilter: nf_tables: don't store address of last rule on jump
  netfilter: nf_tables: merge nft_rules_old structure and end of ruleblob marker
====================

Link: https://lore.kernel.org/r/20230421235021.216950-1-pablo@netfilter.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ffcddcae

bpf: Add __rcu_read_{lock,unlock} into btf id deny list · a0c109dc

Yafang Shao authored Apr 24, 2023

The tracing recursion prevention mechanism must be protected by rcu, that
leaves __rcu_read_{lock,unlock} unprotected by this mechanism. If we trace
them, the recursion will happen. Let's add them into the btf id deny list.

When CONFIG_PREEMPT_RCU is enabled, it can be reproduced with a simple bpf
program as such:
  SEC("fentry/__rcu_read_lock")
  int fentry_run()
  {
      return 0;
  }
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230424161104.3737-2-laoar.shao@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a0c109dc

bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed · 7deca5ea

Dave Marchevsky authored Apr 24, 2023

As reported by Kumar in [0], the shared ownership implementation for BPF
programs has some race conditions which need to be addressed before it
can safely be used. This patch does so in a minimal way instead of
ripping out shared ownership entirely, as proper fixes for the issues
raised will follow ASAP, at which point this patch's commit can be
reverted to re-enable shared ownership.

The patch removes the ability to call bpf_refcount_acquire_impl from BPF
programs. Programs can only bump refcount and obtain a new owning
reference using this kfunc, so removing the ability to call it
effectively disables shared ownership.

Instead of changing success / failure expectations for
bpf_refcount-related selftests, this patch just disables them from
running for now.

  [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2/Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230424204321.2680232-1-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7deca5ea

Merge tag 'for-net-next-2023-04-23' of... · 2efb07b5

David S. Miller authored Apr 24, 2023

Merge tag 'for-net-next-2023-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

bluetooth-next pull request for net-next:

 - Introduce devcoredump support
 - Add support for Realtek RTL8821CS, RTL8851B, RTL8852BS
 - Add support for Mediatek MT7663, MT7922
 - Add support for NXP w8997
 - Add support for Actions Semi ATS2851
 - Add support for QTI WCN6855
 - Add support for Marvell 88W8997

2efb07b5

MAINTAINERS: Remove PPP maintainer · 60fd497c

Paul Mackerras authored Apr 24, 2023

I am not currently maintaining the kernel PPP code, so remove my
address from the MAINTAINERS entry for it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

60fd497c