- 29 Aug, 2024 2 commits
-
-
Eric Dumazet authored
Using a volatile qualifier for a specific struct field is unusual. Use instead READ_ONCE()/WRITE_ONCE() where necessary. tcp_timewait_state_process() can change tw_substate while other threads are reading this field. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20240827015250.3509197-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jeongjun Park authored
During the list_for_each_entry_rcu iteration call of xenvif_flush_hash, kfree_rcu does not exist inside the rcu read critical section, so if kfree_rcu is called when the rcu grace period ends during the iteration, UAF occurs when accessing head->next after the entry becomes free. Therefore, to solve this, you need to change it to list_for_each_entry_safe. Signed-off-by: Jeongjun Park <aha310510@gmail.com> Link: https://patch.msgid.link/20240822181109.2577354-1-aha310510@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 27 Aug, 2024 38 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueJakub Kicinski authored
Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-08-26 (ice) This series contains updates to ice driver only. Jake implements and uses rd32_poll_timeout to replace a jiffies loop for calling ice_sq_done. The rd32_poll_timeout() function is designed to allow simplifying other places in the driver where we need to read a register until it matches a known value. Jake, Bruce, and Przemek update ice_debug_cq() to be more robust, and more useful for tracing control queue messages sent and received by the device driver. Jake rewords several commands in the ice_control.c file which previously referred to the "Admin queue" when they were actually generic functions usable on any control queue. Jake removes the unused and unnecessary cmd_buf array allocation for send queues. This logic originally was going to be useful if we ever implemented asynchronous completion of transmit messages. This support is unlikely to materialize, so the overhead of allocating a command buffer is unnecessary. Sergey improves the log messages when the ice driver reports that the NVM version on the device is not supported by the driver. Now, these messages include both the discovered NVM version and the requested/expected NVM version. Aleksandr Mishin corrects overallocation of memory related to adding scheduler nodes. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: Adjust over allocation of memory in ice_sched_add_root_node() and ice_sched_add_node() ice: Report NVM version numbers on mismatch during load ice: remove unnecessary control queue cmd_buf arrays ice: reword comments referring to control queues ice: stop intermixing AQ commands/responses debug dumps ice: do not clutter debug logs with unused data ice: improve debug print for control queue messages ice: implement and use rd32_poll_timeout for ice_sq_done timeout ==================== Link: https://patch.msgid.link/20240826224655.133847-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Tristram Ha says: ==================== net: dsa: microchip: Add KSZ8895/KSZ8864 switch support This series of patches is to add KSZ8895/KSZ8864 switch support to the KSZ DSA driver. ==================== Link: https://patch.msgid.link/BYAPR11MB3558B8A089C88DFFFC09B067EC8B2@BYAPR11MB3558.namprd11.prod.outlook.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Tristram Ha authored
KSZ8895/KSZ8864 is a switch family between KSZ8863/73 and KSZ8795, so it shares some registers and functions in those switches already implemented in the KSZ DSA driver. Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Tested-by: Pieter Van Trappen <pieter.van.trappen@cern.ch> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Tristram Ha authored
KSZ8895/KSZ8864 is a switch family developed before KSZ8795 and after KSZ8863, so it shares some registers and functions in those switches. KSZ8895 has 5 ports and so is more similar to KSZ8795. KSZ8864 is a 4-port version of KSZ8895. The first port is removed while port 5 remains as a host port. Signed-off-by: Tristram Ha <tristram.ha@microchip.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/BYAPR11MB3558FD0717772263FAD86846EC8B2@BYAPR11MB3558.namprd11.prod.outlook.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
A K M Fazla Mehrab authored
Replace fput() with sockfd_put() in handshake_nl_done_doit(). Signed-off-by: A K M Fazla Mehrab <a.mehrab@bytedance.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Link: https://patch.msgid.link/20240826182652.2449359-1-a.mehrab@bytedance.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Shradha Gupta authored
Currently the values of WQs for RX and TX queues for MANA devices are hardcoded to default sizes. Allow configuring these values for MANA devices as ringparam configuration(get/set) through ethtool_ops. Pre-allocate buffers at the beginning of this operation, to prevent complete network loss in low-memory conditions. Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://patch.msgid.link/1724688461-12203-1-git-send-email-shradhagupta@linux.microsoft.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Pawel Dembicki authored
This commit introduces MDI-X configuration support in vsc73xx phys. Vsc73xx supports only auto mode or forced MDI. Vsc73xx have auto MDI-X disabled by default in forced speed mode. This commit enables it. Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20240826093710.511837-1-paweldembicki@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Liao Chen says: ==================== net: fix module autoloading This patchset aims to enable autoloading of some net modules. By registering MDT, the kernel is allowed to automatically bind modules to devices that match the specified compatible strings. ==================== Link: https://patch.msgid.link/20240826091858.369910-1-liaochen4@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Liao Chen authored
Add MODULE_DEVICE_TABLE(), so modules could be properly autoloaded based on the alias from of_device_id table. Signed-off-by: Liao Chen <liaochen4@huawei.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20240826091858.369910-4-liaochen4@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Liao Chen authored
Add MODULE_DEVICE_TABLE(), so modules could be properly autoloaded based on the alias from of_device_id table. Signed-off-by: Liao Chen <liaochen4@huawei.com> Link: https://patch.msgid.link/20240826091858.369910-3-liaochen4@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Liao Chen authored
Add MODULE_DEVICE_TABLE(), so modules could be properly autoloaded based on the alias from of_device_id table. Signed-off-by: Liao Chen <liaochen4@huawei.com> Link: https://patch.msgid.link/20240826091858.369910-2-liaochen4@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yu Liao authored
PCI core API pci_dev_id() can be used to get the BDF number for a PCI device. We don't need to compose it manually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Yu Liao <liaoyu15@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240826012100.3975175-1-liaoyu15@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Rosen Penev authored
Currently, of_get_ethdev_address() return is checked for any return error code which means that trying to get the MAC from NVMEM cells that is backed by MTD will fail if it was not probed before ag71xx. So, lets check the return error code for EPROBE_DEFER and defer the ag71xx probe in that case until the underlying NVMEM device is live. Signed-off-by: Robert Marko <robimarko@gmail.com> Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20240824200249.137209-1-rosenp@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yue Haibing authored
Commit da15c78b ("liquidio CN23XX: VF register access") declared cn23xx_dump_vf_initialized_regs() but never implemented it. octeon_dump_soft_command() is never implemented and used since introduction in commit 35878618 ("liquidio: Added delayed work for periodically updating the link statistics."). And finally, a few other declarations were never implenmented since introduction in commit f21fb3ed ("Add support of Cavium Liquidio ethernet adapters"). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240824083107.3639602-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Yue Haibing authored
Commit 4863dea3 ("net: Adding support for Cavium ThunderX network controller") declared nicvf_qset_reg_{write,read}() but never implemented. Commit 4863dea3 ("net: Adding support for Cavium ThunderX network controller") declared bgx_add_dmac_addr() but no implementation. After commit 5fc7cf17 ("net: thunderx: Cleanup PHY probing code.") octeon_mdiobus_force_mod_depencency() is not used any more. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240824082754.3637963-1-yuehaibing@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Dmitry Safonov via says: ==================== net/selftests: TCP-AO selftests updates First 3 patches are more-or-less cleanups/preparations. Patches 4/5 are fixes for netns file descriptors leaks/open. Patch 6 was sent to me/contributed off-list by Mohammad, who wants 32-bit kernels to run TCP-AO. Patch 7 is a workaround/fix for slow VMs. Albeit, I can't reproduce the issue, but I hope it will fix netdev flakes for connect-deny-* tests. And the biggest change is adding TCP-AO tracepoints to selftests. I think it's a good addition by the following reasons: - The related tracepoints are now tested; - It allows tcp-ao selftests to raise expectations on the kernel behavior - up from the syscalls exit statuses + net counters. - Provides tracepoints usage samples. As tracepoints are not a stable ABI, any kernel changes done to them will be reflected to the selftests, which also will allow users to see how to change their code. It's quite better than parsing dmesg (what BGP was doing pre-tracepoints, ugh). Somewhat arguably, the code parses trace_pipe, rather than uses libtraceevent (which any sane user should do). The reason behind that is the same as for rt-netlink macros instead of libmnl: I'm trying to minimize the library dependencies of the selftests. And the performance of formatting text in kernel and parsing it again in a test is not critical. Current output sample: > ok 73 Trace events matched expectations: 13 tcp_hash_md5_required[2] tcp_hash_md5_unexpected[4] tcp_hash_ao_required[3] tcp_ao_key_not_found[4] Previously, tracepoints selftests were part of kernel tcp tracepoints submission [1], but since then the code was quite changed: - Now generic tracing setup is in lib/ftrace.c, separate from lib/ftrace-tcp.c which utilizes TCP trace points. This separation allows future selftests to trace non-TCP events, i.e. to find out an skb's drop reason, which was useful in the creation of TCP-CLOSE stress-test (not in this patch set, but used in attempt to reproduce the issue from [2]). - Another change is that in the previous submission the trace events where used only to detect unexpected TCP-AO/TCP-MD5 events. In this version the selftests will fail if an expected trace event didn't appear. Let's see how reliable this is on the netdev bot - it obviously passes on my testing, but potentially may require a temporary XFAIL patch if it misbehaves on a slow VM. [1] https://lore.kernel.org/lkml/20240224-tcp-ao-tracepoints-v1-0-15f31b7f30a7@arista.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=33700a0c9b56 v3: https://lore.kernel.org/20240815-tcp-ao-selftests-upd-6-12-v3-0-7bd2e22bb81c@gmail.com v2: https://lore.kernel.org/20240802-tcp-ao-selftests-upd-6-12-v2-0-370c99358161@gmail.com v1: https://lore.kernel.org/20240730-tcp-ao-selftests-upd-6-12-v1-0-ffd4bf15d638@gmail.com ==================== Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-0-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
Setup trace points, add a new ftrace instance in order to not interfere with the rest of the system, filtering by net namespace cookies. Raise a new background thread that parses trace_pipe, matches them with the list of expected events. Wiring up trace events to selftests provides another insight if there is anything unexpected happining in the tcp-ao code (i.e. key rotation when it's not expected). Note: in real programs libtraceevent should be used instead of this manual labor of setting ftrace up and parsing. I'm not using it here as I don't want to have an .so library dependency that one would have to bring into VM or DUT (Device Under Test). Please, don't copy it over into any real world programs, that aren't tests. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-8-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
On tests that are expecting failure the timeout value is TEST_RETRANSMIT_SEC == 1 second. Which is big enough for most of devices under tests. But on a particularly slow machine/VM, 1 second might be not enough for another thread to be scheduled and attempt to connect(). It is not a problem for tests that expect connect() to succeed as the timeout value for them (TEST_TIMEOUT_SEC) is intentionally bigger. One obvious way to solve this would be to increase TEST_RETRANSMIT_SEC. But as all tests would increase the timeouts, that's going to sum up. But here is less obvious way that keeps timeouts for expected connect() failures low: just synchronize the two threads, which will assure that before counter checks the other thread got a chance to run and timeout on connect(). The expected increase of the related counter for listen() socket will yet test the expected failure. Never happens on my machine, but I suppose the majority of netdev's connect-deny-* flakes [1] are caused by this. Prevents the following testing issue: > # selftests: net/tcp_ao: connect-deny_ipv6 > # 1..21 > # # 462[lib/setup.c:243] rand seed 1720905426 > # TAP version 13 > # ok 1 Non-AO server + AO client > # not ok 2 Non-AO server + AO client: TCPAOKeyNotFound counter did not increase: 0 <= 0 > # ok 3 AO server + Non-AO client > # ok 4 AO server + Non-AO client: counter TCPAORequired increased 0 => 1 ... [1]: https://netdev-3.bots.linux.dev/vmksft-tcp-ao/results/681741/6-connect-deny-ipv6/stdoutSigned-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-7-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Mohammad Nassiri authored
It's not safe to use '%zu' specifier for printing uint64_t on 32-bit systems. For uint64_t, we should use the 'PRIu64' macro from the inttypes.h library. This ensures that the uint64_t is printed correctly from the selftests regardless of the system architecture. Signed-off-by: Mohammad Nassiri <mnassiri@ciena.com> [Added missing spaces in fail/ok messages and uint64_t cast in setsockopt-closed, as otherwise it was giving warnings on 64bit. And carried it to netdev ml] Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-6-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
The switch_save_ns() helper suppose to help switching to another namespace for some action and to return back to original namespace. The fd should be closed. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-5-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
It turns to be that open_netns() is called rarely from the child-thread and more often from parent-thread. Yet, on initialization of kconfig checks, either of threads may reach kconfig_lock mutex first. VRF-related checks do create a temporary ksft-check VRF in an unshare()'d namespace and than setns() back to the original. As original was opened from "/proc/self/ns/net", it's valid for thread-leader (parent), but it's invalid for the child, resulting in the following failure on tests that check has_vrfs() support: > # ok 54 TCP-AO required on socket + TCP-MD5 key: prefailed as expected: Key was rejected by service > # not ok 55 # error 381[unsigned-md5.c:24] Failed to add a VRF: -17 > # not ok 56 # error 383[unsigned-md5.c:33] Failed to add a route to VRF: -22: Key was rejected by service > not ok 1 selftests: net/tcp_ao: unsigned-md5_ipv6 # exit=1 Use "/proc/thread-self/ns/net" which is valid for any thread. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-4-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
Most of the functions in tcp-ao lib/ return negative errno or -1 in case of a failure. That creates inconsistencies in lib/kconfig, which saves what was the error code. As well as the uninitialized kconfig value is -1, which also may be the result of a check. Define KCONFIG_UNKNOWN and save negative return code, rather than libc-style errno. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-3-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
Instead of pre-allocating a fixed-sized buffer of TEST_MSG_BUFFER_SIZE and printing into it, call vsnprintf() with str = NULL, which will return the needed size of the buffer. This hack is documented in man 3 vsnprintf. Essentially, in C++ terms, it re-invents std::stringstream, which is going to be used to print different tracing paths and formatted strings. Use it straight away in __test_print() - which is thread-safe version of printing in selftests. Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-2-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dmitry Safonov authored
Correct copy'n'paste typo: the previous line already initialises get_all to 1. Reported-by: Nassiri, Mohammad <mnassiri@ciena.com> Closes: https://lore.kernel.org/all/DM6PR04MB4202BC58A9FD5BDD24A16E8EC56F2@DM6PR04MB4202.namprd04.prod.outlook.com/Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240823-tcp-ao-selftests-upd-6-12-v4-1-05623636fe8c@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
James Chapman authored
Recent commit fc7ec7f5 ("l2tp: delete sessions using work queue") incorrectly uses drain_workqueue. The use of drain_workqueue in l2tp_pre_exit_net is flawed because the workqueue is shared by all nets and it is therefore possible for new work items to be queued for other nets while drain_workqueue runs. Instead of using drain_workqueue, use __flush_workqueue twice. The first one will run all tunnel delete work items and any work already queued. When tunnel delete work items are run, they may queue new session delete work items, which the second __flush_workqueue will run. In l2tp_exit_net, warn if any of the net's idr lists are not empty. Fixes: fc7ec7f5 ("l2tp: delete sessions using work queue") Signed-off-by: James Chapman <jchapman@katalix.com> Link: https://patch.msgid.link/20240823142257.692667-1-jchapman@katalix.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Detlev Casanova says: ==================== Add GMAC support for rk3576 Add the necessary constants and functions to support the GMAC devices on the rk3576. ==================== Link: https://patch.msgid.link/20240823141318.51201-1-detlev.casanova@collabora.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
David Wu authored
Add constants and callback functions for the dwmac on RK3576 soc. Signed-off-by: David Wu <david.wu@rock-chips.com> [rebase, extracted bindings] Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20240823141318.51201-4-detlev.casanova@collabora.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Detlev Casanova authored
Add a rockchip,rk3576-gmac compatible for supporting the 2 gmac devices on the rk3576. Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20240823141318.51201-3-detlev.casanova@collabora.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Detlev Casanova authored
Fix SELET -> SELECT in RK3588_GMAC_CLK_SELET_CRU and RK3588_GMAC_CLK_SELET_IO Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20240823141318.51201-2-detlev.casanova@collabora.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Andy Shevchenko authored
of_node_to_fwnode() is a IRQ domain specific implementation of of_fwnode_handle(). Replace the former with more suitable API. Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240822230550.708112-1-andy.shevchenko@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Diogo Jahchan Koike authored
fix an unreleased lock in out_dev_put path by removing the (now) unnecessary path. Reported-by: syzbot+c641161e97237326ea74@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c641161e97237326ea74 Fixes: 3688ff30 ("net: ethtool: cable-test: Target the command to the requested PHY") Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20240826134656.94892-1-djahchankoike@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Paolo Abeni authored
Boris Sukholitko says: ==================== tc: adjust network header after 2nd vlan push <tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan. The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr> Consider the following shell script snippet configuring TC rules on the veth interface: ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up tc qdisc add dev veth0 clsact tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success" Sending double-tagged vlan packet with the IP payload inside: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg. OTOH, sending single-tagged vlan packet: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*.......... 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 20, will push the second vlan tag but will *not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel. Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience: if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30. proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version): nhoff = skb_network_offset(skb); Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC. Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow. Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped: if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; } At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header. Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets): if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len); .... case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break; .... out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len); And skb_vlan_push (net/core/skbuff.c:6204) function does: err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err; skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN; in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points: 1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet. 2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet. 3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again. Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header. The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len. More about the patch series: * patch 1 fixes skb_vlan_push and the callers * patch 2 adds ingress tc_actions test * patch 3 adds egress tc_actions test ==================== Link: https://patch.msgid.link/20240822103510.468293-1-boris.sukholitko@broadcom.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Boris Sukholitko authored
Add new test checking the correctness of inner vlan flushing to the skb data when outer vlan tag is added through act_vlan on egress. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Boris Sukholitko authored
Add new test checking the correctness of inner vlan flushing to the skb data when outer vlan tag is added through act_vlan on ingress. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Boris Sukholitko authored
<tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan. The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr> Consider the following shell script snippet configuring TC rules on the veth interface: ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up tc qdisc add dev veth0 clsact tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success" Sending double-tagged vlan packet with the IP payload inside: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg. OTOH, sending single-tagged vlan packet: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*.......... 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 20, will push the second vlan tag but will *not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel. Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience: if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30. proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version): nhoff = skb_network_offset(skb); Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC. Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow. Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped: if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; } At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header. Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets): if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len); .... case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break; .... out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len); And skb_vlan_push (net/core/skbuff.c:6204) function does: err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err; skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN; in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points: 1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet. 2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet. 3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again. Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header. The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Jakub Kicinski authored
Arkadiusz Kubalewski says: ==================== Add Embedded SYNC feature for a dpll's pin Introduce and allow DPLL subsystem users to get/set capabilities of Embedded SYNC on a dpll's pin. Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> ==================== Link: https://patch.msgid.link/20240822222513.255179-1-arkadiusz.kubalewski@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Arkadiusz Kubalewski authored
Allow the user to get and set configuration of Embedded SYNC feature on the ice driver dpll pins. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20240822222513.255179-3-arkadiusz.kubalewski@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Arkadiusz Kubalewski authored
Implement and document new pin attributes for providing Embedded SYNC capabilities to the DPLL subsystem users through a netlink pin-get do/dump messages. Allow the user to set Embedded SYNC frequency with pin-set do netlink message. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20240822222513.255179-2-arkadiusz.kubalewski@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-