- 02 May, 2024 12 commits
-
-
Jakub Kicinski authored
Florian Fainelli says: ==================== net: dsa: adjust_link removal Now that the last in-tree driver (b53) has been converted to PHYLINK, we can get rid of all of code that catered to working with drivers implementing only PHYLIB's adjust_link callback. ==================== Link: https://lore.kernel.org/r/20240430164816.2400606-1-florian.fainelli@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
Now that we no longer any drivers using PHYLIB's adjust_link callback, remove all paths that made use of adjust_link as well as the associated functions. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-3-florian.fainelli@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
We have not had a switch driver use a fixed_link_update callback since 58d56fcc ("net: dsa: bcm_sf2: Get rid of PHYLIB functions") remove this callback. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240430164816.2400606-2-florian.fainelli@broadcom.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
MD Danish Anwar authored
Add SW IRQ coalescing based on hrtimers for RX and TX data path for ICSSG driver, which can be enabled by ethtool commands: - RX coalescing ethtool -C eth1 rx-usecs 50 - TX coalescing can be enabled per TX queue - by default enables coalescing for TX0 ethtool -C eth1 tx-usecs 50 - configure TX0 ethtool -Q eth0 queue_mask 1 --coalesce tx-usecs 100 - configure TX1 ethtool -Q eth0 queue_mask 2 --coalesce tx-usecs 100 - configure TX0 and TX1 ethtool -Q eth0 queue_mask 3 --coalesce tx-usecs 100 --coalesce tx-usecs 100 Minimum value for both rx-usecs and tx-usecs is 20us. Compared to gro_flush_timeout and napi_defer_hard_irqs this patch allows to enable IRQ coalescing for RX path separately. Benchmarking numbers: =============================================================== | Method | Tput_TX | CPU_TX | Tput_RX | CPU_RX | | ============================================================== | Default Driver 943 Mbps 31% 517 Mbps 38% | | IRQ Coalescing (Patch) 943 Mbps 28% 518 Mbps 25% | =============================================================== Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240430120634.1558998-1-danishanwar@ti.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Kuniyuki Iwashima says: ==================== arp: Random clean up and RCU conversion for ioctl(SIOCGARP). arp_ioctl() holds rtnl_lock() regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. This series cleans up ioctl() code a bit and extends the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() and netdev_copy_name() for SIOCGARP. v2: https://lore.kernel.org/netdev/20240425170002.68160-1-kuniyu@amazon.com/ v1: https://lore.kernel.org/netdev/20240422194755.4221-1-kuniyu@amazon.com/ ==================== Link: https://lore.kernel.org/r/20240430015813.71143-1-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
ioctl(SIOCGARP) holds rtnl_lock() to get netdev by __dev_get_by_name() and copy dev->name safely and calls neigh_lookup() later, which looks up a neighbour entry under RCU. Let's replace __dev_get_by_name() with dev_get_by_name_rcu() and strscpy() with netdev_copy_name() to avoid locking rtnl_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-8-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
We will convert ioctl(SIOCGARP) to RCU, and then we need to copy dev->name which is currently protected by rtnl_lock(). This patch does the following: 1) Add seqlock netdev_rename_lock to protect dev->name 2) Add netdev_copy_name() that copies dev->name to buffer under netdev_rename_lock 3) Use netdev_copy_name() in netdev_get_name() and drop devnet_rename_sem Suggested-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/CANn89iJEWs7AYSJqGCUABeVqOCTkErponfZdT5kV-iD=-SajnQ@mail.gmail.com/Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-7-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
arp_ioctl() holds rtnl_lock() first regardless of cmd (SIOCDARP, SIOCSARP, and SIOCGARP) to get net_device by __dev_get_by_name() and copy dev->name safely. In the SIOCGARP path, arp_req_get() calls neigh_lookup(), which looks up a neighbour entry under RCU. We will extend the RCU section not to take rtnl_lock() and instead use dev_get_by_name_rcu() for SIOCGARP. As a preparation, let's move __dev_get_by_name() into another function and call it from arp_req_delete(), arp_req_set(), and arp_req_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-6-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
This is a prep patch to make the following changes tidy. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-5-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
When ioctl(SIOCDARP/SIOCSARP) is issued for non-proxy entry (no ATF_COM) without arpreq.arp_dev[] set, arp_req_set() and arp_req_delete() looks up dev based on IPv4 address by ip_route_output(). Let's factorise the same code as arp_req_dev(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-4-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
When ioctl(SIOCDARP/SIOCSARP) is issued with ATF_PUBL, r.arp_netmask must be 0.0.0.0 or 255.255.255.255. Currently, the netmask is validated in arp_req_delete_public() or arp_req_set_public() under rtnl_lock(). We have ATF_NETMASK test in arp_ioctl() before holding rtnl_lock(), so let's move the netmask validation there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-3-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Kuniyuki Iwashima authored
In arp_req_set(), if ATF_PERM is set in arpreq.arp_flags, ATF_COM is set automatically. The flag will be used later for neigh_update() only when a neighbour entry is found. Let's set ATF_COM just before calling neigh_update(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240430015813.71143-2-kuniyu@amazon.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 01 May, 2024 13 commits
-
-
Florian Westphal authored
Even a 1h timeout isn't enough for nft_concat_range.sh to complete on debug kernels. Reduce test complexity and only match on single entry if KSFT_MACHINE_SLOW is set. To spot 'slow' tests, print the subtest duration (in seconds) in addition to the status. Add new nft_concat_range_perf.sh script, not executed via kselftest, to run the performance (pps match rate) tests. Those need about 25m to complete which seems too much to run this via 'make run_tests'. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240430145810.23447-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
This is a followup of commit b5327b9a ("ipv6: use call_rcu_hurry() in fib6_info_release()"). I had another pmtu.sh failure, and found another lazy call_rcu() causing this failure. aca_free_rcu() calls fib6_info_release() which releases devices references. We must not delay it too much or risk unregister_netdevice/ref_tracker traces because references to netdev are not released in time. This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Breno Leitao authored
KCSAN detected a race condition in netpoll: BUG: KCSAN: data-race in net_rx_action / netpoll_send_skb write (marked) to 0xffff8881164168b0 of 4 bytes by interrupt on cpu 10: net_rx_action (./include/linux/netpoll.h:90 net/core/dev.c:6712 net/core/dev.c:6822) <snip> read to 0xffff8881164168b0 of 4 bytes by task 1 on cpu 2: netpoll_send_skb (net/core/netpoll.c:319 net/core/netpoll.c:345 net/core/netpoll.c:393) netpoll_send_udp (net/core/netpoll.c:?) <snip> value changed: 0x0000000a -> 0xffffffff This happens because netpoll_owner_active() needs to check if the current CPU is the owner of the lock, touching napi->poll_owner non atomically. The ->poll_owner field contains the current CPU holding the lock. Use an atomic read to check if the poll owner is the current CPU. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240429100437.3487432-1-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Breno Leitao authored
With commit 34d21de9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the loopback driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429085559.2841918-1-leitao@debian.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Flavio Suligoi says: ==================== dt-bindings: net: snps, dwmac: remove tx-sched-sp property Strict priority for the tx scheduler is by default in Linux driver, so the tx-sched-sp property was removed in commit aed68640 ("net: stmmac: platform: Delete a redundant condition branch"). This property is still in use in the following DT (and it will be removed in a separate patch series): - arch/arm64/boot/dts/freescale/imx8mp-beacon-som.dtsi - arch/arm64/boot/dts/freescale/imx8mp-evk.dts - arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi - arch/arm64/boot/dts/qcom/sa8540p-ride.dts - arch/arm64/boot/dts/qcom/sa8775p-ride.dts There is no problem if that property is still used in the DTs above, since, as seen above, it is a default property of the driver. ==================== Link: https://lore.kernel.org/r/20240429092654.31390-1-f.suligoi@asem.itSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Flavio Suligoi authored
Strict priority for the tx scheduler is by default in Linux driver, so the tx-sched-sp property was removed in commit aed68640 ("net: stmmac: platform: Delete a redundant condition branch"). This property is still in use in the following DT (and it will be removed in a separate patch series): - arch/arm64/boot/dts/freescale/imx8mp-beacon-som.dtsi - arch/arm64/boot/dts/freescale/imx8mp-evk.dts - arch/arm64/boot/dts/freescale/imx8mp-verdin.dtsi - arch/arm64/boot/dts/qcom/sa8540p-ride.dts - arch/arm64/boot/dts/qcom/sa8775p-ride.dts There is no problem if that property is still used in the DTs above, since, as seen above, it is a default property of the driver. Signed-off-by: Flavio Suligoi <f.suligoi@asem.it> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Adam Ford <aford173@gmail.com> Link: https://lore.kernel.org/r/20240429092654.31390-2-f.suligoi@asem.itSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Eric Dumazet says: ==================== net: three additions to net_hotdata This series moves three fast path sysctls to net_hotdata. To avoid <net/hotdata.h> inclusion from <net/sock.h>, create <net/proto_memory.h> to hold proto memory definitions. ==================== Link: https://lore.kernel.org/r/20240429134025.1233626-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sysctl_mem_pcpu_rsv is used in TCP fast path, move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-6-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
Move some proto memory definitions out of <net/sock.h> Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
tcp_out_of_memory() has a single caller: tcp_check_oom(). Following patch will also make sk_memory_allocated() not anymore visible from <net/sock.h> and <net/tcp.h> Add const qualifier to sock argument of tcp_out_of_memory() and tcp_check_oom(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sysctl_skb_defer_max is used in TCP fast path, move it to net_hodata. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-3-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
sysctl_max_skb_frags is used in TCP and MPTCP fast paths, move it to net_hodata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Eric Dumazet authored
I added dst_rt6_info() in commit e8dfd42c ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 30 Apr, 2024 15 commits
-
-
Jakub Kicinski authored
Jakub Kicinski says: ==================== selftests: net: page_poll allocation error injection Add a test for exercising driver memory allocation failure paths. page pool is a bit tricky to inject errors into at the page allocator level because of the bulk alloc and recycling, so add explicit error injection support "in front" of the caches. Add a test to exercise that using only the standard APIs. This is the first useful test for the new tests with an endpoint. There's no point testing netdevsim here, so this is also the first HW-only test in Python. I'm not super happy with the traffic generation using iperf3, my initial approach was to use mausezahn. But it turned out to be 5x slower in terms of PPS. Hopefully this is good enough for now. v1: https://lore.kernel.org/all/20240426232400.624864-1-kuba@kernel.org/ ==================== Link: https://lore.kernel.org/r/20240429144426.743476-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Bugs in memory allocation failure paths are quite common. Add a test exercising those paths based on qstat and page pool failure hook. Running on bnxt: # ./drivers/net/hw/pp_alloc_fail.py KTAP version 1 1..1 # ethtool -G change retval: success ok 1 pp_alloc_fail.test_pp_alloc # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 I initially wrote this test to validate commit be43b748 ("net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq") but mlx5 still doesn't have qstat. So I run it on bnxt, and while bnxt survives I found the problem fixed in commit 73011773 ("eth: bnxt: fix counting packets discarded due to OOM and netpoll"). Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-7-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
While we are not very interested in testing performance it's useful to be able to generate a lot of traffic. iperf is the simplest way of getting relatively high PPS. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-6-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
When picking TCP ports to use, avoid all below 10k. This should lower the chance of collision or running afoul whatever random policies may be on the host. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-5-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
The main use of the ip() wrapper over cmd() is that it can parse JSON. cmd("ip -j link show") will return stdout as a string, and test has to call json.loads(). With ip("link show", json=True) the return value will be already parsed. More tools (ethtool, bpftool etc.) support the --json switch. To avoid having to wrap all of them individually create a tool() helper. Switch from -j to --json (for ethtool). While at it consume the netns attribute at the ip() level. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-4-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
We created a separate directory for HW-only tests, recently. Glue in the Python test library there, Python is a bit annoying when it comes to using library code located "lower" in the directory structure. Reuse the Env class, but let tests require non-nsim setup. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-3-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Because of caching / recycling using the general page allocation failures to induce errors in page pool allocation is very hard. Add direct error injection support to page_pool_alloc_pages(). Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240429144426.743476-2-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Westphal authored
Jakub reports that some tests fail on netdev CI when executed in a debug kernel. Increase test timeout to 30m, this should hopefully be enough. Also reduce test duration where possible for "slow" machines. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20240429105736.22677-1-fw@strlen.deSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Russell King (Oracle) authored
sfp_select_interface() does not modify its link_modes argument, so make this a const pointer. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15s0-00AHyq-8E@rmk-PC.armlinux.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Russell King (Oracle) authored
Allow use of 2500base-X interface mode for PHY modules that support 2500base-T. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15rv-00AHyk-5S@rmk-PC.armlinux.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Russell King (Oracle) authored
Add a debugging print in phylink_validate_phy() when we detect that the PHY has not supplied a possible_interfaces bitmap. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Link: https://lore.kernel.org/r/E1s15rq-00AHye-22@rmk-PC.armlinux.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Russell King (Oracle) authored
Convert realtek to provide its own phylink MAC operations, thus avoiding the shim layer in DSA's port.c. We need to provide a stub for the mandatory mac_config() method for rtl8366rb. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/E1s11qJ-00AHi0-Kk@rmk-PC.armlinux.org.ukSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Arınç ÜNAL authored
DSA initalises the ds->num_ports amount of ports in dsa_switch_touch_ports(). When the PHY muxing feature is in use, port 5 won't be defined in the device tree. Because of this, the type member of the dsa_port structure for this port will be assigned DSA_PORT_TYPE_UNUSED. The dsa_port_setup() function calls ds->ops->port_disable() when the port type is DSA_PORT_TYPE_UNUSED. The MT7530_P5_DIS bit is unset in mt7530_setup() when PHY muxing is being used. mt7530_port_disable() which is assigned to ds->ops->port_disable() is called afterwards. Currently, mt7530_port_disable() sets MT7530_P5_DIS which breaks network connectivity when PHY muxing is being used. Therefore, do not set MT7530_P5_DIS when PHY muxing is being used. Fixes: 377174c5 ("net: dsa: mt7530: move MT753X_MTRAP operations for MT7530") Reported-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240428-for-netnext-mt7530-do-not-disable-port5-when-phy-muxing-v2-1-bb7c37d293f8@arinc9.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Paolo Abeni authored
Wen Gu says: ==================== net/smc: SMC intra-OS shortcut with loopback-ism This patch set acts as the second part of the new version of [1] (The first part can be referred from [2]), the updated things of this version are listed at the end. - Background SMC-D is now used in IBM z with ISM function to optimize network interconnect for intra-CPC communications. Inspired by this, we try to make SMC-D available on the non-s390 architecture through a software-implemented Emulated-ISM device, that is the loopback-ism device here, to accelerate inter-process or inter-containers communication within the same OS instance. - Design This patch set includes 3 parts: - Patch #1: some prepare work for loopback-ism. - Patch #2-#7: implement loopback-ism device and adapt SMC-D for it. loopback-ism now serves only SMC and no userspace interfaces exposed. - Patch #8-#11: memory copy optimization for intra-OS scenario. The loopback-ism device is designed as an ISMv2 device and not be limited to a specific net namespace, ends of both inter-process connection (1/1' in diagram below) or inter-container connection (2/2' in diagram below) can find the same available loopback-ism and choose it during the CLC handshake. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ | +-------+ +-------+ +-------+ | | +-------+ | | | App A | | App B | | App C | | | | App D |<-+ | | +-------+ +---^---+ +-------+ | | +-------+ |(2') | | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ | | | | Kernel | | | | +----+-------v---+-----------v----------------------------------+---+----+ | | TCP | | | | | | | +--------------------------------------------------------------+ | | | | +--------------+ | | | smc loopback | | +---------------------------+--------------+-----------------------------+ loopback-ism device creates DMBs (shared memory) for each connection peer. Since data transfer occurs within the same kernel, the sndbuf of each peer is only a descriptor and point to the same memory region as peer DMB, so that the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. Container 1 (ns1) Container 2 (ns2) +-----------------------------------------+ +-------------------------+ | +-------+ | | +-------+ | | | App C |-----+ | | | App D | | | +-------+ | | | +-^-----+ | | | | | | | | (2) | | | (2') | | | | | | | | +---------------|-------------------------+ +----------|--------------+ | | Kernel | | +---------------|-----------------------------------------|--------------+ | +--------+ +--v-----+ +--------+ +--------+ | | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | | +-----|--+ +--|-----+ +-----|--+ +--------+ | | +-----|--+ | +-----|--+ | | | DMB C | +---------------------------------| DMB D | | | +--------+ +--------+ | | | | +--------------+ | | | smc loopback | | +---------------------------+--------------+-----------------------------+ - Benchmark Test * Test environments: - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. - SMC sndbuf/DMB size 1MB. * Test object: - TCP: run on TCP loopback. - SMC lo: run on SMC loopback-ism. 1. ipc-benchmark (see [3]) - ./<foo> -c 1000000 -s 100 TCP SMC-lo Message rate (msg/s) 84991 151293(+78.01%) 2. sockperf - serv: <smc_run> sockperf sr --tcp - clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 TCP SMC-lo Bandwidth(MBps) 5033.569 7987.732(+58.69%) Latency(us) 5.986 3.398(-43.23%) 3. nginx/wrk - serv: <smc_run> nginx - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 TCP SMC-lo Requests/s 187951.76 267107.90(+42.12%) 4. redis-benchmark - serv: <smc_run> redis-server - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 TCP SMC-lo GET(Requests/s) 86132.64 118133.49(+37.15%) SET(Requests/s) 87374.40 122887.86(+40.65%) Change log: v7->v6 - Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit(). - Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send(). - all: collect the Reviewed-by tags. v6->RFC v5 Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/ - Patch #2: make the use of CONFIG_SMC_LO cleaner. - Patch #5: mark some smcd_ops that loopback-ism doesn't support as optional and check for the support when they are called. - Patch #7: keep loopback-ism at the beginning of the SMC-D device list. - Some expression changes in commit logs and comments. RFC v5->RFC v4: Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/ - Patch #2: minor changes in description of config SMC_LO and comments. - Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy()) check in smcd_cdc_msg_send(). - Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID to SMC_LO_RESERVED_CHID. - Patch #5: memcpy while holding the ldev->dmb_ht_lock. - Some expression changes in commit logs. RFC v4->v3: Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/ - The merge window of v6.9 is open, so post this series as an RFC. - Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose not apply to Emulated-ISM (including loopback-ism here), loopback-ism is not exposed through smc netlink for the time being. we may refactor this part when smc netlink interface is updated. v3->v2: Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/ - Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler() to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh() to acquire dmb_ht_lock. v2->v1: Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/ - All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1. - Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism to sysfs and remove all the knobs until future definition clear. - Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data(). - Patch #6: defaultly use physical contiguous DMB buffers. - Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using wait_event to keep waiting in unregister_dmb. v1->RFC: Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/ - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: /sys/devices/virtual/smc/loopback-ism/xfer_bytes - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports merging sndbuf with peer DMB. - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and control of whether to merge sndbuf and DMB. They can be respectively set by: /sys/devices/virtual/smc/loopback-ism/dmb_type /sys/devices/virtual/smc/loopback-ism/dmb_copy The motivation for these two control is that a performance bottleneck was found when using vzalloced DMB and sndbuf is merged with DMB, and there are many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the vmap lock contention [6]. It has significant effects, but using virtual memory still has additional overhead compared to using physical memory. So this new version provides controls of dmb_type and dmb_copy to suit different scenarios. - Some minor changes and comments improvements. RFC->old version([1]): Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ - Patch #1: improve the loopback-ism dump, it shows as follows now: # smcd d FID Type PCI-ID PCHID InUse #LGs PNET-ID 0000 0 loopback-ism ffff No 0 - Patch #3: introduce the smc_ism_set_v2_capable() helper and set smc_ism_v2_capable when ISMv2 or virtual ISM is registered, regardless of whether there is already a device in smcd device list. - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active to activate or deactivate the loopback-ism. - Patch #9: introduce the statistics of loopback-ism by /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. - Some minor changes and comments improvements. [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ [3] https://github.com/goldsborough/ipc-bench [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ ==================== Link: https://lore.kernel.org/r/20240428060738.60843-1-guwen@linux.alibaba.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Wen Gu authored
This implements operations related to merging sndbuf with peer DMB in loopback-ism. The DMB won't be freed until no sndbuf is attached to it. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-