- 02 Feb, 2022 5 commits
-
-
Vladimir Oltean authored
In order for switch driver to be able to make simple and reliable use of the master tracking operations, they must also be notified of the initial state of the DSA master, not just of the changes. This is because they might enable certain features only during the time when they know that the DSA master is up and running. Therefore, this change explicitly checks the state of the DSA master under the same rtnl_mutex as we were holding during the dsa_master_setup() and dsa_master_teardown() call. The idea being that if the DSA master became operational in between the moment in which it became a DSA master (dsa_master_setup set dev->dsa_ptr) and the moment when we checked for the master being up, there is a chance that we would emit a ->master_state_change() call with no actual state change. We need to avoid that by serializing the concurrent netdevice event with us. If the netdevice event started before, we force it to finish before we begin, because we take rtnl_lock before making netdev_uses_dsa() return true. So we also handle that early event and do nothing on it. Similarly, if the dev_open() attempt is concurrent with us, it will attempt to take the rtnl_mutex, but we're holding it. We'll see that the master flag IFF_UP isn't set, then when we release the rtnl_mutex we'll process the NETDEV_UP notifier. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
Certain drivers may need to send management traffic to the switch for things like register access, FDB dump, etc, to accelerate what their slow bus (SPI, I2C, MDIO) can already do. Ethernet is faster (especially in bulk transactions) but is also more unreliable, since the user may decide to bring the DSA master down (or not bring it up), therefore severing the link between the host and the attached switch. Drivers needing Ethernet-based register access already should have fallback logic to the slow bus if the Ethernet method fails, but that fallback may be based on a timeout, and the I/O to the switch may slow down to a halt if the master is down, because every Ethernet packet will have to time out. The driver also doesn't have the option to turn off Ethernet-based I/O momentarily, because it wouldn't know when to turn it back on. Which is where this change comes in. By tracking NETDEV_CHANGE, NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know the exact interval of time during which this interface is reliably available for traffic. Provide this information to switches so they can use it as they wish. An helper is added dsa_port_master_is_operational() to check if a master port is operational. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Martin Habets authored
Ideally the size would depend on the link speed, but the recycle ring is created when the interface is brought up before the driver knows the link speed. So size it for the maximum speed of a given NIC. PowerPC is only supported on SFN7xxx and SFN8xxx NICs. With this patch on a 40G NIC the number of calls to alloc_pages and friends went down from about 18% to under 2%. On a 10G NIC the number of calls to alloc_pages and friends went down from about 15% to 0 (perf did not capture any calls during the 60 second test). On a 100G NIC the number of calls to alloc_pages and friends went down from about 23% to 4%. Reported-by: Íñigo Huguet <ihuguet@redhat.com> Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220131111054.cp4f6foyinaarwbn@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Heiner Kallweit authored
According to Realtek RTL8168h supports the same L1.2 control as RTL8125. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/4784d5ce-38ac-046a-cbfa-5fdd9773f820@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
There's not reason SO_MARK would be allowed via setsockopt() and not via cmsg, let's keep the two consistent. See commit 079925cc ("net: allow SO_MARK with CAP_NET_RAW") for justification why NET_RAW -> SO_MARK is safe. Reviewed-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220131233357.52964-1-kuba@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 01 Feb, 2022 16 commits
-
-
David S. Miller authored
Horatiu Vultur says: ==================== net: lan966x: Add PTP Hardward Clock support Add support for PTP Hardware Clock (PHC) for lan966x. The switch supports both PTP 1-step and 2-step modes. v1->v2: - fix commit messages - reduce the scope of the lock ptp_lock inside the function lan966x_ptp_hwtstamp_set - the rx timestamping is always enabled for all packages ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Implement the function get_ts_info in ethtool_ops which is needed to get the HW capabilities for timestamping. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
When doing 2-step timestamping the HW will generate an interrupt when it managed to timestamp a frame. It is the SW responsibility to read it from the FIFO. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Update both the extraction and injection to do timestamping of the frames. The extraction is always doing the timestamping while for injection is doing the timestamping only if it is configured. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Implement the ioctl callbacks SIOCSHWTSTAMP and SIOCGHWTSTAMP to allow to configure the ports to enable/disable timestamping for TX. The RX timestamping is always enabled. The HW is capable to run both 1-step timestamping and 2-step timestamping. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
The lan966x has 3 PHC. Enable each of them, for now all the timestamping is happening on the first PHC. Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Add the registers that will be used to configure the PHC in the HW. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Horatiu Vultur authored
Extend dt-bindings for lan966x with ptp interrupt. This is generated when doing 2-step timestamping and the timestamp can be read from the FIFO. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Run sysctl in quiet mode. Echoing the modified sysctl doesn't bring any useful information. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
All callers of fib_rule6_test_match_n_redirect() and fib_rule4_test_match_n_redirect() pass a third argument containing a description of the test being run. Instead of ignoring this argument, let's use it for logging instead of printing a truncated version of the command. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
The fib_rule6_del_by_pref() and fib_rule4_del_by_pref() functions use an uninitialised $TABLE variable. They should use $RTABLE instead. This doesn't alter the result of the test, as it just makes the grep command less specific (but since the script always uses the same table number, that doesn't really matter). Let's fix it anyway and, while there, specify the filtering parameters directly in 'ip -X rule show' to avoid the extra grep command entirely. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Let's restrict the scope of these variables to avoid possible interferences. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/nextDavid S. Miller authored
-queue Tony Nguyen says: ==================== 10GbE Intel Wired LAN Driver Updates 2022-01-31 Alexander Lobakin says: This is an interpolation of [0] to other Intel Ethernet drivers (and is (re)based on its code). The main aim is to keep XDP metadata not only in case with build_skb(), but also when we do napi_alloc_skb() + memcpy(). All Intel drivers suffers from the same here: - metadata gets lost on XDP_PASS in legacy-rx; - excessive headroom allocation on XSK Rx to skbs; - metadata gets lost on XSK Rx to skbs. Those get especially actual in XDP Hints upcoming. I couldn't have addressed the first one for all Intel drivers due to that they don't reserve any headroom for now in legacy-rx mode even with XDP enabled. This is hugely wrong, but requires quite a bunch of work and a separate series. Luckily, ice doesn't suffer from that. igc has 1 and 3 already fixed in [0]. [0] https://lore.kernel.org/netdev/163700856423.565980.10162564921347693758.stgit@firesoul ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sergey Shtylyov authored
sh_eth_{suspend|resume}() initialize their local variable 'ret' to 0 but this value is never really used, thus we can kill those intializers... Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/f09d7c64-4a2b-6973-09a4-10d759ed0df4@omp.ruSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hyeonggon Yoo authored
By profiling, discovered that ena device driver allocates skb by build_skb() and frees by napi_skb_cache_put(). Because the driver does not use napi skb cache in allocation path, napi skb cache is periodically filled and flushed. This is waste of napi skb cache. As ena_alloc_skb() is called only in napi, Use napi_build_skb() and napi_alloc_skb() when allocating skb. This patch was tested on aws a1.metal instance. [ jwiedmann.dev@gmail.com: Use napi_alloc_skb() instead of netdev_alloc_skb_ip_align() to keep things consistent. ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Link: https://lore.kernel.org/r/YfUAkA9BhyOJRT4B@ip-172-31-19-208.ap-northeast-1.compute.internalSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Venkata Sudheer Kumar Bhavaraju authored
Change qed_mcp_cmd() to use msleep() (by setting QED_MB_FLAG_CAN_SLEEP flag) and add new nosleep() version of the api. These api are used to issue cmds to management fw and the change affects how driver behaves while waiting for a response/resource. All sleepable callers of the existing api now use msleep() version. For non-sleepable callers, the new nosleep() version is explicitly used. Signed-off-by: Venkata Sudheer Kumar Bhavaraju <vbhavaraju@marvell.com> Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Link: https://lore.kernel.org/r/20220131005235.1647881-1-vbhavaraju@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 31 Jan, 2022 19 commits
-
-
Alexander Lobakin authored
For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match ixgbe_construct_skb(). Fixes: d0bcacd0 ("ixgbe: add AF_XDP zero-copy Rx support") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, ixgbe_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: d0bcacd0 ("ixgbe: add AF_XDP zero-copy Rx support") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
To not dereference bi->xdp each time in ixgbe_construct_skb_zc(), pass bi->xdp as an argument instead of bi. We can also call xsk_buff_free() outside of the function as well as assign bi->xdp to NULL, which seems to make it closer to its name. Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, igc_construct_skb_zc() currently allocates and reserves additional `xdp->data_meta - xdp->data_hard_start`, which is about XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only (+ meta) to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Also, net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match igc_construct_skb(). Fixes: fc9df2a0 ("igc: Enable RX via AF_XDP zero-copy") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Nechama Kraus <nechamax.kraus@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match ice_construct_skb(). Fixes: 2d4238f5 ("ice: Add support for AF_XDP") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, ice_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: 2d4238f5 ("ice: Add support for AF_XDP") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
In "legacy-rx" mode represented by ice_construct_skb(), we can still use XDP (and XDP metadata), but after XDP_PASS the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. Point net_prefetch() to xdp->data_meta instead of data. This won't change anything when the meta is not here, but will save some cache misses otherwise. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match i40e_construct_skb(). Fixes: 0a714186 ("i40e: add AF_XDP zero-copy Rx support") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
Alexander Lobakin authored
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, i40e_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: 0a714186 ("i40e: add AF_XDP zero-copy Rx support") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
-
David S. Miller authored
Haiyang Zhang says: ==================== net: mana: Add XDP counters, reuse dropped pages Add drop, tx counters for XDP. Reuse dropped pages ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Haiyang Zhang authored
Reuse the dropped page in RX path to save page allocation overhead. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Haiyang Zhang authored
This counter will show up in ethtool stat. It is the number of packets received and forwarded by XDP program. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Haiyang Zhang authored
This counter will show up in ethtool stat data. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Tony Lu says: ==================== net/smc: Improvements for TCP_CORK and sendfile() Currently, SMC use default implement for syscall sendfile() [1], which is wildly used in nginx and big data sences. Usually, applications use sendfile() with TCP_CORK: fstat(20, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 setsockopt(19, SOL_TCP, TCP_CORK, [1], 4) = 0 writev(19, [{iov_base="HTTP/1.1 200 OK\r\nServer: nginx/1"..., iov_len=240}], 1) = 240 sendfile(19, 20, [0] => [4096], 4096) = 4096 close(20) = 0 setsockopt(19, SOL_TCP, TCP_CORK, [0], 4) = 0 The above is an example of Nginx, when sendfile() on, Nginx first enables TCP_CORK, write headers, the data will not be sent. Then call sendfile(), it reads file and write to sndbuf. When TCP_CORK is cleared, all pending data is sent out. The performance of the default implement of sendfile is lower than when it is off. After investigation, it shows two parts to improve: - unnecessary lock contention of delayed work - less data per send than when sendfile off Patch #1 tries to reduce lock_sock() contention in smc_tx_work(). Patch #2 removes timed work for corking, and let applications control it. See TCP_CORK [2] MSG_MORE [3]. Patch #3 adds MSG_SENDPAGE_NOTLAST for corking more data when sendfile(). Test environments: - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4 - socket sndbuf / rcvbuf: 16384 / 131072 bytes - server: smc_run nginx - client: smc_run ./wrk -c 100 -t 2 -d 30 http://192.168.100.1:8080/4k.html - payload: 4KB local disk file Items QPS sendfile off 272477.10 sendfile on (orig) 223622.79 sendfile on (this) 395847.21 This benchmark shows +45.28% improvement compared with sendfile off, and +77.02% compared with original sendfile implement. [1] https://man7.org/linux/man-pages/man2/sendfile.2.html [2] https://linux.die.net/man/7/tcp [3] https://man7.org/linux/man-pages/man2/send.2.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tony Lu authored
This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is involved in syscall sendfile() [1], it indicates this is not the last page. So we can cork the data until the page is not specify this flag. It has the same effect as MSG_MORE, but existed in sendfile() only. This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to cork more data before sending when using sendfile(), which acts like TCP's behaviour. Also, this reimplements the default sendpage to inform that it is supported to some extent. [1] https://man7.org/linux/man-pages/man2/sendfile.2.htmlSigned-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tony Lu authored
Based on the manual of TCP_CORK [1] and MSG_MORE [2], these two options have the same effect. Applications can set these options and informs the kernel to pend the data, and send them out only when the socket or syscall does not specify this flag. In other words, there's no need to send data out by a delayed work, which will queue a lot of work. This removes corked delayed work with SMC_TX_CORK_DELAY (250ms), and the applications control how/when to send them out. It improves the performance for sendfile and throughput, and remove unnecessary race of lock_sock(). This also unlocks the limitation of sndbuf, and try to fill it up before sending. [1] https://linux.die.net/man/7/tcp [2] https://man7.org/linux/man-pages/man2/send.2.htmlSigned-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tony Lu authored
According to the man page of TCP_CORK [1], if set, don't send out partial frames. All queued partial frames are sent when option is cleared again. When applications call setsockopt to disable TCP_CORK, this call is protected by lock_sock(), and tries to mod_delayed_work() to 0, in order to send pending data right now. However, the delayed work smc_tx_work is also protected by lock_sock(). There introduces lock contention for sending data. To fix it, send pending data directly which acts like TCP, without lock_sock() protected in the context of setsockopt (already lock_sock()ed), and cancel unnecessary dealyed work, which is protected by lock. [1] https://linux.die.net/man/7/tcpSigned-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Akhmat Karakotov says: ==================== Make hash rethink configurable As it was shown in the report by Alexander Azimov, hash rethink at the client-side may lead to connection timeout toward stateful anycast services. Tom Herbert created a patchset to address this issue by applying hash rethink only after a negative routing event (3RTOs) [1]. This change also affects server-side behavior, which we found undesirable. This patchset changes defaults in a way to make them safe: hash rethink at the client-side is disabled and enabled at the server-side upon each RTO event or in case of duplicate acknowledgments. This patchset provides two options to change default behaviour. The hash rethink may be disabled at the server-side by the new sysctl option. Changes in the sysctl option don't affect default behavior at the client-side. Hash rethink can also be enabled/disabled with socket option or bpf syscalls which ovewrite both default and sysctl settings. This socket option is available on both client and server-side. This should provide mechanics to enable hash rethink inside administrative domain, such as DC, where hash rethink at the client-side can be desirable. [1] https://lore.kernel.org/netdev/20210809185314.38187-1-tom@herbertland.com/ v2: - Changed sysctl default to ENABLED in all patches. Reduced sysctl and socket option size to u8. Fixed netns bug reported by kernel test robot. v3: - Fixed bug with bad u8 comparison. Moved sk_txrehash to use less bytes in struct. Added WRITE_ONCE() in setsockopt in and READ_ONCE() in tcp_rtx_synack. v4: - Rebase and add documentation for sysctl option. v5: - Move sk_txrehash out of busy poll ifdef. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Akhmat Karakotov authored
Disabling rehash behavior did not affect SYN ACK retransmits because hash was forcefully changed bypassing the sk_rethink_hash function. This patch adds a condition which checks for rehash mode before resetting hash. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-