- 11 Aug, 2018 40 commits
-
-
Vlad Buslov authored
Extend gen_new_estimator() to also take stats_lock when re-assigning rate estimator statistics pointer. (to be used by unlocked actions) Rename 'stats_lock' to 'lock' and change argument description to explain that it is now also used for control path. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Re-introduce mirred list spinlock, that was removed some time ago, in order to protect it from concurrent modifications, instead of relying on rtnl lock. Use tcf spinlock to protect mirred action private data from concurrent modification in init and dump. Rearrange access to mirred data in order to be performed only while holding the lock. Rearrange net dev access to always hold reference while working with it, instead of relying on rntl lock. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
As a preparation for removing dependency on rtnl lock from rules update path, all users of shared objects must take reference while working with them. Extend action ops with put_dev() API to be used on net device returned by get_dev(). Modify mirred action (only action that implements get_dev callback): - Take reference to net device in get_dev. - Implement put_dev API that releases reference to net device. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect vlan action private data from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf lock to protect tunnel key action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl lock assertion that is no longer required. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf spinlock to protect private skbmod data from concurrent modification during dump. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect private simple action data from concurrent modification during dump. (simple init already uses tcf spinlock when changing action state) Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect private sample action data from concurrent modification during dump and init. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Rearrange pedit init code to only access pedit action data while holding tcf spinlock. Change keys allocation type to atomic to allow it to execute while holding tcf spinlock. Take tcf spinlock in dump function when accessing pedit action data. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect ipt action private data from concurrent modification during dump. Ipt init already takes tcf spinlock when modifying ipt state. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock and rcu to protect params pointer from concurrent modification during dump and init. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Ife action has meta-actions that are compiled as standalone modules. Rtnl mutex must be released while loading a kernel module. In order to support execution without rtnl mutex, propagate 'rtnl_held' argument to meta action loading functions. When requesting meta action module, conditionally release rtnl lock depending on 'rtnl_held' argument. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect gact action private state from concurrent modification during dump and init. Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf lock to protect csum action struct private data from concurrent modification in init and dump. Use rcu swap operation to reassign params pointer under protection of tcf lock. (old params value is not used by init, so there is no need of standalone rcu dereference step) Remove rtnl assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vlad Buslov authored
Use tcf spinlock to protect bpf action private data from concurrent modification during dump and init. Remove rtnl lock assertion that is no longer necessary. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Konstantin Khorenko says: ==================== net/sctp: Avoid allocating high order memory with kmalloc() Each SCTP association can have up to 65535 input and output streams. For each stream type an array of sctp_stream_in or sctp_stream_out structures is allocated using kmalloc_array() function. This function allocates physically contiguous memory regions, so this can lead to allocation of memory regions of very high order, i.e.: sizeof(struct sctp_stream_out) == 24, ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page), which means 9th memory order. This can lead to a memory allocation failures on the systems under a memory stress. We actually do not need these arrays of memory to be physically contiguous. Possible simple solution would be to use kvmalloc() instread of kmalloc() as kvmalloc() can allocate physically scattered pages if contiguous pages are not available. But the problem is that the allocation can happed in a softirq context with GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario. So the other possible solution is to use flexible arrays instead of contiguios arrays of memory so that the memory would be allocated on a per-page basis. This patchset replaces kvmalloc() with flex_array usage. It consists of two parts: * First patch is preparatory - it mechanically wraps all direct access to assoc->stream.out[] and assoc->stream.in[] arrays with SCTP_SO() and SCTP_SI() wrappers so that later a direct array access could be easily changed to an access to a flex_array (or any other possible alternative). * Second patch replaces kmalloc_array() with flex_array usage. v2 changes: sctp_stream_in() users are updated to provide stream as an argument, sctp_stream_{in,out}_ptr() are now just sctp_stream_{in,out}(). v3 changes: Move type chages struct sctp_stream_out -> flex_array to next patch. Make sctp_stream_{in,out}() static incline and move them to a header. Performance results (single stream): ==================================== * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread) * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz RAM: 32 Gb * netperf: taken from https://github.com/HewlettPackard/netperf.git, compiled from sources with sctp support * netperf server and client are run on the same node * ip link set lo mtu 1500 The script used to run tests: # cat run_tests.sh #!/bin/bash for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do echo "TEST: $test"; for i in `seq 1 3`; do echo "Iteration: $i"; set -x netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 \ -l 60 -- -m 1452; set +x done done ================================================ Results (a bit reformatted to be more readable): Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec v4.18-rc7 v4.18-rc7 + fixes TEST: SCTP_STREAM 212992 212992 1452 60.21 1125.52 1247.04 212992 212992 1452 60.20 1376.38 1149.95 212992 212992 1452 60.20 1131.40 1163.85 TEST: SCTP_STREAM_MANY 212992 212992 1452 60.00 1111.00 1310.05 212992 212992 1452 60.00 1188.55 1130.50 212992 212992 1452 60.00 1108.06 1162.50 =========== Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec v4.18-rc7 v4.18-rc7 + fixes TEST: SCTP_RR 212992 212992 1 1 60.00 45486.98 46089.43 212992 212992 1 1 60.00 45584.18 45994.21 212992 212992 1 1 60.00 45703.86 45720.84 TEST: SCTP_RR_MANY 212992 212992 1 1 60.00 40.75 40.77 212992 212992 1 1 60.00 40.58 40.08 212992 212992 1 1 60.00 39.98 39.97 Performance results for many streams: ===================================== * Kernel: v4.18-rc8 - stock and with 2 patches v3 * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz RAM: 32 Gb * sctp_test: https://github.com/sctp/lksctp-tools * both server and client are run on the same node * ip link set lo mtu 1500 * sysctl -w vm.max_map_count=65530000 (need it to make memory fragmented) The script used to run tests: ============================= # cat run_sctp_test.sh #!/bin/bash set -x uname -r ip link set lo mtu 1500 swapoff -a free cat /proc/buddyinfo ./src/apps/sctp_test -H 127.0.0.1 -P 22222 -l -d 0 & sleep 3 time ./src/apps/sctp_test -H 127.0.0.1 -P 22221 -h 127.0.0.1 -p 22222 \ -s -c 1 -M 65535 -T -t 1 -x 100000 -d 0 1>/dev/null killall -9 lt-sctp_test =============================== Results (a bit reformatted to be more readable): 1) ms stock kernel v4.18-rc8, no memory fragmentation test 1 test 2 test 3 real 0m14.715s 0m14.593s 0m15.954s user 0m0.954s 0m0.955s 0m0.854s sys 0m13.388s 0m12.537s 0m13.749s 2) kernel with fixes, no memory fragmentation test 1 test 2 test 3 real 0m14.959s 0m14.693s 0m14.762s user 0m0.948s 0m0.921s 0m0.929s sys 0m13.538s 0m13.225s 0m13.217s 3) kernel with fixes, memory fragmented 'free': total used free shared buff/cache available Mem: 32906008 30555200 302740 764 2048068 266452 Mem: 32906008 30379948 541436 764 1984624 442376 Mem: 32906008 30717312 262380 764 1926316 109908 /proc/buddyinfo: Node 0, zone Normal 40773 37 34 29 0 0 0 0 0 0 0 Node 0, zone Normal 100332 68 8 4 2 1 1 0 0 0 0 Node 0, zone Normal 31113 7 2 1 0 0 0 0 0 0 0 test 1 test 2 test 3 real 0m14.159s 0m15.252s 0m15.826s user 0m0.839s 0m1.004s 0m1.048s sys 0m11.827s 0m14.240s 0m14.778s ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Konstantin Khorenko authored
This path replaces physically contiguous memory arrays allocated using kmalloc_array() with flexible arrays. This enables to avoid memory allocation failures on the systems under a memory stress. Signed-off-by: Oleg Babin <obabin@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Konstantin Khorenko authored
This patch introduces wrappers for accessing in/out streams indirectly. This will enable to replace physically contiguous memory arrays of streams with flexible arrays (or maybe any other appropriate mechanism) which do memory allocation on a per-page basis. Signed-off-by: Oleg Babin <obabin@virtuozzo.com> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Keara Leibovitz authored
Updated README. Added config file that contains the minimum required features enabled to run the tests currently present in the kernel. This must be updated when new unittests are created and require their own modules. Signed-off-by: Keara Leibovitz <kleib@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Guillaume Nault says: ==================== l2tp: rework pppol2tp ioctl handling The current ioctl() handling code can be simplified. It tests for non-relevant conditions and uselessly holds sockets. Once useless code is removed, it becomes even simpler to let pppol2tp_ioctl() handle commands directly, rather than dispatch them to pppol2tp_tunnel_ioctl() or pppol2tp_session_ioctl(). That is the approach taken by this series. Patch #1 and #2 define helper functions aimed at simplifying the rest of the patch set. Patch #3 drops useless tests in pppol2p_ioctl() and avoid holding a refcount on the socket. Patches #4, #5 and #6 are the core of the series. They let pppol2tp_ioctl() handle all ioctls and drop the tunnel and session specific functions. Then patch #6 brings a little bit of consolidation. Finally, patch #7 takes advantage of the simplified code to make pppol2tp sockets compatible with dev_ioctl(). Certainly not a killer feature, but it is trivial and it is always nice to see l2tp getting better integration with the rest of the stack. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl() handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX. PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring this mechanism. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it manually every time. While there, constify 'stats'. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS on session sockets. We just need to copy the stats and set ->session_id. As a side effect of sharing session and tunnel code, ->using_ipsec is properly set even when the request was made using a session socket. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a tunnel. This one is a bit special because the caller may use the tunnel socket to retrieve statistics of one of its sessions. If the session_id is set, the corresponding session's statistics are returned, instead of those of the tunnel. This is handled by the new pppol2tp_tunnel_copy_stats() helper function. Set ->tunnel_id and ->using_ipsec out of the conditional, so that it can be used by the 'else' branch in the following patch. We cannot do that for ->session_id, because tunnel sockets have to report the value that was originally passed in 'stats.session_id', while session sockets have to report their own session_id. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
* Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could not have called us. * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket would be in the process of being released and no ioctl could be running anymore. * Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get the session structure. If it is non-NULL, then the socket is connected. Testing for PPPOX_* is redundant. * Retrieve session using ->sk_user_data directly, instead of going through pppol2tp_sock_to_session(). This avoids grabbing a useless reference on the socket. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
l2tp_session_get() is used for two different purposes. If 'tunnel' is NULL, the session is searched globally in the supplied network namespace. Otherwise it is searched exclusively in the tunnel context. Callers always know the context in which they need to search the session. But some of them do provide both a namespace and a tunnel, making the semantic of the call unclear. This patch defines l2tp_tunnel_get_session() for lookups done in a tunnel and restricts l2tp_session_get() to namespace searches. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Use helper function to figure out if a tunnel is using ipsec. Also, avoid accessing ->sk_policy directly since it's RCU protected. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ilias Apalodimas says: ==================== netsec driver improvements This patchset introduces some improvements on socionext netsec driver. - patch 1/2, avoids unneeded MMIO reads on the Rx path - patch 2/2, is adjusting the numbers of descriptors used Changes since v1: - Move dma_rmb() to protect descriptor accesses until the device has updated the NETSEC_RX_PKT_OWN_FIELD bit ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilias Apalodimas authored
Increasing descriptors to 256 from 128 and adjusting the NAPI weight to 64 increases performace on Rx by ~20% on 64byte packets Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilias Apalodimas authored
MMIO reads for remaining packets in queue occur (at least)twice per invocation of netsec_process_rx(). We can use the packet descriptor to identify if it's owned by the hardware and break out, avoiding the more expensive MMIO read operations. This has a ~2% increase on the pps of the Rx path when tested with 64byte packets Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YueHaibing authored
Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/neterion/vxge/vxge-config.c:1097:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2263:6: warning: variable 'req_out' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2262:22: warning: variable 'status' set but not used [-Wunused-but-set-variable] drivers/net/ethernet/neterion/vxge/vxge-config.c:2360:22: warning: variable 'status' set but not used [-Wunused-but-set-variable] enum vxge_hw_status status = VXGE_HW_OK; Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Caleb Raitto says: ==================== virtio_net: Expand affinity to arbitrary numbers of cpu and vq Virtio-net tries to pin each virtual queue rx and tx interrupt to a cpu if there are as many queues as cpus. Expand this heuristic to configure a reasonable affinity setting also when the number of cpus != the number of virtual queues. Patch 1 allows vqs to take an affinity mask with more than 1 cpu. Patch 2 generalizes the algorithm in virtnet_set_affinity beyond the case where #cpus == #vqs. v2 changes: Renamed "virtio_net: Make vp_set_vq_affinity() take a mask." to "virtio: Make vp_set_vq_affinity() take a mask." Tested: [InstanceSetup] set_multiqueue = false $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 8000 $ sudo ethtool -L eth0 combined 15 $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list; done 0-15 0-1 0-1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 14` ; do sudo grep ".*" tx-$i/xps_cpus; done 0003 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 8000 $ sudo ethtool -L eth0 combined 8 $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list; done 0-15 0-1 0-1 2-3 2-3 4-5 4-5 6-7 6-7 8-9 8-9 10-11 10-11 12-13 12-13 14-15 14-15 9 9 10 10 11 11 12 12 13 13 14 14 15 15 15 15 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 7` ; do sudo grep ".*" tx-$i/xps_cpus; done 0003 000c 0030 00c0 0300 0c00 3000 c000 $ sudo ethtool -L eth0 combined 16 $ sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu15/online" $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 0 0 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0100 0200 0400 0800 1000 2000 4000 0001 $ for i in `seq 8 15`; \ do sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu$i/online"; done $ cd /proc/irq $ for i in `seq 24 60` ; do sudo grep ".*" $i/smp_affinity_list; done 0-15 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 0-15 0-15 0-15 0-15 $ cd /sys/class/net/eth0/queues/ $ for i in `seq 0 15` ; do sudo grep ".*" tx-$i/xps_cpus; done 0001 0002 0004 0008 0010 0020 0040 0080 0001 0002 0004 0008 0010 0020 0040 0080 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Caleb Raitto authored
Always set the affinity hint, even if #cpu != #vq. Handle the case where #cpu > #vq (including when #cpu % #vq != 0) and when #vq > #cpu (including when #vq % #cpu != 0). Signed-off-by: Caleb Raitto <caraitto@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jon Olson <jonolson@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Caleb Raitto authored
Make vp_set_vq_affinity() take a cpumask instead of taking a single CPU. If there are fewer queues than cores, queue affinity should be able to map to multiple cores. Link: https://patchwork.ozlabs.org/patch/948149/Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Caleb Raitto <caraitto@google.com> Acked-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Bryan Whitehead authored
PTP support includes: Ingress, and egress timestamping. One step timestamping available. PTP clock support. Periodic output support. Signed-off-by: Bryan Whitehead <Bryan.Whitehead@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Yuchung Cheng says: ==================== tcp: new mechanism to ACK immediately This patch is a follow-up feature improvement to the recent fixes on the performance issues in ECN (delayed) ACKs. Many of the fixes use tcp_enter_quickack_mode routine to force immediate ACKs. However the routine also reset tracking interactive session. This is not ideal because these immediate ACKs are required by protocol specifics unrelated to the interactiveness nature of the application. This patch set introduces a new flag to send a one-time immediate ACK without changing the status of interactive session tracking. With this patch set the immediate ACKs are generated upon these protocol states: 1) When a hole is repaired 2) When CE status changes between subsequent data packets received 3) When a data packet carries CWR flag ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
Previously commit 9aee4000 ("tcp: ack immediately when a cwr packet arrives") calls tcp_enter_quickack_mode to force sending two immediate ACKs upon receiving a packet w/ CWR flag. The side effect is it'll also reset the delayed ACK timer and interactive session tracking. This patch removes that side effect by using the new ACK_NOW flag to force an immmediate ACK. Packetdrill to demonstrate: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8> +.1 < [ect0] . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 < [ect0] . 1:1001(1000) ack 1 win 257 +0 > [ect01] . 1:1(0) ack 1001 +0 write(4, ..., 1) = 1 +0 > [ect01] P. 1:2(1) ack 1001 +0 < [ect0] . 1001:2001(1000) ack 2 win 257 +0 write(4, ..., 1) = 1 +0 > [ect01] P. 2:3(1) ack 2001 +0 < [ect0] . 2001:3001(1000) ack 3 win 257 +0 < [ect0] . 3001:4001(1000) ack 3 win 257 // Ack delayed ... +.01 < [ce] P. 4001:4501(500) ack 3 win 257 +0 > [ect01] . 3:3(0) ack 4001 +0 > [ect01] E. 3:3(0) ack 4501 +.001 read(4, ..., 4500) = 4500 +0 write(4, ..., 1) = 1 +0 > [ect01] PE. 3:4(1) ack 4501 win 100 +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257 // No delayed ACK on CWR flag +0 > [ect01] . 4:4(0) ack 5501 +.31 < [ect0] . 5501:6501(1000) ack 4 win 257 +0 > [ect01] . 4:4(0) ack 6501 Fixes: 9aee4000 ("tcp: ack immediately when a cwr packet arrives") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
RFC 5681 sec 4.2: To provide feedback to senders recovering from losses, the receiver SHOULD send an immediate ACK when it receives a data segment that fills in all or part of a gap in the sequence space. When a gap is partially filled, __tcp_ack_snd_check already checks the out-of-order queue and correctly send an immediate ACK. However when a gap is fully filled, the previous implementation only resets pingpong mode which does not guarantee an immediate ACK because the quick ACK counter may be zero. This patch addresses this issue by marking the one-time immediate ACK flag instead. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
The recent fix of acking immediately in DCTCP on CE status change has an undesirable side-effect: it also resets TCP ack timer and disables pingpong mode (interactive session). But the CE status change has nothing to do with them. This patch addresses that by using the new one-time immediate ACK flag instead of calling tcp_enter_quickack_mode(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
Add a new flag to indicate a one-time immediate ACK. This flag is occasionaly set under specific TCP protocol states in addition to the more common quickack mechanism for interactive application. In several cases in the TCP code we want to force an immediate ACK but do not want to call tcp_enter_quickack_mode() because we do not want to forget the icsk_ack.pingpong or icsk_ack.ato state. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-