- 07 Aug, 2017 8 commits
-
-
Anton Volkov authored
The synchronization type that was used earlier to guard the loop that deletes unused log buffers may lead to a situation that prevents any thread from going through the loop. The patch deletes previously used synchronization mechanism and moves the loop under the spin_lock so the similar cases won't be feasible in the future. Found by by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Anton Volkov <avolkov@ispras.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Julian Wiedmann authored
On L3, the qeth_hdr struct needs to be filled with the next-hop IP address. The current code accesses rtable->rt_gateway without checking that rtable is a valid address. The accidental access to a lowcore area results in a random next-hop address in the qeth_hdr. rtable (or more precisely, skb_dst(skb)) can be NULL in rare cases (for instance together with AF_PACKET sockets). This patch adds the missing NULL-ptr checks. Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Fixes: 87e7597b qeth: Move away from using neighbour entries in qeth_l3_fill_header() Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Dean Jenkins says: ==================== asix: Improve robustness Please consider taking these patches to improve the robustness of the ASIX USB to Ethernet driver. Failures prompting an ASIX driver code review ============================================= On an ARM i.MX6 embedded platform some strange one-off and two-off failures were observed in and around the ASIX USB to Ethernet driver. This was observed on a highly modified kernel 3.14 with the ASIX driver containing back-ported changes from kernel.org up to kernel 4.8 approximately. a) A one-off failure in asix_rx_fixup_internal(): There was an occurrence of an attempt to write off the end of the netdev buffer which was trapped by skb_over_panic() in skb_put(). [20030.846440] skbuff: skb_over_panic: text:7f2271c0 len:120 put:60 head:8366ecc0 data:8366ed02 tail:0x8366ed7a end:0x8366ed40 dev:eth0 [20030.863007] Kernel BUG at 8044ce38 [verbose debug info unavailable] [20031.215345] Backtrace: [20031.217884] [<8044cde0>] (skb_panic) from [<8044d50c>] (skb_put+0x50/0x5c) [20031.227408] [<8044d4bc>] (skb_put) from [<7f2271c0>] (asix_rx_fixup_internal+0x1c4/0x23c [asix]) [20031.242024] [<7f226ffc>] (asix_rx_fixup_internal [asix]) from [<7f22724c>] (asix_rx_fixup_common+0x14/0x18 [asix]) [20031.260309] [<7f227238>] (asix_rx_fixup_common [asix]) from [<7f21f7d4>] (usbnet_bh+0x74/0x224 [usbnet]) [20031.269879] [<7f21f760>] (usbnet_bh [usbnet]) from [<8002f834>] (call_timer_fn+0xa4/0x1f0) [20031.283961] [<8002f790>] (call_timer_fn) from [<80030834>] (run_timer_softirq+0x230/0x2a8) [20031.302782] [<80030604>] (run_timer_softirq) from [<80028780>] (__do_softirq+0x15c/0x37c) [20031.321511] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8) [20031.339298] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8) [20031.350038] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8) [20031.365528] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70) Analysis of the logic of the ASIX driver (containing backported changes from kernel.org up to kernel 4.8 approximately) suggested that the software could not trigger skb_over_panic(). The analysis of the kernel BUG() crash information suggested that the netdev buffer was written with 2 minimal 60 octet length Ethernet frames (ASIX hardware drops the 4 octet FCS field) and the 2nd Ethernet frame attempted to write off the end of the netdev buffer. Note that the netdev buffer should only contain 1 Ethernet frame so if an attempt to write 2 Ethernet frames into the buffer is made then that is wrong. However, the logic of the asix_rx_fixup_internal() only allows 1 Ethernet frame to be written into the netdev buffer. Potentially this failure was due to memory corruption because it was only seen once. b) Two-off failures in the NAPI layer's backlog queue: There were 2 crashes in the NAPI layer's backlog queue presumably after asix_rx_fixup_internal() called usbnet_skb_return(). [24097.273945] Unable to handle kernel NULL pointer dereference at virtual address 00000004 [24097.398944] PC is at process_backlog+0x80/0x16c [24097.569466] Backtrace: [24097.572007] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248) [24097.591631] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c) [24097.610022] [<80028624>] (__do_softirq) from [<800289cc>] (run_ksoftirqd+0x2c/0x84) and [ 1059.828452] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 1059.953715] PC is at process_backlog+0x84/0x16c [ 1060.140896] Backtrace: [ 1060.143434] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248) [ 1060.163075] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c) [ 1060.181474] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8) [ 1060.199256] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8) [ 1060.210006] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8) [ 1060.225492] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70) The embedded board was only using an ASIX USB to Ethernet adaptor eth0. Analysis suggested that the doubly-linked list pointers of the backlog queue had been corrupted because one of the link pointers was NULL. Potentially this failure was due to memory corruption because it was only seen twice. Results of the ASIX driver code review ====================================== During the code review some weaknesses were observed in the ASIX driver and the following patches have been created to improve the robustness. Brief overview of the patches ----------------------------- 1. asix: Add rx->ax_skb = NULL after usbnet_skb_return() The current ASIX driver sends the received Ethernet frame to the NAPI layer of the network stack via the call to usbnet_skb_return() in asix_rx_fixup_internal() but retains the rx->ax_skb pointer to the netdev buffer. The driver no longer needs the rx->ax_skb pointer at this point because the NAPI layer now has the Ethernet frame. This means that asix_rx_fixup_internal() must not use rx->ax_skb after the call to usbnet_skb_return() because it could corrupt the handling of the Ethernet frame within the network layer. Therefore, to remove the risk of erroneous usage of rx->ax_skb, set rx->ax_skb to NULL after the call to usbnet_skb_return(). This avoids potential erroneous freeing of rx->ax_skb and erroneous writing to the netdev buffer. If the software now somehow inappropriately reused rx->ax_skb, then a NULL pointer dereference of rx->ax_skb would occur which makes investigation easier. 2. asix: Ensure asix_rx_fixup_info members are all reset This patch creates reset_asix_rx_fixup_info() to allow all the asix_rx_fixup_info structure members to be consistently reset to initial conditions. Call reset_asix_rx_fixup_info() upon each detectable error condition so that the next URB is processed from a known state. Otherwise, there is a risk that some members of the asix_rx_fixup_info structure may be incorrect after an error occurred so potentially leading to a malfunction. 3. asix: Fix small memory leak in ax88772_unbind() This patch creates asix_rx_fixup_common_free() to allow the rx->ax_skb to be freed when necessary. asix_rx_fixup_common_free() is called from ax88772_unbind() before the parent private data structure is freed. Without this patch, there is a risk of a small netdev buffer memory leak each time ax88772_unbind() is called during the reception of an Ethernet frame that spans across 2 URBs. Testing ======= The patches have been sanity tested on a 64-bit Linux laptop running kernel 4.13-rc2 with the 3 patches applied on top. The ASIX USB to Adaptor used for testing was (output of lsusb): ID 0b95:772b ASIX Electronics Corp. AX88772B Test #1 ------- The test ran a flood ping test script which slowly incremented the ICMP Echo Request's payload from 0 to 5000 octets. This eventually causes IPv4 fragmentation to occur which causes Ethernet frames to be sent very close to each other so increases the probability that an Ethernet frame will span 2 URBs. The test showed that all pings were successful. The test took about 15 minutes to complete. Test #2 ------- A script was run on the laptop to periodically run ifdown and ifup every second so that the ASIX USB to Adaptor was up for 1 second and down for 1 second. From a Linux PC connected to the laptop, the following ping command was used ping -f -s 5000 <ip address of laptop> The large ICMP payload causes IPv4 fragmentation resulting in multiple Ethernet frames per original IP packet. Kernel debug within the ASIX driver was enabled to see whether any ASIX errors were generated. The test was run for about 24 hours and no ASIX errors were seen. Patches ======= The 3 patches have been rebased off the net-next repo master branch with HEAD fbbeefdd net: fec: Allow reception of frames bigger than 1522 bytes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Dean Jenkins authored
When Ethernet frames span mulitple URBs, the netdev buffer memory pointed to by the asix_rx_fixup_info structure remains allocated during the time gap between the 2 executions of asix_rx_fixup_internal(). This means that if ax88772_unbind() is called within this time gap to free the memory of the parent private data structure then a memory leak of the part filled netdev buffer memory will occur. Therefore, create a new function asix_rx_fixup_common_free() to free the memory of the netdev buffer and add a call to asix_rx_fixup_common_free() from inside ax88772_unbind(). Consequently when an unbind occurs part way through receiving an Ethernet frame, the netdev buffer memory that is holding part of the received Ethernet frame will now be freed. Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Dean Jenkins authored
There is a risk that the members of the structure asix_rx_fixup_info become unsynchronised leading to the possibility of a malfunction. For example, rx->split_head was not being set to false after an error was detected so potentially could cause a malformed 32-bit Data header word to be formed. Therefore add function reset_asix_rx_fixup_info() to reset all the members of asix_rx_fixup_info so that future processing will start with known initial conditions. Also, if (skb->len != offset) becomes true then call reset_asix_rx_fixup_info() so that the processing of the next URB starts with known initial conditions. Without the call, the check does nothing which potentially could lead to a malfunction when the next URB is processed. In addition, for robustness, call reset_asix_rx_fixup_info() before every error path's "return 0". This ensures that the next URB is processed from known initial conditions. Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Dean Jenkins authored
In asix_rx_fixup_internal() there is a risk that rx->ax_skb gets reused after passing the Ethernet frame into the network stack via usbnet_skb_return(). The risks include: a) asynchronously freeing rx->ax_skb after passing the netdev buffer to the NAPI layer which might corrupt the backlog queue. b) erroneously reusing rx->ax_skb such as calling skb_put_data() multiple times which causes writing off the end of the netdev buffer. Therefore add a defensive rx->ax_skb = NULL after usbnet_skb_return() so that it is not possible to free rx->ax_skb or to apply skb_put_data() too many times. Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Richter authored
Commit 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads") introduced new eBPF test cases. One of them (test_pkt_md_access.c) fails on s390x. The BPF verifier error message is: [root@s8360046 bpf]# ./test_progs test_pkt_access:PASS:ipv4 349 nsec test_pkt_access:PASS:ipv6 212 nsec [....] libbpf: load bpf program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (71) r2 = *(u8 *)(r1 +0) invalid bpf_context access off=0 size=1 libbpf: -- END LOG -- libbpf: failed to load program 'test1' libbpf: failed to load object './test_pkt_md_access.o' Summary: 29 PASSED, 1 FAILED [root@s8360046 bpf]# This is caused by a byte endianness issue. S390x is a big endian architecture. Pointer access to the lowest byte or halfword of a four byte value need to add an offset. On little endian architectures this offset is not needed. Fix this and use the same approach as the originator used for other files (for example test_verifier.c) in his original commit. With this fix the test program test_progs succeeds on s390x: [root@s8360046 bpf]# ./test_progs test_pkt_access:PASS:ipv4 236 nsec test_pkt_access:PASS:ipv6 217 nsec test_xdp:PASS:ipv4 3624 nsec test_xdp:PASS:ipv6 1722 nsec test_l4lb:PASS:ipv4 926 nsec test_l4lb:PASS:ipv6 1322 nsec test_tcp_estats:PASS: 0 nsec test_bpf_obj_id:PASS:get-fd-by-notexist-prog-id 0 nsec test_bpf_obj_id:PASS:get-fd-by-notexist-map-id 0 nsec test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec test_bpf_obj_id:PASS:get-prog-info(fd) 0 nsec test_bpf_obj_id:PASS:get-map-info(fd) 0 nsec test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec test_bpf_obj_id:PASS:get-prog-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-prog-info(next_id->fd) 0 nsec test_bpf_obj_id:PASS:check total prog id found by get_next_id 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec test_bpf_obj_id:PASS:get-map-fd(next_id) 0 nsec test_bpf_obj_id:PASS:check get-map-info(next_id->fd) 0 nsec test_bpf_obj_id:PASS:check total map id found by get_next_id 0 nsec test_pkt_md_access:PASS: 277 nsec Summary: 30 PASSED, 0 FAILED [root@s8360046 bpf]# Fixes: 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads") Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
stephen hemminger authored
The existing sub channel code did not wait for all the sub-channels to completely initialize. This could lead to race causing crash in napi_netif_del() from bad list. The existing code would send an init message, then wait only for the initial response that the init message was received. It thought it was waiting for sub channels but really the init response did the wakeup. The new code keeps track of the number of open channels and waits until that many are open. Other issues here were: * host might return less sub-channels than was requested. * the new init status is not valid until after init was completed. Fixes: b3e6b82a ("hv_netvsc: Wait for sub-channels to be processed during probe") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 04 Aug, 2017 9 commits
-
-
Daniel Borkmann authored
We really must check with #if __BYTE_ORDER == XYZ instead of just presence of #ifdef __LITTLE_ENDIAN. I noticed that when actually running this on big endian machine, the latter test resolves to true for user space, same for #ifdef __BIG_ENDIAN. E.g., looking at endian.h from libc, both are also defined there, so we really must test this against __BYTE_ORDER instead for proper insns selection. For the kernel, such checks are fine though e.g. see 13da9e20 ("Revert "endian: #define __BYTE_ORDER"") and 415586c9 ("UAPI: fix endianness conditionals in M32R's asm/stat.h") for some more context, but not for user space. Lets also make sure to properly include endian.h. After that, suite passes for me: ./test_verifier: ELF 64-bit MSB executable, [...] Linux foo 4.13.0-rc3+ #4 SMP Fri Aug 4 06:59:30 EDT 2017 s390x s390x s390x GNU/Linux Before fix: Summary: 505 PASSED, 11 FAILED After fix: Summary: 516 PASSED, 0 FAILED Fixes: 18f3d6be ("selftests/bpf: Add test cases to test narrower ctx field loads") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Bogendoerfer authored
Even the driver doesn't do anything with the clk source for SGMII ports it needs to be enabled by doing a devm_clk_get(), if there is a clk source in DT. Fixes: 0db01097 ('xgene: Don't fail probe, if there is no clk resource for SGMII interfaces') Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Tested-by: Laura Abbott <labbott@redhat.com> Acked-by: Iyappan Subramanian <isubramanian@apm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Daney authored
Inexplicably, commit f381bf6d ("MIPS: Add support for eBPF JIT.") lost a file somewhere on its path to Linus' tree. Add back the missing ebpf_jit.c so that we can build with CONFIG_BPF_JIT selected. This version of ebpf_jit.c is identical to the original except for two minor change need to resolve conflicts with changes merged from the BPF branch: A) Set prog->jited_len = image_size; B) Use BPF_TAIL_CALL instead of BPF_CALL | BPF_X Fixes: f381bf6d ("MIPS: Add support for eBPF JIT.") Signed-off-by: David Daney <david.daney@cavium.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Daniel Borkmann says: ==================== Two BPF fixes for s390 Found while testing some other work touching JITs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
The BPF feature test as well as libbpf is missing the __NR_bpf define for s390 and currently refuses to compile (selftest suite depends on libbpf as well). Similar issue was fixed some time ago via b0c47807 ("bpf: Add sparc support to tools and samples."), just do the same and add definitions. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
While testing some other work that required JIT modifications, I run into test_bpf causing a hang when JIT enabled on s390. The problematic test case was the one from ddc665a4 (bpf, arm64: fix jit branch offset related to ldimm64), and turns out that we do have a similar issue on s390 as well. In bpf_jit_prog() we update next instruction address after returning from bpf_jit_insn() with an insn_count. bpf_jit_insn() returns either -1 in case of error (e.g. unsupported insn), 1 or 2. The latter is only the case for ldimm64 due to spanning 2 insns, however, next address is only set to i + 1 not taking actual insn_count into account, thus fix is to use insn_count instead of 1. bpf_jit_enable in mode 2 provides also disasm on s390: Before fix: 000003ff800349b6: a7f40003 brc 15,3ff800349bc ; target 000003ff800349ba: 0000 unknown 000003ff800349bc: e3b0f0700024 stg %r11,112(%r15) 000003ff800349c2: e3e0f0880024 stg %r14,136(%r15) 000003ff800349c8: 0db0 basr %r11,%r0 000003ff800349ca: c0ef00000000 llilf %r14,0 000003ff800349d0: e320b0360004 lg %r2,54(%r11) 000003ff800349d6: e330b03e0004 lg %r3,62(%r11) 000003ff800349dc: ec23ffeda065 clgrj %r2,%r3,10,3ff800349b6 ; jmp 000003ff800349e2: e3e0b0460004 lg %r14,70(%r11) 000003ff800349e8: e3e0b04e0004 lg %r14,78(%r11) 000003ff800349ee: b904002e lgr %r2,%r14 000003ff800349f2: e3b0f0700004 lg %r11,112(%r15) 000003ff800349f8: e3e0f0880004 lg %r14,136(%r15) 000003ff800349fe: 07fe bcr 15,%r14 After fix: 000003ff80ef3db4: a7f40003 brc 15,3ff80ef3dba 000003ff80ef3db8: 0000 unknown 000003ff80ef3dba: e3b0f0700024 stg %r11,112(%r15) 000003ff80ef3dc0: e3e0f0880024 stg %r14,136(%r15) 000003ff80ef3dc6: 0db0 basr %r11,%r0 000003ff80ef3dc8: c0ef00000000 llilf %r14,0 000003ff80ef3dce: e320b0360004 lg %r2,54(%r11) 000003ff80ef3dd4: e330b03e0004 lg %r3,62(%r11) 000003ff80ef3dda: ec230006a065 clgrj %r2,%r3,10,3ff80ef3de6 ; jmp 000003ff80ef3de0: e3e0b0460004 lg %r14,70(%r11) 000003ff80ef3de6: e3e0b04e0004 lg %r14,78(%r11) ; target 000003ff80ef3dec: b904002e lgr %r2,%r14 000003ff80ef3df0: e3b0f0700004 lg %r11,112(%r15) 000003ff80ef3df6: e3e0f0880004 lg %r14,136(%r15) 000003ff80ef3dfc: 07fe bcr 15,%r14 test_bpf.ko suite runs fine after the fix. Fixes: 05462310 ("s390/bpf: Add s390x eBPF JIT compiler backend") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jiri Pirko says: ==================== mlxsw: Couple of fixes Ido says: The first patch prevents us from warning about valid situations that can happen due to the fact that some operations in switchdev are deferred. Second patch fixes a long standing problem in which we didn't correctly free resources upon module removal, resulting in a memory leak. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
Each multicast group (MID) stores a bitmap of ports to which a packet should be forwarded to in case an MDB entry associated with the MID is hit. Since the initial introduction of IGMP snooping in commit 3a49b4fd ("mlxsw: Adding layer 2 multicast support") the driver didn't correctly free these multicast groups upon ungraceful situations such as the removal of the upper bridge device or module removal. The correct way to fix this is to associate each MID with the bridge ports member in it and then drop the reference in case the bridge port is destroyed, but this will result in a lot more code and will be fixed in net-next. For now, upon module removal, traverse the MID list and release each one. Fixes: 3a49b4fd ("mlxsw: Adding layer 2 multicast support") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
Some operations in the bridge driver such as MDB deletion are preformed in an atomic context and thus deferred to a process context by the switchdev infrastructure. Therefore, by the time the operation is performed by the underlying device driver it's possible the bridge port context is already gone. This is especially true for removal flows, but theoretically can also be invoked during addition. Remove the warnings in such situations and return normally. Fixes: c57529e1 ("mlxsw: spectrum: Replace vPorts with Port-VLAN") Fixes: 3922285d ("net: bridge: Add support for offloading port attributes") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 03 Aug, 2017 7 commits
-
-
David S. Miller authored
Neal Cardwell says: ==================== tcp: fix xmit timer rearming to avoid stalls This patch series is a bug fix for a TCP loss recovery performance bug reported independently in recent netdev threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports, traces, and packetdrill test cases, which enabled us to root-cause this issue and verify the fix. - v1 -> v2: - In patch 2/3, changed an unclear comment in the pre-existing code in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet for suggesting we improve this). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
Fix a TCP loss recovery performance bug raised recently on the netdev list, in two threads: (i) July 26, 2017: netdev thread "TCP fast retransmit issues" (ii) July 26, 2017: netdev thread: "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one outstanding TLP retransmission" The basic problem is that incoming TCP packets that did not indicate forward progress could cause the xmit timer (TLP or RTO) to be rearmed and pushed back in time. In certain corner cases this could result in the following problems noted in these threads: - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes could cause TCP to repeatedly schedule TLPs forever. We kept sending TLPs after every ~200ms, which elicited bogus SACKs, which caused more TLPs, ad infinitum; we never fired an RTO to fill in the holes. - Incoming data segments could, in some cases, cause us to reschedule our RTO or TLP timer further out in time, for no good reason. This could cause repeated inbound data to result in stalls in outbound data, in the presence of packet loss. This commit fixes these bugs by changing the TLP and RTO ACK processing to: (a) Only reschedule the xmit timer once per ACK. (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the ACK indicates sufficient forward progress (a packet was cumulatively ACKed, or we got a SACK for a packet that was sent before the most recent retransmit of the write queue head). This brings us back into closer compliance with the RFCs, since, as the comment for tcp_rearm_rto() notes, we should only restart the RTO timer after forward progress on the connection. Previously we were restarting the xmit timer even in these cases where there was no forward progress. As a side benefit, this commit simplifies and speeds up the TCP timer arming logic. We had been calling inet_csk_reset_xmit_timer() three times on normal ACKs that cumulatively acknowledged some data: 1) Once near the top of tcp_ack() to switch from TLP timer to RTO: if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) tcp_rearm_rto(sk); 2) Once in tcp_clean_rtx_queue(), to update the RTO: if (flag & FLAG_ACKED) { tcp_rearm_rto(sk); 3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO to TLP: if (icsk->icsk_pending == ICSK_TIME_RETRANS) tcp_schedule_loss_probe(sk); This commit, by only rescheduling the xmit timer once per ACK, simplifies the code and reduces CPU overhead. This commit was tested in an A/B test with Google web server traffic. SNMP stats and request latency metrics were within noise levels, substantiating that for normal web traffic patterns this is a rare issue. This commit was also tested with packetdrill tests to verify that it fixes the timer behavior in the corner cases discussed in the netdev threads mentioned above. This patch is a bug fix patch intended to be queued for -stable relases. Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)") Reported-by: Klavs Klavsen <kl@vsen.dk> Reported-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
Have tcp_schedule_loss_probe() base the TLP scheduling decision based on when the RTO *should* fire. This is to enable the upcoming xmit timer fix in this series, where tcp_schedule_loss_probe() cannot assume that the last timer installed was an RTO timer (because we are no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every ACK). So tcp_schedule_loss_probe() must independently figure out when an RTO would want to fire. In the new TLP implementation following in this series, we cannot assume that icsk_timeout was set based on an RTO; after processing a cumulative ACK the icsk_timeout we see can be from a previous TLP or RTO. So we need to independently recalculate the RTO time (instead of reading it out of icsk_timeout). Removing this dependency on the nature of icsk_timeout makes things a little easier to reason about anyway. Note that the old and new code should be equivalent, since they are both saying: "if the RTO is in the future, but at an earlier time than the normal TLP time, then set the TLP timer to fire when the RTO would have fired". Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
Pure refactor. This helper will be required in the xmit timer fix later in the patch series. (Because the TLP logic will want to make this calculation.) Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Xin Long authored
After commit c2ed1880 ("net: ipv6: check route protocol when deleting routes"), ipv6 route checks rt protocol when trying to remove a rt entry. It introduced a side effect causing 'ip -6 route flush cache' not to work well. When flushing caches with iproute, all route caches get dumped from kernel then removed one by one by sending DELROUTE requests to kernel for each cache. The thing is iproute sends the request with the cache whose proto is set with RTPROT_REDIRECT by rt6_fill_node() when kernel dumps it. But in kernel the rt_cache protocol is still 0, which causes the cache not to be matched and removed. So the real reason is rt6i_protocol in the route is not set when it is allocated. As David Ahern's suggestion, this patch is to set rt6i_protocol properly in the route when it is installed and remove the codes setting rtm_protocol according to rt6i_flags in rt6_fill_node. This is also an improvement to keep rt6i_protocol consistent with rtm_protocol. Fixes: c2ed1880 ("net: ipv6: check route protocol when deleting routes") Reported-by: Jianlin Shi <jishi@redhat.com> Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
syzkaller was able to trigger a divide by 0 in TCP stack [1] Issue here is that keepalive timer needs to be updated to not attempt to send a probe if the connection setup was deferred using TCP_FASTOPEN_CONNECT socket option added in linux-4.11 [1] divide error: 0000 [#1] SMP CPU: 18 PID: 0 Comm: swapper/18 Not tainted task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000 RIP: 0010:[<ffffffff8409cc0d>] [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160 Call Trace: <IRQ> [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20 [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0 [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160 [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230 [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0 [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280 [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257 [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0 [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80 [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90 <EOI> [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0 [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0 Tested: Following packetdrill no longer crashes the kernel `echo 0 >/proc/sys/net/ipv4/tcp_timestamps` // Cache warmup: send a Fast Open cookie request 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress) +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop> +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop> +0 > . 1:1(0) ack 1 +0 close(3) = 0 +0 > F. 1:1(0) ack 1 +0 < F. 1:1(0) ack 2 win 92 +0 > . 2:2(0) ack 2 +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4 +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 +.01 connect(4, ..., ...) = 0 +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0 +10 close(4) = 0 `echo 1 >/proc/sys/net/ipv4/tcp_timestamps` Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.open-mesh.org/linux-mergeDavid S. Miller authored
Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - fix TT sync flag inconsistency problems, which can lead to excess packets, by Linus Luessing ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 02 Aug, 2017 16 commits
-
-
Yuchung Cheng authored
If the sender switches the congestion control during ECN-triggered cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to the ssthresh value calculated by the previous congestion control. If the previous congestion control is BBR that always keep ssthresh to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe step is to avoid assigning invalid ssthresh value when recovery ends. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Thomas Falcon authored
SCRQ resources are freed during renegotiation, but they are not re-allocated afterwards due to some changes in the initialization process. Fix that by re-allocating the memory after renegotation. SCRQ's can also be freed if a server capabilities request fails. If this were encountered during a device reset for example, SCRQ's may not be re-allocated. This operation is not necessary anymore so remove it. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hector Martin authored
Signed-off-by: Hector Martin <marcan@marcan.st> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Tariq Toukan says: ==================== mlx4 misc fixes This patchset contains misc bug fixes from the team to the mlx4 Core and Eth drivers. Patch 1 by Inbar fixes a wrong ethtool indication for Wake-on-LAN. The other 3 patches by Jack add a missing capability description, and fixes the off-by-1 misalignment for the following capabilities descriptions. Series generated against net commit: cc75f851 samples/bpf: fix bpf tunnel cleanup ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jack Morgenstein authored
The cited commit introduced the following new enum value in file include/linux/mlx4/device.h: QUERY_DEV_CAP_DIAG_RPRT_PER_PORT However, it failed to introduce a corresponding entry in function dump_dev_cap_flags2() for outputting a line in the message log when this capability bit is set. The change here fixes that omission. Fixes: c7c122ed ("net/mlx4: Add diagnostic counters capability bit") Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jack Morgenstein authored
The cited commit introduced the following new enum value in file include/linux/mlx4/device.h: MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP However the value of MLX4_DEV_CAP_FLAG2_SVLAN_BY_QP needs to stay consistent with the value used in another namespace in function dump_dev_cap_flags2(), which is manually kept in sync. The change here restores that consistency. Fixes: 7c3d21c8 ("net/mlx4_core: Preparation for VF vlan protocol 802.1ad") Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jack Morgenstein authored
The index value in function dump_dev_cap_flags2() for outputting "sl to vl mapping table change event support" needs to be consistent with the value of the enumerated constant MLX4_DEV_CAP_FLAG2_SL_TO_VL_CHANGE_EVENT defined in file include/linux/mlx4_device.h The change here restores that consistency. Fixes: fd10ed8e ("IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets") Reported-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Inbar Karmy authored
Currently when WoL is supported but disabled, ethtool reports: "Supports Wake-on: d". Fix the indication of Wol support, so that the indication remains "g" all the time if the NIC supports WoL. Tested: As accepted, when NIC supports WoL- ethtool reports: Supports Wake-on: g Wake-on: d when NIC doesn't support WoL- ethtool reports: Supports Wake-on: d Wake-on: d Fixes: 14c07b13 ("mlx4: Wake on LAN support") Signed-off-by: Inbar Karmy <inbark@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Nisar Sayed says: ==================== lan78xx: Fixes to lan78xx driver This series of patches are for lan78xx driver. These patches fixes potential issues associated with lan78xx driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nisar Sayed authored
Fix to handle hard_header_len update When ifconfig up/down sequence is initiated hard_header_len get updated incrementally for each ifconfig up /down sequence, this leads invalid hard_header_len, moving to lan78xx_bind to have one time update of hard_header_len addresses the issue. Signed-off-by: Nisar Sayed <Nisar.Sayed@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nisar Sayed authored
USB fast connect/disconnect crash fix When USB plugged/unplugged at fast rate, lan78xx_mdio_init() in lan78xx_bind() failing case is not handled. Whenever lan78xx_mdio_init() failed, dev->mdiobus will be freed, however since lan78xx_bind() not consider as error and try to proceed for further initialization in lan78xx_probe() which leads system hung/crash. Also when register_netdev() failed, netdev is freed without calling lan78xx_unbind(). Hence halting the failed cases right manner fixes the system crash/hung issue. Signed-off-by: Nisar Sayed <Nisar.Sayed@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Florian Fainelli says: ==================== drivers: net: Fix 64-bit statistics seqcount init This patch series fixes a bunch of drivers to have their 64-bit statistics seqcount cookie be initialized correctly. Most of these drivers (except b44, gtp) are probably used on 64-bit only hosts and so the lockdep splat might have never been seen. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that by using the proper helper function: netdev_alloc_pcpu_stats(). Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that. In commit 6c80f3fc ("netvsc: report per-channel stats in ethtool statistics") netdev_alloc_pcpu_stats() was removed in favor of open-coding the 64-bits statistics, except that u64_stats_init() was missed. Fixes: 6c80f3fc ("netvsc: report per-channel stats in ethtool statistics") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that by using netdev_alloc_pcpu_stats() instead of an open coded allocation. Fixes: 459aa660 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a lockdep splat indicating this seqcount is not correctly initialized, fix that. Fixes: 4c352362 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-