- 03 Oct, 2015 19 commits
-
-
Eric Dumazet authored
[ Upstream commit 675ee231 ] RST packets sent on behalf of TCP connections with TS option (RFC 7323 TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr. A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100 ecr 0], length 0 B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss 1460,nop,nop,TS val 7264344 ecr 100], length 0 A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr 7264344], length 0 B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0 ecr 110], length 0 We need to call skb_mstamp_get() to get proper TS val, derived from skb->skb_mstamp Note that RFC 1323 was advocating to not send TS option in RST segment, but RFC 7323 recommends the opposite : Once TSopt has been successfully negotiated, that is both <SYN> and <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> segment for the duration of the connection, and SHOULD be sent in an <RST> segment (see Section 5.2 for details) Note this RFC recommends to send TS val = 0, but we believe it is premature : We do not know if all TCP stacks are properly handling the receive side : When an <RST> segment is received, it MUST NOT be subjected to the PAWS check by verifying an acceptable value in SEG.TSval, and information from the Timestamps option MUST NOT be used to update connection state information. SEG.TSecr MAY be used to provide stricter <RST> acceptance checks. In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider to decide to send TS val = 0, if it buys something. Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jesse Gross authored
[ Upstream commit ae5f2fb1 ] When support for megaflows was introduced, OVS needed to start installing flows with a mask applied to them. Since masking is an expensive operation, OVS also had an optimization that would only take the parts of the flow keys that were covered by a non-zero mask. The values stored in the remaining pieces should not matter because they are masked out. While this works fine for the purposes of matching (which must always look at the mask), serialization to netlink can be problematic. Since the flow and the mask are serialized separately, the uninitialized portions of the flow can be encoded with whatever values happen to be present. In terms of functionality, this has little effect since these fields will be masked out by definition. However, it leaks kernel memory to userspace, which is a potential security vulnerability. It is also possible that other code paths could look at the masked key and get uninitialized data, although this does not currently appear to be an issue in practice. This removes the mask optimization for flows that are being installed. This was always intended to be the case as the mask optimizations were really targetting per-packet flow operations. Fixes: 03f0d916 ("openvswitch: Mega flow implementation") Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Michael S. Tsirkin authored
[ Upstream commit 3ea79249 ] Upon TUNSETSNDBUF, macvtap reads the requested sndbuf size into a local variable u. commit 39ec7de7 ("macvtap: fix uninitialized access on TUNSETIFF") changed its type to u16 (which is the right thing to do for all other macvtap ioctls), breaking all values > 64k. The value of TUNSETSNDBUF is actually a signed 32 bit integer, so the right thing to do is to read it into an int. Cc: David S. Miller <davem@davemloft.net> Fixes: 39ec7de7 ("macvtap: fix uninitialized access on TUNSETIFF") Reported-by: Mark A. Peloquin Bisected-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upsteam commit 4671fc6d ] When changing rss key, we do not want to overwrite user provided key by the one provided by netdev_rss_key_fill(), which is the host random key generated at boot time. Fixes: 947cbb0a ("net/mlx4_en: Support for configurable RSS hash function") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eyal Perry <eyalpe@mellanox.com> CC: Amir Vadai <amirv@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Linus Lüssing authored
[ Upstream commit c2d4fbd2 ] With the newly introduced helper functions the skb pulling is hidden in the checksumming function - and undone before returning to the caller. The IGMPv3 and MLDv2 report parsing functions in the bridge still assumed that the skb is pointing to the beginning of the IGMP/MLD message while it is now kept at the beginning of the IPv4/6 header, breaking the message parsing and creating packet loss. Fixing this by taking the offset between IP and IGMP/MLD header into account, too. Fixes: 9afd85c9 ("net: Export IGMP/MLD message validation code") Reported-by: Tobias Powalowski <tobias.powalowski@googlemail.com> Tested-by: Tobias Powalowski <tobias.powalowski@googlemail.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Marcelo Ricardo Leitner authored
[ Upstream commit 8e2d61e0 ] Consider sctp module is unloaded and is being requested because an user is creating a sctp socket. During initialization, sctp will add the new protocol type and then initialize pernet subsys: status = sctp_v4_protosw_init(); if (status) goto err_protosw_init; status = sctp_v6_protosw_init(); if (status) goto err_v6_protosw_init; status = register_pernet_subsys(&sctp_net_ops); The problem is that after those calls to sctp_v{4,6}_protosw_init(), it is possible for userspace to create SCTP sockets like if the module is already fully loaded. If that happens, one of the possible effects is that we will have readers for net->sctp.local_addr_list list earlier than expected and sctp_net_init() does not take precautions while dealing with that list, leading to a potential panic but not limited to that, as sctp_sock_init() will copy a bunch of blank/partially initialized values from net->sctp. The race happens like this: CPU 0 | CPU 1 socket() | __sock_create | socket() inet_create | __sock_create list_for_each_entry_rcu( | answer, &inetsw[sock->type], | list) { | inet_create /* no hits */ | if (unlikely(err)) { | ... | request_module() | /* socket creation is blocked | * the module is fully loaded | */ | sctp_init | sctp_v4_protosw_init | inet_register_protosw | list_add_rcu(&p->list, | last_perm); | | list_for_each_entry_rcu( | answer, &inetsw[sock->type], sctp_v6_protosw_init | list) { | /* hit, so assumes protocol | * is already loaded | */ | /* socket creation continues | * before netns is initialized | */ register_pernet_subsys | Simply inverting the initialization order between register_pernet_subsys() and sctp_v4_protosw_init() is not possible because register_pernet_subsys() will create a control sctp socket, so the protocol must be already visible by then. Deferring the socket creation to a work-queue is not good specially because we loose the ability to handle its errors. So, as suggested by Vlad, the fix is to split netns initialization in two moments: defaults and control socket, so that the defaults are already loaded by when we register the protocol, while control socket initialization is kept at the same moment it is today. Fixes: 4db67e80 ("sctp: Make the address lists per network namespace") Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit 1853c949 ] Ken-ichirou reported that running netlink in mmap mode for receive in combination with nlmon will throw a NULL pointer dereference in __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable to handle kernel paging request". The problem is the skb_clone() in __netlink_deliver_tap_skb() for skbs that are mmaped. I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink skb has it pointed to netlink_skb_destructor(), set in the handler netlink_ring_setup_skb(). There, skb->head is being set to NULL, so that in such cases, __kfree_skb() doesn't perform a skb_release_data() via skb_release_all(), where skb->head is possibly being freed through kfree(head) into slab allocator, although netlink mmap skb->head points to the mmap buffer. Similarly, the same has to be done also for large netlink skbs where the data area is vmalloced. Therefore, as discussed, make a copy for these rather rare cases for now. This fixes the issue on my and Ken-ichirou's test-cases. Reference: http://thread.gmane.org/gmane.linux.network/371129 Fixes: bcbde0d4 ("net: netlink: virtual tap device management") Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Florian Fainelli authored
[ Upstream commit 03679a14 ] The macro to write 64-bits quantities to the 32-bits register swapped the value and offsets arguments, we want to preserve the ordering of the arguments with respect to how writel() is implemented for instance: value first, offset/base second. Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Roopa Prabhu authored
[ Upstream commit 6b9ea5a6 ] Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route leaving the fib in an inconsistent state. This patch reduces the possibility of this by doing the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that most errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 27596472 ("ipv6: fix ECMP route replacement") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Florian Fainelli authored
[ Upstream commit 39797a27 ] The comparison check between cur_hw_state and hw_state is currently invalid because cur_hw_state is right shifted by G_MISTP_SHIFT, while hw_state is not, so we end-up comparing bits 2:0 with bits 7:5, which is going to cause an additional aging to occur. Fix this by not shifting cur_hw_state while reading it, but instead, mask the value with the appropriately shitfted bitmask. The other problem with the fast-ageing process is that we did not set the EN_AGE_DYNAMIC bit to request the ageing to occur for dynamically learned MAC addresses. Finally, write back 0 to the FAST_AGE_CTRL register to avoid leaving spurious bits sets from one operation to the other. Fixes: 12f460f2 ("net: dsa: bcm_sf2: add HW bridging support") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Richard Laing authored
[ Upstream commit 25b4a44c ] In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing <richard.laing@alliedtelesis.co.nz> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Atsushi Nemoto authored
[ Upstream commit 4548a697 ] tse_poll() calls __napi_complete() with irq enabled. This leads napi poll_list corruption and may stop all napi drivers working. Use napi_complete() instead of __napi_complete(). Signed-off-by: Atsushi Nemoto <nemoto@toshiba-tops.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Russell King authored
[ Upstream commit ed63f1dc ] The patch just to re-submit the patch "db3421c1" because the patch "4d494cdc" remove the change. Clear any pending receive interrupt before we process a pending packet. This helps to avoid any spurious interrupts being raised after we have fully cleaned the receive ring, while still allowing an interrupt to be raised if we receive another packet. The position of this is critical: we must do this prior to reading the next packet status to avoid potentially dropping an interrupt when a packet is still pending. Acked-by: Fugang Duan <B38611@freescale.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit e41b0bed ] We previously register IPPROTO_ROUTING offload under inet6_add_offload(), but in error path, we try to unregister it with inet_del_offload(). This doesn't seem correct, it should actually be inet6_del_offload(), also ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it also uses rthdr_offload twice), but it got removed entirely later on. Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit b382c086 ] diag socket's sock_diag_put_filterinfo() dumps classic BPF programs upon request to user space (ss -0 -b). However, native eBPF programs attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method: Their orig_prog is always NULL. However, sock_diag_put_filterinfo() unconditionally tries to access its filter length resp. wants to copy the filter insns from there. Internal cBPF to eBPF transformations attached to sockets don't have this issue, as orig_prog state is kept. It's currently only used by packet sockets. If we would want to add native eBPF support in the future, this needs to be done through a different attribute than PACKET_DIAG_FILTER to not confuse possible user space disassemblers that work on diag data. Fixes: 89aa0758 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eugene Shatokhin authored
[ Upstream commit f50791ac ] It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in usbnet_stop(), but its value should be read before it is cleared when dev->flags is set to 0. The problem was spotted and the fix was provided by Oliver Neukum <oneukum@suse.de>. Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru> Acked-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
WANG Cong authored
[ Upstream commit a6c1aea0 ] In commit 1e052be6 ("net_sched: destroy proto tp when all filters are gone") I added a check in u32_destroy() to see if all real filters are gone for each tp, however, that is only done for root_ht, same is needed for others. This can be reproduced by the following tc commands: tc filter add dev eth0 parent 1:0 prio 5 handle 15: protocol ip u32 divisor 256 tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:2 u32 ht 15:2: match ip src 10.0.0.2 flowid 1:10 tc filter add dev eth0 protocol ip parent 1: prio 5 handle 15:2:3 u32 ht 15:2: match ip src 10.0.0.3 flowid 1:10 Fixes: 1e052be6 ("net_sched: destroy proto tp when all filters are gone") Reported-by: Akshat Kakkar <akshat.1984@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Marcelo Ricardo Leitner authored
[ Upstream commit bef0057b ] Before 56ef9c90[1] it used to ignore all errors from igmp_join(). That commit enhanced that and made it error out whatever error happened with igmp_join(), but that's not good because when using multicast groups vxlan will try to join it multiple times if the socket is reused and then the 2nd and further attempts will fail with EADDRINUSE. As we don't track to which groups the socket is already subscribed, it's okay to just ignore that error. Fixes: 56ef9c90 ("vxlan: Move socket initialization to within rtnl scope") Reported-by: John Nielsen <lists@jnielsen.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
huaibin Wang authored
[ Upstream commit d4257295 ] When a tunnel is deleted, the cached dst entry should be released. This problem may prevent the removal of a netns (seen with a x-netns IPv6 gre tunnel): unregister_netdevice: waiting for lo to become free. Usage count = 3 CC: Dmitry Kozlov <xeb@mail.ru> Fixes: c12b395a ("gre: Support GRE over IPv6") Signed-off-by: huaibin Wang <huaibin.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 29 Sep, 2015 21 commits
-
-
Greg Kroah-Hartman authored
-
Daniel Axtens authored
commit 4e1efb40 upstream. If the driver doesn't participate in EEH, the AFUs will be removed by cxl_remove, which will be invoked by EEH. If the driver does particpate in EEH, the vPHB needs to stick around so that the it can particpate. In both cases, we shouldn't remove the AFU/vPHB. Reviewed-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Andy Whitcroft authored
[ Upstream commit 25b97c01 ] When generating /proc/net/route we emit a header followed by a line for each route. When a short read is performed we will restart this process based on the open file descriptor. When calculating the start point we fail to take into account that the 0th entry is the header. This leads us to skip the first entry when doing a continuation read. This can be easily seen with the comparison below: while read l; do echo "$l"; done </proc/net/route >A cat /proc/net/route >B diff -bu A B | grep '^[+-]' On my example machine I have approximatly 10KB of route output. There we see the very first non-title element is lost in the while read case, and an entry around the 8K mark in the cat case: +wlan0 00000000 02021EAC 0003 0 0 400 00000000 0 0 0 -tun1 00C0AC0A 00000000 0001 0 0 950 00C0FFFF 0 0 0 Fix up the off-by-one when reaquiring position on continuation. Fixes: 8be33e95 ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf") BugLink: http://bugs.launchpad.net/bugs/1483440Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Florian Fainelli authored
[ Upstream commit 211c504a ] In case we need to divert reads/writes using the slave MII bus, we may have already fetched a valid PHY interface property from Device Tree, and that mode is used by the PHY driver to make configuration decisions. If we could not fetch the "phy-mode" property, we will assign p->phy_interface to PHY_INTERFACE_MODE_NA, such that we can actually check for that condition as to whether or not we should override the interface value. Fixes: 19334920 ("net: dsa: Set valid phy interface type") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 2235f2ac ] reqsk_queue_destroy() and reqsk_queue_unlink() should use del_timer_sync() instead of del_timer() before calling reqsk_put(), otherwise we could free a req still used by another cpu. But before doing so, reqsk_queue_destroy() must release syn_wait_lock spinlock or risk a dead lock, as reqsk_timer_handler() might need to take this same spinlock from reqsk_queue_unlink() (called from inet_csk_reqsk_queue_drop()) Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 3257d8b1 ] In commit b357a364 ("inet: fix possible panic in reqsk_queue_unlink()"), I missed fact that tcp_check_req() can return the listener socket in one case, and that we must release the request socket refcount or we leak it. Tested: Following packetdrill test template shows the issue 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 2920 <mss 1460,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.002 < . 1:1(0) ack 21 win 2920 +0 > R 21:21(0) Fixes: b357a364 ("inet: fix possible panic in reqsk_queue_unlink()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit 4e7c1330 ] Linus reports the following deadlock on rtnl_mutex; triggered only once so far (extract): [12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000 [12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018 [12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff [12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80 [12236.694235] Call Trace: [12236.694250] [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10 [12236.694257] [<ffffffff815d133a>] ? schedule+0x2a/0x70 [12236.694263] [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10 [12236.694271] [<ffffffff815d2c3f>] ? __mutex_lock_slowpath+0x7f/0xf0 [12236.694280] [<ffffffff815d2cc6>] ? mutex_lock+0x16/0x30 [12236.694291] [<ffffffff814f1f90>] ? rtnetlink_rcv+0x10/0x30 [12236.694299] [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180 [12236.694309] [<ffffffff814f5ad3>] ? rtnl_getlink+0x113/0x190 [12236.694319] [<ffffffff814f202a>] ? rtnetlink_rcv_msg+0x7a/0x210 [12236.694331] [<ffffffff8124565c>] ? sock_has_perm+0x5c/0x70 [12236.694339] [<ffffffff814f1fb0>] ? rtnetlink_rcv+0x30/0x30 [12236.694346] [<ffffffff8150d62c>] ? netlink_rcv_skb+0x9c/0xc0 [12236.694354] [<ffffffff814f1f9f>] ? rtnetlink_rcv+0x1f/0x30 [12236.694360] [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180 [12236.694367] [<ffffffff8150d344>] ? netlink_sendmsg+0x484/0x5d0 [12236.694376] [<ffffffff810a236f>] ? __wake_up+0x2f/0x50 [12236.694387] [<ffffffff814cad23>] ? sock_sendmsg+0x33/0x40 [12236.694396] [<ffffffff814cb05e>] ? ___sys_sendmsg+0x22e/0x240 [12236.694405] [<ffffffff814cab75>] ? ___sys_recvmsg+0x135/0x1a0 [12236.694415] [<ffffffff811a9d12>] ? eventfd_write+0x82/0x210 [12236.694423] [<ffffffff811a0f9e>] ? fsnotify+0x32e/0x4c0 [12236.694429] [<ffffffff8108cb70>] ? wake_up_q+0x60/0x60 [12236.694434] [<ffffffff814cba09>] ? __sys_sendmsg+0x39/0x70 [12236.694440] [<ffffffff815d4797>] ? entry_SYSCALL_64_fastpath+0x12/0x6a It seems so far plausible that the recursive call into rtnetlink_rcv() looks suspicious. One way, where this could trigger is that the senders NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so the rtnl_getlink() request's answer would be sent to the kernel instead to the actual user process, thus grabbing rtnl_mutex() twice. One theory would be that netlink_autobind() triggered via netlink_sendmsg() internally overwrites the -EBUSY error to 0, but where it is wrongly originating from __netlink_insert() instead. That would reset the socket's portid to 0, which is then filled into NETLINK_CB(skb).portid later on. As commit d470e3b4 ("[NETLINK]: Fix two socket hashing bugs.") also puts it, -EBUSY should not be propagated from netlink_insert(). It looks like it's very unlikely to reproduce. We need to trigger the rhashtable_insert_rehash() handler under a situation where rehashing currently occurs (one /rare/ way would be to hit ht->elasticity limits while not filled enough to expand the hashtable, but that would rather require a specifically crafted bind() sequence with knowledge about destination slots, seems unlikely). It probably makes sense to guard __netlink_insert() in any case and remap that error. It was suggested that EOVERFLOW might be better than an already overloaded ENOMEM. Reference: http://thread.gmane.org/gmane.linux.network/372676Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ivan Vecera authored
[ Upstream commit ade4dc3e ] The commit "e29aa339 bna: Enable Multi Buffer RX" moved packets counter increment from the beginning of the NAPI processing loop after the check for erroneous packets so they are never accounted. This counter is used to inform firmware about number of processed completions (packets). As these packets are never acked the firmware fires IRQs for them again and again. Fixes: e29aa339 ("bna: Enable Multi Buffer RX") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Rasesh Mody <rasesh.mody@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nikolay Aleksandrov authored
[ Upstream commit 786c2077 ] The attribute size wasn't accounted for in the get_slave_size() callback (br_port_get_slave_size) when it was introduced, so fix it now. Also add a policy entry for it in br_port_policy. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 842a9ae0 ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nikolay Aleksandrov authored
[ Upstream commit 355b9f9d ] The attribute size wasn't accounted for in the get_slave_size() callback (br_port_get_slave_size) when it was introduced, so fix it now. Also add a policy entry for it in br_port_policy. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 95850116 ("bridge: Add support for IEEE 802.11 Proxy ARP") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Eric Dumazet authored
[ Upstream commit 10e2eb87 ] Multicast dst are not cached. They carry DST_NOCACHE. As mentioned in commit f8864972 ("ipv4: fix dst race in sk_dst_get()"), these dst need special care before caching them into a socket. Caching them is allowed only if their refcnt was not 0, ie we must use atomic_inc_not_zero() Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned in commit d0c294c5 ("tcp: prevent fetching dst twice in early demux code") Fixes: 421b3885 ("udp: ipv4: Add udp early demux") Tested-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz> Reported-by: Alex Gartrell <agartrell@fb.com> Cc: Michal Kubeček <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Dan Carpenter authored
[ Upstream commit 468b732b ] "len" is a signed integer. We check that len is not negative, so it goes from zero to INT_MAX. PAGE_SIZE is unsigned long so the comparison is type promoted to unsigned long. ULONG_MAX - 4095 is a higher than INT_MAX so the condition can never be true. I don't know if this is harmful but it seems safe to limit "len" to INT_MAX - 4095. Fixes: a8c879a7 ('RDS: Info and stats') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ido Schimmel authored
[ Upstream commit 1ebd47ef ] When removing a port's netdevice in 'rocker_remove_ports', we should also free the allocated 'net_device' structure. Do that by calling 'free_netdev' after unregistering it. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Fixes: 4b8ac966 ("rocker: introduce rocker switch driver") Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit 28e6b67f ] Since commit 55334a5d ("net_sched: act: refuse to remove bound action outside"), we end up with a wrong reference count for a tc action. Test case 1: FOO="1,6 0 0 4294967295," BAR="1,6 0 0 4294967294," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \ action bpf bytecode "$FOO" tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 1 bind 1 tc actions replace action bpf bytecode "$BAR" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 1 ref 2 bind 1 tc actions replace action bpf bytecode "$FOO" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 3 bind 1 Test case 2: FOO="1,6 0 0 4294967295," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 2 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 3 bind 1 What happens is that in tcf_hash_check(), we check tcf_common for a given index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've found an existing action. Now there are the following cases: 1) We do a late binding of an action. In that case, we leave the tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init() handler. This is correctly handeled. 2) We replace the given action, or we try to add one without replacing and find out that the action at a specific index already exists (thus, we go out with error in that case). In case of 2), we have to undo the reference count increase from tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to do so because of the 'tcfc_bindcnt > 0' check which bails out early with an -EPERM error. Now, while commit 55334a5d prevents 'tc actions del action ...' on an already classifier-bound action to drop the reference count (which could then become negative, wrap around etc), this restriction only accounts for invocations outside a specific action's ->init() handler. One possible solution would be to add a flag thus we possibly trigger the -EPERM ony in situations where it is indeed relevant. After the patch, above test cases have correct reference count again. Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Borkmann authored
[ Upstream commit f4eaed28 ] We currently trigger multiple memory leaks when replacing bpf actions, besides others: comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s) hex dump (first 32 bytes): 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................ 18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00 ...m............ backtrace: [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0 [<ffffffff8120a37a>] __vmalloc+0x4a/0x50 [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0 [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0 [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf] [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0 [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0 [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0 [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf] [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf] [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910 [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240 [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0 [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40 [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0 Issue is that the old content from tcf_bpf is allocated and needs to be released when we replace it. We seem to do that since the beginning of act_bpf on the filter and insns, later on the name as well. Example test case, after patch: # FOO="1,6 0 0 4294967295," # BAR="1,6 0 0 4294967294," # tc actions add action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$BAR" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions del action bpf index 2 [...] # echo "scan" > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l 0 Fixes: d23b8ad8 ("tc: add BPF based action") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexander Drozdov authored
[ Upstream commit dbd46ab4 ] tpacket_fill_skb() can return a negative value (-errno) which is stored in tp_len variable. In that case the following condition will be (but shouldn't be) true: tp_len > dev->mtu + dev->hard_header_len as dev->mtu and dev->hard_header_len are both unsigned. That may lead to just returning an incorrect EMSGSIZE errno to the user. Fixes: 52f1454f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Lars Westerhoff authored
[ Upstream commit 158cd4af ] When binding a PF_PACKET socket, the use count of the bound interface is always increased with dev_hold in dev_get_by_{index,name}. However, when rebound with the same protocol and device as in the previous bind the use count of the interface was not decreased. Ultimately, this caused the deletion of the interface to fail with the following message: unregister_netdevice: waiting for dummy0 to become free. Usage count = 1 This patch moves the dev_put out of the conditional part that was only executed when either the protocol or device changed on a bind. Fixes: 902fefb8 ('packet: improve socket create/bind latency in some cases') Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alexander Duyck authored
[ Upstream commit 1513069e ] It was reported that update_suffix was taking a long time on systems where a large number of leaves were attached to a single node. As it turns out fib_table_flush was calling update_suffix for each leaf that didn't have all of the aliases stripped from it. As a result, on this large node removing one leaf would result in us calling update_suffix for every other leaf on the node. The fix is to just remove the calls to leaf_pull_suffix since they are redundant as we already have a call in resize that will go through and update the suffix length for the node before we exit out of fib_table_flush or fib_table_flush_external. Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jack Morgenstein authored
[ Upstream commit 1c1bf349 ] The port-change event processing in procedure mlx4_eq_int() uses "slave" as the vf_oper array index. Since the value of "slave" is the PF function index, the result is that the PF link state is used for deciding to propagate the event for all the VFs. The VF link state should be used, so the VF function index should be used here. Fixes: 948e306d ('net/mlx4: Add VF link state support') Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nikolay Aleksandrov authored
[ Upstream commit 963ad948 ] Since slave_changelink support was added there have been a few race conditions when using br_setport() since some of the port functions it uses require the bridge lock. It is very easy to trigger a lockup due to some internal spin_lock() usage without bh disabled, also it's possible to get the bridge into an inconsistent state. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 3ac636b8 ("bridge: implement rtnl_link_ops->slave_changelink") Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Michael S. Tsirkin authored
[ Upstream commit 75993300 ] ANY_LAYOUT is a compatibility feature. It's implied for VERSION_1 devices, and non-transitional devices might not offer it. Change code to behave accordingly. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-