1. 13 Aug, 2017 16 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.4.82 · 4e2e415f
      Greg Kroah-Hartman authored
      4e2e415f
    • Michal Kubeček's avatar
      net: account for current skb length when deciding about UFO · fab61468
      Michal Kubeček authored
      commit a5cb659b upstream.
      
      Our customer encountered stuck NFS writes for blocks starting at specific
      offsets w.r.t. page boundary caused by networking stack sending packets via
      UFO enabled device with wrong checksum. The problem can be reproduced by
      composing a long UDP datagram from multiple parts using MSG_MORE flag:
      
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 3000, 0, ...);
      
      Assume this packet is to be routed via a device with MTU 1500 and
      NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
      this condition is tested (among others) to decide whether to call
      ip_ufo_append_data():
      
        ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
      
      At the moment, we already have skb with 1028 bytes of data which is not
      marked for GSO so that the test is false (fragheaderlen is usually 20).
      Thus we append second 1000 bytes to this skb without invoking UFO. Third
      sendto(), however, has sufficient length to trigger the UFO path so that we
      end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
      uses udp_csum() to calculate the checksum but that assumes all fragments
      have correct checksum in skb->csum which is not true for UFO fragments.
      
      When checking against MTU, we need to add skb->len to length of new segment
      if we already have a partially filled skb and fragheaderlen only if there
      isn't one.
      
      In the IPv6 case, skb can only be null if this is the first segment so that
      we have to use headersize (length of the first IPv6 header) rather than
      fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Fixes: e4c5e13a ("ipv6: Should use consistent conditional judgement for
      	ip6 fragment between __ip6_append_data and ip6_finish_output")
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Acked-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fab61468
    • zheng li's avatar
      ipv4: Should use consistent conditional judgement for ip fragment in... · 96cdeaa3
      zheng li authored
      ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
      
      commit 0a28cfd5 upstream.
      
      There is an inconsistent conditional judgement in __ip_append_data and
      ip_finish_output functions, the variable length in __ip_append_data just
      include the length of application's payload and udp header, don't include
      the length of ip header, but in ip_finish_output use
      (skb->len > ip_skb_dst_mtu(skb)) as judgement, and skb->len include the
      length of ip header.
      
      That causes some particular application's udp payload whose length is
      between (MTU - IP Header) and MTU were fragmented by ip_fragment even
      though the rst->dev support UFO feature.
      
      Add the length of ip header to length in __ip_append_data to keep
      consistent conditional judgement as ip_finish_output for ip fragment.
      Signed-off-by: default avatarZheng Li <james.z.li@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      96cdeaa3
    • Matthew Dawson's avatar
      mm/mempool: avoid KASAN marking mempool poison checks as use-after-free · d45aabad
      Matthew Dawson authored
      commit 76401310 upstream.
      
      When removing an element from the mempool, mark it as unpoisoned in KASAN
      before verifying its contents for SLUB/SLAB debugging.  Otherwise KASAN
      will flag the reads checking the element use-after-free writes as
      use-after-free reads.
      Signed-off-by: default avatarMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrii Bordunov <aborduno@cisco.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d45aabad
    • Suzuki K Poulose's avatar
      KVM: arm/arm64: Handle hva aging while destroying the vm · 7e86f2d5
      Suzuki K Poulose authored
      commit 7e5a6722 upstream.
      
      The mmu_notifier_release() callback of KVM triggers cleaning up
      the stage2 page table on kvm-arm. However there could be other
      notifier callbacks in parallel with the mmu_notifier_release(),
      which could cause the call backs ending up in an empty stage2
      page table. Make sure we check it for all the notifier callbacks.
      
      Fixes: commit 293f2936 ("kvm-arm: Unmap shadow pagetables properly")
      Reported-by: default avatarAlex Graf <agraf@suse.de>
      Reviewed-by: default avatarChristoffer Dall <cdall@linaro.org>
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      7e86f2d5
    • Rob Gardner's avatar
      sparc64: Prevent perf from running during super critical sections · 6fe71ca3
      Rob Gardner authored
      commit fc290a11 upstream.
      
      This fixes another cause of random segfaults and bus errors that may
      occur while running perf with the callgraph option.
      
      Critical sections beginning with spin_lock_irqsave() raise the interrupt
      level to PIL_NORMAL_MAX (14) and intentionally do not block performance
      counter interrupts, which arrive at PIL_NMI (15).
      
      But some sections of code are "super critical" with respect to perf
      because the perf_callchain_user() path accesses user space and may cause
      TLB activity as well as faults as it unwinds the user stack.
      
      One particular critical section occurs in switch_mm:
      
              spin_lock_irqsave(&mm->context.lock, flags);
              ...
              load_secondary_context(mm);
              tsb_context_switch(mm);
              ...
              spin_unlock_irqrestore(&mm->context.lock, flags);
      
      If a perf interrupt arrives in between load_secondary_context() and
      tsb_context_switch(), then perf_callchain_user() could execute with
      the context ID of one process, but with an active TSB for a different
      process. When the user stack is accessed, it is very likely to
      incur a TLB miss, since the h/w context ID has been changed. The TLB
      will then be reloaded with a translation from the TSB for one process,
      but using a context ID for another process. This exposes memory from
      one process to another, and since it is a mapping for stack memory,
      this usually causes the new process to crash quickly.
      
      This super critical section needs more protection than is provided
      by spin_lock_irqsave() since perf interrupts must not be allowed in.
      
      Since __tsb_context_switch already goes through the trouble of
      disabling interrupts completely, we fix this by moving the secondary
      context load down into this better protected region.
      
      Orabug: 25577560
      Signed-off-by: default avatarDave Aldridge <david.j.aldridge@oracle.com>
      Signed-off-by: default avatarRob Gardner <rob.gardner@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fe71ca3
    • Willem de Bruijn's avatar
      udp: consistently apply ufo or fragmentation · 938990d2
      Willem de Bruijn authored
      
      [ Upstream commit 85f1bd9a ]
      
      When iteratively building a UDP datagram with MSG_MORE and that
      datagram exceeds MTU, consistently choose UFO or fragmentation.
      
      Once skb_is_gso, always apply ufo. Conversely, once a datagram is
      split across multiple skbs, do not consider ufo.
      
      Sendpage already maintains the first invariant, only add the second.
      IPv6 does not have a sendpage implementation to modify.
      
      A gso skb must have a partial checksum, do not follow sk_no_check_tx
      in udp_send_skb.
      
      Found by syzkaller.
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      938990d2
    • Greg Kroah-Hartman's avatar
      revert "ipv4: Should use consistent conditional judgement for ip fragment in... · 98c1ad1e
      Greg Kroah-Hartman authored
      revert "ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output"
      
      This reverts commit f102bb71 which is
      commit 0a28cfd5 upstream as there is
      another patch that needs to be applied instead of this one.
      
      Cc: Zheng Li <james.z.li@ericsson.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Sasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      98c1ad1e
    • Greg Kroah-Hartman's avatar
      revert "net: account for current skb length when deciding about UFO" · 54fc0c32
      Greg Kroah-Hartman authored
      This reverts commit ef09c9ff which is
      commit a5cb659b upstream as it causes
      merge issues with later patches that are much more important...
      
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Cc: Vlad Yasevich <vyasevic@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Sasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54fc0c32
    • Willem de Bruijn's avatar
      packet: fix tp_reserve race in packet_set_ring · 63364a50
      Willem de Bruijn authored
      
      [ Upstream commit c27927e3 ]
      
      Updates to tp_reserve can race with reads of the field in
      packet_set_ring. Avoid this by holding the socket lock during
      updates in setsockopt PACKET_RESERVE.
      
      This bug was discovered by syzkaller.
      
      Fixes: 8913336a ("packet: add PACKET_RESERVE sockopt")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63364a50
    • Willem de Bruijn's avatar
      net: avoid skb_warn_bad_offload false positives on UFO · 37d5c6e8
      Willem de Bruijn authored
      
      [ Upstream commit 8d63bee6 ]
      
      skb_warn_bad_offload triggers a warning when an skb enters the GSO
      stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
      checksum offload set.
      
      Commit b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      observed that SKB_GSO_DODGY producers can trigger the check and
      that passing those packets through the GSO handlers will fix it
      up. But, the software UFO handler will set ip_summed to
      CHECKSUM_NONE.
      
      When __skb_gso_segment is called from the receive path, this
      triggers the warning again.
      
      Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
      Tx these two are equivalent. On Rx, this better matches the
      skb state (checksum computed), as CHECKSUM_NONE here means no
      checksum computed.
      
      See also this thread for context:
      http://patchwork.ozlabs.org/patch/799015/
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37d5c6e8
    • Eric Dumazet's avatar
      tcp: fastopen: tcp_connect() must refresh the route · 8607d550
      Eric Dumazet authored
      
      [ Upstream commit 8ba60924 ]
      
      With new TCP_FASTOPEN_CONNECT socket option, there is a possibility
      to call tcp_connect() while socket sk_dst_cache is either NULL
      or invalid.
      
       +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
       +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
       +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
       +0 connect(4, ..., ...) = 0
      
      << sk->sk_dst_cache becomes obsolete, or even set to NULL >>
      
       +1 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
      
      We need to refresh the route otherwise bad things can happen,
      especially when syzkaller is running on the host :/
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8607d550
    • Xin Long's avatar
      net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target · 40fc2b44
      Xin Long authored
      
      [ Upstream commit 96d97030 ]
      
      Commit 55917a21 ("netfilter: x_tables: add context to know if
      extension runs from nft_compat") introduced a member nft_compat to
      xt_tgchk_param structure.
      
      But it didn't set it's value for ipt_init_target. With unexpected
      value in par.nft_compat, it may return unexpected result in some
      target's checkentry.
      
      This patch is to set all it's fields as 0 and only initialize the
      non-zero fields in ipt_init_target.
      
      v1->v2:
        As Wang Cong's suggestion, fix it by setting all it's fields as
        0 and only initializing the non-zero fields.
      
      Fixes: 55917a21 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
      Suggested-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      40fc2b44
    • Daniel Borkmann's avatar
      bpf, s390: fix jit branch offset related to ldimm64 · d0da2877
      Daniel Borkmann authored
      
      [ Upstream commit b0a0c256 ]
      
      While testing some other work that required JIT modifications, I
      run into test_bpf causing a hang when JIT enabled on s390. The
      problematic test case was the one from ddc665a4 (bpf, arm64:
      fix jit branch offset related to ldimm64), and turns out that we
      do have a similar issue on s390 as well. In bpf_jit_prog() we
      update next instruction address after returning from bpf_jit_insn()
      with an insn_count. bpf_jit_insn() returns either -1 in case of
      error (e.g. unsupported insn), 1 or 2. The latter is only the
      case for ldimm64 due to spanning 2 insns, however, next address
      is only set to i + 1 not taking actual insn_count into account,
      thus fix is to use insn_count instead of 1. bpf_jit_enable in
      mode 2 provides also disasm on s390:
      
      Before fix:
      
        000003ff800349b6: a7f40003   brc     15,3ff800349bc                 ; target
        000003ff800349ba: 0000               unknown
        000003ff800349bc: e3b0f0700024       stg     %r11,112(%r15)
        000003ff800349c2: e3e0f0880024       stg     %r14,136(%r15)
        000003ff800349c8: 0db0               basr    %r11,%r0
        000003ff800349ca: c0ef00000000       llilf   %r14,0
        000003ff800349d0: e320b0360004       lg      %r2,54(%r11)
        000003ff800349d6: e330b03e0004       lg      %r3,62(%r11)
        000003ff800349dc: ec23ffeda065       clgrj   %r2,%r3,10,3ff800349b6 ; jmp
        000003ff800349e2: e3e0b0460004       lg      %r14,70(%r11)
        000003ff800349e8: e3e0b04e0004       lg      %r14,78(%r11)
        000003ff800349ee: b904002e   lgr     %r2,%r14
        000003ff800349f2: e3b0f0700004       lg      %r11,112(%r15)
        000003ff800349f8: e3e0f0880004       lg      %r14,136(%r15)
        000003ff800349fe: 07fe               bcr     15,%r14
      
      After fix:
      
        000003ff80ef3db4: a7f40003   brc     15,3ff80ef3dba
        000003ff80ef3db8: 0000               unknown
        000003ff80ef3dba: e3b0f0700024       stg     %r11,112(%r15)
        000003ff80ef3dc0: e3e0f0880024       stg     %r14,136(%r15)
        000003ff80ef3dc6: 0db0               basr    %r11,%r0
        000003ff80ef3dc8: c0ef00000000       llilf   %r14,0
        000003ff80ef3dce: e320b0360004       lg      %r2,54(%r11)
        000003ff80ef3dd4: e330b03e0004       lg      %r3,62(%r11)
        000003ff80ef3dda: ec230006a065       clgrj   %r2,%r3,10,3ff80ef3de6 ; jmp
        000003ff80ef3de0: e3e0b0460004       lg      %r14,70(%r11)
        000003ff80ef3de6: e3e0b04e0004       lg      %r14,78(%r11)          ; target
        000003ff80ef3dec: b904002e   lgr     %r2,%r14
        000003ff80ef3df0: e3b0f0700004       lg      %r11,112(%r15)
        000003ff80ef3df6: e3e0f0880004       lg      %r14,136(%r15)
        000003ff80ef3dfc: 07fe               bcr     15,%r14
      
      test_bpf.ko suite runs fine after the fix.
      
      Fixes: 05462310 ("s390/bpf: Add s390x eBPF JIT compiler backend")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0da2877
    • Eric Dumazet's avatar
      net: fix keepalive code vs TCP_FASTOPEN_CONNECT · 4e0675f4
      Eric Dumazet authored
      
      [ Upstream commit 2dda6400 ]
      
      syzkaller was able to trigger a divide by 0 in TCP stack [1]
      
      Issue here is that keepalive timer needs to be updated to not attempt
      to send a probe if the connection setup was deferred using
      TCP_FASTOPEN_CONNECT socket option added in linux-4.11
      
      [1]
       divide error: 0000 [#1] SMP
       CPU: 18 PID: 0 Comm: swapper/18 Not tainted
       task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
       RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
       Call Trace:
        <IRQ>
        [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
        [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
        [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
        [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
        [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
        [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
        [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
        [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
        [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
        [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
        <EOI>
        [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
        [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0
      
      Tested:
      
      Following packetdrill no longer crashes the kernel
      
      `echo 0 >/proc/sys/net/ipv4/tcp_timestamps`
      
      // Cache warmup: send a Fast Open cookie request
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
         +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
       +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
         +0 > . 1:1(0) ack 1
         +0 close(3) = 0
         +0 > F. 1:1(0) ack 1
         +0 < F. 1:1(0) ack 2 win 92
         +0 > .  2:2(0) ack 2
      
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
         +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
       +.01 connect(4, ..., ...) = 0
         +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
         +10 close(4) = 0
      
      `echo 1 >/proc/sys/net/ipv4/tcp_timestamps`
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4e0675f4
    • Yuchung Cheng's avatar
      tcp: avoid setting cwnd to invalid ssthresh after cwnd reduction states · 025bb7f7
      Yuchung Cheng authored
      
      [ Upstream commit ed254971 ]
      
      If the sender switches the congestion control during ECN-triggered
      cwnd-reduction state (CA_CWR), upon exiting recovery cwnd is set to
      the ssthresh value calculated by the previous congestion control. If
      the previous congestion control is BBR that always keep ssthresh
      to TCP_INIFINITE_SSTHRESH, cwnd ends up being infinite. The safe
      step is to avoid assigning invalid ssthresh value when recovery ends.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      025bb7f7
  2. 11 Aug, 2017 24 commits