1. 03 May, 2018 2 commits
    • Julian Anastasov's avatar
      ipv4: fix fnhe usage by non-cached routes · 94720e3a
      Julian Anastasov authored
      Allow some non-cached routes to use non-expired fnhe:
      
      1. ip_del_fnhe: moved above and now called by find_exception.
      The 4.5+ commit deed49df expires fnhe only when caching
      routes. Change that to:
      
      1.1. use fnhe for non-cached local output routes, with the help
      from (2)
      
      1.2. allow __mkroute_input to detect expired fnhe (outdated
      fnhe_gw, for example) when do_cache is false, eg. when itag!=0
      for unicast destinations.
      
      2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
      to use fnhe info even when the new route will not be cached into fnhe.
      After commit 839da4d9 ("net: ipv4: set orig_oif based on fib
      result for local traffic") it means all local routes will be affected
      because they are not cached. This change is used to solve a PMTU
      problem with IPVS (and probably Netfilter DNAT) setups that redirect
      local clients from target local IP (local route to Virtual IP)
      to new remote IP target, eg. IPVS TUN real server. Loopback has
      64K MTU and we need to create fnhe on the local route that will
      keep the reduced PMTU for the Virtual IP. Without this change
      fnhe_pmtu is updated from ICMP but never exposed to non-cached
      local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
      with flowi4_oif=any for 4.14+).
      
      3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
      new entries
      
      Fixes: 839da4d9 ("net: ipv4: set orig_oif based on fib result for local traffic")
      Fixes: d6d5e999 ("route: do not cache fib route info on local routes with oif")
      Fixes: deed49df ("route: check and remove route cache when we get route")
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94720e3a
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e002434e
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2018-05-03
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Several BPF sockmap fixes mostly related to bugs in error path
         handling, that is, a bug in updating the scatterlist length /
         offset accounting, a missing sk_mem_uncharge() in redirect
         error handling, and a bug where the outstanding bytes counter
         sg_size was not zeroed, from John.
      
      2) Fix two memory leaks in the x86-64 BPF JIT, one in an error
         path where we still don't converge after image was allocated
         and another one where BPF calls are used and JIT passes don't
         converge, from Daniel.
      
      3) Minor fix in BPF selftests where in test_stacktrace_build_id()
         we drop useless args in urandom_read and we need to add a missing
         newline in a CHECK() error message, from Song.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e002434e
  2. 02 May, 2018 17 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-sockmap-fixes' · b5b6ff73
      Alexei Starovoitov authored
      John Fastabend says:
      
      ====================
      When I added the test_sockmap to selftests I mistakenly changed the
      test logic a bit. The result of this was on redirect cases we ended up
      choosing the wrong sock from the BPF program and ended up sending to a
      socket that had no receive handler. The result was the actual receive
      handler, running on a different socket, is timing out and closing the
      socket. This results in errors (-EPIPE to be specific) on the sending
      side. Typically happening if the sender does not complete the send
      before the receive side times out. So depending on timing and the size
      of the send we may get errors. This exposed some bugs in the sockmap
      error path handling.
      
      This series fixes the errors. The primary issue is we did not do proper
      memory accounting in these cases which resulted in missing a
      sk_mem_uncharge(). This happened in the redirect path and in one case
      on the normal send path. See the three patches for the details.
      
      The other take-away from this is we need to fix the test_sockmap and
      also add more negative test cases. That will happen in bpf-next.
      
      Finally, I tested this using the existing test_sockmap program, the
      older sockmap sample test script, and a few real use cases with
      Cilium. All of these seem to be in working correctly.
      
      v2: fix compiler warning, drop iterator variable 'i' that is no longer
          used in patch 3.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b5b6ff73
    • John Fastabend's avatar
      bpf: sockmap, fix error handling in redirect failures · abaeb096
      John Fastabend authored
      When a redirect failure happens we release the buffers in-flight
      without calling a sk_mem_uncharge(), the uncharge is called before
      dropping the sock lock for the redirecte, however we missed updating
      the ring start index. When no apply actions are in progress this
      is OK because we uncharge the entire buffer before the redirect.
      But, when we have apply logic running its possible that only a
      portion of the buffer is being redirected. In this case we only
      do memory accounting for the buffer slice being redirected and
      expect to be able to loop over the BPF program again and/or if
      a sock is closed uncharge the memory at sock destruct time.
      
      With an invalid start index however the program logic looks at
      the start pointer index, checks the length, and when seeing the
      length is zero (from the initial release and failure to update
      the pointer) aborts without uncharging/releasing the remaining
      memory.
      
      The fix for this is simply to update the start index. To avoid
      fixing this error in two locations we do a small refactor and
      remove one case where it is open-coded. Then fix it in the
      single function.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      abaeb096
    • John Fastabend's avatar
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend authored
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • John Fastabend's avatar
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend authored
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3cc9a472
    • Eric Dumazet's avatar
      net_sched: fq: take care of throttled flows before reuse · 7df40c26
      Eric Dumazet authored
      Normally, a socket can not be freed/reused unless all its TX packets
      left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
      this to happen.
      
      With commit fc59d5bd ("pkt_sched: fq: clear time_next_packet for
      reused flows") we cleared f->time_next_packet but took no special
      action if the flow was still in the throttled rb-tree.
      
      Since f->time_next_packet is the key used in the rb-tree searches,
      blindly clearing it might break rb-tree integrity. We need to make
      sure the flow is no longer in the rb-tree to avoid this problem.
      
      Fixes: fc59d5bd ("pkt_sched: fq: clear time_next_packet for reused flows")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7df40c26
    • Ido Schimmel's avatar
      ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6" · 30ca22e4
      Ido Schimmel authored
      This reverts commit edd7ceb7 ("ipv6: Allow non-gateway ECMP for
      IPv6").
      
      Eric reported a division by zero in rt6_multipath_rebalance() which is
      caused by above commit that considers identical local routes to be
      siblings. The division by zero happens because a nexthop weight is not
      set for local routes.
      
      Revert the commit as it does not fix a bug and has side effects.
      
      To reproduce:
      
      # ip -6 address add 2001:db8::1/64 dev dummy0
      # ip -6 address add 2001:db8::1/64 dev dummy1
      
      Fixes: edd7ceb7 ("ipv6: Allow non-gateway ECMP for IPv6")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30ca22e4
    • Alexei Starovoitov's avatar
      Merge branch 'x86-bpf-jit-fixes' · 0f58e58e
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Fix two memory leaks in x86 JIT. For details, please see
      individual patches in this series. Thanks!
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0f58e58e
    • Daniel Borkmann's avatar
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann authored
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • Daniel Borkmann's avatar
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann authored
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3aab8884
    • Ursula Braun's avatar
      net/smc: restrict non-blocking connect finish · 784813ae
      Ursula Braun authored
      The smc_poll code tries to finish connect() if the socket is in
      state SMC_INIT and polling of the internal CLC-socket returns with
      EPOLLOUT. This makes sense for a select/poll call following a connect
      call, but not without preceding connect().
      With this patch smc_poll starts connect logic only, if the CLC-socket
      is no longer in its initial state TCP_CLOSE.
      
      In addition, a poll error on the internal CLC-socket is always
      propagated to the SMC socket.
      
      With this patch the code path mentioned by syzbot
      https://syzkaller.appspot.com/bug?extid=03faa2dc16b8b64be396
      is no longer possible.
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Reported-by: syzbot+03faa2dc16b8b64be396@syzkaller.appspotmail.com
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      784813ae
    • Ingo Molnar's avatar
      8139too: Use disable_irq_nosync() in rtl8139_poll_controller() · af3e0fcf
      Ingo Molnar authored
      Use disable_irq_nosync() instead of disable_irq() as this might be
      called in atomic context with netpoll.
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af3e0fcf
    • Xin Long's avatar
      sctp: fix the issue that the cookie-ack with auth can't get processed · ce402f04
      Xin Long authored
      When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp
      processes auth chunk first, then continues to the next chunk in
      this packet if chunk_end + chunk_hdr size < skb_tail_pointer().
      Otherwise, it will go to the next packet or discard this chunk.
      
      However, it missed the fact that cookie-ack chunk's size is equal
      to chunk_hdr size, which couldn't match that check, and thus this
      chunk would not get processed.
      
      This patch fixes it by changing the check to chunk_end + chunk_hdr
      size <= skb_tail_pointer().
      
      Fixes: 26b87c78 ("net: sctp: fix remote memory pressure from excessive queueing")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce402f04
    • Xin Long's avatar
      sctp: use the old asoc when making the cookie-ack chunk in dupcook_d · 46e16d4b
      Xin Long authored
      When processing a duplicate cookie-echo chunk, for case 'D', sctp will
      not process the param from this chunk. It means old asoc has nothing
      to be updated, and the new temp asoc doesn't have the complete info.
      
      So there's no reason to use the new asoc when creating the cookie-ack
      chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk
      can not be set with auth, and it will definitely be dropped by peer.
      
      This issue is there since very beginning, and we fix it by using the
      old asoc instead.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46e16d4b
    • Xin Long's avatar
      sctp: init active key for the new asoc in dupcook_a and dupcook_b · 4842a08f
      Xin Long authored
      When processing a duplicate cookie-echo chunk, for case 'A' and 'B',
      after sctp_process_init for the new asoc, if auth is enabled for the
      cookie-ack chunk, the active key should also be initialized.
      
      Otherwise, the cookie-ack chunk made later can not be set with auth
      shkey properly, and a crash can even be caused by this, as after
      Commit 1b1e0bc9 ("sctp: add refcnt support for sh_key"), sctp
      needs to hold the shkey when making control chunks.
      
      Fixes: 1b1e0bc9 ("sctp: add refcnt support for sh_key")
      Reported-by: default avatarJianwen Ji <jiji@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4842a08f
    • Neal Cardwell's avatar
      tcp_bbr: fix to zero idle_restart only upon S/ACKed data · e6e6a278
      Neal Cardwell authored
      Previously the bbr->idle_restart tracking was zeroing out the
      bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
      e.g. receiving incoming data or receiver window updates. In such
      situations BBR would forget that this was a restart-from-idle
      situation, and if the min_rtt had expired it would unnecessarily enter
      PROBE_RTT (even though we were actually restarting from idle but had
      merely forgotten that fact).
      
      The fix is simple: we need to remember we are restarting from idle
      until we receive a S/ACK for some data (a S/ACK for the first flight
      of data we send as we are restarting).
      
      This commit is a stable candidate for kernels back as far as 4.9.
      
      Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: default avatarYousuk Seung <ysseung@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6e6a278
    • Grygorii Strashko's avatar
      net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode · 5e5add17
      Grygorii Strashko authored
      In dual_mac mode packets arrived on one port should not be forwarded by
      switch hw to another port. Only Linux Host can forward packets between
      ports. The below test case (reported in [1]) shows that packet arrived on
      one port can be leaked to anoter (reproducible with dual port evms):
       - connect port 1 (eth0) to linux Host 0 and run tcpdump or Wireshark
       - connect port 2 (eth1) to linux Host 1 with vlan 1 configured
       - ping <IPx> from Host 1 through vlan 1 interface.
      ARP packets will be seen on Host 0.
      
      Issue happens because dual_mac mode is implemnted using two vlans: 1 (Port
      1+Port 0) and 2 (Port 2+Port 0), so there are vlan records created for for
      each vlan. By default, the ALE will find valid vlan record in its table
      when vlan 1 tagged packet arrived on Port 2 and so forwards packet to all
      ports which are vlan 1 members (like Port.
      
      To avoid such behaviorr the ALE VLAN ID Ingress Check need to be enabled
      for each external CPSW port (ALE_PORTCTLn.VID_INGRESS_CHECK) so ALE will
      drop ingress packets if Rx port is not VLAN member.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e5add17
    • Michael S. Tsirkin's avatar
      Revert "vhost: make msg padding explicit" · c818aa88
      Michael S. Tsirkin authored
      This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406.
      
      Unfortunately the padding will break 32 bit userspace.
      Ouch. Need to add some compat code, revert for now.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c818aa88
  3. 01 May, 2018 9 commits
    • John Hurley's avatar
      nfp: flower: set tunnel ttl value to net default · 50a5852a
      John Hurley authored
      Firmware requires that the ttl value for an encapsulating ipv4 tunnel
      header be included as an action field. Prior to the support of Geneve
      tunnel encap (when ttl set was removed completely), ttl value was
      extracted from the tunnel key. However, tests have shown that this can
      still produce a ttl of 0.
      
      Fix the issue by setting the namespace default value for each new tunnel.
      Follow up patch for net-next will do a full route lookup.
      
      Fixes: 3ca3059d ("nfp: flower: compile Geneve encap actions")
      Fixes: b27d6a95 ("nfp: compile flower vxlan tunnel set actions")
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50a5852a
    • Dave Watson's avatar
      net/tls: Don't recursively call push_record during tls_write_space callbacks · c212d2c7
      Dave Watson authored
      It is reported that in some cases, write_space may be called in
      do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again:
      
      [  660.468802]  ? do_tcp_sendpages+0x8d/0x580
      [  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
      [  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
      [  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
      ...
      
      tls_push_sg already does a loop over all sending sg's, so ignore
      any tls_write_space notifications until we are done sending.
      We then have to call the previous write_space to wake up
      poll() waiters after we are done with the send loop.
      Reported-by: default avatarAndre Tomt <andre@tomt.net>
      Signed-off-by: default avatarDave Watson <davejwatson@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c212d2c7
    • Song Liu's avatar
      bpf: minor fix to selftest test_stacktrace_build_id() · a4e21ff8
      Song Liu authored
      1. remove useless parameter list to ./urandom_read
      2. add missing "\n" to the end of an error message
      
      Fixes: 81f77fd0 ("bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID")
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a4e21ff8
    • Thomas Winter's avatar
      ipv6: Allow non-gateway ECMP for IPv6 · edd7ceb7
      Thomas Winter authored
      It is valid to have static routes where the nexthop
      is an interface not an address such as tunnels.
      For IPv4 it was possible to use ECMP on these routes
      but not for IPv6.
      Signed-off-by: default avatarThomas Winter <Thomas.Winter@alliedtelesis.co.nz>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edd7ceb7
    • Wenwen Wang's avatar
      ethtool: fix a potential missing-check bug · d656fe49
      Wenwen Wang authored
      In ethtool_get_rxnfc(), the object "info" is firstly copied from
      user-space. If the FLOW_RSS flag is set in the member field flow_type of
      "info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from
      user-space because FLOW_RSS is newer and has new definition, as mentioned
      in the comment. However, given that the user data resides in user-space, a
      malicious user can race to change the data after the first copy. By doing
      so, the user can inject inconsistent data. For example, in the second
      copy, the FLOW_RSS flag could be cleared in the field flow_type of "info".
      In the following execution, "info" will be used in the function
      ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected
      information leakage since ops->get_rxnfc() will prepare various types of
      data according to flow_type, and the prepared data will be eventually
      copied to user-space. This inconsistent data may also cause undefined
      behaviors based on how ops->get_rxnfc() is implemented.
      
      This patch simply re-verifies the flow_type field of "info" after the
      second copy. If the value is not as expected, an error code will be
      returned.
      Signed-off-by: default avatarWenwen Wang <wang6495@umn.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d656fe49
    • Colin Ian King's avatar
      net/mlx4: fix spelling mistake: "failedi" -> "failed" · 26ff7585
      Colin Ian King authored
      trivial fix to spelling mistake in mlx4_warn message.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26ff7585
    • Michael S. Tsirkin's avatar
      vhost: make msg padding explicit · de08481a
      Michael S. Tsirkin authored
      There's a 32 bit hole just after type. It's best to
      give it a name, this way compiler is forced to initialize
      it with rest of the structure.
      Reported-by: default avatarKevin Easton <kevin@guarana.org>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de08481a
    • Eric Dumazet's avatar
      tcp: fix TCP_REPAIR_QUEUE bound checking · bf2acc94
      Eric Dumazet authored
      syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out()
      with following C-repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
      sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
      	1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242
      setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0
      writev(3, [{"\270", 1}], 1)             = 1
      setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0
      writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144
      
      The 3rd system call looks odd :
      setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0
      
      This patch makes sure bound checking is using an unsigned compare.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf2acc94
    • Eric Dumazet's avatar
      ipv6: fix uninit-value in ip6_multipath_l3_keys() · cea67a2d
      Eric Dumazet authored
      syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(),
      root caused to a bad assumption of ICMP header being already
      pulled in skb->head
      
      ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug.
      
      BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
      BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
      CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
       ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline]
       rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858
       ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884
       ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69
       NF_HOOK include/linux/netfilter.h:288 [inline]
       ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208
       __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562
       __netif_receive_skb net/core/dev.c:4627 [inline]
       netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
       netif_receive_skb+0x230/0x240 net/core/dev.c:4725
       tun_rx_batched drivers/net/tun.c:1555 [inline]
       tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962
       tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
       call_write_iter include/linux/fs.h:1782 [inline]
       new_sync_write fs/read_write.c:469 [inline]
       __vfs_write+0x7fb/0x9f0 fs/read_write.c:482
       vfs_write+0x463/0x8d0 fs/read_write.c:544
       SYSC_write+0x172/0x360 fs/read_write.c:589
       SyS_write+0x55/0x80 fs/read_write.c:581
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 23aebdac ("ipv6: Compute multipath hash for ICMP errors from offending packet")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Jakub Sitnicki <jkbs@redhat.com>
      Acked-by: default avatarJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cea67a2d
  4. 30 Apr, 2018 5 commits
  5. 28 Apr, 2018 3 commits
  6. 27 Apr, 2018 4 commits
    • Lance Richardson's avatar
      net: support compat 64-bit time in {s,g}etsockopt · 988bf724
      Lance Richardson authored
      For the x32 ABI, struct timeval has two 64-bit fields. However
      the kernel currently interprets the user-space values used for
      the SO_RCVTIMEO and SO_SNDTIMEO socket options as having a pair
      of 32-bit fields.
      
      When the seconds portion of the requested timeout is less than 2**32,
      the seconds portion of the effective timeout is correct but the
      microseconds portion is zero.  When the seconds portion of the
      requested timeout is zero and the microseconds portion is non-zero,
      the kernel interprets the timeout as zero (never timeout).
      
      Fix by using 64-bit time for SO_RCVTIMEO/SO_SNDTIMEO as required
      for the ABI.
      
      The code included below demonstrates the problem.
      
      Results before patch:
          $ gcc -m64 -Wall -O2 -o socktmo socktmo.c && ./socktmo
          recv time: 2.008181 seconds
          send time: 2.015985 seconds
      
          $ gcc -m32 -Wall -O2 -o socktmo socktmo.c && ./socktmo
          recv time: 2.016763 seconds
          send time: 2.016062 seconds
      
          $ gcc -mx32 -Wall -O2 -o socktmo socktmo.c && ./socktmo
          recv time: 1.007239 seconds
          send time: 1.023890 seconds
      
      Results after patch:
          $ gcc -m64 -O2 -Wall -o socktmo socktmo.c && ./socktmo
          recv time: 2.010062 seconds
          send time: 2.015836 seconds
      
          $ gcc -m32 -O2 -Wall -o socktmo socktmo.c && ./socktmo
          recv time: 2.013974 seconds
          send time: 2.015981 seconds
      
          $ gcc -mx32 -O2 -Wall -o socktmo socktmo.c && ./socktmo
          recv time: 2.030257 seconds
          send time: 2.013383 seconds
      
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/socket.h>
       #include <sys/types.h>
       #include <sys/time.h>
      
       void checkrc(char *str, int rc)
       {
               if (rc >= 0)
                       return;
      
               perror(str);
               exit(1);
       }
      
       static char buf[1024];
       int main(int argc, char **argv)
       {
               int rc;
               int socks[2];
               struct timeval tv;
               struct timeval start, end, delta;
      
               rc = socketpair(AF_UNIX, SOCK_STREAM, 0, socks);
               checkrc("socketpair", rc);
      
               /* set timeout to 1.999999 seconds */
               tv.tv_sec = 1;
               tv.tv_usec = 999999;
               rc = setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof tv);
               rc = setsockopt(socks[0], SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof tv);
               checkrc("setsockopt", rc);
      
               /* measure actual receive timeout */
               gettimeofday(&start, NULL);
               rc = recv(socks[0], buf, sizeof buf, 0);
               gettimeofday(&end, NULL);
               timersub(&end, &start, &delta);
      
               printf("recv time: %ld.%06ld seconds\n",
                      (long)delta.tv_sec, (long)delta.tv_usec);
      
               /* fill send buffer */
               do {
                       rc = send(socks[0], buf, sizeof buf, 0);
               } while (rc > 0);
      
               /* measure actual send timeout */
               gettimeofday(&start, NULL);
               rc = send(socks[0], buf, sizeof buf, 0);
               gettimeofday(&end, NULL);
               timersub(&end, &start, &delta);
      
               printf("send time: %ld.%06ld seconds\n",
                      (long)delta.tv_sec, (long)delta.tv_usec);
               exit(0);
       }
      
      Fixes: 515c7af8 ("x32: Use compat shims for {g,s}etsockopt")
      Reported-by: default avatarGopal RajagopalSai <gopalsr83@gmail.com>
      Signed-off-by: default avatarLance Richardson <lance.richardson.net@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      988bf724
    • Vivien Didelot's avatar
      MAINTAINERS: add davem in NETWORKING DRIVERS · 0b21bca0
      Vivien Didelot authored
      "./scripts/get_maintainer.pl -f" does not actually show us David as the
      maintainer of drivers/net directories such as team, bonding, phy or dsa.
      Adding him in an M: entry of NETWORKING DRIVERS fixes this.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b21bca0
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2018-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · e8e96081
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      Mellanox, mlx5 fixes 2018-04-26
      
      This pull request includes fixes for mlx5 core and netdev driver.
      
      Please pull and let me know if there's any problems.
      
      For -stable v4.12
          net/mlx5e: TX, Use correct counter in dma_map error flow
      For -stable v4.13
          net/mlx5: Avoid cleaning flow steering table twice during error flow
      For -stable v4.14
          net/mlx5e: Allow offloading ipv4 header re-write for icmp
      For -stable v4.15
          net/mlx5e: DCBNL fix min inline header size for dscp
      For -stable v4.16
          net/mlx5: Fix mlx5_get_vector_affinity function
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8e96081
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-for-davem-2018-04-26' of... · 1da9a586
      David S. Miller authored
      Merge tag 'wireless-drivers-for-davem-2018-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      wireless-drivers fixes for 4.17
      
      A few fixes for 4.17 but nothing really special. The new ETSI WMM
      parameter support for iwlwifi is not technically a bugfix but
      important for regulatory compliance.
      
      iwlwifi
      
      * use new ETSI WMM parameters from regulatory database
      
      * fix a regression with the older firmware API 31 (eg. 31.560484.0)
      
      brcmfmac
      
      * fix a double free in nvmam loading fails
      
      rtlwifi
      
      * yet another fix for ant_sel module parameter
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1da9a586