1. 15 Sep, 2017 10 commits
    • Edward Cree's avatar
      bpf/verifier: reject BPF_ALU64|BPF_END · e67b8a68
      Edward Cree authored
      Neither ___bpf_prog_run nor the JITs accept it.
      Also adds a new test case.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e67b8a68
    • Xin Long's avatar
      sctp: do not mark sk dumped when inet_sctp_diag_fill returns err · 8c7c19a5
      Xin Long authored
      sctp_diag would not actually dump out sk/asoc if inet_sctp_diag_fill
      returns err, in which case it shouldn't mark sk dumped by setting
      cb->args[3] as 1 in sctp_sock_dump().
      
      Otherwise, it could cause some asocs to have no parent's sk dumped
      in 'ss --sctp'.
      
      So this patch is to not set cb->args[3] when inet_sctp_diag_fill()
      returns err in sctp_sock_dump().
      
      Fixes: 8f840e47 ("sctp: add the sctp_diag.c file")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c7c19a5
    • Xin Long's avatar
      sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb
      Xin Long authored
      Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
      dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
      with holding sock and sock lock.
      
      But Paolo noticed that endpoint could be destroyed in sctp_rcv without
      sock lock protection. It means the use-after-free issue still could be
      triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
      !ep, although it's pretty hard to reproduce.
      
      I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
      and sctp_sock_dump long time.
      
      This patch is to add another param cb_done to sctp_for_each_transport
      and dump ep->assocs with holding tsp after jumping out of transport's
      traversal in it to avoid this issue.
      
      It can also improve sctp diag dump to make it run faster, as no need
      to save sk into cb->args[5] and keep calling sctp_for_each_transport
      any more.
      
      This patch is also to use int * instead of int for the pos argument
      in sctp_for_each_transport, which could make postion increment only
      in sctp_for_each_transport and no need to keep changing cb->args[2]
      in sctp_sock_filter and sctp_sock_dump any more.
      
      Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25adbeb
    • Stephen Hemminger's avatar
      netvsc: increase default receive buffer size · 5023a6db
      Stephen Hemminger authored
      The default receive buffer size was reduced by recent change
      to a value which was appropriate for 10G and Windows Server 2016.
      But the value is too small for full performance with 40G on Azure.
      Increase the default back to maximum supported by host.
      
      Fixes: 8b532797 ("netvsc: allow controlling send/recv buffer size")
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5023a6db
    • Eric Dumazet's avatar
      tcp: update skb->skb_mstamp more carefully · 8c72c65b
      Eric Dumazet authored
      liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
      in tcp_probe_timer() :
            https://www.spinics.net/lists/netdev/msg454496.html
      
      After investigations, the root cause of the problem is that we update
      skb->skb_mstamp of skbs in write queue, even if the attempt to send a
      clone or copy of it failed. One reason being a routing problem.
      
      This patch prevents this, solving liujian issue.
      
      It also removes a potential RTT miscalculation, since
      __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
      TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
      been changed.
      
      A future ACK would then lead to a very small RTT sample and min_rtt
      would then be lowered to this too small value.
      
      Tested:
      
      # cat user_timeout.pkt
      --local_ip=192.168.102.64
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`
      
         +0 < S 0:0(0) win 0 <mss 1460>
         +0 > S. 0:0(0) ack 1 <mss 1460>
      
        +.1 < . 1:1(0) ack 1 win 65530
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
         +0 write(4, ..., 24) = 24
         +0 > P. 1:25(24) ack 1 win 29200
         +.1 < . 1:1(0) ack 25 win 65530
      
      //change the ipaddress
         +1 `ifconfig tun0 192.168.0.10/16`
      
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
      
         +0 `ifconfig tun0 192.168.102.64/16`
         +0 < . 1:2(1) ack 25 win 65530
         +0 `ifconfig tun0 192.168.0.10/16`
      
         +3 write(4, ..., 24) = -1
      
      # ./packetdrill user_timeout.pkt
      Signed-off-by: default avatarEric Dumazet <edumazet@googl.com>
      Reported-by: default avatarliujian <liujian56@huawei.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c72c65b
    • David Ahern's avatar
      net: ipv4: fix l3slave check for index returned in IP_PKTINFO · cbea8f02
      David Ahern authored
      rt_iif is only set to the actual egress device for the output path. The
      recent change to consider the l3slave flag when returning IP_PKTINFO
      works for local traffic (the correct device index is returned), but it
      broke the more typical use case of packets received from a remote host
      always returning the VRF index rather than the original ingress device.
      Update the fixup to consider l3slave and rt_iif actually getting set.
      
      Fixes: 1dfa7639 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbea8f02
    • Geert Uytterhoeven's avatar
      net: smsc911x: Quieten netif during suspend · 2aa70f86
      Geert Uytterhoeven authored
      If the network interface is kept running during suspend, the net core
      may call net_device_ops.ndo_start_xmit() while the Ethernet device is
      still suspended, which may lead to a system crash.
      
      E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
      driven by a PM controlled clock.  If the Ethernet registers are accessed
      while the clock is not running, the system will crash with an imprecise
      external abort.
      
      As this is a race condition with a small time window, it is not so easy
      to trigger at will.  Using pm_test may increase your chances:
      
          # echo 0 > /sys/module/printk/parameters/console_suspend
          # echo platform > /sys/power/pm_test
          # echo mem > /sys/power/state
      
      To fix this, make sure the network interface is quietened during
      suspend.
      Signed-off-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2aa70f86
    • Florian Fainelli's avatar
      net: systemport: Fix 64-bit stats deadlock · 7095c973
      Florian Fainelli authored
      We can enter a deadlock situation because there is no sufficient protection
      when ndo_get_stats64() runs in process context to guard against RX or TX NAPI
      contexts running in softirq, this can lead to the following lockdep splat and
      actual deadlock was experienced as well with an iperf session in the background
      and a while loop doing ifconfig + ethtool.
      
      [    5.780350] ================================
      [    5.784679] WARNING: inconsistent lock state
      [    5.789011] 4.13.0-rc7-02179-g32fae27c725d #70 Not tainted
      [    5.794561] --------------------------------
      [    5.798890] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [    5.804971] swapper/0/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
      [    5.810175]  (&syncp->seq#2){+.?...}, at: [<c0768a28>] bcm_sysport_tx_reclaim+0x30/0x54
      [    5.818327] {SOFTIRQ-ON-W} state was registered at:
      [    5.823278]   bcm_sysport_get_stats64+0x17c/0x258
      [    5.828053]   dev_get_stats+0x38/0xac
      [    5.831776]   rtnl_fill_stats+0x30/0x118
      [    5.835761]   rtnl_fill_ifinfo+0x538/0xe24
      [    5.839921]   rtmsg_ifinfo_build_skb+0x6c/0xd8
      [    5.844430]   rtmsg_ifinfo_event.part.5+0x14/0x44
      [    5.849201]   rtmsg_ifinfo+0x20/0x28
      [    5.852837]   register_netdevice+0x628/0x6b8
      [    5.857171]   register_netdev+0x14/0x24
      [    5.861051]   bcm_sysport_probe+0x30c/0x438
      [    5.865280]   platform_drv_probe+0x50/0xb0
      [    5.869418]   driver_probe_device+0x2e8/0x450
      [    5.873817]   __driver_attach+0x104/0x120
      [    5.877871]   bus_for_each_dev+0x7c/0xc0
      [    5.881834]   bus_add_driver+0x1b0/0x270
      [    5.885797]   driver_register+0x78/0xf4
      [    5.889675]   do_one_initcall+0x54/0x190
      [    5.893646]   kernel_init_freeable+0x144/0x1d0
      [    5.898135]   kernel_init+0x8/0x110
      [    5.901665]   ret_from_fork+0x14/0x2c
      [    5.905363] irq event stamp: 24263
      [    5.908804] hardirqs last  enabled at (24262): [<c08eecf0>] net_rx_action+0xc4/0x4e4
      [    5.916624] hardirqs last disabled at (24263): [<c0a7da00>] _raw_spin_lock_irqsave+0x1c/0x98
      [    5.925143] softirqs last  enabled at (24258): [<c022a7fc>] irq_enter+0x84/0x98
      [    5.932524] softirqs last disabled at (24259): [<c022a918>] irq_exit+0x108/0x16c
      [    5.939985]
      [    5.939985] other info that might help us debug this:
      [    5.946576]  Possible unsafe locking scenario:
      [    5.946576]
      [    5.952556]        CPU0
      [    5.955031]        ----
      [    5.957506]   lock(&syncp->seq#2);
      [    5.960955]   <Interrupt>
      [    5.963604]     lock(&syncp->seq#2);
      [    5.967227]
      [    5.967227]  *** DEADLOCK ***
      [    5.967227]
      [    5.973222] 1 lock held by swapper/0/0:
      [    5.977092]  #0:  (&(&ring->lock)->rlock){..-...}, at: [<c0768a18>] bcm_sysport_tx_reclaim+0x20/0x54
      
      So just remove the u64_stats_update_begin()/end() pair in ndo_get_stats64()
      since it does not appear to be useful for anything. No inconsistency was
      observed with either ifconfig or ethtool, global TX counts equal the sum of
      per-queue TX counts on a 32-bit architecture.
      
      Fixes: 10377ba7 ("net: systemport: Support 64bit statistics")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7095c973
    • Arnd Bergmann's avatar
      net: vrf: avoid gcc-4.6 warning · ecf09117
      Arnd Bergmann authored
      When building an allmodconfig kernel with gcc-4.6, we get a rather
      odd warning:
      
      drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’:
      drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten [-Werror]
      drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror]
      
      I have no idea what this warning is even trying to say, but it does
      seem like a false positive. Reordering the initialization in to match
      the structure definition gets rid of the warning, and might also avoid
      whatever gcc thinks is wrong here.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ecf09117
    • Himanshu Jha's avatar
      qed: remove unnecessary call to memset · 4739df62
      Himanshu Jha authored
      call to memset to assign 0 value immediately after allocating
      memory with kzalloc is unnecesaary as kzalloc allocates the memory
      filled with 0 value.
      
      Semantic patch used to resolve this issue:
      
      @@
      expression e,e2; constant c;
      statement S;
      @@
      
        e = kzalloc(e2, c);
        if(e == NULL) S
      - memset(e, 0, e2);
      Signed-off-by: default avatarHimanshu Jha <himanshujha199640@gmail.com>
      Signed-off-by: default avatarHimanshu Jha <himanshujha199640@gmail.com>
      Acked-by: default avatarSudarsana Kalluru <sudarsana.kalluru@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4739df62
  2. 14 Sep, 2017 2 commits
  3. 13 Sep, 2017 21 commits
    • Dan Carpenter's avatar
      sctp: potential read out of bounds in sctp_ulpevent_type_enabled() · fa5f7b51
      Dan Carpenter authored
      This code causes a static checker warning because Smatch doesn't trust
      anything that comes from skb->data.  I've reviewed this code and I do
      think skb->data can be controlled by the user here.
      
      The sctp_event_subscribe struct has 13 __u8 fields and we want to see
      if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
      We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
      either before the start of the struct or after the end.
      
      This is a very old bug and it's surprising that it would go undetected
      for so long but my theory is that it just doesn't have a big impact so
      it would be hard to notice.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa5f7b51
    • Sergei Shtylyov's avatar
      MAINTAINERS: review Renesas DT bindings as well · 6fa9c623
      Sergei Shtylyov authored
      When adding myself  as a reviewer for  the Renesas  Ethernet drivers
      I somehow forgot about the bindings -- I want to review them as well.
      
      Fixes: 8e6569af ("MAINTAINERS: add myself as Renesas Ethernet drivers reviewer")
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fa9c623
    • Eric Dumazet's avatar
      net_sched: gen_estimator: fix scaling error in bytes/packets samples · ca558e18
      Eric Dumazet authored
      Denys reported wrong rate estimations with HTB classes.
      
      It appears the bug was added in linux-4.10, since my tests
      where using intervals of one second only.
      
      HTB using 4 sec default rate estimators, reported rates
      were 4x higher.
      
      We need to properly scale the bytes/packets samples before
      integrating them in EWMA.
      
      Tested:
       echo 1 >/sys/module/sch_htb/parameters/htb_rate_est
      
       Setup HTB with one class with a rate/cail of 5Gbit
      
       Generate traffic on this class
      
       tc -s -d cl sh dev eth0 classid 7002:11
      class htb 7002:11 parent 7002:1 prio 5 quantum 200000 rate 5Gbit ceil
      5Gbit linklayer ethernet burst 80000b/1 mpu 0b cburst 80000b/1 mpu 0b
      level 0 rate_handle 1
       Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0
      requeues 0)
       rate 5Gbit 412814pps backlog 136260b 2p requeues 0
       TCP pkts/rtx 982969327/45 bytes 1488215557414/68130
       lended: 22732826 borrowed: 0 giants: 0
       tokens: -1684 ctokens: -1684
      
      Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca558e18
    • David S. Miller's avatar
      Merge branch 'nfp-card-init' · d371465e
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      nfp: wait more carefully for card init
      
      The first patch is a small fix for flower offload, we need a whitelist
      of supported matches, otherwise the unsupported ones will be ignored.
      
      The second and the third patch are adding wait/polling to the probe path.
      We had reports of driver failing probe because it couldn't find the
      control process (NSP) on the card.  Turns out the NSP will only announce
      its existence after it's fully initialized.  Until now we assumed it
      will be reachable, just not processing commands (hence we wait for
      a NOOP command to execute successfully).
      
      v2:
       - fix a bad merge which resulted in a build warning and retest.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d371465e
    • Jakub Kicinski's avatar
      nfp: wait for the NSP resource to appear on boot · 7dbd5b75
      Jakub Kicinski authored
      The control process (NSP) may take some time to complete its
      initialization.  This is not a problem on most servers, but
      on very fast-booting machines it may not be ready for operation
      when driver probes the device.  There is also a version of the
      flash in the wild where NSP tries to train the links as part
      of init.  To wait for NSP initialization we should make sure
      its resource has already been added to the resource table.
      NSP adds itself there as last step of init.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7dbd5b75
    • Jakub Kicinski's avatar
      nfp: wait for board state before talking to the NSP · 4cbe94f2
      Jakub Kicinski authored
      Board state informs us which low-level initialization stages the card
      has completed.  We should wait for the card to be fully initialized
      before trying to communicate with it, not only before we configure
      passing traffic.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cbe94f2
    • Pieter Jansen van Vuuren's avatar
      nfp: add whitelist of supported flow dissector · b95a2d83
      Pieter Jansen van Vuuren authored
      Previously we did not check the flow dissector against a list of allowed
      and supported flow key dissectors. This patch introduces such a list and
      correctly rejects unsupported flow keys.
      
      Fixes: 43f84b72 ("nfp: add metadata to each flow offload")
      Signed-off-by: default avatarPieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b95a2d83
    • Jiri Pirko's avatar
      net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker · 255cd50f
      Jiri Pirko authored
      Recent commit d7fb60b9 ("net_sched: get rid of tcfa_rcu") removed
      freeing in call_rcu, which changed already existing hard-to-hit
      race condition into 100% hit:
      
      [  598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [  598.607782] IP: tcf_action_destroy+0xc0/0x140
      
      Or:
      
      [   40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
      [   40.862840] IP: tcf_generic_walker+0x534/0x820
      
      Fix this by storing the ops and use them directly for module_put call.
      
      Fixes: a85a970a ("net_sched: move tc_action into tcf_common")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      255cd50f
    • Suresh Reddy's avatar
      be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx · 822f8565
      Suresh Reddy authored
      IPv6 TSO requests with extension hdrs are a problem to the
      Lancer and BEx chips. Workaround is to disable TSO6 feature
      for such packets.
      
      Also in Lancer chips, MSS less than 256 was resulting in TX stall.
      Fix this by disabling GSO when MSS less than 256.
      Signed-off-by: default avatarSuresh Reddy <suresh.reddy@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      822f8565
    • Arnd Bergmann's avatar
      w90p910_ether: include linux/interrupt.h · 854426ef
      Arnd Bergmann authored
      A randconfig build caused a compile failure:
      
      drivers/net/ethernet/nuvoton/w90p910_ether.c: In function 'w90p910_ether_close':
      drivers/net/ethernet/nuvoton/w90p910_ether.c:580:2: error: implicit declaration of function 'free_irq'; did you mean 'free_uid'? [-Werror=implicit-function-declaration]
      
      Adding the correct include fixes the problem.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      854426ef
    • Nikolay Aleksandrov's avatar
      net: bonding: fix tlb_dynamic_lb default value · f13ad104
      Nikolay Aleksandrov authored
      Commit 8b426dc5 ("bonding: remove hardcoded value") changed the
      default value for tlb_dynamic_lb which lead to either broken ALB mode
      (since tlb_dynamic_lb can be changed only in TLB) or setting TLB mode
      with tlb_dynamic_lb equal to 0.
      The first issue was recently fixed by setting tlb_dynamic_lb to 1 always
      when switching to ALB mode, but the default value is still wrong and
      we'll enter TLB mode with tlb_dynamic_lb equal to 0 if the mode is
      changed via netlink or sysfs. In order to restore the previous behaviour
      and default value simply remove the mode check around the default param
      initialization for tlb_dynamic_lb which will always set it to 1 as
      before.
      
      Fixes: 8b426dc5 ("bonding: remove hardcoded value")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f13ad104
    • Haishuang Yan's avatar
      ip6_tunnel: fix ip6 tunnel lookup in collect_md mode · 6c1cb439
      Haishuang Yan authored
      In collect_md mode, if the tun dev is down, it still can call
      __ip6_tnl_rcv to receive on packets, and the rx statistics increase
      improperly.
      
      When the md tunnel is down, it's not neccessary to increase RX drops
      for the tunnel device, packets would be recieved on fallback tunnel,
      and the RX drops on fallback device will be increased as expected.
      
      Fixes: 8d79266b ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c1cb439
    • Haishuang Yan's avatar
      ip_tunnel: fix ip tunnel lookup in collect_md mode · 833a8b40
      Haishuang Yan authored
      In collect_md mode, if the tun dev is down, it still can call
      ip_tunnel_rcv to receive on packets, and the rx statistics increase
      improperly.
      
      When the md tunnel is down, it's not neccessary to increase RX drops
      for the tunnel device, packets would be recieved on fallback tunnel,
      and the RX drops on fallback device will be increased as expected.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      833a8b40
    • Yuval Mintz's avatar
      mlxsw: spectrum: Prevent mirred-related crash on removal · 6399ebcc
      Yuval Mintz authored
      When removing the offloading of mirred actions under
      matchall classifiers, mlxsw would find the destination port
      associated with the offloaded action and utilize it for undoing
      the configuration.
      
      Depending on the order by which ports are removed, it's possible that
      the destination port would get removed before the source port.
      In such a scenario, when actions would be flushed for the source port
      mlxsw would perform an illegal dereference as the destination port is
      no longer listed.
      
      Since the only item necessary for undoing the configuration on the
      destination side is the port-id and that in turn is already maintained
      by mlxsw on the source-port, simply stop trying to access the
      destination port and use the port-id directly instead.
      
      Fixes: 763b4b70 ("mlxsw: spectrum: Add support in matchall mirror TC offloading")
      Signed-off-by: default avatarYuval Mintz <yuvalm@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6399ebcc
    • David S. Miller's avatar
      Merge branch 'net_sched-fix-filter-chain-reference-counting' · 63428fb6
      David S. Miller authored
      Cong Wang says:
      
      ====================
      net_sched: fix filter chain reference counting
      
      This patchset fixes tc filter chain reference counting and nasty race
      conditions with RCU callbacks. Please see each patch for details.
      
      v3: Rebase on the latest -net
          Add code comment in patch 1
          Improve comment and changelog for patch 2
          Add patch 3
      
      v2: Add patch 1
          Get rid of more ugly code in patch 2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      63428fb6
    • Cong Wang's avatar
      net_sched: carefully handle tcf_block_put() · 1697c4bb
      Cong Wang authored
      As pointed out by Jiri, there is still a race condition between
      tcf_block_put() and tcf_chain_destroy() in a RCU callback. There
      is no way to make it correct without proper locking or synchronization,
      because both operate on a shared list.
      
      Locking is hard, because the only lock we can pick here is a spinlock,
      however, in tc_dump_tfilter() we iterate this list with a sleeping
      function called (tcf_chain_dump()), which makes using a lock to protect
      chain_list almost impossible.
      
      Jiri suggested the idea of holding a refcnt before flushing, this works
      because it guarantees us there would be no parallel tcf_chain_destroy()
      during the loop, therefore the race condition is gone. But we have to
      be very careful with proper synchronization with RCU callbacks.
      Suggested-by: default avatarJiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1697c4bb
    • Cong Wang's avatar
      net_sched: fix reference counting of tc filter chain · e2ef7544
      Cong Wang authored
      This patch fixes the following ugliness of tc filter chain refcnt:
      
      a) tp proto should hold a refcnt to the chain too. This significantly
         simplifies the logic.
      
      b) Chain 0 is no longer special, it is created with refcnt=1 like any
         other chains. All the ugliness in tcf_chain_put() can be gone!
      
      c) No need to handle the flushing oddly, because block still holds
         chain 0, it can not be released, this guarantees block is the last
         user.
      
      d) The race condition with RCU callbacks is easier to handle with just
         a rcu_barrier(). Much easier to understand, nothing to hide. Thanks
         to the previous patch. Please see also the comments in code.
      
      e) Make the code understandable by humans, much less error-prone.
      
      Fixes: 744a4cf6 ("net: sched: fix use after free when tcf_chain_destroy is called multiple times")
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2ef7544
    • Cong Wang's avatar
      net_sched: get rid of tcfa_rcu · d7fb60b9
      Cong Wang authored
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller is no longer needed to wait for a grace period.
      So this patch gets rid of it.
      
      This also completely closes a race condition between action free
      path and filter chain add/remove path for the following patch.
      Because otherwise the nested RCU callback can't be caught by
      rcu_barrier().
      
      Please see also the comments in code.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7fb60b9
    • Eric Dumazet's avatar
      tcp/dccp: remove reqsk_put() from inet_child_forget() · da8ab578
      Eric Dumazet authored
      Back in linux-4.4, I inadvertently put a call to reqsk_put() in
      inet_child_forget(), forgetting it could be called from two different
      points.
      
      In the case it is called from inet_csk_reqsk_queue_add(), we want to
      keep the reference on the request socket, since it is released later by
      the caller (tcp_v{4|6}_rcv())
      
      This bug never showed up because atomic_dec_and_test() was not signaling
      the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
      prevented the request to be put in quarantine.
      
      Recent conversion of socket refcount from atomic_t to refcount_t finally
      exposed the bug.
      
      So move the reqsk_put() to inet_csk_listen_stop() to fix this.
      
      Thanks to Shankara Pailoor for using syzkaller and providing
      a nice set of .config and C repro.
      
      WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186
      refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0xf7/0x1aa lib/dump_stack.c:52
       panic+0x1ae/0x3a7 kernel/panic.c:180
       __warn+0x1c4/0x1d9 kernel/panic.c:541
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
       do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
       do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846
      RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      RSP: 0018:ffff88006e006b60 EFLAGS: 00010286
      RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60
      RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d
      R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       reqsk_put+0x71/0x2b0 include/net/request_sock.h:123
       tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:477 [inline]
       ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336
       process_backlog+0x1c5/0x6d0 net/core/dev.c:5102
       napi_poll net/core/dev.c:5499 [inline]
       net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565
       __do_softirq+0x2cb/0xb2d kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898
       </IRQ>
       do_softirq.part.16+0x63/0x80 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline]
       ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231
       ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:237 [inline]
       ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:471 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123
       tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575
       tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545
       tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline]
       tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930
       tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483
       sk_backlog_rcv include/net/sock.h:907 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2223
       release_sock+0xa4/0x2a0 net/core/sock.c:2715
       inet_wait_for_connect net/ipv4/af_inet.c:557 [inline]
       __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643
       inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682
       SYSC_connect+0x204/0x470 net/socket.c:1628
       SyS_connect+0x24/0x30 net/socket.c:1609
       entry_SYSCALL_64_fastpath+0x18/0xad
      RIP: 0033:0x451e59
      RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59
      RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007
      RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
      R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000
      
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarShankara Pailoor <sp3485@columbia.edu>
      Tested-by: default avatarShankara Pailoor <sp3485@columbia.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da8ab578
    • Christophe JAILLET's avatar
      openvswitch: Fix an error handling path in 'ovs_nla_init_match_and_action()' · 5829e62a
      Christophe JAILLET authored
      All other error handling paths in this function go through the 'error'
      label. This one should do the same.
      
      Fixes: 9cc9a5cb ("datapath: Avoid using stack larger than 1024.")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5829e62a
    • Nisar Sayed's avatar
      smsc95xx: Configure pause time to 0xffff when tx flow control enabled · 9c082731
      Nisar Sayed authored
      Configure pause time to 0xffff when tx flow control enabled
      
      Set pause time to 0xffff in the pause frame to indicate the
      partner to stop sending the packets. When RX buffer frees up,
      the device sends pause frame with pause time zero for partner to
      resume transmission.
      
      Fixes: 2f7ca802 ("Add SMSC LAN9500 USB2.0 10/100 ethernet adapter driver")
      Signed-off-by: default avatarNisar Sayed <Nisar.Sayed@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c082731
  4. 11 Sep, 2017 7 commits
    • Josh Hunt's avatar
      net/sched: fix pointer check in gen_handle · 230cfd2d
      Josh Hunt authored
      Fixes sparse warning about pointer in gen_handle:
      net/sched/cls_rsvp.h:392:40: warning: Using plain integer as NULL pointer
      
      Fixes: 8113c095 ("net_sched: use void pointer for filter handle")
      Signed-off-by: default avatarJosh Hunt <johunt@akamai.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      230cfd2d
    • David Lebrun's avatar
      ipv6: sr: remove duplicate routing header type check · 33e34e73
      David Lebrun authored
      As seg6_validate_srh() already checks that the Routing Header type is
      correct, it is not necessary to do it again in get_srh().
      
      Fixes: 5829d70b ("ipv6: sr: fix get_srh() to comply with IPv6 standard "RFC 8200")
      Signed-off-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33e34e73
    • Jesper Dangaard Brouer's avatar
      xdp: implement xdp_redirect_map for generic XDP · 96c5508e
      Jesper Dangaard Brouer authored
      Using bpf_redirect_map is allowed for generic XDP programs, but the
      appropriate map lookup was never performed in xdp_do_generic_redirect().
      
      Instead the map-index is directly used as the ifindex.  For the
      xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
      sending on ifindex 0 which isn't valid, resulting in getting SKB
      packets dropped.  Thus, the reported performance numbers are wrong in
      commit 24251c26 ("samples/bpf: add option for native and skb mode
      for redirect apps") for the 'xdp_redirect_map -S' case.
      
      Before commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") it could crash the kernel.  Like this
      commit also check that the map_owner owner is correct before
      dereferencing the map pointer.  But make sure that this API misusage
      can be caught by a tracepoint. Thus, allowing userspace via
      tracepoints to detect misbehaving bpf_progs.
      
      Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic")
      Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96c5508e
    • Yonghong Song's avatar
      perf/bpf: fix a clang compilation issue · 609320c8
      Yonghong Song authored
      clang does not support variable length array for structure member.
      It has the following error during compilation:
      
      kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant size:
      'variable length array in structure' extension will never be supported
                      unsigned long args[sys_data->nb_args];
                                    ^
      
      The fix is to use a fixed array length instead.
      Reported-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      609320c8
    • Kosuke Tatsukawa's avatar
      net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs · c6644d07
      Kosuke Tatsukawa authored
      Commit cbf5ecb3 ("net: bonding: Fix transmit load balancing in
      balance-alb mode") tried to fix transmit dynamic load balancing in
      balance-alb mode, which wasn't working after commit 8b426dc5
      ("bonding: remove hardcoded value").
      
      It turned out that my previous patch only fixed the case when
      balance-alb was specified as bonding module parameter, and not when
      balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
      common usage).  In the latter case, tlb_dynamic_lb was set up according
      to the default mode of the bonding interface, which happens to be
      balance-rr.
      
      This additional patch addresses this issue by setting up tlb_dynamic_lb
      to 1 if "mode" is set to balance-alb through the sysfs interface.
      
      I didn't add code to change tlb_balance_lb back to the default value for
      other modes, because "mode" is usually set up only once during
      initialization, and it's not worthwhile to change the static variable
      bonding_defaults in bond_main.c to a global variable just for this
      purpose.
      
      Commit 8b426dc5 also changes the value of tlb_dynamic_lb for
      balance-tlb mode if it is set up using the sysfs interface.  I didn't
      change that behavior, because the value of tlb_balance_lb can be changed
      using the sysfs interface for balance-tlb, and I didn't like changing
      the default value back and forth for balance-tlb.
      
      As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
      written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
      is not an intended usage, so there is little use making it writable at
      this moment.
      
      Fixes: 8b426dc5 ("bonding: remove hardcoded value")
      Reported-by: default avatarReinis Rozitis <r@roze.lv>
      Signed-off-by: default avatarKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Cc: stable@vger.kernel.org  # v4.12+
      Acked-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6644d07
    • Stephen Hemminger's avatar
      hv_netvsc: avoid unnecessary wakeups on subchannel creation · 8f2bb1de
      Stephen Hemminger authored
      Only need to wakeup the initiator after all sub-channels
      are opened.
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f2bb1de
    • Stephen Hemminger's avatar
      hv_netvsc: fix deadlock on hotplug · 8195b139
      Stephen Hemminger authored
      When a virtual device is added dynamically (via host console), then
      the vmbus sends an offer message for the primary channel. The processing
      of this message for networking causes the network device to then
      initialize the sub channels.
      
      The problem is that setting up the sub channels needs to wait until
      the subsequent subchannel offers have been processed. These offers
      come in on the same ring buffer and work queue as where the primary
      offer is being processed; leading to a deadlock.
      
      This did not happen in older kernels, because the sub channel waiting
      logic was broken (it wasn't really waiting).
      
      The solution is to do the sub channel setup in its own work queue
      context that is scheduled by the primary channel setup; and then
      happens later.
      
      Fixes: 732e4985 ("netvsc: fix race on sub channel creation")
      Reported-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8195b139