1. 29 Apr, 2016 22 commits
  2. 28 Apr, 2016 18 commits
    • Mahesh Bandewar's avatar
      ipvlan: Fix failure path in dev registration during link creation · 494e8489
      Mahesh Bandewar authored
      When newlink creation fails at device-registration, the port->count
      is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found
      this issue in Macvlan and the same exists in IPvlan driver too.
      
      While fixing this issue I noticed another issue of missing unregister
      in case of failure, so adding it to the fix which is similar to the
      macvlan fix by Francesco in commit 30837960 ("macvlan: fix failure
      during registration v3")
      Reported-by: default avatarFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      494e8489
    • françois romieu's avatar
      pch_gbe: replace private tx ring lock with common netif_tx_lock · 222e4d0b
      françois romieu authored
      pch_gbe_tx_ring.tx_lock is only used in the hard_xmit handler and
      in the transmit completion reaper called from NAPI context.
      
      Compile-tested only. Potential victims Cced.
      
      Someone more knowledgeable may check if pch_gbe_tx_queue could
      have some use for a mmiowb.
      Signed-off-by: default avatarFrancois Romieu <romieu@fr.zoreil.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Andy Cress <andy.cress@us.kontron.com>
      Cc: bryan@fossetcon.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      222e4d0b
    • Florian Fainelli's avatar
      net: dsa: Provide CPU port statistics to master netdev · badf3ada
      Florian Fainelli authored
      This patch overloads the DSA master netdev, aka CPU Ethernet MAC to also
      include switch-side statistics, which is useful for debugging purposes,
      when the switch is not properly connected to the Ethernet MAC (duplex
      mismatch, (RG)MII electrical issues etc.).
      
      We accomplish this by retaining the original copy of the master netdev's
      ethtool_ops, and just overload the 3 operations we care about:
      get_sset_count, get_strings and get_ethtool_stats so as to intercept
      these calls and call into the original master_netdev ethtool_ops, plus
      our own.
      
      We take this approach as opposed to providing a set of DSA helper
      functions that would retrive the CPU port's statistics, because the
      entire purpose of DSA is to allow unmodified Ethernet MAC drivers to be
      used as CPU conduit interfaces, therefore, statistics overlay in such
      drivers would simply not scale.
      
      The new ethtool -S <iface> output would therefore look like this now:
      <iface> statistics
      p<2 digits cpu port number>_<switch MIB counter names>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      badf3ada
    • Eric Dumazet's avatar
      tcp: give prequeue mode some care · 0cef6a4c
      Eric Dumazet authored
      TCP prequeue goal is to defer processing of incoming packets
      to user space thread currently blocked in a recvmsg() system call.
      
      Intent is to spend less time processing these packets on behalf
      of softirq handler, as softirq handler is unfair to normal process
      scheduler decisions, as it might interrupt threads that do not
      even use networking.
      
      Current prequeue implementation has following issues :
      
      1) It only checks size of the prequeue against sk_rcvbuf
      
         It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
         But we now have ~8MB values to cope with modern networking needs.
         We have to add sk_rmem_alloc in the equation, since out of order
         packets can definitely use up to sk_rcvbuf memory themselves.
      
      2) Even with a fixed memory truesize check, prequeue can be filled
         by thousands of packets. When prequeue needs to be flushed, either
         from sofirq context (in tcp_prequeue() or timer code), or process
         context (in tcp_prequeue_process()), this adds a latency spike
         which is often not desirable.
         I added a fixed limit of 32 packets, as this translated to a max
         flush time of 60 us on my test hosts.
      
         Also note that all packets in prequeue are not accounted for tcp_mem,
         since they are not charged against sk_forward_alloc at this point.
         This is probably not a big deal.
      
      Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
      which is misnamed, as packets are not dropped at all, but rather pushed
      to the stack (where they can be either consumed or dropped)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cef6a4c
    • Michal Kazior's avatar
      fq: split out backlog update logic · b43e7199
      Michal Kazior authored
      mac80211 (which will be the first user of the
      fq.h) recently started to support software A-MSDU
      aggregation. It glues skbuffs together into a
      single one so the backlog accounting needs to be
      more fine-grained.
      
      To avoid backlog sorting logic duplication split
      it up for re-use.
      Signed-off-by: default avatarMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b43e7199
    • Dan Carpenter's avatar
      tipc: remove an unnecessary NULL check · b4358657
      Dan Carpenter authored
      This is never called with a NULL "buf" and anyway, we dereference 's' on
      the lines before so it would Oops before we reach the check.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4358657
    • Arnd Bergmann's avatar
      net/mlx5e: avoid stack overflow in mlx5e_open_channels · 6b87663f
      Arnd Bergmann authored
      struct mlx5e_channel_param is a large structure that is allocated
      on the stack of mlx5e_open_channels, and with a recent change
      it has grown beyond the warning size for the maximum stack
      that a single function should use:
      
      mellanox/mlx5/core/en_main.c: In function 'mlx5e_open_channels':
      mellanox/mlx5/core/en_main.c:1325:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
      
      The function is already using dynamic allocation and is not in
      a fast path, so the easiest workaround is to use another kzalloc
      for allocating the channel parameters.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: d3c9bc27 ("net/mlx5e: Added ICO SQs")
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b87663f
    • Jason Wang's avatar
      tuntap: calculate rps hash only when needed · 3df97ba8
      Jason Wang authored
      There's no need to calculate rps hash if it was not enabled. So this
      patch export rps_needed and check it before trying to get rps
      hash. Tests (using pktgen to inject packets to guest) shows this can
      improve pps about 13% (when rps is disabled).
      
      Before:
      ~1150000 pps
      After:
      ~1300000 pps
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      ----
      Changes from V1:
      - Fix build when CONFIG_RPS is not set
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3df97ba8
    • David S. Miller's avatar
      Merge branch 'tcp-eor' · f345c9a5
      David S. Miller authored
      Martin KaFai Lau says:
      
      ====================
      tcp: Make use of MSG_EOR in tcp_sendmsg
      
      v4:
      ~ Do not set eor bit in do_tcp_sendpages() since there is
        no way to pass MSG_EOR from the userland now.
      ~ Avoid rmw by testing MSG_EOR first in tcp_sendmsg().
      ~ Move TCP_SKB_CB(skb)->eor test to a new helper
        tcp_skb_can_collapse_to() (suggested by Soheil).
      ~ Add some packetdrill tests.
      
      v3:
      ~ Separate EOR marking from the SKBTX_ANY_TSTAMP logic.
      ~ Move the eor bit test back to the loop in tcp_sendmsg and
        tcp_sendpage because there could be >1 threads doing
        sendmsg.
      ~ Thanks to Eric Dumazet's suggestions on v2.
      ~ The TCP timestamp bug fixes are separated into other threads.
      
      v2:
      ~ Rework based on the recent work
        "add TX timestamping via cmsg" by
        Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
      ~ This version takes the MSG_EOR bit as a signal of
        end-of-response-message and leave the selective
        timestamping job to the cmsg
      ~ Changes based on the v1 feedback (like avoid
        unlikely check in a loop and adding tcp_sendpage
        support)
      ~ The first 3 patches are bug fixes.  The fixes in this
        series depend on the newly introduced txstamp_ack in
        net-next.  I will make relevant patches against net after
        getting some feedback.
      ~ The test results are based on the recently posted net fix:
        "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"
      
      One potential use case is to use MSG_EOR with
      SOF_TIMESTAMPING_TX_ACK to get a more accurate
      TCP ack timestamping on application protocol with
      multiple outgoing response messages (e.g. HTTP2).
      
      One of our use case is at the webserver.  The webserver tracks
      the HTTP2 response latency by measuring when the webserver sends
      the first byte to the socket till the TCP ACK of the last byte
      is received.  In the cases where we don't have client side
      measurement, measuring from the server side is the only option.
      In the cases we have the client side measurement, the server side
      data can also be used to justify/cross-check-with the client
      side data.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f345c9a5
    • Martin KaFai Lau's avatar
      tcp: Handle eor bit when fragmenting a skb · a166140e
      Martin KaFai Lau authored
      When fragmenting a skb, the next_skb should carry
      the eor from prev_skb.  The eor of prev_skb should
      also be reset.
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
      0.200 sendto(4, ..., 730, 0, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > . 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a166140e
    • Martin KaFai Lau's avatar
      tcp: Handle eor bit when coalescing skb · a643b5d4
      Martin KaFai Lau authored
      This patch:
      1. Prevent next_skb from coalescing to the prev_skb if
         TCP_SKB_CB(prev_skb)->eor is set
      2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
         allowed
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 write(4, ..., 11680) = 11680
      
      0.200 > P. 1:731(730) ack 1
      0.200 > P. 731:1461(730) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:13141(4380) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
      0.300 > P. 1:731(730) ack 1
      0.300 > P. 731:1461(730) ack 1
      0.400 < . 1:1(0) ack 13141 win 257
      
      0.400 close(4) = 0
      0.400 > F. 13141:13141(0) ack 1
      0.500 < F. 1:1(0) ack 13142 win 257
      0.500 > . 13142:13142(0) ack 2
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a643b5d4
    • Martin KaFai Lau's avatar
      tcp: Make use of MSG_EOR in tcp_sendmsg · c134ecb8
      Martin KaFai Lau authored
      This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
      is passed to tcp_sendmsg, the eor bit will be set at the skb
      containing the last byte of the userland's msg.  The eor bit
      will prevent data from appending to that skb in the future.
      
      The change in do_tcp_sendpages is to honor the eor set
      during the previous tcp_sendmsg(MSG_EOR) call.
      
      This patch handles the tcp_sendmsg case.  The followup patches
      will handle other skb coalescing and fragment cases.
      
      One potential use case is to use MSG_EOR with
      SOF_TIMESTAMPING_TX_ACK to get a more accurate
      TCP ack timestamping on application protocol with
      multiple outgoing response messages (e.g. HTTP2).
      
      Packetdrill script for testing:
      ~~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 14600) = 14600
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
      
      0.200 > .  1:7301(7300) ack 1
      0.200 > P. 7301:14601(7300) ack 1
      
      0.300 < . 1:1(0) ack 14601 win 257
      0.300 > P. 14601:15331(730) ack 1
      0.300 > P. 15331:16061(730) ack 1
      
      0.400 < . 1:1(0) ack 16061 win 257
      0.400 close(4) = 0
      0.400 > F. 16061:16061(0) ack 1
      0.400 < F. 1:1(0) ack 16062 win 257
      0.400 > . 16062:16062(0) ack 2
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c134ecb8
    • David S. Miller's avatar
      Merge branch 'tcp-redundant-checks' · 2a9e8438
      David S. Miller authored
      Soheil Hassas Yeganeh says:
      
      ====================
      tcp: simplify ack tx timestamps
      
      v2:
      - Fully remove SKBTX_ACK_TSTAMP, as suggested by Willem de Bruijn.
      
      This patch series aims at removing redundant checks and fields
      for ack timestamps for TCP.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a9e8438
    • Soheil Hassas Yeganeh's avatar
      tcp: remove SKBTX_ACK_TSTAMP since it is redundant · 0a2cf20c
      Soheil Hassas Yeganeh authored
      The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
      the timestamp of the TCP acknowledgement should be reported on
      error queue. Since accessing skb_shinfo is likely to incur a
      cache-line miss at the time of receiving the ack, the
      txstamp_ack bit was added in tcp_skb_cb, which is set iff
      the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
      SKBTX_ACK_TSTAMP flag redundant.
      
      Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
      everywhere.
      
      Note that this frees one bit in shinfo->tx_flags.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Suggested-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a2cf20c
    • Soheil Hassas Yeganeh's avatar
      tcp: remove an unnecessary check in tcp_tx_timestamp · 863c1fd9
      Soheil Hassas Yeganeh authored
      Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.
      
      tcp_tx_timestamp() receives the tsflags as a parameter. As a
      result the "sk->sk_tsflags || tsflags" is redundant, since
      tsflags already includes sk->sk_tsflags plus overrides from
      control messages.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      863c1fd9
    • Eric Dumazet's avatar
      net: snmp: fix 64bit stats on 32bit arches · ba7863f4
      Eric Dumazet authored
      I accidentally replaced BH disabling by preemption disabling
      in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds.
      
      For 64bit stats on 32bit arch, we really need to disable BH,
      since the "struct u64_stats_sync syncp" might be manipulated
      both from process and BH contexts.
      
      Fixes: 6aef70a8 ("net: snmp: kill various STATS_USER() helpers")
      Reported-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Tested-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba7863f4
    • David S. Miller's avatar
      Merge branch 'socket-space-optimizations' · 8be2748a
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: avoid some atomic ops when FASYNC is not used
      
      We can avoid some atomic operations on sockets not using FASYNC
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8be2748a
    • Eric Dumazet's avatar
      net: SOCKWQ_ASYNC_WAITDATA optimizations · 4be73522
      Eric Dumazet authored
      SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data()
      and equivalent functions, so that sock_wake_async() can send
      a SIGIO only when necessary.
      
      Since these atomic operations are really not needed unless
      socket expressed interest in FASYNC, we can omit them in most
      cases.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4be73522