An error occurred fetching the project authors.
  1. 14 May, 2017 1 commit
  2. 18 Feb, 2017 1 commit
  3. 15 Nov, 2016 2 commits
  4. 16 Aug, 2016 1 commit
  5. 19 May, 2016 1 commit
  6. 17 Dec, 2015 1 commit
  7. 02 Nov, 2015 1 commit
  8. 22 Oct, 2015 1 commit
    • Renato Westphal's avatar
      tcp: remove improper preemption check in tcp_xmit_probe_skb() · e2e8009f
      Renato Westphal authored
      Commit e520af48 introduced the following bug when setting the
      TCP_REPAIR sockoption:
      
      [ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
      [ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
      [ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
      [ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
      [ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
      [ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
      [ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
      [ 2860.657065] Call Trace:
      [ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
      [ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
      [ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
      [ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
      [ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
      [ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
      [ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
      [ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
      [ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
      [ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75
      
      Since tcp_xmit_probe_skb() can be called from process context, use
      NET_INC_STATS() instead of NET_INC_STATS_BH().
      
      Fixes: e520af48 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
      Signed-off-by: default avatarRenato Westphal <renatow@taghos.com.br>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2e8009f
  9. 21 Oct, 2015 1 commit
  10. 19 Oct, 2015 1 commit
  11. 13 Oct, 2015 1 commit
  12. 03 Oct, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: attach SYNACK messages to request sockets instead of listener · ca6fb065
      Eric Dumazet authored
      If a listen backlog is very big (to avoid syncookies), then
      the listener sk->sk_wmem_alloc is the main source of false
      sharing, as we need to touch it twice per SYNACK re-transmit
      and TX completion.
      
      (One SYN packet takes listener lock once, but up to 6 SYNACK
      are generated)
      
      By attaching the skb to the request socket, we remove this
      source of contention.
      
      Tested:
      
       listen(fd, 10485760); // single listener (no SO_REUSEPORT)
       16 RX/TX queue NIC
       Sustain a SYNFLOOD attack of ~320,000 SYN per second,
       Sending ~1,400,000 SYNACK per second.
       Perf profiles now show listener spinlock being next bottleneck.
      
          20.29%  [kernel]  [k] queued_spin_lock_slowpath
          10.06%  [kernel]  [k] __inet_lookup_established
           5.12%  [kernel]  [k] reqsk_timer_handler
           3.22%  [kernel]  [k] get_next_timer_interrupt
           3.00%  [kernel]  [k] tcp_make_synack
           2.77%  [kernel]  [k] ipt_do_table
           2.70%  [kernel]  [k] run_timer_softirq
           2.50%  [kernel]  [k] ip_finish_output
           2.04%  [kernel]  [k] cascade
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca6fb065
  13. 29 Sep, 2015 1 commit
    • Bendik Rønning Opstad's avatar
      tcp: Fix CWV being too strict on thin streams · d2e1339f
      Bendik Rønning Opstad authored
      Application limited streams such as thin streams, that transmit small
      amounts of payload in relatively few packets per RTT, can be prevented
      from growing the CWND when in congestion avoidance. This leads to
      increased sojourn times for data segments in streams that often transmit
      time-dependent data.
      
      Currently, a connection is considered CWND limited only after having
      successfully transmitted at least one packet with new data, while at the
      same time failing to transmit some unsent data from the output queue
      because the CWND is full. Applications that produce small amounts of
      data may be left in a state where it is never considered to be CWND
      limited, because all unsent data is successfully transmitted each time
      an incoming ACK opens up for more data to be transmitted in the send
      window.
      
      Fix by always testing whether the CWND is fully used after successful
      packet transmissions, such that a connection is considered CWND limited
      whenever the CWND has been filled. This is the correct behavior as
      specified in RFC2861 (section 3.1).
      
      Cc: Andreas Petlund <apetlund@simula.no>
      Cc: Carsten Griwodz <griff@simula.no>
      Cc: Jonas Markussen <jonassm@ifi.uio.no>
      Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
      Cc: Mads Johannessen <madsjoh@ifi.uio.no>
      Signed-off-by: default avatarBendik Rønning Opstad <bro.devel+kernel@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2e1339f
  14. 25 Sep, 2015 4 commits
  15. 23 Sep, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: add proper TS val into RST packets · 675ee231
      Eric Dumazet authored
      RST packets sent on behalf of TCP connections with TS option (RFC 7323
      TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
      
      A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
      ecr 0], length 0
      B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
      1460,nop,nop,TS val 7264344 ecr 100], length 0
      A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
      7264344], length 0
      
      B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
      ecr 110], length 0
      
      We need to call skb_mstamp_get() to get proper TS val,
      derived from skb->skb_mstamp
      
      Note that RFC 1323 was advocating to not send TS option in RST segment,
      but RFC 7323 recommends the opposite :
      
        Once TSopt has been successfully negotiated, that is both <SYN> and
        <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
        segment for the duration of the connection, and SHOULD be sent in an
        <RST> segment (see Section 5.2 for details)
      
      Note this RFC recommends to send TS val = 0, but we believe it is
      premature : We do not know if all TCP stacks are properly
      handling the receive side :
      
         When an <RST> segment is
         received, it MUST NOT be subjected to the PAWS check by verifying an
         acceptable value in SEG.TSval, and information from the Timestamps
         option MUST NOT be used to update connection state information.
         SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
      
      In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
      to decide to send TS val = 0, if it buys something.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      675ee231
  16. 21 Sep, 2015 1 commit
  17. 18 Sep, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: provide skb->hash to synack packets · 58d607d3
      Eric Dumazet authored
      In commit b73c3d0e ("net: Save TX flow hash in sock and set in skbuf
      on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
      
      We'd like to provide one as well for SYNACK packets, so that all packets
      of a given flow share same txhash, to later enable bonding driver to
      also use skb->hash to perform slave selection.
      
      Note that a SYNACK retransmit shuffles the tx hash, as Tom did
      in commit 265f94ff ("net: Recompute sk_txhash on negative routing
      advice") for established sockets.
      
      This has nice effect making TCP flows resilient to some kind of black
      holes, even at connection establish phase.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58d607d3
  18. 10 Sep, 2015 1 commit
  19. 25 Aug, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: fix slow start after idle vs TSO/GSO · 6f021c62
      Eric Dumazet authored
      slow start after idle might reduce cwnd, but we perform this
      after first packet was cooked and sent.
      
      With TSO/GSO, it means that we might send a full TSO packet
      even if cwnd should have been reduced to IW10.
      
      Moving the SSAI check in skb_entail() makes sense, because
      we slightly reduce number of times this check is done,
      especially for large send() and TCP Small queue callbacks from
      softirq context.
      
      As Neal pointed out, we also need to perform the check
      if/when receive window opens.
      
      Tested:
      
      Following packetdrill test demonstrates the problem
      // Test of slow start after idle
      
      `sysctl -q net.ipv4.tcp_slow_start_after_idle=1`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.100 < . 1:1(0) ack 1 win 511
      +0    accept(3, ..., ...) = 4
      +0    setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0
      
      +0    write(4, ..., 26000) = 26000
      +0    > . 1:5001(5000) ack 1
      +0    > . 5001:10001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      +.100 < . 1:1(0) ack 10001 win 511
      +0    %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }%
      +0    > . 10001:20001(10000) ack 1
      +0    > P. 20001:26001(6000) ack 1
      
      +.100 < . 1:1(0) ack 26001 win 511
      +0    %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }%
      
      +4 write(4, ..., 20000) = 20000
      // If slow start after idle works properly, we should send 5 MSS here (cwnd/2)
      +0    > . 26001:31001(5000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      +0    > . 31001:36001(5000) ack 1
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f021c62
  20. 13 Aug, 2015 2 commits
  21. 27 Jul, 2015 1 commit
  22. 09 Jul, 2015 1 commit
    • Jon Maxwell's avatar
      tcp: v1 always send a quick ack when quickacks are enabled · 2251ae46
      Jon Maxwell authored
      V1 of this patch contains Eric Dumazet's suggestion to move the per
      dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.
      
      I ran some tests and after setting the "ip route change quickack 1"
      knob there were still many delayed ACKs sent. This occured
      because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
      subsequently ignored as tcp_in_quickack_mode() checks both these
      values. The condition for a quick ack to trigger requires
      that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
      only icsk_ack.pingpong is controlled by the knob. But the
      icsk_ack.quick value changes dynamically depending on heuristics.
      The crux of the matter is that delayed acks still cannot be entirely
      disabled even with the RTAX_QUICKACK per dst knob enabled. This
      patch ensures that a quick ack is always sent when the RTAX_QUICKACK
      per dst knob is turned on.
      
      The "ip route change quickack 1" knob was recently added to enable
      quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
      This issue is that even with "ip route change quickack 1" enabled
      we still see delayed ACKs under some conditions. It would be nice
      to be able to completely disable delayed ACKs.
      
      Here is an example:
      
      # netstat -s|grep dela
          3 delayed acks sent
      
      For all routes enable the knob
      
      # ip route change quickack 1
      
      Generate some traffic across a slow link and we still see the delayed
      acks.
      
      # netstat -s|grep dela
          106 delayed acks sent
          1 delayed acks further delayed because of locked socket
      
      The issue is that both the "ip route change quickack 1" knob and
      the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
      However at the business end in the __tcp_ack_snd_check() routine,
      tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
      and icsk_ack.pingpong=0 in order to trigger a quickack. As
      icsk_ack.quick is determined by heuristics it can be 0. When
      that occurs the icsk_ack.pingpong value is ignored and a delayed
      ACK is sent regardless.
      
      This patch moves the RTAX_QUICKACK per dst check into the
      tcp_in_quickack_mode() routine which ensures that a quickack is
      always sent when the quickack knob is enabled for that dst.
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2251ae46
  23. 11 Jun, 2015 4 commits
  24. 04 Jun, 2015 1 commit
  25. 27 May, 2015 1 commit
  26. 22 May, 2015 1 commit
    • Marcelo Ricardo Leitner's avatar
      tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info · 2efd055c
      Marcelo Ricardo Leitner authored
      This patch tracks the total number of inbound and outbound segments on a
      TCP socket. One may use this number to have an idea on connection
      quality when compared against the retransmissions.
      
      RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
      
      These are a 32bit field each and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->segs_out was placed near tp->snd_nxt for good data
      locality and minimal performance impact, while tp->segs_in was placed
      near tp->bytes_received for the same reason.
      
      Join work with Eric Dumazet.
      
      Note that received SYN are accounted on the listener, but sent SYNACK
      are not accounted.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2efd055c
  27. 21 May, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet authored
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb934478
  28. 19 May, 2015 1 commit
    • Daniel Borkmann's avatar
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann authored
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: default avatarBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49213555
  29. 18 May, 2015 2 commits
  30. 09 May, 2015 2 commits
    • Eric Dumazet's avatar
      tcp: add TCPWinProbe and TCPKeepAlive SNMP counters · e520af48
      Eric Dumazet authored
      Diagnosing problems related to Window Probes has been hard because
      we lack a counter.
      
      TCPWinProbe counts the number of ACK packets a sender has to send
      at regular intervals to make sure a reverse ACK packet opening back
      a window had not been lost.
      
      TCPKeepAlive counts the number of ACK packets sent to keep TCP
      flows alive (SO_KEEPALIVE)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNandita Dukkipati <nanditad@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e520af48
    • Eric Dumazet's avatar
      tcp: adjust window probe timers to safer values · 21c8fe99
      Eric Dumazet authored
      With the advent of small rto timers in datacenter TCP,
      (ip route ... rto_min x), the following can happen :
      
      1) Qdisc is full, transmit fails.
      
         TCP sets a timer based on icsk_rto to retry the transmit, without
         exponential backoff.
         With low icsk_rto, and lot of sockets, all cpus are servicing timer
         interrupts like crazy.
         Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
         and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
      
      2) Receivers can send zero windows if they don't drain their receive queue.
      
         TCP sends zero window probes, based on icsk_rto current value, with
         exponential backoff.
         With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
         some cases), sender can abort in less than one or two minutes !
         If receiver stops the sender, it obviously doesn't care of very tight
         rto. Probability of dropping the ACK reopening the window is not
         worth the risk.
      
      Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
      events (but not normal RTO based retransmits)
      
      A followup patch adds a new SNMP counter, as it would have helped a lot
      diagnosing this issue.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21c8fe99