1. 09 Feb, 2015 29 commits
  2. 08 Feb, 2015 11 commits
    • Jon Paul Maloy's avatar
      tipc: fix bug in socket reception function · 51a00daf
      Jon Paul Maloy authored
      In commit c637c103 ("tipc: resolve race
      problem at unicast message reception") we introduced a time limit
      for how long the function tipc_sk_eneque() would be allowed to execute
      its loop. Unfortunately, the test for when this limit is passed was put
      in the wrong place, resulting in a lost message when the test is true.
      
      We fix this by moving the test to before we dequeue the next buffer
      from the input queue.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51a00daf
    • Michael Büsch's avatar
      rt6_probe_deferred: Do not depend on struct ordering · 662f5533
      Michael Büsch authored
      rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred.
      But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work.
      This works, because struct work_struct is the first element of struct __rt6_probe_work.
      
      Change it to kfree struct __rt6_probe_work to not implicitly depend on
      struct work_struct being the first element.
      
      This does not affect the generated code.
      Signed-off-by: default avatarMichael Buesch <m@bues.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      662f5533
    • David S. Miller's avatar
      Merge branch 'tcp_ack_loops' · f06535c5
      David S. Miller authored
      Neal Cardwellsays:
      
      ====================
      tcp: mitigate TCP ACK loops due to out-of-window validation dupacks
      
      This patch series mitigates "ack loop" DoS scenarios by rate-limiting
      outgoing duplicate ACKs sent in response to incoming "out of window"
      segments.
      
      Background
      -----------
      
      There are several cases in which the TCP RFCs specify that a TCP
      endpoint should send a pure duplicate ACK in response to a pure
      duplicate ACK that appears to be invalid due to being "out of window":
      
      (1) RFC 793 (section 3.9, page 69) specifies that endpoints should
          send a duplicate ACK in response to an ACK when the incoming
          sequence number is invalid due to being outside the receive
          window: "If an incoming segment is not acceptable, an
          acknowledgment should be sent in reply".
      
      (2) RFC 793 (section 3.9, page 72) says: "If the ACK acknowledges
          something not yet sent (SEG.ACK > SND.NXT) then send an ACK".
      
      (3) RFC 1323 (section 4.2.1, page 18) specifies that endpoints should
          send a duplicate ACK in response to an ACK when the PAWS check for
          the incoming timestamp value fails: "If .... SEG.TSval < TS.Recent
          and if TS.Recent is valid ... Send an acknowledgement in reply"
      
      The problem
      ------------
      
      Normally, this is not a problem. However, a buggy middlebox or
      malicious man-in-the-middle can inject a few packets into the
      conversation that advance each endpoint's notion of the current window
      (sequence, ACK, or timestamp), without either side noticing. In this
      case, from then on each side can think the other is sending invalid
      segments. Thus an infinite feedback loop of duplicate ACKs can ensue,
      as each endpoint receives a duplicate ACK, decides that it is invalid
      (due to sequence number, ACK number, or timestamp), and then sends a
      dupack in reply, which the other side decides is invalid, responding
      with a dupack... ad infinitum. This ping-pong feedback loop can happen
      at a very high rate.
      
      This phenomenon can and does happen in practice. It has been seen in
      datacenter and Internet contexts at Google, and has been documented by
      Anil Agarwal in the Nov 2013 tcpm thread "TCP mismatched sequence
      numbers issue", and Avery Fay in the Feb 2015 Linux netdev thread
      "Invalid timestamp? causing tight ack loop (hundreds of thousands of
      packets / sec)".
      
      This patch series
      ------------------
      
      This patch series mitigates such ack loops by rate-limiting outgoing
      duplicate ACKs sent in response to incoming TCP packets that are for
      an existing connection but that are invalid due to any of the reasons
      mentioned above: sequence number (1), ACK field (2), or timestamp
      value (3). The rate limit for such duplicate ACKs is specified by a
      new sysctl, tcp_invalid_ratelimit, which specifies the minimal space
      between such outbound duplicate ACKs, in milliseconds. The default is
      500 (500ms), and 0 disables the mechanism.
      
      We rate-limit these duplicate ACK responses rather than blocking them
      entirely or resetting the connection, because legitimate connections
      can rely on dupacks in response to some out-of-window segments. For
      example, zero window probes are typically sent with a sequence number
      that is below the current window, and ZWPs thus expect to thus elicit
      a dupack in response.
      
      Testing: this approach has been in use at Google for a while.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f06535c5
    • Neal Cardwell's avatar
      tcp: mitigate ACK loops for connections as tcp_timewait_sock · 4fb17a60
      Neal Cardwell authored
      Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
      represented by a tcp_timewait_sock, we rate limit dupacks in response
      to incoming packets (a) with TCP timestamps that fail PAWS checks, or
      (b) with sequence numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      Reported-by: default avatarAvery Fay <avery@mixpanel.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fb17a60
    • Neal Cardwell's avatar
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell authored
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: default avatarAvery Fay <avery@mixpanel.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • Neal Cardwell's avatar
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell authored
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: default avatarAvery Fay <avery@mixpanel.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • Neal Cardwell's avatar
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell authored
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: default avatarAvery Fay <avery@mixpanel.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      032ee423
    • Pravin B Shelar's avatar
      openvswitch: Initialize unmasked key and uid len · ca539345
      Pravin B Shelar authored
      Flow alloc needs to initialize unmasked key pointer. Otherwise
      it can crash kernel trying to free random unmasked-key pointer.
      
      general protection fault: 0000 [#1] SMP
      3.19.0-rc6-net-next+ #457
      Hardware name: Supermicro X7DWU/X7DWU, BIOS  1.1 04/30/2008
      RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196
      Call Trace:
       [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch]
       [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch]
       [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch]
       [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch]
       [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec
       [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9
       [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa
       [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89
       [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9
       [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95
       [<ffffffff81399898>] ? genl_rcv+0x18/0x37
       [<ffffffff813998a7>] genl_rcv+0x27/0x37
       [<ffffffff81399033>] netlink_unicast+0x103/0x191
       [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310
       [<ffffffff811007ad>] ? might_fault+0x50/0xa0
       [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a
       [<ffffffff8135c799>] sock_sendmsg+0xb/0xd
       [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348
       [<ffffffff8115a720>] ? fsnotify+0x7c/0x348
       [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf
       [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
       [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e
       [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16
       [<ffffffff81411852>] system_call_fastpath+0x12/0x17
      
      Fixes: 74ed7ab9("openvswitch: Add support for unique flow IDs.")
      CC: Joe Stringer <joestringer@nicira.com>
      Reported-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca539345
    • David S. Miller's avatar
      Merge branch 'cxgb4' · 34afb4eb
      David S. Miller authored
      Hariprasad Shenai says:
      
      ====================
      Add support to dump some hw debug info
      
      This patch series adds support to dump sensor info, dump Transport Processor
      event trace, dump Upper Layer Protocol RX module command trace, dump mailbox
      contents and dump Transport Processor congestion control configuration.
      
      Will send a separate patch series for all the hw stats patches, by moving them
      to ethtool.
      
      The patches series is created against 'net-next' tree.
      And includes patches on cxgb4 driver.
      
      We have included all the maintainers of respective drivers. Kindly review the
      change and let us know in case of any review comments.
      
      V2: Dopped all hw stats related patches. Added a new patch which adds support to
      dump congestion control table.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34afb4eb
    • Hariprasad Shenai's avatar
      cxgb4: Add support in debugfs to dump the congestion control table · bad43792
      Hariprasad Shenai authored
      Dump Transport Processor modules congestion control configuration
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bad43792
    • Hariprasad Shenai's avatar
      cxgb4: Add support to dump mailbox content in debugfs · bf7c781d
      Hariprasad Shenai authored
      Adds support to dump the current contents of mailbox and the driver which owns
      it.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf7c781d