1. 23 Oct, 2015 4 commits
    • Eric Dumazet's avatar
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet authored
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e0724d0
    • Paolo Abeni's avatar
      ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180
      Paolo Abeni authored
      Currently adding a new ipv4 address always cause the creation of the
      related network route, with default metric. When a host has multiple
      interfaces on the same network, multiple routes with the same metric
      are created.
      
      If the userspace wants to set specific metric on each routes, i.e.
      giving better metric to ethernet links in respect to Wi-Fi ones,
      the network routes must be deleted and recreated, which is error-prone.
      
      This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
      address. When an address is added with such flag set, no associated
      network route is created, no network route is deleted when
      said IP is gone and it's up to the user space manage such route.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b131180
    • Michael Chan's avatar
      bnxt_en: New Broadcom ethernet driver. · c0c050c5
      Michael Chan authored
      Broadcom ethernet driver for the new family of NetXtreme-C/E
      ethernet devices.
      
      v5:
        - Removed empty blank lines at end of files (noted by David Miller).
        - Moved busy poll helper functions to bnxt.h to at least make the
          .c file look less cluttered with #ifdef (noted by Stephen Hemminger).
      
      v4:
        - Broke up 2 long message strings with "\n" (suggested by John Linville)
        - Constify an array of strings (suggested by Stephen Hemminger)
        - Improve bnxt_vf_pciid() (suggested by Stephen Hemminger)
        - Use PCI_VDEVICE() to populate pci_device_id table for more compact
          source.
      
      v3:
        - Fixed 2 more sparse warnings.
        - Removed some unused structures in .h files.
      
      v2:
        - Fixed all kbuild test robot reported warnings.
        - Fixed many of the checkpatch.pl errors and warnings.
        - Fixed the Kconfig description (noted by Dmitry Kravkov).
      Acked-by: default avatarEddie Wai <eddie.wai@broadcom.com>
      Acked-by: default avatarJeffrey Huang <huangjw@broadcom.com>
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0c050c5
    • Vivien Didelot's avatar
      net: dsa: mv88e6xxx: remove debugfs interface · 0a31adae
      Vivien Didelot authored
      It is preferable to have a common debugfs interface for DSA or switchdev
      instead of a driver specific one. Thus remove the mv88e6xxx debug code.
      Signed-off-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a31adae
  2. 22 Oct, 2015 32 commits
  3. 21 Oct, 2015 4 commits
    • Elad Raz's avatar
      Adding switchdev ageing notification on port bridged · 6ac311ae
      Elad Raz authored
      Configure ageing time to the HW for newly bridged device
      
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarElad Raz <eladr@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac311ae
    • David S. Miller's avatar
      Merge branch 'tcp-rack' · eb9fae32
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      RACK loss detection
      
      RACK (Recent ACK) loss recovery uses the notion of time instead of
      packet sequence (FACK) or counts (dupthresh).
      
      It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
      limited transmit (new data packet) is sacked in recovery, then any
      retransmission sent before that newly sacked packet was sent must have
      been lost, since at least one round trip time has elapsed.
      
      But that existing heuristic from tcp_mark_lost_retrans()
      has several limitations:
        1) it can't detect tail drops since it depends on limited transmit
        2) it's disabled upon reordering (assumes no reordering)
        3) it's only enabled in fast recovery but not timeout recovery
      
      RACK addresses these limitations with a core idea: an unacknowledged
      packet P1 is deemed lost if a packet P2 that was sent later is is
      s/acked, since at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when a later retransmission is
      s/acked, while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering, similar to the delayed
      Early Retransmit.
      
      In the current patch set RACK is only a supplemental loss detection
      and does not trigger fast recovery. However we are developing RACK
      to replace or consolidate FACK/dupthresh, early retransmit, and
      thin-dupack. These heuristics all implicitly bear the time notion.
      For example, the delayed Early Retransmit is simply applying RACK
      to trigger the fast recovery with small inflight.
      
      RACK requires measuring the minimum RTT. Tracking a global min is less
      robust due to traffic engineering pathing changes. Therefore it uses a
      windowed filter by Kathleen Nichols. The min RTT can also be useful
      for various other purposes like congestion control or stat monitoring.
      
      This patch has been used on Google servers for well over 1 year. RACK
      has also been implemented in the QUIC protocol. We are submitting an
      IETF draft as well.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb9fae32
    • Yuchung Cheng's avatar
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng authored
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Yuchung Cheng's avatar
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng authored
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      659a8ad5