1. 30 Sep, 2014 3 commits
    • John Fastabend's avatar
      net: sched: restrict use of qstats qlen · 64015853
      John Fastabend authored
      This removes the use of qstats->qlen variable from the classifiers
      and makes it an explicit argument to gnet_stats_copy_queue().
      
      The qlen represents the qdisc queue length and is packed into
      the qstats at the last moment before passnig to user space. By
      handling it explicitely we avoid, in the percpu stats case, having
      to figure out which per_cpu variable to put it in.
      
      It would probably be best to remove it from qstats completely
      but qstats is a user space ABI and can't be broken. A future
      patch could make an internal only qstats structure that would
      avoid having to allocate an additional u32 variable on the
      Qdisc struct. This would make the qstats struct 128bits instead
      of 128+32.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64015853
    • John Fastabend's avatar
      net: sched: implement qstat helper routines · 25331d6c
      John Fastabend authored
      This adds helpers to manipulate qstats logic and replaces locations
      that touch the counters directly. This simplifies future patches
      to push qstats onto per cpu counters.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25331d6c
    • John Fastabend's avatar
      net: sched: make bstats per cpu and estimator RCU safe · 22e0f8b9
      John Fastabend authored
      In order to run qdisc's without locking statistics and estimators
      need to be handled correctly.
      
      To resolve bstats make the statistics per cpu. And because this is
      only needed for qdiscs that are running without locks which is not
      the case for most qdiscs in the near future only create percpu
      stats when qdiscs set the TCQ_F_CPUSTATS flag.
      
      Next because estimators use the bstats to calculate packets per
      second and bytes per second the estimator code paths are updated
      to use the per cpu statistics.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      22e0f8b9
  2. 29 Sep, 2014 25 commits
    • Michael Braun's avatar
      macvlan: add source mode · 79cf79ab
      Michael Braun authored
      This patch adds a new mode of operation to macvlan, called "source".
      It allows one to set a list of allowed mac address, which is used
      to match against source mac address from received frames on underlying
      interface.
      This enables creating mac based VLAN associations, instead of standard
      port or tag based. The feature is useful to deploy 802.1x mac based
      behavior, where drivers of underlying interfaces doesn't allows that.
      
      Configuration is done through the netlink interface using e.g.:
       ip link add link eth0 name macvlan0 type macvlan mode source
       ip link add link eth0 name macvlan1 type macvlan mode source
       ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
       ip link set link dev macvlan0 type macvlan macaddr add 00:22:22:22:22:22
       ip link set link dev macvlan0 type macvlan macaddr add 00:33:33:33:33:33
       ip link set link dev macvlan1 type macvlan macaddr add 00:33:33:33:33:33
       ip link set link dev macvlan1 type macvlan macaddr add 00:44:44:44:44:44
      
      This allows clients with MAC addresses 00:11:11:11:11:11,
      00:22:22:22:22:22 to be part of only VLAN associated with macvlan0
      interface. Clients with MAC addresses 00:44:44:44:44:44 with only VLAN
      associated with macvlan1 interface. And client with MAC address
      00:33:33:33:33:33 to be associated with both VLANs.
      
      Based on work of Stefan Gula <steweg@gmail.com>
      
      v8: last version of Stefan Gula for Kernel 3.2.1
      v9: rework onto linux-next 2014-03-12 by Michael Braun
          add MACADDR_SET command, enable to configure mac for source mode
          while creating interface
      v10:
        - reduce indention level
        - rename source_list to source_entry
        - use aligned 64bit ether address
        - use hash_64 instead of addr[5]
      v11:
        - rebase for 3.14 / linux-next 20.04.2014
      v12
        - rebase for linux-next 2014-09-25
      Signed-off-by: default avatarMichael Braun <michael-dev@fami-braun.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      79cf79ab
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 85224844
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      pull request: netfilter/ipvs updates for net-next
      
      The following patchset contains Netfilter/IPVS updates for net-next,
      most relevantly they are:
      
      1) Four patches to make the new nf_tables masquerading support
         independent of the x_tables infrastructure. This also resolves a
         compilation breakage if the masquerade target is disabled but the
         nf_tables masq expression is enabled.
      
      2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
         skbinfo extension that allows you to store packet metainformation in the
         elements. This can be used to fetch and restore this to the packets through
         the iptables SET target, patches from Anton Danilov.
      
      3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.
      
      4) Add simple weighted fail-over scheduler via Simon Horman. This provides
         a fail-over IPVS scheduler (unlike existing load balancing schedulers).
         Connections are directed to the appropriate server based solely on
         highest weight value and server availability, patch from Kenny Mathis.
      
      5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
         Simon Horman informs that the motivation for this is to allow more
         flexibility in the choice of IP version offered by both virtual-servers
         and real-servers as they no longer need to match: An IPv4 connection
         from an end-user may be forwarded to a real-server using IPv6 and
         vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
         and Julian Anastasov.
      
      6) Add global generation ID to the nf_tables ruleset. When dumping from
         several different object lists, we need a way to identify that an update
         has ocurred so userspace knows that it needs to refresh its lists. This
         also includes a new command to obtain the 32-bits generation ID. The
         less significant 16-bits of this ID is also exposed through res_id field
         in the nfnetlink header to quickly detect the interference and retry when
         there is no risk of ID wraparound.
      
      7) Move br_netfilter out of the bridge core. The br_netfilter code is
         built in the bridge core by default. This causes problems of different
         kind to people that don't want this: Jesper reported performance drop due
         to the inconditional hook registration and I remember to have read complains
         on netdev from people regarding the unexpected behaviour of our bridging
         stack when br_netfilter is enabled (fragmentation handling, layer 3 and
         upper inspection). People that still need this should easily undo the
         damage by modprobing the new br_netfilter module.
      
      8) Dump the set policy nf_tables that allows set parameterization. So
         userspace can keep user-defined preferences when saving the ruleset.
         From Arturo Borrero.
      
      9) Use __seq_open_private() helper function to reduce boiler plate code
         in x_tables, From Rob Jones.
      
      10) Safer default behaviour in case that you forget to load the protocol
         tracker. Daniel Borkmann and Florian Westphal detected that if your
         ruleset is stateful, you allow traffic to at least one single SCTP port
         and the SCTP protocol tracker is not loaded, then any SCTP traffic may
         be pass through unfiltered. After this patch, the connection tracking
         classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
         been compiled with support for these modules.
      ====================
      
      Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
      netfilter skbuff members around, and the netfilter tree adjusted the
      ifdef guards for the bridging info pointer.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85224844
    • Florian Westphal's avatar
      tcp: change TCP_ECN prefixes to lower case · 735d3831
      Florian Westphal authored
      Suggested by Stephen. Also drop inline keyword and let compiler decide.
      
      gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up.
      The actual evaluation is not inlined anymore while the ECN_OK test is.
      Suggested-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      735d3831
    • Florian Westphal's avatar
      tcp: move TCP_ECN_create_request out of header · d82bd122
      Florian Westphal authored
      After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only
      has a single callsite.
      
      While at it, convert name to lowercase, suggested by Stephen.
      Suggested-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d82bd122
    • David S. Miller's avatar
      Merge branch 'arcnet-EAE' · 2b7fc477
      David S. Miller authored
      Michael Grzeschik says:
      
      ====================
      ARCNET: add support for EAE multi interfac card
      
      this series adds support for the PLX Bridge based multi interface
      pci cards and adds support to change device address on com200xx chips
      during runtime.
      
      This series is based on v3.17-rc7.
      It is fixed for build against com20020_cs.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b7fc477
    • Michael Grzeschik's avatar
      ARCNET: enable eae arcnet card support · 5b85bad2
      Michael Grzeschik authored
      This patch adds support for the EAE arcnet cards
      which has two Interfaces.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b85bad2
    • Michael Grzeschik's avatar
      ARCNET: add support for multi interfaces on com20020 · c51da42a
      Michael Grzeschik authored
      The com20020-pci driver is currently designed to instance
      one netdev with one pci device. This patch adds support to
      instance many cards with one pci device, depending on the device
      data in the private data.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c51da42a
    • Michael Grzeschik's avatar
      ARCNET: add com20020 PCI IDs with metadata · 8c14f9c7
      Michael Grzeschik authored
      This patch adds metadata for the com20020 to prepare for devices with
      multiple io address areas with multi card interfaces.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c14f9c7
    • Michael Grzeschik's avatar
      ARCNET: add com20020_set_hwddr to change address · a0d2e513
      Michael Grzeschik authored
      This patch adds com20020_set_hwaddr to make
      it possible to change the hwaddr on runtime.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0d2e513
    • Michael Grzeschik's avatar
      ARCNET: return IRQ_NONE if the interface isn't running · 226ee675
      Michael Grzeschik authored
      The interrupt handler needs to return IRQ_NONE in case
      two devices are used with the shared interrupt handler.
      Otherwise it could steal interrupts from the other
      interface.
      Signed-off-by: default avatarMichael Grzeschik <m.grzeschik@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      226ee675
    • Li RongQing's avatar
      tcp: remove unnecessary assignment. · 41c91996
      Li RongQing authored
      This variable i is overwritten to 0 by following code
      Signed-off-by: default avatarLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41c91996
    • Eric Dumazet's avatar
      net: reorganize sk_buff for faster __copy_skb_header() · b1937227
      Eric Dumazet authored
      With proliferation of bit fields in sk_buff, __copy_skb_header() became
      quite expensive, showing as the most expensive function in a GSO
      workload.
      
      __copy_skb_header() performance is also critical for non GSO TCP
      operations, as it is used from skb_clone()
      
      This patch carefully moves all the fields that were not copied in a
      separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more
      
      Then I moved all other fields and all other copied fields in a section
      delimited by headers_start[0]/headers_end[0] section so that we
      can use a single memcpy() call, inlined by compiler using long
      word load/stores.
      
      I also tried to make all copies in the natural orders of sk_buff,
      to help hardware prefetching.
      
      I made sure sk_buff size did not change.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1937227
    • Florian Westphal's avatar
      netfilter: conntrack: disable generic tracking for known protocols · db29a950
      Florian Westphal authored
      Given following iptables ruleset:
      
      -P FORWARD DROP
      -A FORWARD -m sctp --dport 9 -j ACCEPT
      -A FORWARD -p tcp --dport 80 -j ACCEPT
      -A FORWARD -p tcp -m conntrack -m state ESTABLISHED,RELATED -j ACCEPT
      
      One would assume that this allows SCTP on port 9 and TCP on port 80.
      Unfortunately, if the SCTP conntrack module is not loaded, this allows
      *all* SCTP communication, to pass though, i.e. -p sctp -j ACCEPT,
      which we think is a security issue.
      
      This is because on the first SCTP packet on port 9, we create a dummy
      "generic l4" conntrack entry without any port information (since
      conntrack doesn't know how to extract this information).
      
      All subsequent packets that are unknown will then be in established
      state since they will fallback to proto_generic and will match the
      'generic' entry.
      
      Our originally proposed version [1] completely disabled generic protocol
      tracking, but Jozsef suggests to not track protocols for which a more
      suitable helper is available, hence we now mitigate the issue for in
      tree known ct protocol helpers only, so that at least NAT and direction
      information will still be preserved for others.
      
       [1] http://www.spinics.net/lists/netfilter-devel/msg33430.html
      
      Joint work with Daniel Borkmann.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Acked-by: default avatarJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      db29a950
    • Arturo Borrero's avatar
      netfilter: nf_tables: store and dump set policy · 9363dc4b
      Arturo Borrero authored
      We want to know in which cases the user explicitly sets the policy
      options. In that case, we also want to dump back the info.
      Signed-off-by: default avatarArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9363dc4b
    • David S. Miller's avatar
      Merge branch 'qca7000_spi' · 842abe08
      David S. Miller authored
      Stefan Wahren says:
      
      ====================
      add Qualcomm QCA7000 ethernet driver
      
      This patch series adds support for the Qualcomm QCA7000 Homeplug GreenPHY.
      The QCA7000 is serial-to-powerline bridge with two interfaces: UART and SPI.
      These patches handles only the last one, with an Ethernet over SPI protocol
      driver.
      
      This driver based on the Qualcomm code [1], but contains a lot of changes
      since last year:
      
      * devicetree support
      * DebugFS support
      * ethtool support
      * better error handling
      * performance improvements
      * code cleanup
      * some bugfixes
      
      The code has been tested only on Freescale i.MX28 boards, but should work
      on other platforms.
      
      [1] - https://github.com/IoE/qca7000
      
      Changes in V3:
      - Use ether_addr_copy instead of memcpy
      - Remove qcaspi_set_mac_address
      - Improve DT parsing
      - replace OF_GPIO dependancy with OF
      - fix compile error caused by SET_ETHTOOL_OPS
      - fix possible endless loop when spi read fails
      - fix DT documentation
      - fix coding style
      - fix sparse warnings
      
      Changes in V2:
      - replace in DT the SPI intr GPIO with pure interrupt
      - make legacy mode a boolean DT property and remove it as module parameter
      - make burst length a module parameter instead of DT property
      - make pluggable a module parameter instead of DT property
      - improve DT documentation
      - replace debugFS register dump with ethtool function
      - replace debugFS stats with ethtool function
      - implement function to get ring parameter via ethtool
      - implement function to set TX ring count via ethtool
      - fix TX ring state in debugFS
      - optimize tx ring flush
      - add byte limit for TX ring to avoid bufferbloat
      - fix TX queue full and write buffer miss counter
      - fix SPI clk speed module parameter
      - fix possible packet loss
      - fix possible race during transmit
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      842abe08
    • Stefan Wahren's avatar
      net: qualcomm: new Ethernet over SPI driver for QCA7000 · 291ab06e
      Stefan Wahren authored
      This patch adds the Ethernet over SPI driver for the
      Qualcomm QCA7000 HomePlug GreenPHY.
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      291ab06e
    • Stefan Wahren's avatar
      Documentation: add Device tree bindings for QCA7000 · 7d50df8f
      Stefan Wahren authored
      This patch adds the Device tree bindings for the
      Ethernet over SPI protocol driver of the Qualcomm
      QCA7000 HomePlug GreenPHY.
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d50df8f
    • David S. Miller's avatar
      Merge branch 'dctcp' · a11238ec
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      net: tcp: DCTCP congestion control algorithm
      
      This patch series adds support for the DataCenter TCP (DCTCP) congestion
      control algorithm. Please see individual patches for the details.
      
      The last patch adds DCTCP as a congestion control module, and previous
      ones add needed infrastructure to extend the congestion control framework.
      
      Joint work between Florian Westphal, Daniel Borkmann and Glenn Judd.
      
      v3 -> v2:
       - No changes anywhere, just a resend as requested by Dave
       - Added Stephen's ACK
      v1 -> v2:
       - Rebased to latest net-next
       - Addressed Eric's feedback, thanks!
        - Update stale comment wrt. DCTCP ECN usage
        - Don't call INET_ECN_xmit for every packet
       - Add dctcp ss/inetdiag support to expose internal stats to userspace
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a11238ec
    • Daniel Borkmann's avatar
      net: tcp: add DCTCP congestion control algorithm · e3118e83
      Daniel Borkmann authored
      This work adds the DataCenter TCP (DCTCP) congestion control
      algorithm [1], which has been first published at SIGCOMM 2010 [2],
      resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
      recently as an informational IETF draft available at [4]).
      
      DCTCP is an enhancement to the TCP congestion control algorithm for
      data center networks. Typical data center workloads are i.e.
      i) partition/aggregate (queries; bursty, delay sensitive), ii) short
      messages e.g. 50KB-1MB (for coordination and control state; delay
      sensitive), and iii) large flows e.g. 1MB-100MB (data update;
      throughput sensitive). DCTCP has therefore been designed for such
      environments to provide/achieve the following three requirements:
      
        * High burst tolerance (incast due to partition/aggregate)
        * Low latency (short flows, queries)
        * High throughput (continuous data updates, large file
          transfers) with commodity, shallow buffered switches
      
      The basic idea of its design consists of two fundamentals: i) on the
      switch side, packets are being marked when its internal queue
      length > threshold K (K is chosen so that a large enough headroom
      for marked traffic is still available in the switch queue); ii) the
      sender/host side maintains a moving average of the fraction of marked
      packets, so each RTT, F is being updated as follows:
      
       F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
       alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
      
      The resulting alpha (iow: probability that switch queue is congested)
      is then being used in order to adaptively decrease the congestion
      window W:
      
       W := (1 - (alpha / 2)) * W
      
      The means for receiving marked packets resp. marking them on switch
      side in DCTCP is the use of ECN.
      
      RFC3168 describes a mechanism for using Explicit Congestion Notification
      from the switch for early detection of congestion, rather than waiting
      for segment loss to occur.
      
      However, this method only detects the presence of congestion, not
      the *extent*. In the presence of mild congestion, it reduces the TCP
      congestion window too aggressively and unnecessarily affects the
      throughput of long flows [4].
      
      DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
      processing to estimate the fraction of bytes that encounter congestion,
      rather than simply detecting that some congestion has occurred. DCTCP
      then scales the TCP congestion window based on this estimate [4],
      thus it can derive multibit feedback from the information present in
      the single-bit sequence of marks in its control law. And thus act in
      *proportion* to the extent of congestion, not its *presence*.
      
      Switches therefore set the Congestion Experienced (CE) codepoint in
      packets when internal queue lengths exceed threshold K. Resulting,
      DCTCP delivers the same or better throughput than normal TCP, while
      using 90% less buffer space.
      
      It was found in [2] that DCTCP enables the applications to handle 10x
      the current background traffic, without impacting foreground traffic.
      Moreover, a 10x increase in foreground traffic did not cause any
      timeouts, and thus largely eliminates TCP incast collapse problems.
      
      The algorithm itself has already seen deployments in large production
      data centers since then.
      
      We did a long-term stress-test and analysis in a data center, short
      summary of our TCP incast tests with iperf compared to cubic:
      
      This test measured DCTCP throughput and latency and compared it with
      CUBIC throughput and latency for an incast scenario. In this test, 19
      senders sent at maximum rate to a single receiver. The receiver simply
      ran iperf -s.
      
      The senders ran iperf -c <receiver> -t 30. All senders started
      simultaneously (using local clocks synchronized by ntp).
      
      This test was repeated multiple times. Below shows the results from a
      single test. Other tests are similar. (DCTCP results were extremely
      consistent, CUBIC results show some variance induced by the TCP timeouts
      that CUBIC encountered.)
      
      For this test, we report statistics on the number of TCP timeouts,
      flow throughput, and traffic latency.
      
      1) Timeouts (total over all flows, and per flow summaries):
      
                  CUBIC            DCTCP
        Total     3227             25
        Mean       169.842          1.316
        Median     183              1
        Max        207              5
        Min        123              0
        Stddev      28.991          1.600
      
      Timeout data is taken by measuring the net change in netstat -s
      "other TCP timeouts" reported. As a result, the timeout measurements
      above are not restricted to the test traffic, and we believe that it
      is likely that all of the "DCTCP timeouts" are actually timeouts for
      non-test traffic. We report them nevertheless. CUBIC will also include
      some non-test timeouts, but they are drawfed by bona fide test traffic
      timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
      TCP timeouts. DCTCP reduces timeouts by at least two orders of
      magnitude and may well have eliminated them in this scenario.
      
      2) Throughput (per flow in Mbps):
      
                  CUBIC            DCTCP
        Mean      521.684          521.895
        Median    464              523
        Max       776              527
        Min       403              519
        Stddev    105.891            2.601
        Fairness    0.962            0.999
      
      Throughput data was simply the average throughput for each flow
      reported by iperf. By avoiding TCP timeouts, DCTCP is able to
      achieve much better per-flow results. In CUBIC, many flows
      experience TCP timeouts which makes flow throughput unpredictable and
      unfair. DCTCP, on the other hand, provides very clean predictable
      throughput without incurring TCP timeouts. Thus, the standard deviation
      of CUBIC throughput is dramatically higher than the standard deviation
      of DCTCP throughput.
      
      Mean throughput is nearly identical because even though cubic flows
      suffer TCP timeouts, other flows will step in and fill the unused
      bandwidth. Note that this test is something of a best case scenario
      for incast under CUBIC: it allows other flows to fill in for flows
      experiencing a timeout. Under situations where the receiver is issuing
      requests and then waiting for all flows to complete, flows cannot fill
      in for timed out flows and throughput will drop dramatically.
      
      3) Latency (in ms):
      
                  CUBIC            DCTCP
        Mean      4.0088           0.04219
        Median    4.055            0.0395
        Max       4.2              0.085
        Min       3.32             0.028
        Stddev    0.1666           0.01064
      
      Latency for each protocol was computed by running "ping -i 0.2
      <receiver>" from a single sender to the receiver during the incast
      test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
      that traffic traversed the DCTCP queue and was not dropped when the
      queue size was greater than the marking threshold. The summary
      statistics above are over all ping metrics measured between the single
      sender, receiver pair.
      
      The latency results for this test show a dramatic difference between
      CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
      which incurs the maximum queue latency (more buffer memory will lead
      to high latency.) DCTCP, on the other hand, deliberately attempts to
      keep queue occupancy low. The result is a two orders of magnitude
      reduction of latency with DCTCP - even with a switch with relatively
      little RAM. Switches with larger amounts of RAM will incur increasing
      amounts of latency for CUBIC, but not for DCTCP.
      
      4) Convergence and stability test:
      
      This test measured the time that DCTCP took to fairly redistribute
      bandwidth when a new flow commences. It also measured DCTCP's ability
      to remain stable at a fair bandwidth distribution. DCTCP is compared
      with CUBIC for this test.
      
      At the commencement of this test, a single flow is sending at maximum
      rate (near 10 Gbps) to a single receiver. One second after that first
      flow commences, a new flow from a distinct server begins sending to
      the same receiver as the first flow. After the second flow has sent
      data for 10 seconds, the second flow is terminated. The first flow
      sends for an additional second. Ideally, the bandwidth would be evenly
      shared as soon as the second flow starts, and recover as soon as it
      stops.
      
      The results of this test are shown below. Note that the flow bandwidth
      for the two flows was measured near the same time, but not
      simultaneously.
      
      DCTCP performs nearly perfectly within the measurement limitations
      of this test: bandwidth is quickly distributed fairly between the two
      flows, remains stable throughout the duration of the test, and
      recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
      fairly, and has trouble remaining stable.
      
        CUBIC                      DCTCP
      
        Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
         0       9.93    0          0       9.92    0
         0.5     9.87    0          0.5     9.86    0
         1       8.73    2.25       1       6.46    4.88
         1.5     7.29    2.8        1.5     4.9     4.99
         2       6.96    3.1        2       4.92    4.94
         2.5     6.67    3.34       2.5     4.93    5
         3       6.39    3.57       3       4.92    4.99
         3.5     6.24    3.75       3.5     4.94    4.74
         4       6       3.94       4       5.34    4.71
         4.5     5.88    4.09       4.5     4.99    4.97
         5       5.27    4.98       5       4.83    5.01
         5.5     4.93    5.04       5.5     4.89    4.99
         6       4.9     4.99       6       4.92    5.04
         6.5     4.93    5.1        6.5     4.91    4.97
         7       4.28    5.8        7       4.97    4.97
         7.5     4.62    4.91       7.5     4.99    4.82
         8       5.05    4.45       8       5.16    4.76
         8.5     5.93    4.09       8.5     4.94    4.98
         9       5.73    4.2        9       4.92    5.02
         9.5     5.62    4.32       9.5     4.87    5.03
        10       6.12    3.2       10       4.91    5.01
        10.5     6.91    3.11      10.5     4.87    5.04
        11       8.48    0         11       8.49    4.94
        11.5     9.87    0         11.5     9.9     0
      
      SYN/ACK ECT test:
      
      This test demonstrates the importance of ECT on SYN and SYN-ACK packets
      by measuring the connection probability in the presence of competing
      flows for a DCTCP connection attempt *without* ECT in the SYN packet.
      The test was repeated five times for each number of competing flows.
      
                    Competing Flows  1 |    2 |    4 |    8 |   16
                                     ------------------------------
      Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
      Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0
      
      As the number of competing flows moves beyond 1, the connection
      probability drops rapidly.
      
      Enabling DCTCP with this patch requires the following steps:
      
      DCTCP must be running both on the sender and receiver side in your
      data center, i.e.:
      
        sysctl -w net.ipv4.tcp_congestion_control=dctcp
      
      Also, ECN functionality must be enabled on all switches in your
      data center for DCTCP to work. The default ECN marking threshold (K)
      heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
      1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
      
      In above tests, for each switch port, traffic was segregated into two
      queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
      0x04 - the packet was placed into the DCTCP queue. All other packets
      were placed into the default drop-tail queue. For the DCTCP queue,
      RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
      More details however, we refer you to the paper [2] under section 3).
      
      There are no code changes required to applications running in user
      space. DCTCP has been implemented in full *isolation* of the rest of
      the TCP code as its own congestion control module, so that it can run
      without a need to expose code to the core of the TCP stack, and thus
      nothing changes for non-DCTCP users.
      
      Changes in the CA framework code are minimal, and DCTCP algorithm
      operates on mechanisms that are already available in most Silicon.
      The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
      the paper, but we leave the option that it can be chosen carefully
      to a different value by the user.
      
      In case DCTCP is being used and ECN support on peer site is off,
      DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
      
      ss {-4,-6} -t -i diag interface:
      
        ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
        ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
        send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
        reordering:101 rcv_space:29200
      
        ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
        cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
        325.5Mbps rcv_rtt:1.5 rcv_space:29200
      
      More information about DCTCP can be found in [1-4].
      
        [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
        [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
        [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
        [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3118e83
    • Florian Westphal's avatar
      net: tcp: more detailed ACK events and events for CE marked packets · 9890092e
      Florian Westphal authored
      DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
      and ACK properties, e.g. ACK that updates window is treated differently
      than DUPACK.
      
      Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
      DCTCP also implements a CE state machine that keeps track of CE markings
      of incoming packets.
      
      Therefore, extend the congestion control framework to provide these
      event types, so that DCTCP can be properly implemented as a normal
      congestion algorithm module outside of the core stack.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9890092e
    • Florian Westphal's avatar
      net: tcp: split ack slow/fast events from cwnd_event · 7354c8c3
      Florian Westphal authored
      The congestion control ops "cwnd_event" currently supports
      CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
      Both FAST and SLOW_ACK are only used by Westwood congestion
      control algorithm.
      
      This removes both flags from cwnd_event and adds a new
      in_ack_event callback for this. The goal is to be able to
      provide more detailed information about ACKs, such as whether
      ECE flag was set, or whether the ACK resulted in a window
      update.
      
      It is required for DataCenter TCP (DCTCP) congestion control
      algorithm as it makes a different choice depending on ECE being
      set or not.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7354c8c3
    • Daniel Borkmann's avatar
      net: tcp: add flag for ca to indicate that ECN is required · 30e502a3
      Daniel Borkmann authored
      This patch adds a flag to TCP congestion algorithms that allows
      for requesting to mark IPv4/IPv6 sockets with transport as ECN
      capable, that is, ECT(0), when required by a congestion algorithm.
      
      It is currently used and needed in DataCenter TCP (DCTCP), as it
      requires both peers to assert ECT on all IP packets sent - it
      uses ECN feedback (i.e. CE, Congestion Encountered information)
      from switches inside the data center to derive feedback to the
      end hosts.
      
      Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
      algorithm/behaviour slightly diverges from RFC3168, therefore this
      is only (!) enabled iff the assigned congestion control ops module
      has requested this. By that, we can tightly couple this logic really
      only to the provided congestion control ops.
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30e502a3
    • Florian Westphal's avatar
      net: tcp: assign tcp cong_ops when tcp sk is created · 55d8694f
      Florian Westphal authored
      Split assignment and initialization from one into two functions.
      
      This is required by followup patches that add Datacenter TCP
      (DCTCP) congestion control algorithm - we need to be able to
      determine if the connection is moderated by DCTCP before the
      3WHS has finished.
      
      As we walk the available congestion control list during the
      assignment, we are always guaranteed to have Reno present as
      it's fixed compiled-in. Therefore, since we're doing the
      early assignment, we don't have a real use for the Reno alias
      tcp_init_congestion_ops anymore and can thus remove it.
      
      Actual usage of the congestion control operations are being
      made after the 3WHS has finished, in some cases however we
      can access get_info() via diag if implemented, therefore we
      need to zero out the private area for those modules.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55d8694f
    • John Fastabend's avatar
      net: sched: cls_rcvp, complete rcu conversion · 53dfd501
      John Fastabend authored
      This completes the cls_rsvp conversion to RCU safe
      copy, update semantics.
      
      As a result all cases of tcf_exts_change occur on
      empty lists now.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53dfd501
    • Eric Dumazet's avatar
      dql: dql_queued() should write first to reduce bus transactions · 3d9a0d2f
      Eric Dumazet authored
      While doing high throughput test on a BQL enabled NIC,
      I found a very high cost in ndo_start_xmit() when accessing BQL data.
      
      It turned out the problem was caused by compiler trying to be
      smart, but involving a bad MESI transaction :
      
        0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
        0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
       58.23 │  add    %edx,%edi
        0.58 │  cmp    %edi,0xc4(%rax)
        0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
        0.72 │  js     bd8
      
      I got an incredible 10 % gain [1] by making sure cpu do not attempt
      to get the cache line in Shared mode, but directly requests for
      ownership.
      
      New code :
      	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
      	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
      	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
      	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
      	cmp    %edx,%ecx
      
      The TX completion was running from another cpu, with high interrupts
      rate.
      
      Note that I am using barrier() as a soft hint, as mb() here could be
      too heavy cost.
      
      [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d9a0d2f
  3. 28 Sep, 2014 12 commits