1. 17 May, 2017 28 commits
  2. 16 May, 2017 12 commits
    • Christoph Hellwig's avatar
    • Andrew Lunn's avatar
      net: phy: Remove residual magic from PHY drivers · 1b86f702
      Andrew Lunn authored
      commit fa8cddaf ("net phylib: Remove unnecessary condition check in phy")
      removed the only place where the PHY flag PHY_HAS_MAGICANEG was
      checked. But it left the flag being set in the drivers. Remove the flag.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b86f702
    • Leon Romanovsky's avatar
      bnx2x: Remove open coded carrier check · 3fdd34c1
      Leon Romanovsky authored
      There is inline function to test if carrier present,
      so it makes open-coded solution redundant.
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Acked-by: default avatarYuval Mintz <Yuval.Mintz@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fdd34c1
    • Eric Dumazet's avatar
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet authored
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      218af599
    • David S. Miller's avatar
      Merge branch 'udp-scalability-improvements' · 8dfedc53
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      udp: scalability improvements
      
      This patch series implement an idea suggested by Eric Dumazet to
      reduce the contention of the udp sk_receive_queue lock when the socket is
      under flood.
      
      An ancillary queue is added to the udp socket, and the socket always
      tries first to read packets from such queue. If it's empty, we splice
      the content from sk_receive_queue into the ancillary queue.
      
      The first patch introduces some helpers to keep the udp code small, and the
      following two implement the ancillary queue strategy. The code is split
      to hopefully help the reviewing process.
      
      The measured overall gain under udp flood is up to the 30% depending on
      the numa layout and the number of ingress queue used by the relevant nic.
      
      The performance numbers have been gathered using pktgen as sender, with 64
      bytes packets, random src port on a host b2b connected via a 10Gbs link
      with the dut.
      
      The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
      the ingress nic, so that the number of ingress nic rx queues hit by the udp
      traffic could be controlled via ethtool -L.
      
      The udp_sink program was bound to the first idle cpu, to get more
      stable numbers.
      
      On a single numa node receiver:
      
      nic rx queues           vanilla                 patched kernel
      1                       1820 kpps               1900 kpps
      2                       1950 kpps               2500 kpps
      16                      1670 kpps               2120 kpps
      
      When using a single nic rx queue, busy polling was also enabled,
      elsewhere, in the above scenario, the bh processing becomes the bottle-neck
      and this produces large artifacts in the measured performances (e.g.
      improving the udp sink run time, decreases the overall tput, since more
      action from the scheduler comes into play).
      
      [1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
      
      v1 -> v2:
        Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
        of udp_rmem_release() is now a bool.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8dfedc53
    • Paolo Abeni's avatar
      udp: keep the sk_receive_queue held when splicing · 6dfb4367
      Paolo Abeni authored
      On packet reception, when we are forced to splice the
      sk_receive_queue, we can keep the related lock held, so
      that we can avoid re-acquiring it, if fwd memory
      scheduling is required.
      
      v1 -> v2:
        the rx_queue_lock_held param in udp_rmem_release() is
        now a bool
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6dfb4367
    • Paolo Abeni's avatar
      udp: use a separate rx queue for packet reception · 2276f58a
      Paolo Abeni authored
      under udp flood the sk_receive_queue spinlock is heavily contended.
      This patch try to reduce the contention on such lock adding a
      second receive queue to the udp sockets; recvmsg() looks first
      in such queue and, only if empty, tries to fetch the data from
      sk_receive_queue. The latter is spliced into the newly added
      queue every time the receive path has to acquire the
      sk_receive_queue lock.
      
      The accounting of forward allocated memory is still protected with
      the sk_receive_queue lock, so udp_rmem_release() needs to acquire
      both locks when the forward deficit is flushed.
      
      On specific scenarios we can end up acquiring and releasing the
      sk_receive_queue lock multiple times; that will be covered by
      the next patch
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2276f58a
    • Paolo Abeni's avatar
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni authored
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65101aec
    • David S. Miller's avatar
      Merge branch 'nfp-LSO-checksum-and-XDP-datapath-updates' · 9dca599b
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      nfp: LSO, checksum and XDP datapath updates
      
      This series introduces a number of refinements to standard features
      like LSO and checksum offload.  Three major features are support for
      CHECKSUM_COMPLETE, refinement of TSO handling and another small speed
      up for XDP TX.  This series also switches from depending on some
      app FW<>driver ABI versions to heavier use of capabilities.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dca599b
    • Jakub Kicinski's avatar
      nfp: eliminate an if statement in calculation of completed frames · 730b3ab5
      Jakub Kicinski authored
      Given that our rings are always a power of 2, we can simplify the
      calculation of number of completed TX descriptors by using masking
      instead of if statement based on whether the index have wrapped
      or not.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      730b3ab5
    • Jakub Kicinski's avatar
      nfp: add a helper for wrapping descriptor index · 4aa3b766
      Jakub Kicinski authored
      We have a number of places where we calculate the descriptor
      index based on a value which may have overflown.  Create a
      macro for masking with the ring size.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4aa3b766
    • Jakub Kicinski's avatar
      nfp: complete the XDP TX ring only when it's full · abeeec4a
      Jakub Kicinski authored
      Since XDP TX ring holds "spare" RX buffers anyway, we don't have to
      rush the completion.  We can wait until ring fills up completely
      before trying to reclaim buffers.  If RX poll has ended an no
      buffer has been queued for XDP TX we have no guarantee we will see
      another interrupt, so run the reclaim there as well, to make sure
      TX statistics won't become stale.
      
      This should help us reclaim more buffers per single queue controller
      register read.
      
      Note that the XDP completion is very trivial, it only adds up
      the sizes of transmitted frames for statistics so the latency
      spike should be acceptable.  In case user sets the ring sizes
      to something crazy, limit the completion to 2k entries.
      
      The check if the ring is empty at the beginning of xdp_complete()
      is no longer needed - the callers will perform it.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      abeeec4a