1. 24 Mar, 2019 10 commits
    • David S. Miller's avatar
      Merge branch 'tcp-rx-tx-cache' · bdaba895
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: add rx/tx cache to reduce lock contention
      
      On hosts with many cpus we can observe a very serious contention
      on spinlocks used in mm slab layer.
      
      The following can happen quite often :
      
      1) TX path
        sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
        ACK is received on CPU B, and consumes the skb that was in the retransmit
        queue.
      
      2) RX path
        network driver allocates skb on CPU C
        recvmsg() happens on CPU D, freeing the skb after it has been delivered
        to user space.
      
      In both cases, we are hitting the asymetric alloc/free pattern
      for which slab has to drain alien caches. At 8 Mpps per second,
      this represents 16 Mpps alloc/free per second and has a huge penalty.
      
      In an interesting experiment, I tried to use a single kmem_cache for all the skbs
      (in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
                        kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
      qnd most of the contention disappeared, since cpus could better use
      their local slab per-cpu cache.
      
      But we can do actually better, in the following patches.
      
      TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
           so that next sendmsg() can reuse it immediately.
      
      RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
         so that it can be freed by the cpu feeding the incoming packets in BH.
      
      This increased the performance of small RPC benchmark by about 10 % on a host
      with 112 hyperthreads.
      
      v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
             clone has been freed.
           - Really test rps_needed in sk_eat_skb() as claimed.
           - Fixed rps_needed use in drivers/net/tun.c
      
      v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bdaba895
    • Eric Dumazet's avatar
      tcp: add one skb cache for rx · 8b27dae5
      Eric Dumazet authored
      Often times, recvmsg() system calls and BH handling for a particular
      TCP socket are done on different cpus.
      
      This means the incoming skb had to be allocated on a cpu,
      but freed on another.
      
      This incurs a high spinlock contention in slab layer for small rpc,
      but also a high number of cache line ping pongs for larger packets.
      
      A full size GRO packet might use 45 page fragments, meaning
      that up to 45 put_page() can be involved.
      
      More over performing the __kfree_skb() in the recvmsg() context
      adds a latency for user applications, and increase probability
      of trapping them in backlog processing, since the BH handler
      might found the socket owned by the user.
      
      This patch, combined with the prior one increases the rpc
      performance by about 10 % on servers with large number of cores.
      
      (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
       instead of 8 Mpps)
      
      This also increases single bulk flow performance on 40Gbit+ links,
      since in this case there are often two cpus working in tandem :
      
       - CPU handling the NIC rx interrupts, feeding the receive queue,
        and (after this patch) freeing the skbs that were consumed.
      
       - CPU in recvmsg() system call, essentially 100 % busy copying out
        data to user space.
      
      Having at most one skb in a per-socket cache has very little risk
      of memory exhaustion, and since it is protected by socket lock,
      its management is essentially free.
      
      Note that if rps/rfs is used, we do not enable this feature, because
      there is high chance that the same cpu is handling both the recvmsg()
      system call and the TCP rx path, but that another cpu did the skb
      allocations in the device driver right before the RPS/RFS logic.
      
      To properly handle this case, it seems we would need to record
      on which cpu skb was allocated, and use a different channel
      to give skbs back to this cpu.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b27dae5
    • Eric Dumazet's avatar
      tcp: add one skb cache for tx · 472c2e07
      Eric Dumazet authored
      On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.
      
          20.69%  [kernel]       [k] queued_spin_lock_slowpath
           5.64%  [kernel]       [k] _raw_spin_lock
           3.83%  [kernel]       [k] syscall_return_via_sysret
           3.48%  [kernel]       [k] __entry_text_start
           1.76%  [kernel]       [k] __netif_receive_skb_core
           1.64%  [kernel]       [k] __fget
      
      For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.
      
      In many cases, ACK packets are handled by another cpus, and this unfortunately
      incurs heavy costs for slab layer.
      
      This patch uses an extra pointer in socket structure, so that we try to reuse
      the same skb and avoid these expensive costs.
      
      We cache at most one skb per socket so this should be safe as far as
      memory pressure is concerned.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      472c2e07
    • Eric Dumazet's avatar
      net: convert rps_needed and rfs_needed to new static branch api · dc05360f
      Eric Dumazet authored
      We prefer static_branch_unlikely() over static_key_false() these days.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc05360f
    • David S. Miller's avatar
      Merge branch 'net-dev-BYPASS-for-lockless-qdisc' · 7c1508e5
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      net: dev: BYPASS for lockless qdisc
      
      This patch series is aimed at improving xmit performances of lockless qdisc
      in the uncontended scenario.
      
      After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
      Due to retpolines the overhead for the avoidables enqueue and dequeue operations
      has increased and we see measurable regressions.
      
      The first patch introduces the BYPASS code path for lockless qdisc, and the
      second one optimizes such path further. Overall this avoids up to 3 indirect
      calls per xmit packet. Detailed performance figures are reported in the 2nd
      patch.
      
       v2 -> v3:
        - qdisc_is_empty() has a const argument (Eric)
      
       v1 -> v2:
        - use really an 'empty' flag instead of 'not_empty', as
          suggested by Eric
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c1508e5
    • Paolo Abeni's avatar
      net: dev: introduce support for sch BYPASS for lockless qdisc · ba27b4cd
      Paolo Abeni authored
      With commit c5ad119f ("net: sched: pfifo_fast use skb_array")
      pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
      Due to retpolines the cost of the enqueue()/dequeue() pair has become
      relevant and we observe measurable regression for the uncontended
      scenario when the packet-rate is below line rate.
      
      After commit 46b1c18f ("net: sched: put back q.qlen into a
      single location") we can check for empty qdisc with a reasonably
      fast operation even for nolock qdiscs.
      
      This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
      The new chunk of code mirrors closely the existing one for traditional
      qdisc, leveraging a newly introduced helper to read atomically the
      qdisc length.
      
      Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
      device, and MQ root qdisc:
      
      threads         vanilla         patched
                      kpps            kpps
      1               2465            2889
      2               4304            5188
      4               7898            9589
      
      Same as above, but with a single queue device:
      
      threads         vanilla         patched
                      kpps            kpps
      1               2556            2827
      2               2900            2900
      4               5000            5000
      8               4700            4700
      
      No mesaurable changes in the contended scenarios, and more 10%
      improvement in the uncontended ones.
      
       v1 -> v2:
        - rebased after flag name change
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Tested-by: default avatarIvan Vecera <ivecera@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba27b4cd
    • Paolo Abeni's avatar
      net: sched: add empty status flag for NOLOCK qdisc · 28cff537
      Paolo Abeni authored
      The queue is marked not empty after acquiring the seqlock,
      and it's up to the NOLOCK qdisc clearing such flag on dequeue.
      Since the empty status lays on the same cache-line of the
      seqlock, it's always hot on cache during the updates.
      
      This makes the empty flag update a little bit loosy. Given
      the lack of synchronization between enqueue and dequeue, this
      is unavoidable.
      
      v2 -> v3:
       - qdisc_is_empty() has a const argument (Eric)
      
      v1 -> v2:
       - use really an 'empty' flag instead of 'not_empty', as
         suggested by Eric
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28cff537
    • Soheil Hassas Yeganeh's avatar
      tcp: add documentation for tcp_ca_state · 576fd2f7
      Soheil Hassas Yeganeh authored
      Add documentation to the tcp_ca_state enum, since this enum is
      exposed in uapi.
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Sowmini Varadhan <sowmini05@gmail.com>
      Acked-by: default avatarSowmini Varadhan <sowmini05@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      576fd2f7
    • Eric Dumazet's avatar
      tcp: remove conditional branches from tcp_mstamp_refresh() · e6d14070
      Eric Dumazet authored
      tcp_clock_ns() (aka ktime_get_ns()) is using monotonic clock,
      so the checks we had in tcp_mstamp_refresh() are no longer
      relevant.
      
      This patch removes cpu stall (when the cache line is not hot)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6d14070
    • Florian Fainelli's avatar
      net: phy: Correct Cygnus/Omega PHY driver prompt · a7a01ab3
      Florian Fainelli authored
      The tristate prompt should have been replaced rather than defined a few
      lines below, rebase mistake.
      
      Fixes: 17cc9821 ("net: phy: Move Omega PHY entry to Cygnus PHY driver")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7a01ab3
  2. 22 Mar, 2019 3 commits
    • Johannes Berg's avatar
      genetlink: make policy common to family · 3b0f31f2
      Johannes Berg authored
      Since maxattr is common, the policy can't really differ sanely,
      so make it common as well.
      
      The only user that did in fact manage to make a non-common policy
      is taskstats, which has to be really careful about it (since it's
      still using a common maxattr!). This is no longer supported, but
      we can fake it using pre_doit.
      
      This reduces the size of e.g. nl80211.o (which has lots of commands):
      
         text	   data	    bss	    dec	    hex	filename
       398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
       397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
      --------------------------------
         -832      +8       0    -824
      
      Which is obviously just 8 bytes for each command, and an added 8
      bytes for the new policy pointer. I'm not sure why the ops list is
      counted as .text though.
      
      Most of the code transformations were done using the following spatch:
          @ops@
          identifier OPS;
          expression POLICY;
          @@
          struct genl_ops OPS[] = {
          ...,
           {
          -	.policy = POLICY,
           },
          ...
          };
      
          @@
          identifier ops.OPS;
          expression ops.POLICY;
          identifier fam;
          expression M;
          @@
          struct genl_family fam = {
                  .ops = OPS,
                  .maxattr = M,
          +       .policy = POLICY,
                  ...
          };
      
      This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
      the cb->data as ops, which we want to change in a later genl patch.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b0f31f2
    • Heiner Kallweit's avatar
      r8169: use netif_start_queue instead of netif_wake_qeueue in rtl8169_start_xmit · 601ed4d6
      Heiner Kallweit authored
      Replace the call to netif_wake_queue in rtl8169_start_xmit with
      netif_start_queue as we don't need to actually wake up the queue since
      we are still in mid transmit so we just need to reset the bit so it
      doesn't prevent the next transmit.
      (Description shamelessly copied from a mail sent by Alex.)
      Suggested-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      601ed4d6
    • Heiner Kallweit's avatar
      net: phy: aquantia: add downshift support · 110a2432
      Heiner Kallweit authored
      Aquantia PHY's of the AQR107 family support the downshift feature.
      Add support for it as standard PHY tunable so that it can be controlled
      via ethtool.
      The AQCS109 supports a proprietary 2-pair 1Gbps mode. If two such PHY's
      are connected to each other with a 2-pair cable, they may not be able
      to establish a link if both advertise modes > 1Gbps.
      
      v2:
      - add downshift event detection
      - warn if downshift occurred
      - read downshifted rate from vendor register
      - enable downshift per default on all AQR107 family members
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      110a2432
  3. 21 Mar, 2019 27 commits