1. 20 Nov, 2015 21 commits
  2. 18 Nov, 2015 19 commits
    • David S. Miller's avatar
      Merge branch 'net-generic-busy-polling' · 85c72ba1
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: extend busy polling support
      
      This patch series extends busy polling range to tunnels devices,
      and adds busy polling generic support to all NAPI drivers.
      
      No need to provide ndo_busy_poll() method and extra synchronization
      between ndo_busy_poll() and normal napi->poll() method.
      This was proven very difficult and bug prone.
      
      mlx5 driver is changed to support busy polling using this new method,
      and a second mlx5 patch adds napi_complete_done() support and proper
      SNMP accounting.
      
      bnx2x and mlx4 drivers are converted to new infrastructure,
      reducing kernel bloat and improving performance.
      
      Latest patch, adding generic support, adds a new requirement :
      
       -free_netdev() and netif_napi_del() must be called from process context.
      
      Since this might not be the case in some drivers, we might have to
      either : fix the non conformant drivers (by disabling busy polling on them)
      or revert this last patch.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85c72ba1
    • Eric Dumazet's avatar
      net: provide generic busy polling to all NAPI drivers · 93d05d4a
      Eric Dumazet authored
      NAPI drivers no longer need to observe a particular protocol
      to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
      
      napi_hash_add() and napi_hash_del() are automatically called
      from core networking stack, respectively from
      netif_napi_add() and netif_napi_del()
      
      This patch depends on free_netdev() and netif_napi_del() being
      called from process context, which seems to be the norm.
      
      Drivers might still prefer to call napi_hash_del() on their
      own, since they might combine all the rcu grace periods into
      a single one, knowing their NAPI structures lifetime, while
      core networking stack has no idea of a possible combining.
      
      Once this patch proves to not bring serious regressions,
      we will cleanup drivers to either remove napi_hash_del()
      or provide appropriate rcu grace periods combining.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93d05d4a
    • Eric Dumazet's avatar
      net: napi_hash_del() returns a boolean status · 34cbe27e
      Eric Dumazet authored
      napi_hash_del() will soon be used from both drivers (if they want)
      or core networking stack.
      
      Callers are responsibles to ensure an RCU grace period is respected
      before freeing napi structure : napi_hash_del() can signal if
      this RCU grace period is needed or not.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34cbe27e
    • Eric Dumazet's avatar
      net: move napi_hash[] into read mostly section · 6180d9de
      Eric Dumazet authored
      We do not often add/delete a napi context.
      Moving napi_hash[] into read mostly section avoids potential false sharing.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6180d9de
    • Eric Dumazet's avatar
      net: add netif_tx_napi_add() · d64b5e85
      Eric Dumazet authored
      netif_tx_napi_add() is a variant of netif_napi_add()
      
      It should be used by drivers that use a napi structure
      to exclusively poll TX.
      
      We do not want to add this kind of napi in napi_hash[] in following
      patches, adding generic busy polling to all NAPI drivers.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d64b5e85
    • Eric Dumazet's avatar
      net: move skb_mark_napi_id() into core networking stack · 93f93a44
      Eric Dumazet authored
      We would like to automatically provide busy polling support
      to all NAPI drivers, without them having to implement anything.
      
      skb_mark_napi_id() can be called from napi_gro_receive() and
      napi_get_frags().
      
      Few drivers are still calling skb_mark_napi_id() because
      they use netif_receive_skb(). They should eventually call
      napi_gro_receive() instead. I will leave this to drivers
      maintainers.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93f93a44
    • Eric Dumazet's avatar
      mlx4: remove mlx4_en_low_latency_recv() · 868fdb06
      Eric Dumazet authored
      Busy polling can now be handled in generic NAPI poll infrastructure.
      This removes complexity and fast path overhead :
      
      mlx4 used two spin_lock()/spin_unlock() pair per napi->poll() call
      in mlx4_en_cq_lock_napi()/mlx4_en_cq_unlock_napi()
      
      Tested:
      
      Without busy polling :
      
      lpaa23:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    47330.78
      
      With busy polling :
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    97643.55
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      868fdb06
    • Eric Dumazet's avatar
      bnx2x: remove bnx2x_low_latency_recv() support · b59768c6
      Eric Dumazet authored
      Switch to native NAPI polling, as this reduces overhead and complexity.
      
      Normal path is faster, since one cmpxchg() is not anymore requested,
      and busy polling with the NAPI polling has same performance.
      
      Tested:
      lpk50:~# cat /proc/sys/net/core/busy_read
      70
      lpk50:~# nstat >/dev/null;./netperf -H lpk55 -t TCP_RR;nstat
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpk55.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    40095.07
      16384  87380
      IpInReceives                    401062             0.0
      IpInDelivers                    401062             0.0
      IpOutRequests                   401079             0.0
      TcpActiveOpens                  7                  0.0
      TcpPassiveOpens                 3                  0.0
      TcpAttemptFails                 3                  0.0
      TcpEstabResets                  5                  0.0
      TcpInSegs                       401036             0.0
      TcpOutSegs                      401052             0.0
      TcpOutRsts                      38                 0.0
      UdpInDatagrams                  26                 0.0
      UdpOutDatagrams                 27                 0.0
      Ip6OutNoRoutes                  1                  0.0
      TcpExtDelayedACKs               1                  0.0
      TcpExtTCPPrequeued              98                 0.0
      TcpExtTCPDirectCopyFromPrequeue 98                 0.0
      TcpExtTCPHPHits                 4                  0.0
      TcpExtTCPHPHitsToUser           98                 0.0
      TcpExtTCPPureAcks               5                  0.0
      TcpExtTCPHPAcks                 101                0.0
      TcpExtTCPAbortOnData            6                  0.0
      TcpExtBusyPollRxPackets         400832             0.0
      TcpExtTCPOrigDataSent           400983             0.0
      IpExtInOctets                   21273867           0.0
      IpExtOutOctets                  21261254           0.0
      IpExtInNoECTPkts                401064             0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b59768c6
    • Eric Dumazet's avatar
      mlx5: support napi_complete_done() · 44fb6fbb
      Eric Dumazet authored
      A NAPI poll handler should return number of RX packets processed,
      instead of 0 / budget.
      
      This allows proper busy poll accounting through LINUX_MIB_BUSYPOLLRXPACKETS
      SNMP counter.
      
      napi_complete_done() allows /sys/class/net/ethX/gro_flush_timeout
      to be used for finer GRO aggregation control.
      
      Tested:
      
      Enabled busy polling, and checked TcpExtBusyPollRxPackets counter is increasing.
      
      echo 70 >/proc/sys/net/core/busy_read
      nstat >/dev/null
      netperf -H target -t TCP_RR >/dev/null
      nstat | grep TcpExtBusyPollRxPackets
      TcpExtBusyPollRxPackets         490958             0.0
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44fb6fbb
    • Eric Dumazet's avatar
      mlx5: add busy polling support · 7ae92ae5
      Eric Dumazet authored
      It is now easy to add busy polling support to a NAPI driver,
      with very little impact on normal input path.
      
      This patch serves as a reference implementation.
      
      Note:
      
      A followup patch will add proper napi_complete_done() in mlx5,
      so that LINUX_MIB_BUSYPOLLRXPACKETS snmp counter is properly handled.
      
      Tested:
      
      Normal TCP_RR results without busy polling :
      
      lpk51:~# echo 0 >/proc/sys/net/core/busy_read
      lpk52:~# echo 0 >/proc/sys/net/core/busy_read
      
      lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    53509.49
      16384  87380
      
      Now enable busy polling :
      
      lpk51:~# echo 70 >/proc/sys/net/core/busy_read
      lpk52:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    97530.92
      16384  87380
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7ae92ae5
    • Eric Dumazet's avatar
      net: network drivers no longer need to implement ndo_busy_poll() · ce6aea93
      Eric Dumazet authored
      Instead of having to implement complex ndo_busy_poll() method,
      drivers can simply rely on NAPI poll logic.
      
      Busy polling gains are mainly coming from polling itself,
      not on exact details on how we poll the device.
      
      ndo_busy_poll() if implemented can avoid touching
      napi state, but it adds extra synchronization between
      normal napi->poll() and busy poll handler, slowing down
      the common path (non busy polling) with extra atomic operations.
      In practice few drivers ever got busy poll because of the complexity.
      
      We could go one step further, and make busy polling
      available for all NAPI drivers, but this would require
      that all netif_napi_del() calls are done in process context
      so that we can call synchronize_rcu().
      Full audit would be required.
      
      Before this is done, a driver still needs to call :
      
      - skb_mark_napi_id() for each skb provided to the stack.
      - napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
      - Make sure RCU grace period is respected after napi_hash_del() before
        memory containing napi structure is freed.
      
      Followup patch implements busy poll for mlx5 driver as an example.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce6aea93
    • Eric Dumazet's avatar
      net: allow BH servicing in sk_busy_loop() · 2a028ecb
      Eric Dumazet authored
      Instead of blocking BH in whole sk_busy_loop(), block them
      only around ->ndo_busy_poll() calls.
      
      This has many benefits.
      
      1) allow tunneled traffic to use busy poll as well as native traffic.
         Tunnels handlers usually call netif_rx() and depend on net_rx_action()
         being run (from sofirq handler)
      
      2) allow RFS/RPS being used (sending IPI to other cpus if needed)
      
      3) use the 'lets burn cpu cycles' budget to do useful work
         (like TX completions, timers, RCU callbacks...)
      
      4) reduce BH latencies, making busy poll a better citizen.
      
      Tested:
      
      Tested with SIT tunnel
      
      lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    37373.93
      16384  87380
      
      Now enable busy poll on both hosts
      
      lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa6:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    58314.77
      16384  87380
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a028ecb
    • Eric Dumazet's avatar
      net: un-inline sk_busy_loop() · 02d62e86
      Eric Dumazet authored
      There is really little gain from inlining this big function.
      We'll soon make it even bigger in following patches.
      
      This means we no longer need to export napi_by_id()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02d62e86
    • Eric Dumazet's avatar
      mlx4: mlx4_en_low_latency_recv() called with BH disabled · 5865316c
      Eric Dumazet authored
      mlx4_en_low_latency_recv() is called with BH disabled,
      as other ndo_busy_poll() methods.
      
      No need for spin_lock_bh()/spin_unlock_bh()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5865316c
    • Eric Dumazet's avatar
      net: better skb->sender_cpu and skb->napi_id cohabitation · 52bd2d62
      Eric Dumazet authored
      skb->sender_cpu and skb->napi_id share a common storage,
      and we had various bugs about this.
      
      We had to call skb_sender_cpu_clear() in some places to
      not leave a prior skb->napi_id and fool netdev_pick_tx()
      
      As suggested by Alexei, we could split the space so that
      these errors can not happen.
      
      0 value being reserved as the common (not initialized) value,
      let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
      and [NR_CPUS+1 .. ~0U] for valid napi_id.
      
      This will allow proper busy polling support over tunnels.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52bd2d62
    • Ivan Vecera's avatar
      be2net: remove local variable 'status' · d37b4c0a
      Ivan Vecera authored
      The lancer_cmd_get_file_len() uses lancer_cmd_read_object() to get
      the current size of registers for ethtool registers dump. Returned status
      value is stored but not checked. The check itself is not necessary as
      the data_read output variable is initialized to 0 and status variable
      can be removed.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d37b4c0a
    • huangdaode's avatar
      net: hisilicon: fix binding document of mdio · 6fbaa570
      huangdaode authored
      This patch explains the occasion of "hisilcon,mdio" and
      "hisilicon,hns-mdio" according to Arnd's comments.
      and reformat it according to comments from Rob<robh@kernel.org>.
      Signed-off-by: default avatarhuangdaode <huangdaode@hisilicon.com>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fbaa570
    • Bastian Stender's avatar
      net ipv4: use preferred log methods · 09605cc1
      Bastian Stender authored
      Replace printk calls with preferred unconditional log method calls to keep
      kernel messages clean.
      
      Added newline to "too small MTU" message.
      Signed-off-by: default avatarBastian Stender <bst@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09605cc1
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 34258a32
      Linus Torvalds authored
      Pull s390 fixes from Martin Schwidefsky:
       "Assorted bug fixes, the mlock2 system call gets added, and one
        improvement.  The boot from dasd devices is now possible from a wider
        range of devices"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390: remove SALIPL loader
        s390: wire up mlock2 system call
        s390: remove g5 elf platform support
        s390: avoid cache aliasing under z/VM and KVM
        s390/sclp: _sclp_wait_int(): retain full PSW mask
        s390/zcrypt: Fix initialisation when zcrypt is built-in
        s390/zcrypt: Fix kernel crash on systems without AP bus support
        s390: add support for ipl devices in subchannel sets > 0
        s390/ipl: fix out of bounds access in scpdata_write
        s390/pci_dma: improve debugging of errors during dma map
        s390/pci_dma: handle dma table failures
        s390/pci_dma: unify label of invalid translation table entries
        s390/syscalls: remove system call number calculation
        s390/cio: simplify css_generate_pgid
        s390/diag: add a s390 prefix to the diagnose trace point
        s390/head: fix error message on unsupported hardware
      34258a32