1. 06 Oct, 2014 6 commits
  2. 05 Oct, 2014 5 commits
    • John Fastabend's avatar
      net: sched: suspicious RCU usage in qdisc_watchdog · 1e203c1a
      John Fastabend authored
      Suspicious RCU usage in qdisc_watchdog call needs to be done inside
      rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
      need to ensure timer is cancelled before removing qdisc structure.
      
      [ 3992.191339] ===============================
      [ 3992.191340] [ INFO: suspicious RCU usage. ]
      [ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
      [ 3992.191345] -------------------------------
      [ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
      [ 3992.191348]
      [ 3992.191348] other info that might help us debug this:
      [ 3992.191348]
      [ 3992.191351]
      [ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
      [ 3992.191353] no locks held by swapper/1/0.
      [ 3992.191355]
      [ 3992.191355] stack backtrace:
      [ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
      [ 3992.191360] Hardware name:                  /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
      [ 3992.191362]  0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
      [ 3992.191366]  ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
      [ 3992.191370]  ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
      [ 3992.191374] Call Trace:
      [ 3992.191376]  <IRQ>  [<ffffffff8178f92c>] dump_stack+0x4e/0x68
      [ 3992.191387]  [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130
      [ 3992.191392]  [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0
      [ 3992.191396]  [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420
      [ 3992.191399]  [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240
      [ 3992.191403]  [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0
      [ 3992.191406]  [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240
      [ 3992.191410]  [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140
      [ 3992.191415]  [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60
      [ 3992.191419]  [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60
      [ 3992.191422]  [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80
      [ 3992.191424]  <EOI>  [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0
      [ 3992.191432]  [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0
      [ 3992.191437]  [<ffffffff815ed567>] cpuidle_enter+0x17/0x20
      [ 3992.191441]  [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0
      [ 3992.191445]  [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30
      [ 3992.191448]  [<ffffffff81033c16>] start_secondary+0x1b6/0x260
      
      Fixes: b26b0d1e ("net: qdisc: use rcu prefix and silence sparse warnings")
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e203c1a
    • Florian Fainelli's avatar
      net: dsa: do not call phy_start_aneg · f7d6b96f
      Florian Fainelli authored
      Commit f7f1de51 ("net: dsa: start and stop the PHY state machine")
      add calls to phy_start() in dsa_slave_open() respectively phy_stop() in
      dsa_slave_close().
      
      We also call phy_start_aneg() in dsa_slave_create(), and this call is
      messing up with the PHY state machine, since we basically start the
      auto-negotiation, and later on restart it when calling phy_start().
      phy_start() does not currently handle the PHY_FORCING or PHY_AN states
      properly, but such a fix would be too invasive for this window.
      
      Fixes: f7f1de51 ("net: dsa: start and stop the PHY state machine")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7d6b96f
    • Sébastien Barré's avatar
      Removed unused inet6 address state · dd3619f2
      Sébastien Barré authored
      the inet6 state INET6_IFADDR_STATE_UP only appeared in its definition.
      
      Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarSébastien Barré <sebastien.barre@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd3619f2
    • Vijay Subramanian's avatar
      net: Cleanup skb cloning by adding SKB_FCLONE_FREE · c8753d55
      Vijay Subramanian authored
      SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
      1: If skb is allocated from head_cache, it indicates fclone is not available.
      2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
      it is available to be used.
      
      To avoid confusion for case 2 above, this patch  replaces
      SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
      companion skbs, this indicates it is free for use.
      
      SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
      cannot / will not have a companion fclone.
      Signed-off-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8753d55
    • Eric Dumazet's avatar
      mlx4: add a new xmit_more counter · 9fab426d
      Eric Dumazet authored
      ethtool -S reports a new counter, tracking number of time doorbell
      was not triggered, because skb->xmit_more was set.
      
      $ ethtool -S eth0 | egrep "tx_packet|xmit_more"
           tx_packets: 2413288400
           xmit_more: 666121277
      
      I merged the tso_packet false sharing avoidance in this patch as well.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fab426d
  3. 03 Oct, 2014 23 commits
    • David S. Miller's avatar
      Merge branch 'gudp' · 6106253e
      David S. Miller authored
      Tom Herbert says:
      
      ====================
      net: Generic UDP Encapsulation
      
      Generic UDP Encapsulation (GUE) is UDP encapsulation protocol which
      encapsulates packets of various IP protocols. The GUE protocol is
      described in http://tools.ietf.org/html/draft-herbert-gue-01.
      
      The receive path of GUE is implemented in the FOU over UDP module (FOU).
      This includes a UDP encap receive function for GUE as well as GUE
      specific GRO functions. Management and configuration of GUE ports shares
      most of the same code with FOU.
      
      For the transmit path, the previous FOU support for IPIP, sit, and GRE
      was simply extended for GUE (when GUE is enabled insert the GUE
      header on transmit in addition to UDP header inserted for FOU).
      
      Semantically GUE is the same as FOU in that the encapsulation (UDP
      and GUE headers) that are inserted on transmission and removed on
      reception so that IP packet is processed with the inner header.
      
      This patch set includes:
       - Some fixes to FOU, removal of IPv4,v6 specific GRO functions
       - Support to configure a GUE receive port
       - Implementation of GUE receive path (normal and GRO)
       - Additions to ip_tunnel netlink to configure GUE
       - GUE header inserion in ip_tunnel transmit path
      
      v2:
       - Include net/gue.h in patch set
      
      Testing:
      
      I ran performance numbers using netperf TCP_RR with 200 streams,
      comparing encapsulation without GUE, encapsulation with GUE, and
      encapsulation with FOU.
      
       GRE
          TCP_STREAM
            IPv4, FOU, UDP checksum enabled
              14.04% TX CPU utilization
              13.17% RX CPU utilization
              9211 Mbps
            IPv4, GUE, UDP checksum enabled
              14.99% TX CPU utilization
              13.79% RX CPU utilization
              9185 Mbps
            IPv4, FOU, UDP checksum disabled
              13.14% TX CPU utilization
              23.18% RX CPU utilization
              9277 Mbps
            IPv4, GUE, UDP checksum disabled
              13.66% TX CPU utilization
              23.57% RX CPU utilization
              9184 Mbps
          TCP_RR
            IPv4, FOU, UDP checksum enabled
              94.2% CPU utilization
              155/249/460 90/95/99% latencies
              1.17018e+06 tps
            IPv4, GUE, UDP checksum enabled
              93.9% CPU utilization
              158/253/472 90/95/99% latencies
              1.15045e+06 tps
      
        IPIP
          TCP_STREAM
            FOU, UDP checksum enabled
              15.28% TX CPU utilization
              13.92% RX CPU utilization
              9342 Mbps
            GUE, UDP checksum enabled
              13.99% TX CPU utilization
              13.34% RX CPU utilization
              9210 Mbps
            FOU, UDP checksum disabled
              15.08% TX CPU utilization
              24.64% RX CPU utilization
              9226 Mbps
            GUE, UDP checksum disabled
              15.90% TX CPU utilization
              24.77% RX CPU utilization
              9197 Mbps
          TCP_RR
            FOU, UDP checksum enabled
              94.23% CPU utilization
              149/237/429 90/95/99% latencies
              1.19553e+06 tps
            GUE, UDP checksum enabled
              93.75% CPU utilization
              152/243/442 90/95/99% latencies
              1.17027e+06 tps
      
        SIT
          TCP_STREAM
            FOU, UDP checksum enabled
              14.47% TX CPU utilization
              14.58% RX CPU utilization
              9106 Mbps
            GUE, UDP checksum enabled
              15.09% TX CPU utilization
              14.84% RX CPU utilization
              9080 Mbps
            FOU, UDP checksum disabled
              15.70% TX CPU utilization
              27.93% RX CPU utilization
              9097 Mbps
            GUE, UDP checksum disabled
              15.04% TX CPU utilization
              27.54% RX CPU utilization
              9073 Mbps
          TCP_RR
            FOU, UDP checksum enabled
              96.9% CPU utilization
              170/281/581 90/95/99% latencies
              1.03372e+06 tps
            GUE, UDP checksum enabled
              97.16% CPU utilization
              172/286/576 90/95/99% latencies
              1.00469e+06 tps
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6106253e
    • Tom Herbert's avatar
      ip_tunnel: Add GUE support · bc1fc390
      Tom Herbert authored
      This patch allows configuring IPIP, sit, and GRE tunnels to use GUE.
      This is very similar to fou excpet that we need to insert the GUE header
      in addition to the UDP header on transmit.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc1fc390
    • Tom Herbert's avatar
      gue: Receive side for Generic UDP Encapsulation · 37dd0247
      Tom Herbert authored
      This patch adds support receiving for GUE packets in the fou module. The
      fou module now supports direct foo-over-udp (no encapsulation header)
      and GUE. To support this a type parameter is added to the fou netlink
      parameters.
      
      For a GUE socket we define gue_udp_recv, gue_gro_receive, and
      gue_gro_complete to handle the specifics of the GUE protocol. Most
      of the code to manage and configure sockets is common with the fou.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37dd0247
    • Tom Herbert's avatar
      fou: eliminate IPv4,v6 specific GRO functions · efc98d08
      Tom Herbert authored
      This patch removes fou[46]_gro_receive and fou[46]_gro_complete
      functions. The v4 or v6 variants were chosen for the UDP offloads
      based on the address family of the socket this is not necessary
      or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
      This is set in udp6_gro_receive and unset in udp4_gro_receive. In
      fou_gro_receive the value is used to select the correct inet_offloads
      for the protocol of the outer IP header.
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efc98d08
    • Tom Herbert's avatar
      ip_tunnel: Account for secondary encapsulation header in max_headroom · 7371e022
      Tom Herbert authored
      When adjusting max_header for the tunnel interface based on egress
      device we need to account for any extra bytes in secondary encapsulation
      (e.g. FOU).
      Signed-off-by: default avatarTom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7371e022
    • Eric Dumazet's avatar
      net: do not export skb_gro_receive() · 01291202
      Eric Dumazet authored
      skb_gro_receive() is only called from tcp_gro_receive() which is
      not in a module.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01291202
    • Chen Gang's avatar
      drivers/net/irda/Kconfig: Let SH_IRDA depend on HAS_IOMEM · ad2a2a6d
      Chen Gang authored
      SH_IRDA needs HAS_IOMEM, so depend on it. The related error(with
      allmodconfig under um):
      
          CC [M]  drivers/net/irda/sh_irda.o
        drivers/net/irda/sh_irda.c: In function ‘sh_irda_probe’:
        drivers/net/irda/sh_irda.c:776:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
          self->membase = ioremap_nocache(res->start, resource_size(res));
          ^
        drivers/net/irda/sh_irda.c:776:16: warning: assignment makes pointer from integer without a cast [enabled by default]
          self->membase = ioremap_nocache(res->start, resource_size(res));
                        ^
        drivers/net/irda/sh_irda.c:821:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
          iounmap(self->membase);
          ^
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad2a2a6d
    • Chen Gang's avatar
      drivers/net/ethernet/marvell/Kconfig: Let PXA168_ETH depend on HAS_IOMEM · 65cb29a4
      Chen Gang authored
      PXA168_ETH need HAS_IOMEM, so depend on it, the related error (with
      allmodconfig under um):
      
          CC [M]  drivers/net/ethernet/marvell/pxa168_eth.o
        drivers/net/ethernet/marvell/pxa168_eth.c: In function ‘pxa168_eth_probe’:
        drivers/net/ethernet/marvell/pxa168_eth.c:1605:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
          iounmap(pep->base);
          ^
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65cb29a4
    • Chen Gang's avatar
      drivers/net/dsa/Kconfig: Let NET_DSA_BCM_SF2 depend on HAS_IOMEM · 28b5533a
      Chen Gang authored
      NET_DSA_BCM_SF2 need HAS_IOMEM, so depend on it, the related error (with
      allmodconfig under um):
      
          CC [M]  drivers/net/dsa/bcm_sf2.o
        drivers/net/dsa/bcm_sf2.c: In function ‘bcm_sf2_sw_setup’:
        drivers/net/dsa/bcm_sf2.c:487:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
           iounmap(*base);
           ^
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b5533a
    • Chen Gang's avatar
      drivers/net/can/Kconfig: Let CAN_AT91 depend on HAS_IOMEM · 9dc8be28
      Chen Gang authored
      CAN_AT91 needs HAS_IOMEM, so depends on it. The related error (with
      allmodconfig under um):
      
          CC [M]  drivers/net/can/at91_can.o
        drivers/net/can/at91_can.c: In function ‘at91_can_probe’:
        drivers/net/can/at91_can.c:1329:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
        addr = ioremap_nocache(res->start, resource_size(res));
          ^
        drivers/net/can/at91_can.c:1329:7: warning: assignment makes pointer from integer without a cast [enabled by default]
          addr = ioremap_nocache(res->start, resource_size(res));
               ^
        drivers/net/can/at91_can.c:1384:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
          iounmap(addr);
          ^
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dc8be28
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 579899a9
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2014-10-02
      
      This series contains updates to fm10k, igb, ixgbe and i40e.
      
      Alex provides two updates to the fm10k driver.  First reduces the buffer
      size to 2k for all page sizes, since most frames only have a 1500 MTU
      so supporting a buffer size larger than this is somewhat wasteful.
      Second fixes an issue where the number of transmit queues was not being
      updated, so added the lines necessary to update the number of transmit
      queues.
      
      Rick Jones provides two patches to convert ixgbe, igb and i40e to use
      dev_consume_skb_any().
      
      Emil provides two patches for ixgbe, first cleans up a couple of wait
      loops on auto-negotiation that were not needed.  Second fixes an issue
      reported by Fujitsu/Red Hat, which consolidates the logic behind the
      dynamically setting of TXDCTL.WTHRESH depending on interrupt throttle
      rate (ITR) setting regardless of BQL.
      
      Ethan Zhao provides a cleanup patch for ixgbe where he noticed a
      duplicate define.
      
      Bernhard Kaindl provides a patch for igb to remove a source of latency
      spikes by not calling code that uses mdelay() for feeding a PHY stat
      while being called with a spinlock held.
      
      Todd bumps the igb version based on the recent changes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      579899a9
    • David S. Miller's avatar
      Merge branch 'mlx5-next' · 48fea861
      David S. Miller authored
      Eli Cohen says:
      
      ====================
      mlx5 update for 3.18
      
      This series integrates a new mechanism for populating and extracting field values
      used in the driver/firmware interaction around command mailboxes.
      
      Changes from V1:
       - Remove unused definition of memcpy_cpu_to_be32()
       - Remove definitions of non_existent_*() and use BUILD_BUG_ON() instead.
       - Added a patch one line patch to add support for ConnectX-4 devices.
      
      Changes from V0:
       - trimmed the auto-generated file to a minimum, as required by the reviewers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48fea861
    • Eli Cohen's avatar
      net/mlx5_core: Add ConnectX-4 to list of supported devices · f832dc82
      Eli Cohen authored
      Add the upcoming ConnectX-4 device to the list of supported devices by then
      mlx5 driver.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f832dc82
    • Eli Cohen's avatar
      net/mlx5_core: Identify resources by their type · 5903325a
      Eli Cohen authored
      This patch puts a common part as the first field of mlx5_core_qp. This field is
      used to identify which resource generated an event. This is required since upcoming
      new resource types such as DC targets are allocated for the same numerical space
      as regular QPs and may generate the same events. By searching the resource in the
      same table we can then look at the common field to identify the resource.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5903325a
    • Eli Cohen's avatar
      net/mlx5_core: use set/get macros in device caps · b775516b
      Eli Cohen authored
      Transform device capabilities related commands to use set/get macros to
      manipulate command mailboxes.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b775516b
    • Eli Cohen's avatar
      net/mlx5_core: Use hardware registers description header file · d29b796a
      Eli Cohen authored
      Add an auto generated header file that describes hardware registers along with
      set of macros that set/get values. The macros do static checks to avoid
      overflow, handle endianess, and overall provide a clean way to code commands.
      Currently the header file is small and we will add structs as we make use of
      the macros.
      A few commands were removed from the commands enum since they are not supported
      currently and will be added when support is available.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d29b796a
    • Eli Cohen's avatar
      net/mlx5_core: Update device capabilities handling · c7a08ac7
      Eli Cohen authored
      Rearrange struct mlx5_caps so it has a "gen" field to represent the current
      capabilities configured for the device. Max capabilities can also be queried
      from the device. Also update capabilities struct to contain more fields as per
      the latest revision if firmware specification.
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7a08ac7
    • Eric Dumazet's avatar
      qdisc: validate skb without holding lock · 55a93b3e
      Eric Dumazet authored
      Validation of skb can be pretty expensive :
      
      GSO segmentation and/or checksum computations.
      
      We can do this without holding qdisc lock, so that other cpus
      can queue additional packets.
      
      Trick is that requeued packets were already validated, so we carry
      a boolean so that sch_direct_xmit() can validate a fresh skb list,
      or directly use an old one.
      
      Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
      host.
      
      Turning TSO on or off had no effect on throughput, only few more cpu
      cycles. Lock contention on qdisc lock disappeared.
      
      Same if disabling TX checksum offload.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55a93b3e
    • Tobias Klauser's avatar
      net: ethernet: Remove superfluous ether_setup after alloc_etherdev · 6a05880a
      Tobias Klauser authored
      There is no need to call ether_setup after alloc_ethdev since it was
      already called there.
      
      Follow commits c706471b ("net: axienet: remove unnecessary
      ether_setup after alloc_etherdev") and 3c87dcbf ("net: ll_temac:
      Remove unnecessary ether_setup after alloc_etherdev") and fix the
      pattern in all remaining ethernet drivers.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a05880a
    • David S. Miller's avatar
      Merge branch 'qdisc_bulk_dequeue' · c2bf5ec2
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      qdisc: bulk dequeue support
      
      This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
      from the qdisc layer, to implement dequeue bulking.
      
      Patch01: "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"
       - Implement basic qdisc dequeue bulking
       - This time, 100% relying on BQL limits, no magic safe-guard constants
      
      Patch02: "qdisc: dequeue bulking also pickup GSO/TSO packets"
       - Extend bulking to bulk several GSO/TSO packets
       - Seperate patch, as it introduce a small regression, see test section.
      
      We do have a patch03, which exports a userspace tunable as a BQL
      tunable, that can byte-cap or disable the bulking/bursting.  But we
      could not agree on it internally, thus not sending it now.  We
      basically strive to avoid adding any new userspace tunable.
      
      Testing patch01:
      ================
       Demonstrating the performance improvement of qdisc dequeue bulking, is
      tricky because the effect only "kicks-in" once the qdisc system have a
      backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
      of the link or 2) exceed the capability of the device driver.
      
      For practical use-cases, the measureable effect of this will be a
      reduction in CPU usage
      
      01-TCP_STREAM:
      --------------
      Testing effect for TCP involves disabling TSO and GSO, because TCP
      already benefit from bulking, via TSO and especially for GSO segmented
      packets.  This patch view TSO/GSO as a seperate kind of bulking, and
      avoid further bulking of these packet types.
      
      The measured perf diff benefit (at 10Gbit/s) for a single netperf
      TCP_STREAM were 9.24% less CPU used on calls to _raw_spin_lock()
      (mostly from sch_direct_xmit).
      
      If my E5-2695v2(ES) CPU is tuned according to:
       http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
      Then it is possible that a single netperf TCP_STREAM, with GSO and TSO
      disabled, can utilize all bandwidth on a 10Gbit/s link.  This will
      then cause a standing backlog queue at the qdisc layer.
      
      Trying to pressure the system some more CPU util wise, I'm starting
      24x TCP_STREAMs and monitoring the overall CPU utilization.  This
      confirms bulking saves CPU cycles when it "kicks-in".
      
      Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
       * Disabled bulking: sys:2.58%  soft:8.50%  idle:88.78%
       * Enabled  bulking: sys:2.43%  soft:7.66%  idle:89.79%
      
      02-UDP_STREAM
      -------------
      The measured perf diff benefit for UDP_STREAM were 6.41% less CPU used
      on calls to _raw_spin_lock().  24x UDP_STREAM with packet size -m 1472 (to
      avoid sending UDP/IP fragments).
      
      03-trafgen driver test
      ----------------------
      The performance of the 10Gbit/s ixgbe driver is limited due to
      updating the HW ring-queue tail-pointer on every packet.  As
      previously demonstrated with pktgen.
      
      Using trafgen to send RAW frames from userspace (via AF_PACKET), and
      forcing it through qdisc path (with option --qdisc-path and -t0),
      sending with 12 CPUs.
      
      I can demonstrate this driver layer limitation:
       * 12.8 Mpps with no qdisc bulking
       * 14.8 Mpps with qdisc bulking (full 10G-wirespeed)
      
      Testing patch02:
      ================
      Testing Bulking several GSO/TSO packets:
      
      Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
      netperf-wrapper. Bulking several TSO show no performance regressions
      (requeues were in the area 32 requeues/sec for 10G while transmitting
      approx 813Kpps).
      
      Bulking several GSOs does show small regression or very small
      improvement (requeues were in the area 8000 requeues/sec, for 10G
      while transmitting approx 813Kpps).
      
       Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
      latency. Base-case, which is "normal" GSO bulking, sees varying
      high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
      together, result in a stable high-prio queue delay of 0.50ms.
      
      Corrosponding to:
       (10000*10^6)*((0.50-0.47)/10^3)/8 = 37500 bytes
       (10000*10^6)*((0.50-0.38)/10^3)/8 = 150000 bytes
       37500/1500  = 25 pkts
       150000/1500 = 100 pkts
      
       Using igb at 100Mbit/s with GSO bulking, shows an improvement.
      Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
      diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking
      several GSOs together, result in a stable high-prio queue delay of
      2.23ms.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2bf5ec2
    • Jesper Dangaard Brouer's avatar
      qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0
      Jesper Dangaard Brouer authored
      The TSO and GSO segmented packets already benefit from bulking
      on their own.
      
      The TSO packets have always taken advantage of the only updating
      the tailptr once for a large packet.
      
      The GSO segmented packets have recently taken advantage of
      bulking xmit_more API, via merge commit 53fda7f7 ("Merge
      branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
      Move main gso loop out of dev_hard_start_xmit() into helper.")
      allowing qdisc requeue of remaining list.  And via commit
      ce93718f ("net: Don't keep around original SKB when we
      software segment GSO frames.").
      
      This patch allow further bulking of TSO/GSO packets together,
      when dequeueing from the qdisc.
      
      Testing:
       Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
      netperf-wrapper. Bulking several TSO show no performance regressions
      (requeues were in the area 32 requeues/sec).
      
      Bulking several GSOs does show small regression or very small
      improvement (requeues were in the area 8000 requeues/sec).
      
       Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
      latency. Base-case, which is "normal" GSO bulking, sees varying
      high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
      together, result in a stable high-prio queue delay of 0.50ms.
      
       Using igb at 100Mbit/s with GSO bulking, shows an improvement.
      Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      808e7ac0
    • Jesper Dangaard Brouer's avatar
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3
      Jesper Dangaard Brouer authored
      Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
      sending/processing an entire skb list.
      
      This patch implements qdisc bulk dequeue, by allowing multiple packets
      to be dequeued in dequeue_skb().
      
      The optimization principle for this is two fold, (1) to amortize
      locking cost and (2) avoid expensive tailptr update for notifying HW.
       (1) Several packets are dequeued while holding the qdisc root_lock,
      amortizing locking cost over several packet.  The dequeued SKB list is
      processed under the TXQ lock in dev_hard_start_xmit(), thus also
      amortizing the cost of the TXQ lock.
       (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
      API to delay HW tailptr update, which also reduces the cost per
      packet.
      
      One restriction of the new API is that every SKB must belong to the
      same TXQ.  This patch takes the easy way out, by restricting bulk
      dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
      qdisc only have attached a single TXQ.
      
      Some detail about the flow; dev_hard_start_xmit() will process the skb
      list, and transmit packets individually towards the driver (see
      xmit_one()).  In case the driver stops midway in the list, the
      remaining skb list is returned by dev_hard_start_xmit().  In
      sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
      
      To avoid overshooting the HW limits, which results in requeuing, the
      patch limits the amount of bytes dequeued, based on the drivers BQL
      limits.  In-effect bulking will only happen for BQL enabled drivers.
      
      Small amounts for extra HoL blocking (2x MTU/0.24ms) were
      measured at 100Mbit/s, with bulking 8 packets, but the
      oscillating nature of the measurement indicate something, like
      sched latency might be causing this effect. More comparisons
      show, that this oscillation goes away occationally. Thus, we
      disregard this artifact completely and remove any "magic" bulking
      limit.
      
      For now, as a conservative approach, stop bulking when seeing TSO and
      segmented GSO packets.  They already benefit from bulking on their own.
      A followup patch add this, to allow easier bisect-ability for finding
      regressions.
      
      Jointed work with Hannes, Daniel and Florian.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5772e9a3
    • Mark Einon's avatar
      et131x: Add PCIe gigabit ethernet driver et131x to drivers/net · 38df6492
      Mark Einon authored
      This adds the ethernet driver for Agere et131x devices to
      drivers/net/ethernet.
      
      The driver being added has been in the staging tree for some time, and will be
      removed from there in a seperate patch. This one merely disables the staging
      version to prevent two instances being built.
      Signed-off-by: default avatarMark Einon <mark.einon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38df6492
  4. 02 Oct, 2014 6 commits