1. 06 Feb, 2016 19 commits
    • David S. Miller's avatar
      Merge branch 'bpf-per-cpu-maps' · 8ac2c867
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce per-cpu maps
      
      We've started to use bpf to trace every packet and atomic add
      instruction (event JITed) started to show up in perf profile.
      The solution is to do per-cpu counters.
      For PERCPU_(HASH|ARRAY) map the existing bpf_map_lookup() helper
      returns per-cpu area which bpf programs can use to store and
      increment the counters. The BPF_MAP_LOOKUP_ELEM syscall command
      returns areas from all cpus and user process aggregates the counters.
      The usage example is in patch 6. The api turned out to be very
      easy to use from bpf program and from user space.
      Long term we were discussing to add 'bounded loop' instruction,
      so bpf programs can do aggregation within the program which may
      help some use cases. Right now user space aggregation of
      per-cpu counters fits the best.
      
      This patch set is new approach for per-cpu hash and array maps.
      I've reused the map tests written by Martin and Ming, but
      implementation and api is new. Old discussion here:
      http://thread.gmane.org/gmane.linux.kernel/2123800/focus=2126435
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ac2c867
    • Alexei Starovoitov's avatar
    • tom.leiming@gmail.com's avatar
      samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_ARRAY · df570f57
      tom.leiming@gmail.com authored
      A sanity test for BPF_MAP_TYPE_PERCPU_ARRAY
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df570f57
    • Martin KaFai Lau's avatar
      samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_HASH · e1559671
      Martin KaFai Lau authored
      A sanity test for BPF_MAP_TYPE_PERCPU_HASH.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1559671
    • Alexei Starovoitov's avatar
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov authored
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15a07b33
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8
      Alexei Starovoitov authored
      Primary use case is a histogram array of latency
      where bpf program computes the latency of block requests or other
      events and stores histogram of latency into array of 64 elements.
      All cpus are constantly running, so normal increment is not accurate,
      bpf_xadd causes cache ping-pong and this per-cpu approach allows
      fastest collision-free counters.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a10423b8
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map · 824bd0ce
      Alexei Starovoitov authored
      Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
      accurate counters without need to use BPF_XADD instruction which turned
      out to be too costly for high-performance network monitoring.
      In the typical use case the 'key' is the flow tuple or other long
      living object that sees a lot of events per second.
      
      bpf_map_lookup_elem() returns per-cpu area.
      Example:
      struct {
        u32 packets;
        u32 bytes;
      } * ptr = bpf_map_lookup_elem(&map, &key);
      /* ptr points to this_cpu area of the value, so the following
       * increments will not collide with other cpus
       */
      ptr->packets ++;
      ptr->bytes += skb->len;
      
      bpf_update_elem() atomically creates a new element where all per-cpu
      values are zero initialized and this_cpu value is populated with
      given 'value'.
      Note that non-per-cpu hash map always allocates new element
      and then deletes old after rcu grace period to maintain atomicity
      of update. Per-cpu hash map updates element values in-place.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      824bd0ce
    • Kim Jones's avatar
      ethtool: Declare netdev_rss_key as __read_mostly. · ba905f5e
      Kim Jones authored
      netdev_rss_key is written to once and thereafter is read by
      drivers when they are initialising. The fact that it is mostly
      read and not written to makes it a candidate for a __read_mostly
      declaration.
      Signed-off-by: default avatarKim Jones <kim-marie.jones@intel.com>
      Signed-off-by: default avatarAlan Carey <alan.carey@intel.com>
      Acked-by: default avatarRami Rosen <rami.rosen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba905f5e
    • David S. Miller's avatar
      Merge branch 'tcp_fast_open_synack_fin' · ef449678
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: fastopen: accept data/FIN present in SYNACK
      
      Implements RFC 7413 (TCP Fast Open) 4.2.2, accepting payload and/or FIN
      in SYNACK messages, and prepare removal of SYN flag in tcp_recvmsg()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef449678
    • Eric Dumazet's avatar
      tcp: do not enqueue skb with SYN flag · 9d691539
      Eric Dumazet authored
      If we remove the SYN flag from the skbs that tcp_fastopen_add_skb()
      places in socket receive queue, then we can remove the test that
      tcp_recvmsg() has to perform in fast path.
      
      All we have to do is to adjust SEQ in the slow path.
      
      For the moment, we place an unlikely() and output a message
      if we find an skb having SYN flag set.
      Goal would be to get rid of the test completely.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d691539
    • Eric Dumazet's avatar
      tcp: fastopen: accept data/FIN present in SYNACK message · 61d2bcae
      Eric Dumazet authored
      RFC 7413 (TCP Fast Open) 4.2.2 states that the SYNACK message
      MAY include data and/or FIN
      
      This patch adds support for the client side :
      
      If we receive a SYNACK with payload or FIN, queue the skb instead
      of ignoring it.
      
      Since we already support the same for SYN, we refactor the existing
      code and reuse it. Note we need to clone the skb, so this operation
      might fail under memory pressure.
      
      Sara Dickinson pointed out FreeBSD server Fast Open implementation
      was planned to generate such SYNACK in the future.
      
      The server side might be implemented on linux later.
      Reported-by: default avatarSara Dickinson <sara@sinodun.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61d2bcae
    • David S. Miller's avatar
      Merge branch 'rx_nohandler' · df03288b
      David S. Miller authored
      Jarod Wilson says:
      
      ====================
      net: add and use rx_nohandler stat counter
      
      The network core tries to keep track of dropped packets, but some packets
      you wouldn't really call dropped, so much as intentionally ignored, under
      certain circumstances. One such case is that of bonding and team device
      slaves that are currently inactive. Their respective rx_handler functions
      return RX_HANDLER_EXACT (the only places in the kernel that return that),
      which ends up tracking into the network core's __netif_receive_skb_core()
      function's drop path, with no pt_prev set. On a noisy network, this can
      result in a very rapidly incrementing rx_dropped counter, not only on the
      inactive slave(s), but also on the master device, such as the following:
      
      $ cat /proc/net/dev
      Inter-|   Receive                                                |  Transmit
       face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
        p7p1: 14783346  140430    0 140428    0     0          0      2040      680       8    0    0    0     0       0          0
        p7p2: 14805198  140648    0    0    0     0          0      2034        0       0    0    0    0     0       0          0
       bond0: 53365248  532798    0 421160    0     0          0    115151     2040      24    0    0    0     0       0          0
          lo:    5420      54    0    0    0     0          0         0     5420      54    0    0    0     0       0          0
        p5p1: 19292195  196197    0 140368    0     0          0     56564      680       8    0    0    0     0       0          0
        p5p2: 19289707  196171    0 140364    0     0          0     56547      680       8    0    0    0     0       0          0
         em3: 20996626  158214    0    0    0     0          0       383        0       0    0    0    0     0       0          0
         em2: 14065122  138462    0    0    0     0          0       310        0       0    0    0    0     0       0          0
         em1: 14063162  138440    0    0    0     0          0       308        0       0    0    0    0     0       0          0
         em4: 21050830  158729    0    0    0     0          0       385    71662     469    0    0    0     0       0          0
         ib0:       0       0    0    0    0     0          0         0        0       0    0    0    0     0       0          0
      
      In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
      active-backup bond0, and you can see that all three have high drop counts,
      with the master bond0 showing a tally of all three.
      
      I know that this was previously discussed some here:
      
          http://www.spinics.net/lists/netdev/msg226341.html
      
      It seems additional counters never came to fruition, so this is a first
      attempt at creating one of them, so that we stop calling these drops,
      which for users monitoring rx_dropped, causes great alarm, and renders the
      counter much less useful for them.
      
      This adds a sysfs statistics node and makes the counter available via
      netlink.
      
      Additionally, I'm not certain if this set qualifies for net, or if it
      should be put aside and resubmitted for net-next after 4.5 is put to
      bed, but I do have users who consider this an important bugfix.
      
      This has been tested quite a bit on x86_64, and now lightly on i686 as
      well, to verify functionality of updates to netdev_stats_to_stats64()
      on 32-bit arches.
      ====================
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df03288b
    • Jarod Wilson's avatar
      bond: track sum of rx_nohandler for all slaves · f344b0d9
      Jarod Wilson authored
      Sample output with this set applied for an active-backup bond:
      
      $ cat /sys/devices/virtual/net/bond0/lower_p7p1/statistics/rx_nohandler
      16568
      $ cat /sys/devices/virtual/net/bond0/lower_p5p2/statistics/rx_nohandler
      16583
      $ cat /sys/devices/virtual/net/bond0/statistics/rx_nohandler
      33151
      
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f344b0d9
    • Jarod Wilson's avatar
      team: track sum of rx_nohandler for all slaves · bb63daf9
      Jarod Wilson authored
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb63daf9
    • Jarod Wilson's avatar
      net: add rx_nohandler stat counter · 6e7333d3
      Jarod Wilson authored
      This adds an rx_nohandler stat counter, along with a sysfs statistics
      node, and copies the counter out via netlink as well.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jiri@mellanox.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e7333d3
    • Jarod Wilson's avatar
      net/core: relax BUILD_BUG_ON in netdev_stats_to_stats64 · 9256645a
      Jarod Wilson authored
      The netdev_stats_to_stats64 function copies the deprecated
      net_device_stats format stats into rtnl_link_stats64 for legacy support
      purposes, but with the BUILD_BUG_ON as it was, it wasn't possible to
      extend rtnl_link_stats64 without also extending net_device_stats. Relax
      the BUILD_BUG_ON to only require that rtnl_link_stats64 is larger, and
      zero out all the stat counters that aren't present in net_device_stats.
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarJarod Wilson <jarod@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9256645a
    • Richard Alpe's avatar
      tipc: fix link priority propagation · 81729810
      Richard Alpe authored
      Currently link priority changes isn't handled for active links. In
      this patch we resolve this by changing our priority if the peer passes
      a valid priority in a state message.
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81729810
    • Richard Alpe's avatar
      tipc: fix link attribute propagation bug · d01332f1
      Richard Alpe authored
      Changing certain link attributes (link tolerance and link priority)
      from the TIPC management tool is supposed to automatically take
      effect at both endpoints of the affected link.
      
      Currently the media address is not instantiated for the link and is
      used uninstantiated when crafting protocol messages designated for the
      peer endpoint. This means that changing a link property currently
      results in the property being changed on the local machine but the
      protocol message designated for the peer gets lost. Resulting in
      property discrepancy between the endpoints.
      
      In this patch we resolve this by using the media address from the
      link entry and using the bearer transmit function to send it. Hence,
      we can now eliminate the redundant function tipc_link_prot_xmit() and
      the redundant field tipc_link::media_addr.
      
      Fixes: 2af5ae37 (tipc: clean up unused code and structures)
      Reviewed-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reported-by: default avatarJason Hu <huzhijiang@gmail.com>
      Signed-off-by: default avatarRichard Alpe <richard.alpe@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d01332f1
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6247fd9f
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-02-03
      
      This series contains updates to i40e and i40evf only.
      
      Kiran adds the MAC filter element to the end of the list instead of HEAD
      just in case there are ever any ordering issues in the future.
      
      Anjali fixes several RSS issues, first fixes the hash PCTYPE enable for
      X722 since it supports a broader selection of PCTYPES for TCP and UDP.
      Then fixes a bug in XL710, X710, and X722 support for RSS since we cannot
      reduce the 4-tuple for RSS for TCP/IPv4/IPv6 or UDP/IPv4/IPv6 packets
      since this requires a product feature change coming in a later release.
      Cleans up the reset code where the restart-autoneg workaround is
      applied, since X722 does not need the workaround, add a flag to indicate
      which MAC and firmware version require the workaround to be applied.
      Adds new device id's for X722 and code to add their support.  Also
      adds another way to access the RSS keys and lookup table using the admin
      queue for X722 devices.
      
      Catherine updates the driver to replace the MAC check with a feature
      flag check for 100M SGMII, since it is only support on X722 devices
      currently.
      
      Mitch reworks the VF driver to allow channel bonding, which was not
      possible before this patch due to the asynchronous nature of the admin
      queue mechanism.  Also fixes a rare case which causes a panic if the
      VF driver is removed during reset recovery, resolve this by setting the
      ring pointers to NULL after freeing them.
      
      Shannon cleans up the driver where device capabilities were defined in
      two different places, and neither had all the definitions, so he
      consolidates the definitions in the admin queue API.  Also adds the new
      proxy-wake-on-lan capability bit available with the new X722 device.
      Lastly, added the new External Device Power Ability field to the
      get_link_status data structure by using a reserved field at the end
      of the structure.
      
      Jesse mimics the ixgbe driver's use of a private work queue in the i40e
      and i40evf drivers to avoid blocking the system work queue.
      
      Greg cleans up the driver to limit the firmware revision checks to
      properly handle DCB configurations from the firmware to the older
      devices which need these checks (specifically X710 and XL710 devices
      only).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6247fd9f
  2. 05 Feb, 2016 1 commit
    • Mahesh Bandewar's avatar
      ipvlan: inherit MTU from master device · 296d4856
      Mahesh Bandewar authored
      When we create IPvlan slave; we use ether_setup() and that
      sets up default MTU to 1500 while the master device may have
      lower / different MTU. Any subsequent changes to the masters'
      MTU are reflected into the slaves' MTU setting. However if those
      don't happen (most likely scenario), the slaves' MTU stays at
      1500 which could be bad.
      
      This change adds code to inherit MTU from the master device
      instead of using the default value during the link initialization
      phase.
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Tim Hockins <thockins@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      296d4856
  3. 04 Feb, 2016 20 commits