1. 08 Jan, 2018 2 commits
    • Ido Schimmel's avatar
      ipv6: Remove redundant route flushing during namespace dismantle · 9fcb0714
      Ido Schimmel authored
      By the time fib6_net_exit() is executed all the netdevs in the namespace
      have been either unregistered or pushed back to the default namespace.
      That is because pernet subsys operations are always ordered before
      pernet device operations and therefore invoked after them during
      namespace dismantle.
      
      Thus, all the routing tables in the namespace are empty by the time
      fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.
      
      This allows us to simplify the condition in fib6_ifdown() as it's only
      ever called with an actual netdev.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fcb0714
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 7f0b8000
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-01-07
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Add a start of a framework for extending struct xdp_buff without
         having the overhead of populating every data at runtime. Idea
         is to have a new per-queue struct xdp_rxq_info that holds read
         mostly data (currently that is, queue number and a pointer to
         the corresponding netdev) which is set up during rxqueue config
         time. When a XDP program is invoked, struct xdp_buff holds a
         pointer to struct xdp_rxq_info that the BPF program can then
         walk. The user facing BPF program that uses struct xdp_md for
         context can use these members directly, and the verifier rewrites
         context access transparently by walking the xdp_rxq_info and
         net_device pointers to load the data, from Jesper.
      
      2) Redo the reporting of offload device information to user space
         such that it works in combination with network namespaces. The
         latter is reported through a device/inode tuple as similarly
         done in other subsystems as well (e.g. perf) in order to identify
         the namespace. For this to work, ns_get_path() has been generalized
         such that the namespace can be retrieved not only from a specific
         task (perf case), but also from a callback where we deduce the
         netns (ns_common) from a netdevice. bpftool support using the new
         uapi info and extensive test cases for test_offload.py in BPF
         selftests have been added as well, from Jakub.
      
      3) Add two bpftool improvements: i) properly report the bpftool
         version such that it corresponds to the version from the kernel
         source tree. So pick the right linux/version.h from the source
         tree instead of the installed one. ii) fix bpftool and also
         bpf_jit_disasm build with bintutils >= 2.9. The reason for the
         build breakage is that binutils library changed the function
         signature to select the disassembler. Given this is needed in
         multiple tools, add a proper feature detection to the
         tools/build/features infrastructure, from Roman.
      
      4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the
         stacktrace map. It is currently unimplemented, but there are
         use cases where user space needs to walk all stacktrace map
         entries e.g. for dumping or deleting map entries w/o having to
         close and recreate the map. Add BPF selftests along with it,
         from Yonghong.
      
      5) Few follow-up cleanups for the bpftool cgroup code: i) rename
         the cgroup 'list' command into 'show' as we have it for other
         subcommands as well, ii) then alias the 'show' command such that
         'list' is accepted which is also common practice in iproute2,
         and iii) remove couple of newlines from error messages using
         p_err(), from Jakub.
      
      6) Two follow-up cleanups to sockmap code: i) remove the unused
         bpf_compute_data_end_sk_skb() function and ii) only build the
         sockmap infrastructure when CONFIG_INET is enabled since it's
         only aware of TCP sockets at this time, from John.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f0b8000
  2. 06 Jan, 2018 3 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-stacktrace-map-next-key-support' · 9be99bad
      Daniel Borkmann authored
      Yonghong Song says:
      
      ====================
      The patch set implements bpf syscall command BPF_MAP_GET_NEXT_KEY
      for stacktrace map. Patch #1 is the core implementation
      and Patch #2 implements a bpf test at tools/testing/selftests/bpf
      directory. Please see individual patch comments for details.
      
      Changelog:
        v1 -> v2:
         - For invalid key (key pointer is non-NULL), sets next_key to be the first valid key.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9be99bad
    • Yonghong Song's avatar
      tools/bpf: add a bpf selftest for stacktrace · 3ced9b60
      Yonghong Song authored
      Added a bpf selftest in test_progs at tools directory for stacktrace.
      The test will populate a hashtable map and a stacktrace map
      at the same time with the same key, stackid.
      The user space will compare both maps, using BPF_MAP_LOOKUP_ELEM
      command and BPF_MAP_GET_NEXT_KEY command, to ensure that both have
      the same set of keys.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3ced9b60
    • Yonghong Song's avatar
      bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map · 16f07c55
      Yonghong Song authored
      Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
      supported for stacktrace map. However, there are use cases where
      user space wants to enumerate all stacktrace map entries where
      BPF_MAP_GET_NEXT_KEY command will be really helpful.
      In addition, if user space wants to delete all map entries
      in order to save memory and does not want to close the
      map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
      performance if map entries are sparsely populated.
      
      The implementation has similar behavior for
      BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
      a NULL key pointer or an invalid key, the first key is returned.
      Otherwise, the first valid key after the input parameter "key"
      is returned, or -ENOENT if no valid key can be found.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      16f07c55
  3. 05 Jan, 2018 35 commits
    • Alexei Starovoitov's avatar
      Merge branch 'xdp_rxq_info' · 11d16edb
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      V4:
      * Added reviewers/acks to patches
      * Fix patch desc in i40e that got out-of-sync with code
      * Add SPDX license headers for the two new files added in patch 14
      
      V3:
      * Fixed bug in virtio_net driver
      * Removed export of xdp_rxq_info_init()
      
      V2:
      * Changed API exposed to drivers
        - Removed invocation of "init" in drivers, and only call "reg"
          (Suggested by Saeed)
        - Allow "reg" to fail and handle this in drivers
          (Suggested by David Ahern)
      * Removed the SINKQ qtype, instead allow to register as "unused"
      * Also fixed some drivers during testing on actual HW (noted in patches)
      
      There is a need for XDP to know more about the RX-queue a given XDP
      frames have arrived on.  For both the XDP bpf-prog and kernel side.
      
      Instead of extending struct xdp_buff each time new info is needed,
      this patchset takes a different approach.  Struct xdp_buff is only
      extended with a pointer to a struct xdp_rxq_info (allowing for easier
      extending this later).  This xdp_rxq_info contains information related
      to how the driver have setup the individual RX-queue's.  This is
      read-mostly information, and all xdp_buff frames (in drivers
      napi_poll) point to the same xdp_rxq_info (per RX-queue).
      
      We stress this data/cache-line is for read-mostly info.  This is NOT
      for dynamic per packet info, use the data_meta for such use-cases.
      
      This patchset start out small, and only expose ingress_ifindex and the
      RX-queue index to the XDP/BPF program. Access to tangible info like
      the ingress ifindex and RX queue index, is fairly easy to comprehent.
      The other future use-cases could allow XDP frames to be recycled back
      to the originating device driver, by providing info on RX device and
      queue number.
      
      As XDP doesn't have driver feature flags, and eBPF code due to
      bpf-tail-calls cannot determine that XDP driver invoke it, this
      patchset have to update every driver that support XDP.
      
      For driver developers (review individual driver patches!):
      
      The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
      modification require (temporary) stopping RX frames, then the
      xdp_rxq_info should (likely) also be unregistred and re-registered,
      especially if reallocating the pages in the ring. Make sure ethtool
      set_channels does the right thing. When replacing XDP prog, if and
      only if RX-ring need to be changed, then also re-register the
      xdp_rxq_info.
      
      I'm Cc'ing the individual driver patches to the registered maintainers.
      
      Testing:
      
      I've only tested the NIC drivers I have hardware for.  The general
      test procedure is to (DUT = Device Under Test):
       (1) run pktgen script pktgen_sample04_many_flows.sh       (against DUT)
       (2) run samples/bpf program xdp_rxq_info --dev $DEV       (on DUT)
       (3) runtime modify number of NIC queues via ethtool -L    (on DUT)
       (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)
      
      Patch based on git tree bpf-next (at commit fb982666):
       https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      11d16edb
    • Jesper Dangaard Brouer's avatar
      samples/bpf: program demonstrating access to xdp_rxq_info · 0fca931a
      Jesper Dangaard Brouer authored
      This sample program can be used for monitoring and reporting how many
      packets per sec (pps) are received per NIC RX queue index and which
      CPU processed the packet. In itself it is a useful tool for quickly
      identifying RSS imbalance issues, see below.
      
      The default XDP action is XDP_PASS in-order to provide a monitor
      mode. For benchmarking purposes it is possible to specify other XDP
      actions on the cmdline --action.
      
      Output below shows an imbalance RSS case where most RXQ's deliver to
      CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
      things from a CPU level the two CPUs are processing approx the same
      amount, BUT looking at the rx_queue_index levels it is clear that
      RXQ-2 receive much better service, than other RXQs which all share CPU-0.
      
      Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
      XDP stats       CPU     pps         issue-pps
      XDP-RX CPU      0       900,473     0
      XDP-RX CPU      2       906,921     0
      XDP-RX CPU      total   1,807,395
      
      RXQ stats       RXQ:CPU pps         issue-pps
      rx_queue_index    0:0   180,098     0
      rx_queue_index    0:sum 180,098
      rx_queue_index    1:0   180,098     0
      rx_queue_index    1:sum 180,098
      rx_queue_index    2:2   906,921     0
      rx_queue_index    2:sum 906,921
      rx_queue_index    3:0   180,098     0
      rx_queue_index    3:sum 180,098
      rx_queue_index    4:0   180,082     0
      rx_queue_index    4:sum 180,082
      rx_queue_index    5:0   180,093     0
      rx_queue_index    5:sum 180,093
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0fca931a
    • Jesper Dangaard Brouer's avatar
      bpf: finally expose xdp_rxq_info to XDP bpf-programs · 02dd3291
      Jesper Dangaard Brouer authored
      Now all XDP driver have been updated to setup xdp_rxq_info and assign
      this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
      of the xdp_rxq_info struct members.
      
      This patch extend xdp_md and expose UAPI to userspace for
      ingress_ifindex and rx_queue_index.  Access happens via bpf
      instruction rewrite, that load data directly from struct xdp_rxq_info.
      
      * ingress_ifindex map to xdp_rxq_info->dev->ifindex
      * rx_queue_index  map to xdp_rxq_info->queue_index
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02dd3291
    • Jesper Dangaard Brouer's avatar
      xdp: generic XDP handling of xdp_rxq_info · e817f856
      Jesper Dangaard Brouer authored
      Hook points for xdp_rxq_info:
       * reg  : netif_alloc_rx_queues
       * unreg: netif_free_rx_queues
      
      The net_device have some members (num_rx_queues + real_num_rx_queues)
      and data-area (dev->_rx with struct netdev_rx_queue's) that were
      primarily used for exporting information about RPS (CONFIG_RPS) queues
      to sysfs (CONFIG_SYSFS).
      
      For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
      and remove some of the CONFIG_SYSFS ifdefs.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e817f856
    • Jesper Dangaard Brouer's avatar
      virtio_net: setup xdp_rxq_info · 754b8a21
      Jesper Dangaard Brouer authored
      The virtio_net driver doesn't dynamically change the RX-ring queue
      layout and backing pages, but instead reject XDP setup if all the
      conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
      fairly static.  This allow us to simply add the reg/unreg to
      net_device open/close functions.
      
      Driver hook points for xdp_rxq_info:
       * reg  : virtnet_open
       * unreg: virtnet_close
      
      V3:
       - bugfix, also setup xdp.rxq in receive_mergeable()
       - Tested bpf-sample prog inside guest on a virtio_net device
      
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      754b8a21
    • Jesper Dangaard Brouer's avatar
      tun: setup xdp_rxq_info · 8bf5c4ee
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : tun_attach
       * unreg: __tun_detach
      
      I've done some manual testing of this tun driver, but I would
      appriciate good review and someone else running their use-case tests,
      as I'm not 100% sure I understand the tfile->detached semantics.
      
      V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.
      
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8bf5c4ee
    • Jesper Dangaard Brouer's avatar
      thunderx: setup xdp_rxq_info · 27e95e36
      Jesper Dangaard Brouer authored
      This driver uses a bool scheme for "enable"/"disable" when setting up
      different resources.  Thus, the hook points for xdp_rxq_info is done
      in the same function call nicvf_rcv_queue_config().  This is activated
      through enable/disable via nicvf_config_data_transfer(), which is tied
      into nicvf_stop()/nicvf_open().
      
      Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
      a pointer to the given struct rcv_queue, in-order to access the
      xdp_rxq_info data area (in nicvf_xdp_rx()).
      
      V2: Driver have no proper error path for failed XDP RX-queue info reg,
      as nicvf_rcv_queue_config is a void function.
      
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Robert Richter <rric@kernel.org>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      27e95e36
    • Jesper Dangaard Brouer's avatar
      nfp: setup xdp_rxq_info · 7f1c684a
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : nfp_net_rx_ring_alloc
       * unreg: nfp_net_rx_ring_free
      
      In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
      Thus, the size remaines the same after adding member @xdp_rxq.
      
      Cc: oss-drivers@netronome.com
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Simon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7f1c684a
    • Jesper Dangaard Brouer's avatar
      bnxt_en: setup xdp_rxq_info · 96a8604f
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : bnxt_alloc_rx_rings
       * unreg: bnxt_free_rx_rings
      
      This driver should be updated to re-register when changing
      allocation mode of RX rings.
      
      Tested on actual hardware.
      
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Cc: Michael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      96a8604f
    • Jesper Dangaard Brouer's avatar
      mlx4: setup xdp_rxq_info · ae75415d
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : mlx4_en_create_rx_ring
       * unreg: mlx4_en_destroy_rx_ring
      
      Tested on actual hardware.
      
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ae75415d
    • Jesper Dangaard Brouer's avatar
      xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg · c0124f32
      Jesper Dangaard Brouer authored
      The driver code qede_free_fp_array() depend on kfree() can be called
      with a NULL pointer. This stems from the qede_alloc_fp_array()
      function which either (kz)alloc memory for fp->txq or fp->rxq.
      This also simplifies error handling code in case of memory allocation
      failures, but xdp_rxq_info_unreg need to know the difference.
      
      Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
      and detect this is the failure path by seeing that xdp_rxq_info was
      not registred yet, which first happens after successful alloaction in
      qede_init_fp().
      
      Driver hook points for xdp_rxq_info:
       * reg  : qede_init_fp
       * unreg: qede_free_fp_array
      
      Tested on actual hardware with samples/bpf program.
      
      V2: Driver have no proper error path for failed XDP RX-queue info reg, as
      qede_init_fp() is a void function.
      
      Cc: everest-linux-l2@cavium.com
      Cc: Ariel Elior <Ariel.Elior@cavium.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c0124f32
    • Jesper Dangaard Brouer's avatar
      ixgbe: setup xdp_rxq_info · 99ffc5ad
      Jesper Dangaard Brouer authored
      Driver hook points for xdp_rxq_info:
       * reg  : ixgbe_setup_rx_resources()
       * unreg: ixgbe_free_rx_resources()
      
      Tested on actual hardware.
      
      V2: Fix ixgbe_set_ringparam, clear xdp_rxq_info in temp_ring
      
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      99ffc5ad
    • Jesper Dangaard Brouer's avatar
      i40e: setup xdp_rxq_info · 87128824
      Jesper Dangaard Brouer authored
      The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
      a sideband channel for configuring/updating the flow director tables.
      This (i40e_vsi_)type does not invoke XDP-ebpf code.
      
      As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring
      a special case, reverse the logic and only select RX-rings of type
      I40E_VSI_MAIN to register xdp_rxq_info's for.
      
      Driver hook points for xdp_rxq_info:
       * reg  : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
       * unreg: i40e_free_rx_resources    (via i40e_vsi_free_rx_resources)
      
      Tested on actual hardware with samples/bpf program.
      
      V2: Fixed bug in i40e_set_ringparam (memset zero) + match on I40E_VSI_MAIN.
      V4: Update patch desc that got out-of-sync with code.
      
      Cc: intel-wired-lan@lists.osuosl.org
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87128824
    • Jesper Dangaard Brouer's avatar
      xdp/mlx5: setup xdp_rxq_info · 0ddf5432
      Jesper Dangaard Brouer authored
      The mlx5 driver have a special drop-RQ queue (one per interface) that
      simply drops all incoming traffic. It helps driver keep other HW
      objects (flow steering) alive upon down/up operations.  It is
      temporarily pointed by flow steering objects during the interface
      setup, and when interface is down. It lacks many fields that are set
      in a regular RQ (for example its state is never switched to
      MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).
      
      The XDP RX-queue info for this drop-RQ marked as unused, which
      allow us to use the same takedown/free code path as other RX-queues.
      
      Driver hook points for xdp_rxq_info:
       * reg   : mlx5e_alloc_rq()
       * unused: mlx5e_alloc_drop_rq()
       * unreg : mlx5e_free_rq()
      
      Tested on actual hardware with samples/bpf program
      
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Matan Barak <matanb@mellanox.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ddf5432
    • Jesper Dangaard Brouer's avatar
      xdp: base API for new XDP rx-queue info concept · aecd67b6
      Jesper Dangaard Brouer authored
      This patch only introduce the core data structures and API functions.
      All XDP enabled drivers must use the API before this info can used.
      
      There is a need for XDP to know more about the RX-queue a given XDP
      frames have arrived on.  For both the XDP bpf-prog and kernel side.
      
      Instead of extending xdp_buff each time new info is needed, the patch
      creates a separate read-mostly struct xdp_rxq_info, that contains this
      info.  We stress this data/cache-line is for read-only info.  This is
      NOT for dynamic per packet info, use the data_meta for such use-cases.
      
      The performance advantage is this info can be setup at RX-ring init
      time, instead of updating N-members in xdp_buff.  A possible (driver
      level) micro optimization is that xdp_buff->rxq assignment could be
      done once per XDP/NAPI loop.  The extra pointer deref only happens for
      program needing access to this info (thus, no slowdown to existing
      use-cases).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      aecd67b6
    • Jakub Kicinski's avatar
      nfp: add basic multicast filtering · d0adb51e
      Jakub Kicinski authored
      We currently always pass all multicast traffic through.
      Only set L2MC when actually needed.  Since the driver
      was not making use of the capability to filter out mcast
      frames, some FW projects don't implement it any more.
      Don't warn users if capability is not present (like we
      do for promisc flag).  The lack of L2MC capability is
      assumed to mean all multicast traffic goes through.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0adb51e
    • David S. Miller's avatar
      Merge branch 'rds-use-RCU-between-work-enqueue-and-connection-teardown' · ad521763
      David S. Miller authored
      Sowmini Varadhan says:
      
      ====================
      rds: use RCU between work-enqueue and connection teardown
      
      This patchset follows up on the root-cause mentioned in
      https://www.spinics.net/lists/netdev/msg472849.html
      
      Patch1 implements some code refactoring that was suggeseted
      as an enhancement in http://patchwork.ozlabs.org/patch/843157/
      It replaces the c_destroy_in_prog bit in rds_connection with
      an atomically managed flag in rds_conn_path.
      
      Patch2 builds on Patch1 and uses RCU to make sure that
      work is only enqueued if the connection destroy is not already
      in progress: the test-flag-and-enqueue is done under rcu_read_lock,
      while destroy first sets the flag, uses synchronize_rcu to
      wait for existing reader threads to complete, and then starts
      all the work-cancellation.
      
      Since I have not been able to reproduce the original stack traces
      reported by syszbot, and these are fixes for a race condition that
      are based on code-inspection I am not marking these as reported-by
      at this time.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad521763
    • Sowmini Varadhan's avatar
      rds: use RCU to synchronize work-enqueue with connection teardown · 3db6e0d1
      Sowmini Varadhan authored
      rds_sendmsg() can enqueue work on cp_send_w from process context, but
      it should not enqueue this work if connection teardown  has commenced
      (else we risk enquing work after rds_conn_path_destroy() has assumed that
      all work has been cancelled/flushed).
      
      Similarly some other functions like rds_cong_queue_updates
      and rds_tcp_data_ready are called in softirq context, and may end
      up enqueuing work on rds_wq after rds_conn_path_destroy() has assumed
      that all workqs are quiesced.
      
      Check the RDS_DESTROY_PENDING bit and use rcu synchronization to avoid
      all these races.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3db6e0d1
    • Sowmini Varadhan's avatar
      rds: Use atomic flag to track connections being destroyed · c90ecbfa
      Sowmini Varadhan authored
      Replace c_destroy_in_prog by using a bit in cp_flags that
      can set/tested atomically.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c90ecbfa
    • David S. Miller's avatar
      Merge branch 'tipc-two-small-cleanups' · eb9aa1bf
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: two small cleanups
      
      These two commits are based on commit f9c935db ("tipc: fix
      problems with multipoint-to-point flow control") which has been
      applied to 'net' but not yet to 'net-next'.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb9aa1bf
    • Jon Maloy's avatar
      tipc: simplify small window members' sorting algorithm · d84d1b3b
      Jon Maloy authored
      We simplify the sorting algorithm in tipc_update_member(). We also make
      the remaining conditional call to this function unconditional, since the
      same condition now is tested for inside the said function.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d84d1b3b
    • Jon Maloy's avatar
      tipc: some clarifying name changes · 38266ca1
      Jon Maloy authored
      We rename some functions and variables, to make their purpose clearer.
      
      - tipc_group::congested -> tipc_group::small_win. Members in this list
        are not necessarily (and typically) congested. Instead, they may
        *potentially* be subject to congestion because their send window is
        less than ADV_IDLE, and therefore need to be checked during message
        transmission.
      
      - tipc_group_is_receiver() -> tipc_group_is_sender(). This socket will
        accept messages coming from members fulfilling this condition, i.e.,
        they are senders from this member's viewpoint.
      
      - tipc_group_is_enabled() -> tipc_group_is_receiver(). Members
        fulfilling this condition will accept messages sent from the current
        socket, i.e., they are receivers from its viewpoint.
      
      There are no functional changes in this commit.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38266ca1
    • Wei Yongjun's avatar
      net: dsa: lan9303: Fix error return code in lan9303_check_device() · a31e795a
      Wei Yongjun authored
      Fix to return error code -ENODEV from the chip not found error handling
      case instead of 0(ret have been overwritten to 0 by lan9303_read()), as
      done elsewhere in this function.
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Reviewed-by: default avatarEgil Hjelmeland <privat@egil-hjelmeland.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a31e795a
    • David S. Miller's avatar
      Merge branch 'dsa-Move-padding-into-Broadcom-tagger' · 7892ea23
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: dsa: Move padding into Broadcom tagger
      
      This patch series moves the padding of short packets to where it belongs
      within the DSA Broadcom tagger code, I just found myself doing this for
      a third driver, which was a clear indication this was wrong and did not
      scale.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7892ea23
    • Florian Fainelli's avatar
      net: bgmac: Remove short packet padding for DSA · c979da77
      Florian Fainelli authored
      DSA now correctly pads short packets within net/dsa/tag_brcm.c such that
      this it is no longer necessary to do this within bgmac.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c979da77
    • Florian Fainelli's avatar
      net: systemport: Remove short packet padding · 398aff64
      Florian Fainelli authored
      Short packet padding added to the driver is only necessary when using
      Broadcom tags, but since this is now taken care of net/dsa/tag_brcm.c,
      we are guaranteed being given correctly padded packets.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      398aff64
    • Florian Fainelli's avatar
      net: dsa: Move padding into Broadcom tagger · bf08c340
      Florian Fainelli authored
      Instead of having the different master network device drivers
      potentially used by DSA/Broadcom tags, move the padding necessary for
      the switches to accept short packets where it makes most sense: within
      tag_brcm.c. This avoids multiplying the number of similar commits to
      e.g: bgmac, bcmsysport, etc.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf08c340
    • Quentin Monnet's avatar
      net: sched: fix tcf_block_get_ext() in case CONFIG_NET_CLS is not set · 33c30a8b
      Quentin Monnet authored
      The definition of functions tcf_block_get() and tcf_block_get_ext()
      depends of CONFIG_NET_CLS being set. When those functions gained extack
      support, only one version of the declaration of those functions was
      updated. Function tcf_block_get() was later fixed with commit
      3c149091 ("net: sch: api: fix tcf_block_get").
      
      Change arguments of tcf_block_get_ext() for the case when CONFIG_NET_CLS
      is not set.
      
      Fixes: 8d1a77f9 ("net: sch: api: add extack support in tcf_block_get")
      Signed-off-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33c30a8b
    • Soheil Hassas Yeganeh's avatar
      net: revert "Update RFS target at poll for tcp/udp" · 0a38806f
      Soheil Hassas Yeganeh authored
      On multi-threaded processes, one common architecture is to have
      one (or a small number of) threads polling sockets, and a
      considerably larger pool of threads reading form and writing to the
      sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially
      steer all packets of all the polled FDs to one (or small number of)
      cores, creaing a bottleneck and/or RPS misprediction.
      
      Another common architecture is to shard FDs among threads pinned
      to cores. In such a setting, setting RPS core in tcp_poll() and
      udp_poll() is redundant because the RFS core is correctly
      set in recvmsg and sendmsg.
      
      Thus, revert the following commit:
      c3f1dbaf ("net: Update RFS target at poll for tcp/udp").
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a38806f
    • Soheil Hassas Yeganeh's avatar
      ip: do not set RFS core on error queue reads · e3f2c4a3
      Soheil Hassas Yeganeh authored
      We should only record RPS on normal reads and writes.
      In single threaded processes, all calls record the same state. In
      multi-threaded processes where a separate thread processes
      errors, the RFS table mispredicts.
      
      Note that, when CONFIG_RPS is disabled, sock_rps_record_flow
      is a noop and no branch is added as a result of this patch.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e3f2c4a3
    • David S. Miller's avatar
      Merge branch 'l2tp-remove-configurable-offset-parameters' · 2e40b823
      David S. Miller authored
      James Chapman says:
      
      ====================
      l2tp: remove configurable offset parameters
      
      This patch series removes all code to support a configurable offset in
      transmitted l2tp packets. Code to handle this is incomplete and buggy
      and has been this way for years. If anyone tried to configure an
      offset, it would be ignored for L2TPv2 tunnels, or for L2TPv3 tunnels,
      could result in L2TPv3 packets being transmitted which are not
      compliant with L2TPv3 RFC3931. This patch series removes the support
      for configurable offsets.
      
      No known userspace l2tp daemon configures an offset. However,
      iproute2's "ip l2tp" command has an offset parameter and if set, the
      value is passed to the kernel. This is the most likely use case where
      offsets might be configured, e.g.
      
         ip l2tp add tunnel local 1.1.1.1 remote 1.1.1.2 tunnel_id 1 \
             peer_tunnel_id 2 encap ip
         ip l2tp add session name l2tp0 tunnel_id 1 session_id 1 \
             peer_session_id 2 offset 8
      
      The above would result in packets being transmitted to 1.1.1.2 with 8
      bytes padding between the L2TPv3 header and the payload. The peer
      would need to be configured with the same offset value. However, the
      packets are not compliant with the L2TPv3 RFC, hence I think it's
      unlikely that offset is being used. With this patch series applied,
      the offset would not be configured. The peer would need to be modified to
      remove its offset setting too.
      
      iproute2 should be modified to remove or ignore the ip l2tp offset
      parameter.
      
      This issue was discovered when reviewing a patch series from
      lorenzo.bianconi@redhat.com which adds another netlink attribute to
      configure the expected offset in received L2TPv3 packets. This change
      is reverted by this series because offsets do not exist in L2TPv3
      packets. These commits are:
      
        commit f15bc54e ("l2tp: add peer_offset parameter")
        commit 820da535 ("l2tp: fix missing print session offset info")
      
      In more detail:
      
      The L2TPv2 protocol supports a variable offset from the L2TPv2 header
      to the payload to give the sender implementation some flexibility for
      data alignment when adding L2TP headers on to payloads. The offset
      value is indicated by an optional field in the L2TP header.  Our L2TP
      implementation already detects the presence of the optional offset in
      received packets and skips those bytes when parsing packets. All
      transmitted L2TPv2 packets are always transmitted with no offset.
      
      L2TPv3 has no optional offset field in the L2TPv3 packet
      header. Instead, L2TPv3 defines optional fields in a "Layer-2 Specific
      Sublayer". At the time when the original L2TP code was written, there
      was talk at IETF of offset being implemented in a new Layer-2 Specific
      Sublayer. A L2TP_ATTR_OFFSET netlink attribute was added so that this
      offset could be configured and the intention was to allow it to be
      also used to set the tx offset for L2TPv2. However, no L2TPv3 offset
      was ever specified and the L2TP_ATTR_OFFSET parameter was forgotten
      about.
      
      Setting L2TP_ATTR_OFFSET results in L2TPv3 packets being transmitted
      with the specified number of bytes padding between L2TPv3 header and
      payload. This is not compliant with L2TPv3 RFC3931. So this change
      removes the configurable offset altogether while retaining
      L2TP_ATTR_OFFSET in the API for backwards compatibility. If
      L2TP_ATTR_OFFSET is given, its value is now silently ignored.
      ====================
      Reviewed-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Tested-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e40b823
    • James Chapman's avatar
    • James Chapman's avatar
      l2tp: remove configurable payload offset · 900631ee
      James Chapman authored
      If L2TP_ATTR_OFFSET is set to a non-zero value in L2TPv3 tunnels, it
      results in L2TPv3 packets being transmitted which might not be
      compliant with the L2TPv3 RFC. This patch has l2tp ignore the offset
      setting and send all packets with no offset.
      
      In more detail:
      
      L2TPv2 supports a variable offset from the L2TPv2 header to the
      payload. The offset value is indicated by an optional field in the
      L2TP header.  Our L2TP implementation already detects the presence of
      the optional offset and skips that many bytes when handling data
      received packets. All transmitted packets are always transmitted with
      no offset.
      
      L2TPv3 has no optional offset field in the L2TPv3 packet
      header. Instead, L2TPv3 defines optional fields in a "Layer-2 Specific
      Sublayer". At the time when the original L2TP code was written, there
      was talk at IETF of offset being implemented in a new Layer-2 Specific
      Sublayer. A L2TP_ATTR_OFFSET netlink attribute was added so that this
      offset could be configured and the intention was to allow it to be
      also used to set the tx offset for L2TPv2. However, no L2TPv3 offset
      was ever specified and the L2TP_ATTR_OFFSET parameter was forgotten
      about.
      
      Setting L2TP_ATTR_OFFSET results in L2TPv3 packets being transmitted
      with the specified number of bytes padding between L2TPv3 header and
      payload. This is not compliant with L2TPv3 RFC3931. This change
      removes the configurable offset altogether while retaining
      L2TP_ATTR_OFFSET for backwards compatibility. Any L2TP_ATTR_OFFSET
      value is ignored.
      Signed-off-by: default avatarJames Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      900631ee
    • James Chapman's avatar
      l2tp: revert "l2tp: fix missing print session offset info" · de3b58bc
      James Chapman authored
      Revert commit 820da535 ("l2tp: fix missing print session offset
      info").  The peer_offset parameter is removed.
      Signed-off-by: default avatarJames Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de3b58bc
    • James Chapman's avatar
      l2tp: revert "l2tp: add peer_offset parameter" · 863def15
      James Chapman authored
      Revert commit f15bc54e ("l2tp: add peer_offset parameter"). This
      is removed because it is adding another configurable offset and
      configurable offsets are being removed.
      Signed-off-by: default avatarJames Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      863def15