1. 27 Aug, 2019 7 commits
  2. 21 Aug, 2019 10 commits
  3. 20 Aug, 2019 10 commits
  4. 17 Aug, 2019 13 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-xskmap-improvements' · 1f726723
      Daniel Borkmann authored
      Björn Töpel says:
      
      ====================
      This series (v5 and counting) add two improvements for the XSKMAP,
      used by AF_XDP sockets.
      
      1. Automatic cleanup when an AF_XDP socket goes out of scope/is
         released. Instead of require that the user manually clears the
         "released" state socket from the map, this is done
         automatically. Each socket tracks which maps it resides in, and
         remove itself from those maps at relase. A notable implementation
         change, is that the sockets references the map, instead of the map
         referencing the sockets. Which implies that when the XSKMAP is
         freed, it is by definition cleared of sockets.
      
      2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
         which this patch addresses.
      
      v1->v2: Fixed deadlock and broken cleanup. (Daniel)
      v2->v3: Rebased onto bpf-next
      v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
              Socket release/map update race. (Daniel)
      v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel)
              Removed redundant assignment in xsk_map_update_elem().
              Variable name consistency; Use map_entry everywhere.
      
      [1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1f726723
    • Björn Töpel's avatar
      xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP · 36cc3435
      Björn Töpel authored
      The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
      an entry. This patch addresses that.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      36cc3435
    • Björn Töpel's avatar
      xsk: remove AF_XDP socket from map when the socket is released · 0402acd6
      Björn Töpel authored
      When an AF_XDP socket is released/closed the XSKMAP still holds a
      reference to the socket in a "released" state. The socket will still
      use the netdev queue resource, and block newly created sockets from
      attaching to that queue, but no user application can access the
      fill/complete/rx/tx queues. This results in that all applications need
      to explicitly clear the map entry from the old "zombie state"
      socket. This should be done automatically.
      
      In this patch, the sockets tracks, and have a reference to, which maps
      it resides in. When the socket is released, it will remove itself from
      all maps.
      Suggested-by: default avatarBruce Richardson <bruce.richardson@intel.com>
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0402acd6
    • Daniel Borkmann's avatar
      Merge branch 'bpf-sk-storage-clone' · 8e46c353
      Daniel Borkmann authored
      Stanislav Fomichev says:
      
      ====================
      Currently there is no way to propagate sk storage from the listener
      socket to a newly accepted one. Consider the following use case:
      
              fd = socket();
              setsockopt(fd, SOL_IP, IP_TOS,...);
              /* ^^^ setsockopt BPF program triggers here and saves something
               * into sk storage of the listener.
               */
              listen(fd, ...);
              while (client = accept(fd)) {
                      /* At this point all association between listener
                       * socket and newly accepted one is gone. New
                       * socket will not have any sk storage attached.
                       */
              }
      
      Let's add new BPF_F_CLONE flag that can be specified when creating
      a socket storage map. This new flag indicates that map contents
      should be cloned when the socket is cloned.
      
      v4:
      * drop 'goto err' in bpf_sk_storage_clone (Yonghong Song)
      * add comment about race with bpf_sk_storage_map_free to the
        bpf_sk_storage_clone side as well (Daniel Borkmann)
      
      v3:
      * make sure BPF_F_NO_PREALLOC is always present when creating
        a map (Martin KaFai Lau)
      * don't call bpf_sk_storage_free explicitly, rely on
        sk_free_unlock_clone to do the cleanup (Martin KaFai Lau)
      
      v2:
      * remove spinlocks around selem_link_map/sk (Martin KaFai Lau)
      * BPF_F_CLONE on a map, not selem (Martin KaFai Lau)
      * hold a map while cloning (Martin KaFai Lau)
      * use BTF maps in selftests (Yonghong Song)
      * do proper cleanup selftests; don't call close(-1) (Yonghong Song)
      * export bpf_map_inc_not_zero
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8e46c353
    • Stanislav Fomichev's avatar
      selftests/bpf: add sockopt clone/inheritance test · c3bbf176
      Stanislav Fomichev authored
      Add a test that calls setsockopt on the listener socket which triggers
      BPF program. This BPF program writes to the sk storage and sets
      clone flag. Make sure that sk storage is cloned for a newly
      accepted connection.
      
      We have two cloned maps in the tests to make sure we hit both cases
      in bpf_sk_storage_clone: first element (sk_storage_alloc) and
      non-first element(s) (selem_link_map).
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c3bbf176
    • Stanislav Fomichev's avatar
      bpf: sync bpf.h to tools/ · 9e819ffc
      Stanislav Fomichev authored
      Sync new sk storage clone flag.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9e819ffc
    • Stanislav Fomichev's avatar
      bpf: support cloning sk storage on accept() · 8f51dfc7
      Stanislav Fomichev authored
      Add new helper bpf_sk_storage_clone which optionally clones sk storage
      and call it from sk_clone_lock.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8f51dfc7
    • Stanislav Fomichev's avatar
      bpf: export bpf_map_inc_not_zero · b0e4701c
      Stanislav Fomichev authored
      Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
      indicate that it's caller's responsibility to do proper locking.
      Create and export bpf_map_inc_not_zero wrapper that properly
      locks map_idr_lock. Will be used in the next commit to
      hold a map while cloning a socket.
      
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b0e4701c
    • Petar Penkov's avatar
      selftests/bpf: fix race in test_tcp_rtt test · fae55527
      Petar Penkov authored
      There is a race in this test between receiving the ACK for the
      single-byte packet sent in the test, and reading the values from the
      map.
      
      This patch fixes this by having the client wait until there are no more
      unacknowledged packets.
      
      Before:
      for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
      done | grep -c PASSED
      < trimmed error messages >
      993
      
      After:
      for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
      done | grep -c PASSED
      10000
      
      Fixes: b5587398 ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
      Signed-off-by: default avatarPetar Penkov <ppenkov@google.com>
      Reviewed-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      fae55527
    • Andrii Nakryiko's avatar
      libbpf: relicense bpf_helpers.h and bpf_endian.h · 929ffa6e
      Andrii Nakryiko authored
      bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper
      definitions essential to almost every BPF program. Which makes them
      useful not just for selftests. To be able to expose them as part of
      libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR
      BSD-2-Clause. This patch updates licensing of those two files.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarHechao Li <hechaol@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarAdam Barth <arb@fb.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarJoe Stringer <joe@wand.net.nz>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Acked-by: default avatarAdrian Ratiu <adrian.ratiu@collabora.com>
      Acked-by: default avatarNikita V. Shirokov <tehnerd@tehnerd.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarPetar Penkov <ppenkov@google.com>
      Acked-by: default avatarTeng Qin <palmtenor@gmail.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Michal Rostecki <mrostecki@opensuse.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      929ffa6e
    • Maxim Mikityanskiy's avatar
      net: Don't call XDP_SETUP_PROG when nothing is changed · c14a9f63
      Maxim Mikityanskiy authored
      Don't uninstall an XDP program when none is installed, and don't install
      an XDP program that has the same ID as the one already installed.
      
      dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
      XDP program. It means that the driver's ndo_bpf can be called with
      XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
      case happens if the user runs `ip link set eth0 xdp off` when there is
      no XDP program attached.
      
      The symmetrical case is possible when the user tries to set the program
      that is already set.
      
      The drivers typically perform some heavy operations on XDP_SETUP_PROG,
      so they all have to handle these cases internally to return early if
      they happen. This patch puts this check into the kernel code, so that
      all drivers will benefit from it.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c14a9f63
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-wakeup' · c8186c80
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      This patch set adds support for a new flag called need_wakeup in the
      AF_XDP Tx and fill rings. When this flag is set by the driver, it
      means that the application has to explicitly wake up the kernel Rx
      (for the bit in the fill ring) or kernel Tx (for bit in the Tx ring)
      processing by issuing a syscall. Poll() can wake up both and sendto()
      will wake up Tx processing only.
      
      The main reason for introducing this new flag is to be able to
      efficiently support the case when application and driver is executing
      on the same core. Previously, the driver was just busy-spinning on the
      fill ring if it ran out of buffers in the HW and there were none to
      get from the fill ring. This approach works when the application and
      driver is running on different cores as the application can replenish
      the fill ring while the driver is busy-spinning. Though, this is a
      lousy approach if both of them are running on the same core as the
      probability of the fill ring getting more entries when the driver is
      busy-spinning is zero. With this new feature the driver now sets the
      need_wakeup flag and returns to the application. The application can
      then replenish the fill queue and then explicitly wake up the Rx
      processing in the kernel using the syscall poll(). For Tx, the flag is
      only set to one if the driver has no outstanding Tx completion
      interrupts. If it has some, the flag is zero as it will be woken up by
      a completion interrupt anyway. This flag can also be used in other
      situations where the driver needs to be woken up explicitly.
      
      As a nice side effect, this new flag also improves the Tx performance
      of the case where application and driver are running on two different
      cores as it reduces the number of syscalls to the kernel. The kernel
      tells user space if it needs to be woken up by a syscall, and this
      eliminates many of the syscalls. The Rx performance of the 2-core case
      is on the other hand slightly worse, since there is a need to use a
      syscall now to wake up the driver, instead of the driver
      busy-spinning. It does waste less CPU cycles though, which might lead
      to better overall system performance.
      
      This new flag needs some simple driver support. If the driver does not
      support it, the Rx flag is always zero and the Tx flag is always
      one. This makes any application relying on this feature default to the
      old behavior of not requiring any syscalls in the Rx path and always
      having to call sendto() in the Tx path.
      
      For backwards compatibility reasons, this feature has to be explicitly
      turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
      that you always turn it on as it has a large positive performance
      impact for the one core case and does not degrade 2 core performance
      and actually improves it for Tx heavy workloads.
      
      Here are some performance numbers measured on my local,
      non-performance optimized development system. That is why you are
      seeing numbers lower than the ones from Björn and Jesper. 64 byte
      packets at 40Gbit/s line rate. All results in Mpps. Cores == 1 means
      that both application and driver is executing on the same core. Cores
      == 2 that they are on different cores.
      
                                    Applications
      need_wakeup  cores    txpush    rxdrop      l2fwd
      ---------------------------------------------------------------
           n         1       0.07      0.06        0.03
           y         1       21.6      8.2         6.5
           n         2       32.3      11.7        8.7
           y         2       33.1      11.7        8.7
      
      Overall, the need_wakeup flag provides the same or better performance
      in all the micro-benchmarks. The reduction of sendto() calls in txpush
      is large. Only a few per second is needed. For l2fwd, the drop is 50%
      for the 1 core case and more than 99.9% for the 2 core case. Do not
      know why I am not seeing the same drop for the 1 core case yet.
      
      The name and inspiration of the flag has been taken from io_uring by
      Jens Axboe. Details about this feature in io_uring can be found in
      http://kernel.dk/io_uring.pdf, section 8.3. It also addresses most of
      the denial of service and sendto() concerns raised by Maxim
      Mikityanskiy in https://www.spinics.net/lists/netdev/msg554657.html.
      
      The typical Tx part of an application will have to change from:
      
      ret = sendto(fd,....)
      
      to:
      
      if (xsk_ring_prod__needs_wakeup(&xsk->tx))
             ret = sendto(fd,....)
      
      and th Rx part from:
      
      rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
      if (!rcvd)
             return;
      
      to:
      
      rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
      if (!rcvd) {
             if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
                    ret = poll(fd,.....);
             return;
      }
      
      v3 -> v4:
      * Maxim found a possible race in the Tx part of the driver. The
        setting of the flag needs to happen before the sending, otherwise it
        might trigger this race. Fixed in ixgbe and i40e driver.
      * Mellanox support contributed by Maxim
      * Removed the XSK_DRV_CAN_SLEEP flag as it was not used
        anymore. Thanks to Sridhar for discovering this.
      * For consistency the feature is now always called need_wakeup. There
        were some places where it was referred to as might_sleep, but they
        have been removed. Thanks to Sridhar for spotting.
      * Fixed some typos in the commit messages
      
      v2 -> v3:
      * Converted the Mellanox driver to the new ndo in patch 1 as pointed
        out by Maxim
      * Fixed the compatibility code of XDP_MMAP_OFFSETS so it now works.
      
      v1 -> v2:
      * Fixed bisectability problem pointed out by Jakub
      * Added missing initiliztion of the Tx need_wakeup flag to 1
      
      This patch has been applied against commit b753c5a7 ("Merge branch 'r8152-RX-improve'")
      
      Structure of the patch set:
      
      Patch 1: Replaces the ndo_xsk_async_xmit with ndo_xsk_wakeup to
               support waking up both Rx and Tx processing
      Patch 2: Implements the need_wakeup functionality in common code
      Patch 3-4: Add need_wakeup support to the i40e and ixgbe drivers
      Patch 5: Add need_wakeup support to libbpf
      Patch 6: Add need_wakeup support to the xdpsock sample application
      Patch 7-8: Add need_wakeup support to the Mellanox mlx5 driver
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c8186c80
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Add AF_XDP need_wakeup support · a7bd4018
      Maxim Mikityanskiy authored
      This commit adds support for the new need_wakeup feature of AF_XDP. The
      applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag.
      When this feature is enabled, some behavior changes:
      
      RX side: If the Fill Ring is empty, instead of busy-polling, set the
      flag to tell the application to kick the driver when it refills the Fill
      Ring.
      
      TX side: If there are pending completions or packets queued for
      transmission, set the flag to tell the application that it can skip the
      sendto() syscall and save time.
      
      The performance testing was performed on a machine with the following
      configuration:
      
      - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
      - Mellanox ConnectX-5 Ex with 100 Gbit/s link
      
      The results with retpoline disabled:
      
             | without need_wakeup  | with need_wakeup     |
             |----------------------|----------------------|
             | one core | two cores | one core | two cores |
      -------|----------|-----------|----------|-----------|
      txonly | 20.1     | 33.5      | 29.0     | 34.2      |
      rxdrop | 0.065    | 14.1      | 12.0     | 14.1      |
      l2fwd  | 0.032    | 7.3       | 6.6      | 7.2       |
      
      "One core" means the application and NAPI run on the same core. "Two
      cores" means they are pinned to different cores.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a7bd4018