1. 29 Sep, 2021 4 commits
  2. 28 Sep, 2021 30 commits
  3. 27 Sep, 2021 6 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-xsk-rx-batch' · 4c9f0937
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      This patch set introduces a batched interface for Rx buffer allocation
      in AF_XDP buffer pool. Instead of using xsk_buff_alloc(*pool), drivers
      can now use xsk_buff_alloc_batch(*pool, **xdp_buff_array,
      max). Instead of returning a pointer to an xdp_buff, it returns the
      number of xdp_buffs it managed to allocate up to the maximum value of
      the max parameter in the function call. Pointers to the allocated
      xdp_buff:s are put in the xdp_buff_array supplied in the call. This
      could be a SW ring that already exists in the driver or a new
      structure that the driver has allocated.
      
        u32 xsk_buff_alloc_batch(struct xsk_buff_pool *pool,
                                 struct xdp_buff **xdp,
                                 u32 max);
      
      When using this interface, the driver should also use the new
      interface below to set the relevant fields in the struct xdp_buff. The
      reason for this is that xsk_buff_alloc_batch() does not fill in the
      data and data_meta fields for you as is the case with
      xsk_buff_alloc(). So it is not sufficient to just set data_end
      (effectively the size) anymore in the driver. The reason for this is
      performance as explained in detail in the commit message.
      
        void xsk_buff_set_size(struct xdp_buff *xdp, u32 size);
      
      Patch 6 also optimizes the buffer allocation in the aligned case. In
      this case, we can skip the reinitialization of most fields in the
      xdp_buff_xsk struct at allocation time. As the number of elements in
      the heads array is equal to the number of possible buffers in the
      umem, we can initialize them once and for all at bind time and then
      just point to the correct one in the xdp_buff_array that is returned
      to the driver. No reason to have a stack of free head entries. In the
      unaligned case, the buffers can reside anywhere in the umem, so this
      optimization is not possible as we still have to fill in the right
      information in the xdp_buff every single time one is allocated.
      
      I have updated i40e and ice to use this new batched interface.
      
      These are the throughput results on my 2.1 GHz Cascade Lake system:
      
      Aligned mode:
      ice: +11% / -9 cycles/pkt
      i40e: +12% / -9 cycles/pkt
      
      Unaligned mode:
      ice: +1.5% / -1 cycle/pkt
      i40e: +1% / -1 cycle/pkt
      
      For the aligned case, batching provides around 40% of the performance
      improvement and the aligned optimization the rest, around 60%. Would
      have expected a ~4% boost for unaligned with this data, but I only get
      around 1%. Do not know why. Note that memory consumption in aligned
      mode is also reduced by this patch set.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4c9f0937
    • Magnus Karlsson's avatar
      selftests: xsk: Add frame_headroom test · e34087fc
      Magnus Karlsson authored
      Add a test for the frame_headroom feature that can be set on the
      umem. The logic added validates that all offsets in all tests and
      packets are valid, not just the ones that have a specifically
      configured frame_headroom.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-14-magnus.karlsson@gmail.com
      e34087fc
    • Magnus Karlsson's avatar
      selftests: xsk: Change interleaving of packets in unaligned mode · e4e9baf0
      Magnus Karlsson authored
      Change the interleaving of packets in unaligned mode. With the current
      buffer addresses in the packet stream, the last buffer in the umem
      could not be used as a large packet could potentially write over the
      end of the umem. The kernel correctly threw this buffer address away
      and refused to use it. This is perfectly fine for all regular packet
      streams, but the ones used for unaligned mode have every other packet
      being at some different offset. As we will add checks for correct
      offsets in the next patch, this needs to be fixed. Just start these
      page-boundary straddling buffers one page earlier so that the last
      one is not on the last page of the umem, making all buffers valid.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-13-magnus.karlsson@gmail.com
      e4e9baf0
    • Magnus Karlsson's avatar
      selftests: xsk: Add single packet test · 96a40678
      Magnus Karlsson authored
      Add a test where a single packet is sent and received. This might
      sound like a silly test, but since many of the interfaces in xsk are
      batched, it is important to be able to validate that we did not break
      something as fundamental as just receiving single packets, instead of
      batches of packets at high speed.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-12-magnus.karlsson@gmail.com
      96a40678
    • Magnus Karlsson's avatar
      selftests: xsk: Introduce pacing of traffic · 1bf36496
      Magnus Karlsson authored
      Introduce pacing of traffic so that the Tx thread can never send more
      packets than the receiver has processed plus the number of packets it
      can have in its umem. So at any point in time, the number of in flight
      packets (not processed by the Rx thread) are less than or equal to the
      number of packets that can be held in the Rx thread's umem.
      
      The batch size is also increased to improve running time.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-11-magnus.karlsson@gmail.com
      1bf36496
    • Magnus Karlsson's avatar
      selftests: xsk: Fix socket creation retry · 89013b8a
      Magnus Karlsson authored
      The socket creation retry unnecessarily registered the umem once for
      every retry. No reason to do this. It wastes memory and it might lead
      to too many pages being locked at some point and the failure of a
      test.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210922075613.12186-10-magnus.karlsson@gmail.com
      89013b8a