1. 30 Nov, 2020 7 commits
  2. 27 Nov, 2020 10 commits
  3. 25 Nov, 2020 6 commits
  4. 24 Nov, 2020 4 commits
  5. 20 Nov, 2020 2 commits
  6. 19 Nov, 2020 4 commits
  7. 18 Nov, 2020 3 commits
  8. 17 Nov, 2020 4 commits
    • Daniel Borkmann's avatar
      Merge branch 'af-xdp-tx-batch' · cbf398d7
      Daniel Borkmann authored
      Magnus Karlsson says:
      
      ====================
      This patch set improves the performance of mainly the Tx processing of
      AF_XDP sockets. Though, patch 3 also improves the Rx path. All in all,
      this patch set improves the throughput of the l2fwd xdpsock application
      by around 11%. If we just take a look at Tx processing part, it is
      improved by 35% to 40%.
      
      Hopefully the new batched Tx interfaces should be of value to other
      drivers implementing AF_XDP zero-copy support. But patch #3 is generic
      and will improve performance of all drivers when using AF_XDP sockets
      (under the premises explained in that patch).
      
      @Daniel. In patch 3, I apply all the padding required to hinder the
      adjacency prefetcher to prefetch the wrong things. After this patch
      set, I will submit another patch set that introduces
      ____cacheline_padding_in_smp in include/linux/cache.h according to your
      suggestions. The last patch in that patch set will then convert the
      explicit paddings that we have now to ____cacheline_padding_in_smp.
      
      v2 -> v3:
      * Fixed #pragma warning with clang and defined a loop_unrolled_for macro
        for easier readability [lkp, Nick]
      * Simplified invalid descriptor handling in xskq_cons_read_desc_batch()
      
      v1 -> v2:
      * Removed added parameter in i40e_setup_tx_descriptors and adopted a
        simpler solution [Maciej]
      * Added test for !xs in xsk_tx_peek_release_desc_batch() [John]
      * Simplified return path in xsk_tx_peek_release_desc_batch() [John]
      * Dropped patch #1 in v1 that introduced lazy completions. Hopefully
        this is not needed when we get busy poll [Jakub]
      * Iterate over local variable in xskq_prod_reserve_addr_batch() for
        improved performance
      * Fixed the fallback path in xsk_tx_peek_release_desc_batch() so that
        it also produces a batch of descriptors, albeit by using the slower
        (but more general) older code. This improves the performance of the
        case when multiple sockets are sharing the same device and queue id.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cbf398d7
    • Magnus Karlsson's avatar
      i40e: Use batched xsk Tx interfaces to increase performance · 3106c580
      Magnus Karlsson authored
      Use the new batched xsk interfaces for the Tx path in the i40e driver
      to improve performance. On my machine, this yields a throughput
      increase of 4% for the l2fwd sample app in xdpsock. If we instead just
      look at the Tx part, this patch set increases throughput with above
      20% for Tx.
      
      Note that I had to explicitly loop unroll the inner loop to get to
      this performance level, by using a pragma. It is honored by both clang
      and gcc and should be ignored by versions that do not support
      it. Using the -funroll-loops compiler command line switch on the
      source file resulted in a loop unrolling on a higher level that
      lead to a performance decrease instead of an increase.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/1605525167-14450-6-git-send-email-magnus.karlsson@gmail.com
      3106c580
    • Magnus Karlsson's avatar
      xsk: Introduce batched Tx descriptor interfaces · 9349eb3a
      Magnus Karlsson authored
      Introduce batched descriptor interfaces in the xsk core code for the
      Tx path to be used in the driver to write a code path with higher
      performance. This interface will be used by the i40e driver in the
      next patch. Though other drivers would likely benefit from this new
      interface too.
      
      Note that batching is only implemented for the common case when
      there is only one socket bound to the same device and queue id. When
      this is not the case, we fall back to the old non-batched version of
      the function.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/1605525167-14450-5-git-send-email-magnus.karlsson@gmail.com
      9349eb3a
    • Magnus Karlsson's avatar
      xsk: Introduce padding between more ring pointers · b8c7aece
      Magnus Karlsson authored
      Introduce one cache line worth of padding between the consumer pointer
      and the flags field as well as between the flags field and the start
      of the descriptors in all the lockless rings. This so that the x86 HW
      adjacency prefetcher will not prefetch the adjacent pointer/field when
      only one pointer/field is going to be used. This improves throughput
      performance for the l2fwd sample app with 1% on my machine with HW
      prefetching turned on in the BIOS.
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/1605525167-14450-4-git-send-email-magnus.karlsson@gmail.com
      b8c7aece