1. 04 Jun, 2018 11 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-fixes' · 64995362
      Daniel Borkmann authored
      Björn Töpel says:
      
      ====================
      An issue with the current AF_XDP uapi raised by Mykyta Iziumtsev (see
      https://www.spinics.net/lists/netdev/msg503664.html) is that it does
      not support NICs that have a "type-writer" model in an efficient
      way. In this model, a memory window is passed to the hardware and
      multiple frames might be filled into that window, instead of just one
      that we have in the current fixed frame-size model.
      
      This patch set fixes two bugs in the current implementation and then
      changes the uapi so that the type-writer model can be supported
      efficiently by a possible future extension of AF_XDP.
      
      These are the uapi changes in this patch:
      
      * Change the "u32 idx" in the descriptors to "u64 addr". The current
        idx based format does NOT work for the type-writer model (as packets
        can start anywhere within a frame) but that a relative address
        pointer (the u64 addr) works well for both models in the prototype
        code we have that supports both models. We increased it from u32 to
        u64 to support umems larger than 4G. We have also removed the u16
        offset when having a "u64 addr" since that information is already
        carried in the least significant bits of the address.
      
      * We want to use "u8 padding[5]" for something useful in the future
        (since we are not allowed to change its name), so we now call it
        just options so it can be extended for various purposes in the
        future. It is an u32 as that it what is left of the 16 byte
        descriptor.
      
      * We changed the name of frame_size in the UMEM_REG setsockopt to
        chunk_size since this naming also makes sense to the type-writer
        model.
      
      With these changes to the uapi, we believe the type-writer model can
      be supported without having to resort to a new descriptor format. The
      type-writer model could then be supported, from the uapi point of
      view, by setting a flag at bind time and providing a new flag bit in
      the options field of the descriptor that signals to user space that
      all packets have been written in a chunk. Or with a new chunk
      completion queue as suggested by Mykyta in his latest feedback mail on
      the list.
      
      We based this patch set on bpf-next commit bd3a08aa ("bpf:
      flowlabel in bpf_fib_lookup should be flowinfo")
      
      The structure of the patch set is as follows:
      
      Patches 1-2: Fixes two bugs in the current implementation.
      Patches 3-4: Prepares the uapi for a "type-writer" model and modifies
                   the sample application so that it works with the new
      	     uapi.
      Patch 5: Small performance improvement patch for the sample application.
      
      Cheers: Magnus and Björn
      ====================
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      64995362
    • Magnus Karlsson's avatar
    • Björn Töpel's avatar
      samples/bpf: adapted to new uapi · a412ef54
      Björn Töpel authored
      Here, the xdpsock sample application is adjusted to the new descriptor
      format.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a412ef54
    • Björn Töpel's avatar
      xsk: new descriptor addressing scheme · bbff2f32
      Björn Töpel authored
      Currently, AF_XDP only supports a fixed frame-size memory scheme where
      each frame is referenced via an index (idx). A user passes the frame
      index to the kernel, and the kernel acts upon the data.  Some NICs,
      however, do not have a fixed frame-size model, instead they have a
      model where a memory window is passed to the hardware and multiple
      frames are filled into that window (referred to as the "type-writer"
      model).
      
      By changing the descriptor format from the current frame index
      addressing scheme, AF_XDP can in the future be extended to support
      these kinds of NICs.
      
      In the index-based model, an idx refers to a frame of size
      frame_size. Addressing a frame in the UMEM is done by offseting the
      UMEM starting address by a global offset, idx * frame_size + offset.
      Communicating via the fill- and completion-rings are done by means of
      idx.
      
      In this commit, the idx is removed in favor of an address (addr),
      which is a relative address ranging over the UMEM. To convert an
      idx-based address to the new addr is simply: addr = idx * frame_size +
      offset.
      
      We also stop referring to the UMEM "frame" as a frame. Instead it is
      simply called a chunk.
      
      To transfer ownership of a chunk to the kernel, the addr of the chunk
      is passed in the fill-ring. Note, that the kernel will mask addr to
      make it chunk aligned, so there is no need for userspace to do
      that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
      3000 to the fill-ring will refer to the same chunk.
      
      On the completion-ring, the addr will match that of the Tx descriptor,
      passed to the kernel.
      
      Changing the descriptor format to use chunks/addr will allow for
      future changes to move to a type-writer based model, where multiple
      frames can reside in one chunk. In this model passing one single chunk
      into the fill-ring, would potentially result in multiple Rx
      descriptors.
      
      This commit changes the uapi of AF_XDP sockets, and updates the
      documentation.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bbff2f32
    • Björn Töpel's avatar
      xsk: proper Rx drop statistics update · a509a955
      Björn Töpel authored
      Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
      program redirected the frame to a socket bound to a different queue
      than where the XDP program was executing.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a509a955
    • Björn Töpel's avatar
      xsk: proper fill queue descriptor validation · 4e64c835
      Björn Töpel authored
      Previously the fill queue descriptor was not copied to kernel space
      prior validating it, making it possible for userland to change the
      descriptor post-kernel-validation.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4e64c835
    • David Ahern's avatar
      bpf: flowlabel in bpf_fib_lookup should be flowinfo · bd3a08aa
      David Ahern authored
      As Michal noted the flow struct takes both the flow label and priority.
      Update the bpf_fib_lookup API to note that it is flowinfo and not just
      the flow label.
      
      Cc: Michal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd3a08aa
    • Alexei Starovoitov's avatar
      Merge branch 'bpf_get_current_cgroup_id' · 432bdb58
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      Patch #1 implements the new helper in the kernel.
      Patch #2 syncs the uapi bpf.h header and helper between tools
      and kernel.
      Patch #3 shows how to get the same cgroup id in user space,
      so a filter or policy could be configgured in the bpf program
      based on current task cgroup.
      
      Changelog:
        v1 -> v2:
           . rebase to resolve merge conflict with latest bpf-next.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      432bdb58
    • Yonghong Song's avatar
      tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper · f269099a
      Yonghong Song authored
      Syscall name_to_handle_at() can be used to get cgroup id
      for a particular cgroup path in user space. The selftest
      got cgroup id from both user and kernel, and compare to
      ensure they are equal to each other.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f269099a
    • Yonghong Song's avatar
      tools/bpf: sync uapi bpf.h for bpf_get_current_cgroup_id() helper · c7ddbbaf
      Yonghong Song authored
      Sync kernel uapi/linux/bpf.h with tools uapi/linux/bpf.h.
      Also add the necessary helper define in bpf_helpers.h.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c7ddbbaf
    • Yonghong Song's avatar
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song authored
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
  2. 03 Jun, 2018 21 commits
  3. 02 Jun, 2018 8 commits