1. 02 Feb, 2022 2 commits
  2. 01 Feb, 2022 12 commits
  3. 31 Jan, 2022 4 commits
  4. 28 Jan, 2022 8 commits
  5. 27 Jan, 2022 14 commits
    • Hou Tao's avatar
      bpf, x86: Remove unnecessary handling of BPF_SUB atomic op · b6ec7951
      Hou Tao authored
      According to the LLVM commit (https://reviews.llvm.org/D72184),
      sync_fetch_and_sub() is implemented as a negation followed by
      sync_fetch_and_add(), so there will be no BPF_SUB op, thus just
      remove it. BPF_SUB is also rejected by the verifier anyway.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarBrendan Jackman <jackmanb@google.com>
      Link: https://lore.kernel.org/bpf/20220127083240.1425481-1-houtao1@huawei.com
      b6ec7951
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: add __user tagging support in vmlinux BTF' · 50fc9786
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      
      The __user attribute is currently mainly used by sparse for type checking.
      The attribute indicates whether a memory access is in user memory address
      space or not. Such information is important during tracing kernel
      internal functions or data structures as accessing user memory often
      has different mechanisms compared to accessing kernel memory. For example,
      the perf-probe needs explicit command line specification to indicate a
      particular argument or string in user-space memory ([1], [2], [3]).
      Currently, vmlinux BTF is available in kernel with many distributions.
      If __user attribute information is available in vmlinux BTF, the explicit
      user memory access information from users will not be necessary as
      the kernel can figure it out by itself with vmlinux BTF.
      
      Besides the above possible use for perf/probe, another use case is
      for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf
      programs, users can write direct code like
        p->m1->m2
      and "p" could be a function parameter. Without __user information in BTF,
      the verifier will assume p->m1 accessing kernel memory and will generate
      normal loads. Let us say "p" actually tagged with __user in the source
      code.  In such cases, p->m1 is actually accessing user memory and direct
      load is not right and may produce incorrect result. For such cases,
      bpf_probe_read_user() will be the correct way to read p->m1.
      
      To support encoding __user information in BTF, a new attribute
        __attribute__((btf_type_tag("<arbitrary_string>")))
      is implemented in clang ([4]). For example, if we have
        #define __user __attribute__((btf_type_tag("user")))
      during kernel compilation, the attribute "user" information will
      be preserved in dwarf. After pahole converting dwarf to BTF, __user
      information will be available in vmlinux BTF and such information
      can be used by bpf verifier, perf/probe or other use cases.
      
      Currently btf_type_tag is only supported in clang (>= clang14) and
      pahole (>= 1.23). gcc support is also proposed and under development ([5]).
      
      In the rest of patch set, Patch 1 added support of __user btf_type_tag
      during compilation. Patch 2 added bpf verifier support to utilize __user
      tag information to reject bpf programs not using proper helper to access
      user memories. Patches 3-5 are for bpf selftests which demonstrate verifier
      can reject direct user memory accesses.
      
        [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2
        [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
        [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
        [4] https://reviews.llvm.org/D111199
        [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/
      
      Changelog:
        v2 -> v3:
          - remove FLAG_DONTCARE enumerator and just use 0 as dontcare flag.
          - explain how btf type_tag is encoded in btf type chain.
        v1 -> v2:
          - use MEM_USER flag for PTR_TO_BTF_ID reg type instead of a separate
            field to encode __user tag.
          - add a test with kernel function __sys_getsockname which has __user tagged
            argument.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      50fc9786
    • Yonghong Song's avatar
      docs/bpf: clarify how btf_type_tag gets encoded in the type chain · b7290384
      Yonghong Song authored
      Clarify where the BTF_KIND_TYPE_TAG gets encoded in the type chain,
      so applications and kernel can properly parse them.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154627.665163-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b7290384
    • Yonghong Song's avatar
      selftests/bpf: specify pahole version requirement for btf_tag test · 67ef7e1a
      Yonghong Song authored
      Specify pahole version requirement (1.23) for btf_tag subtests
      btf_type_tag_user_{mod1, mod2, vmlinux}.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154622.663337-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67ef7e1a
    • Yonghong Song's avatar
      selftests/bpf: add a selftest with __user tag · 696c3901
      Yonghong Song authored
      Added a selftest with three__user usages: a __user pointer-type argument
      in bpf_testmod, a __user pointer-type struct member in bpf_testmod,
      and a __user pointer-type struct member in vmlinux. In all cases,
      directly accessing the user memory will result verification failure.
      
        $ ./test_progs -v -n 22/3
        ...
        libbpf: prog 'test_user1': BPF program load failed: Permission denied
        libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
        R1 type=ctx expected=fp
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        ; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
        0: (79) r1 = *(u64 *)(r1 +0)
        func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 136561 type STRUCT 'bpf_testmod_btf_type_tag_1'
        1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
        ; g = arg->a;
        1: (61) r1 = *(u32 *)(r1 +0)
        R1 invalid mem access 'user_ptr_'
        ...
        #22/3 btf_tag/btf_type_tag_user_mod1:OK
      
        $ ./test_progs -v -n 22/4
        ...
        libbpf: prog 'test_user2': BPF program load failed: Permission denied
        libbpf: prog 'test_user2': -- BEGIN PROG LOAD LOG --
        R1 type=ctx expected=fp
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        ; int BPF_PROG(test_user2, struct bpf_testmod_btf_type_tag_2 *arg)
        0: (79) r1 = *(u64 *)(r1 +0)
        func 'bpf_testmod_test_btf_type_tag_user_2' arg0 has btf_id 136563 type STRUCT 'bpf_testmod_btf_type_tag_2'
        1: R1_w=ptr_bpf_testmod_btf_type_tag_2(id=0,off=0,imm=0)
        ; g = arg->p->a;
        1: (79) r1 = *(u64 *)(r1 +0)          ; R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
        ; g = arg->p->a;
        2: (61) r1 = *(u32 *)(r1 +0)
        R1 invalid mem access 'user_ptr_'
        ...
        #22/4 btf_tag/btf_type_tag_user_mod2:OK
      
        $ ./test_progs -v -n 22/5
        ...
        libbpf: prog 'test_sys_getsockname': BPF program load failed: Permission denied
        libbpf: prog 'test_sys_getsockname': -- BEGIN PROG LOAD LOG --
        R1 type=ctx expected=fp
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        ; int BPF_PROG(test_sys_getsockname, int fd, struct sockaddr *usockaddr,
        0: (79) r1 = *(u64 *)(r1 +8)
        func '__sys_getsockname' arg1 has btf_id 2319 type STRUCT 'sockaddr'
        1: R1_w=user_ptr_sockaddr(id=0,off=0,imm=0)
        ; g = usockaddr->sa_family;
        1: (69) r1 = *(u16 *)(r1 +0)
        R1 invalid mem access 'user_ptr_'
        ...
        #22/5 btf_tag/btf_type_tag_user_vmlinux:OK
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154616.659314-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      696c3901
    • Yonghong Song's avatar
      selftests/bpf: rename btf_decl_tag.c to test_btf_decl_tag.c · 571d01a9
      Yonghong Song authored
      The uapi btf.h contains the following declaration:
        struct btf_decl_tag {
             __s32   component_idx;
        };
      
      The skeleton will also generate a struct with name
      "btf_decl_tag" for bpf program btf_decl_tag.c.
      
      Rename btf_decl_tag.c to test_btf_decl_tag.c so
      the corresponding skeleton struct name becomes
      "test_btf_decl_tag". This way, we could include
      uapi btf.h in prog_tests/btf_tag.c.
      There is no functionality change for this patch.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154611.656699-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      571d01a9
    • Yonghong Song's avatar
      bpf: reject program if a __user tagged memory accessed in kernel way · c6f1bfe8
      Yonghong Song authored
      BPF verifier supports direct memory access for BPF_PROG_TYPE_TRACING type
      of bpf programs, e.g., a->b. If "a" is a pointer
      pointing to kernel memory, bpf verifier will allow user to write
      code in C like a->b and the verifier will translate it to a kernel
      load properly. If "a" is a pointer to user memory, it is expected
      that bpf developer should be bpf_probe_read_user() helper to
      get the value a->b. Without utilizing BTF __user tagging information,
      current verifier will assume that a->b is a kernel memory access
      and this may generate incorrect result.
      
      Now BTF contains __user information, it can check whether the
      pointer points to a user memory or not. If it is, the verifier
      can reject the program and force users to use bpf_probe_read_user()
      helper explicitly.
      
      In the future, we can easily extend btf_add_space for other
      address space tagging, for example, rcu/percpu etc.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154606.654961-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c6f1bfe8
    • Yonghong Song's avatar
      compiler_types: define __user as __attribute__((btf_type_tag("user"))) · 7472d5a6
      Yonghong Song authored
      The __user attribute is currently mainly used by sparse for type checking.
      The attribute indicates whether a memory access is in user memory address
      space or not. Such information is important during tracing kernel
      internal functions or data structures as accessing user memory often
      has different mechanisms compared to accessing kernel memory. For example,
      the perf-probe needs explicit command line specification to indicate a
      particular argument or string in user-space memory ([1], [2], [3]).
      Currently, vmlinux BTF is available in kernel with many distributions.
      If __user attribute information is available in vmlinux BTF, the explicit
      user memory access information from users will not be necessary as
      the kernel can figure it out by itself with vmlinux BTF.
      
      Besides the above possible use for perf/probe, another use case is
      for bpf verifier. Currently, for bpf BPF_PROG_TYPE_TRACING type of bpf
      programs, users can write direct code like
        p->m1->m2
      and "p" could be a function parameter. Without __user information in BTF,
      the verifier will assume p->m1 accessing kernel memory and will generate
      normal loads. Let us say "p" actually tagged with __user in the source
      code.  In such cases, p->m1 is actually accessing user memory and direct
      load is not right and may produce incorrect result. For such cases,
      bpf_probe_read_user() will be the correct way to read p->m1.
      
      To support encoding __user information in BTF, a new attribute
        __attribute__((btf_type_tag("<arbitrary_string>")))
      is implemented in clang ([4]). For example, if we have
        #define __user __attribute__((btf_type_tag("user")))
      during kernel compilation, the attribute "user" information will
      be preserved in dwarf. After pahole converting dwarf to BTF, __user
      information will be available in vmlinux BTF.
      
      The following is an example with latest upstream clang (clang14) and
      pahole 1.23:
      
        [$ ~] cat test.c
        #define __user __attribute__((btf_type_tag("user")))
        int foo(int __user *arg) {
                return *arg;
        }
        [$ ~] clang -O2 -g -c test.c
        [$ ~] pahole -JV test.o
        ...
        [1] INT int size=4 nr_bits=32 encoding=SIGNED
        [2] TYPE_TAG user type_id=1
        [3] PTR (anon) type_id=2
        [4] FUNC_PROTO (anon) return=1 args=(3 arg)
        [5] FUNC foo type_id=4
        [$ ~]
      
      You can see for the function argument "int __user *arg", its type is
      described as
        PTR -> TYPE_TAG(user) -> INT
      The kernel can use this information for bpf verification or other
      use cases.
      
      Current btf_type_tag is only supported in clang (>= clang14) and
      pahole (>= 1.23). gcc support is also proposed and under development ([5]).
      
        [1] http://lkml.kernel.org/r/155789874562.26965.10836126971405890891.stgit@devnote2
        [2] http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
        [3] http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
        [4] https://reviews.llvm.org/D111199
        [5] https://lore.kernel.org/bpf/0cbeb2fb-1a18-f690-e360-24b1c90c2a91@fb.com/Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127154600.652613-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7472d5a6
    • Pavel Begunkov's avatar
      cgroup/bpf: fast path skb BPF filtering · 46531a30
      Pavel Begunkov authored
      Even though there is a static key protecting from overhead from
      cgroup-bpf skb filtering when there is nothing attached, in many cases
      it's not enough as registering a filter for one type will ruin the fast
      path for all others. It's observed in production servers I've looked
      at but also in laptops, where registration is done during init by
      systemd or something else.
      
      Add a per-socket fast path check guarding from such overhead. This
      affects both receive and transmit paths of TCP, UDP and other
      protocols. It showed ~1% tx/s improvement in small payload UDP
      send benchmarks using a real NIC and in a server environment and the
      number jumps to 2-3% for preemtible kernels.
      Reviewed-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46531a30
    • Yonghong Song's avatar
      selftests/bpf: fix a clang compilation error · cdb5ed97
      Yonghong Song authored
      When building selftests/bpf with clang
        make -j LLVM=1
        make -C tools/testing/selftests/bpf -j LLVM=1
      I hit the following compilation error:
      
        trace_helpers.c:152:9: error: variable 'found' is used uninitialized whenever 'while' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized]
                while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        trace_helpers.c:161:7: note: uninitialized use occurs here
                if (!found)
                     ^~~~~
        trace_helpers.c:152:9: note: remove the condition if it is always true
                while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       1
        trace_helpers.c:145:12: note: initialize the variable 'found' to silence this warning
                bool found;
                          ^
                           = false
      
      It is possible that for sane /proc/self/maps we may never hit the above issue
      in practice. But let us initialize variable 'found' properly to silence the
      compilation error.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220127163726.1442032-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cdb5ed97
    • Magnus Karlsson's avatar
      selftests, xsk: Fix bpf_res cleanup test · 3b22523b
      Magnus Karlsson authored
      After commit 710ad98c ("veth: Do not record rx queue hint in veth_xmit"),
      veth no longer receives traffic on the same queue as it was sent on. This
      breaks the bpf_res test for the AF_XDP selftests as the socket tied to
      queue 1 will not receive traffic anymore.
      
      Modify the test so that two sockets are tied to queue id 0 using a shared
      umem instead. When killing the first socket enter the second socket into
      the xskmap so that traffic will flow to it. This will still test that the
      resources are not cleaned up until after the second socket dies, without
      having to rely on veth supporting rx_queue hints.
      Reported-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220125082945.26179-1-magnus.karlsson@gmail.com
      3b22523b
    • Daniel Borkmann's avatar
      Merge branch 'xsk-batching' · 33372bc2
      Daniel Borkmann authored
      Maciej Fijalkowski says:
      
      ====================
      Unfortunately, similar scalability issues that were addressed for XDP
      processing in ice, exist for XDP in the zero-copy driver used by AF_XDP.
      Let's resolve them in mostly the same way as we did in [0] and utilize
      the Tx batching API from XSK buffer pool.
      
      Move the array of Tx descriptors that is used with batching approach to
      the XSK buffer pool. This means that future users of this API will not
      have to carry the array on their own side, they can simple refer to
      pool's tx_desc array.
      
      We also improve the Rx side where we extend ice_alloc_rx_buf_zc() to
      handle the ring wrap and bump Rx tail more frequently. By doing so,
      Rx side is adjusted to Tx and it was needed for l2fwd scenario.
      
      Here are the improvements of performance numbers that this set brings
      measured with xdpsock app in busy poll mode for 1 and 2 core modes.
      Both Tx and Rx rings were sized to 1k length and busy poll budget was
      256.
      
      ----------------------------------------------------------------
           |      txonly:      |      l2fwd      |      rxdrop
      ----------------------------------------------------------------
      1C   |       149%        |       14%       |        3%
      ----------------------------------------------------------------
      2C   |       134%        |       20%       |        5%
      ----------------------------------------------------------------
      
      Next step will be to introduce batching onto Rx side.
      
      v5:
      * collect acks
      * fix typos
      * correct comments showing cache line boundaries in ice_tx_ring struct
      v4 - address Alexandr's review:
      * new patch (2) for making sure ring size is pow(2) when attaching
        xsk socket
      * don't open code ALIGN_DOWN (patch 3)
      * resign from storing tx_thresh in ice_tx_ring (patch 4)
      * scope variables in a better way for Tx batching (patch 7)
      v3:
      * drop likely() that was wrapping napi_complete_done (patch 1)
      * introduce configurable Tx threshold (patch 2)
      * handle ring wrap on Rx side when allocating buffers (patch 3)
      * respect NAPI budget when cleaning Tx descriptors in ZC (patch 6)
      v2:
      * introduce new patch that resets @next_dd and @next_rs fields
      * use batching API for AF_XDP Tx on ice side
      
        [0]: https://lore.kernel.org/bpf/20211015162908.145341-8-anthony.l.nguyen@intel.com/
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      33372bc2
    • Maciej Fijalkowski's avatar
      ice: xsk: Borrow xdp_tx_active logic from i40e · 59e92bfe
      Maciej Fijalkowski authored
      One of the things that commit 5574ff7b ("i40e: optimize AF_XDP Tx
      completion path") introduced was the @xdp_tx_active field. Its usage
      from i40e can be adjusted to ice driver and give us positive performance
      results.
      
      If the descriptor that @next_dd points to has been sent by HW (its DD
      bit is set), then we are sure that at least quarter of the ring is ready
      to be cleaned. If @xdp_tx_active is 0 which means that related xdp_ring
      is not used for XDP_{TX, REDIRECT} workloads, then we know how many XSK
      entries should placed to completion queue, IOW walking through the ring
      can be skipped.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-9-maciej.fijalkowski@intel.com
      59e92bfe
    • Maciej Fijalkowski's avatar
      ice: xsk: Improve AF_XDP ZC Tx and use batching API · 126cdfe1
      Maciej Fijalkowski authored
      Apply the logic that was done for regular XDP from commit 9610bd98
      ("ice: optimize XDP_TX workloads") to the ZC side of the driver. On top
      of that, introduce batching to Tx that is inspired by i40e's
      implementation with adjustments to the cleaning logic - take into the
      account NAPI budget in ice_clean_xdp_irq_zc().
      
      Separating the stats structs onto separate cache lines seemed to improve
      the performance.
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Link: https://lore.kernel.org/bpf/20220125160446.78976-8-maciej.fijalkowski@intel.com
      126cdfe1