Commits · 7effe3fdc049a34c56a68671100b5570e53e8f0a · Kirill Smelkov / linux

03 Apr, 2024 4 commits

tools: Add ethtool.h header to tooling infra · 7effe3fd

Tushar Vyavahare authored Apr 02, 2024

This commit duplicates the ethtool.h file from the include/uapi/linux
directory in the kernel source to the tools/include/uapi/linux directory.

This action ensures that the ethtool.h file used in the tools directory
is in sync with the kernel's version, maintaining consistency across the
codebase.

There are some checkpatch warnings in this file that could be cleaned up,
but I preferred to move it over as-is for now to avoid disrupting the code.
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20240402114529.545475-2-tushar.vyavahare@intel.com

7effe3fd

Merge branch 'bpf-arm64-add-support-for-bpf-arena' · 49b73fa6

Alexei Starovoitov authored Apr 02, 2024

Puranjay Mohan says:

====================
bpf,arm64: Add support for BPF Arena

Changes in V4
V3: https://lore.kernel.org/bpf/20240323103057.26499-1-puranjay12@gmail.com/
- Use more descriptive variable names.
- Use insn_is_cast_user() helper.

Changes in V3
V2: https://lore.kernel.org/bpf/20240321153102.103832-1-puranjay12@gmail.com/
- Optimize bpf_addr_space_cast as suggested by Xu Kuohai

Changes in V2
V1: https://lore.kernel.org/bpf/20240314150003.123020-1-puranjay12@gmail.com/
- Fix build warnings by using 5 in place of 32 as DONT_CLEAR marker.
  R5 is not mapped to any BPF register so it can safely be used here.

This series adds the support for PROBE_MEM32 and bpf_addr_space_cast
instructions to the ARM64 BPF JIT. These two instructions allow the
enablement of BPF Arena.

All arena related selftests are passing.

  [root@ip-172-31-6-62 bpf]# ./test_progs -a "*arena*"
  #3/1     arena_htab/arena_htab_llvm:OK
  #3/2     arena_htab/arena_htab_asm:OK
  #3       arena_htab:OK
  #4/1     arena_list/arena_list_1:OK
  #4/2     arena_list/arena_list_1000:OK
  #4       arena_list:OK
  #434/1   verifier_arena/basic_alloc1:OK
  #434/2   verifier_arena/basic_alloc2:OK
  #434/3   verifier_arena/basic_alloc3:OK
  #434/4   verifier_arena/iter_maps1:OK
  #434/5   verifier_arena/iter_maps2:OK
  #434/6   verifier_arena/iter_maps3:OK
  #434     verifier_arena:OK
  Summary: 3/10 PASSED, 0 SKIPPED, 0 FAILED

This will need the patch [1] that introduced insn_is_cast_user() helper to
build.

The verifier_arena selftest could fail in the CI because the following
commit[2] is missing from bpf-next:

[1] https://lore.kernel.org/bpf/20240324183226.29674-1-puranjay12@gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=fa3550dca8f02ec312727653a94115ef3ab68445

Here is a CI run with all dependencies added: https://github.com/kernel-patches/bpf/pull/6641
====================

Link: https://lore.kernel.org/r/20240325150716.4387-1-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49b73fa6

bpf: Add arm64 JIT support for bpf_addr_space_cast instruction. · 4dd31243

Puranjay Mohan authored Mar 25, 2024

LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))). The addr_space=0 is reserved as
bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY.

rY = addr_space_cast(rX, 1, 0) : used to convert a bpf arena pointer to
a pointer in the userspace vma. This has to be converted by the JIT.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-3-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4dd31243

bpf: Add arm64 JIT support for PROBE_MEM32 pseudo instructions. · 339af577

Puranjay Mohan authored Mar 25, 2024

Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW]
instructions.  They are similar to PROBE_MEM instructions with the
following differences:
- PROBE_MEM32 supports store.
- PROBE_MEM32 relies on the verifier to clear upper 32-bit of the
  src/dst register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in R28
  in the prologue). Due to bpf_arena constructions such R28 + reg +
  off16 access is guaranteed to be within arena virtual range, so no
  address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop. When
  LDX faults the destination register is zeroed.

To support these on arm64, we do tmp2 = R28 + src/dst reg and then use
tmp2 as the new src/dst register. This allows us to reuse most of the
code for normal [LDX | STX | ST].
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20240325150716.4387-2-puranjay12@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

339af577

02 Apr, 2024 11 commits

selftests/bpf: Add pid limit for mptcpify prog · c07b4bcd

Geliang Tang authored Apr 02, 2024

In order to prevent mptcpify prog from affecting the running results
of other BPF tests, a pid limit was added to restrict it from only
modifying its own program.
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/8987e2938e15e8ec390b85b5dcbee704751359dc.1712054986.git.tanggeliang@kylinos.cn

c07b4bcd

libbpf: Use local bpf_helpers.h include · 15ea39ad

Tobias Böhm authored Apr 02, 2024

Commit 20d59ee5 ("libbpf: add bpf_core_cast() macro") added a
bpf_helpers include in bpf_core_read.h as a system include. Usually, the
includes are local, though, like in bpf_tracing.h. This commit adjusts
the include to be local as well.
Signed-off-by: Tobias Böhm <tobias@aibor.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/q5d5bgc6vty2fmaazd5e73efd6f5bhiru2le6fxn43vkw45bls@fhlw2s5ootdb

15ea39ad

bpf: Improve program stats run-time calculation · ce09cbdd

Jose Fernandez authored Apr 01, 2024

This patch improves the run-time calculation for program stats by
capturing the duration as soon as possible after the program returns.

Previously, the duration included u64_stats_t operations. While the
instrumentation overhead is part of the total time spent when stats are
enabled, distinguishing between the program's native execution time and
the time spent due to instrumentation is crucial for accurate
performance analysis.

By making this change, the patch facilitates more precise optimization
of BPF programs, enabling users to understand their performance in
environments without stats enabled.

I used a virtualized environment to measure the run-time over one minute
for a basic raw_tracepoint/sys_enter program, which just increments a
local counter. Although the virtualization introduced some performance
degradation that could affect the results, I observed approximately a
16% decrease in average run-time reported by stats with this change
(310 -> 260 nsec).
Signed-off-by: Jose Fernandez <josef@netflix.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402034010.25060-1-josef@netflix.com

ce09cbdd

selftests/bpf: Skip test when perf_event_open returns EOPNOTSUPP · c186ed12

Pu Lehui authored Apr 02, 2024

When testing send_signal and stacktrace_build_id_nmi using the riscv sbi
pmu driver without the sscofpmf extension or the riscv legacy pmu driver,
then failures as follows are encountered:

    test_send_signal_common:FAIL:perf_event_open unexpected perf_event_open: actual -1 < expected 0
    #272/3   send_signal/send_signal_nmi:FAIL

    test_stacktrace_build_id_nmi:FAIL:perf_event_open err -1 errno 95
    #304     stacktrace_build_id_nmi:FAIL

The reason is that the above pmu driver or hardware does not support
sampling events, that is, PERF_PMU_CAP_NO_INTERRUPT is set to pmu
capabilities, and then perf_event_open returns EOPNOTSUPP. Since
PERF_PMU_CAP_NO_INTERRUPT is not only set in the riscv-related pmu driver,
it is better to skip testing when this capability is set.
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240402073029.1299085-1-pulehui@huaweicloud.com

c186ed12

bpftool: Use __typeof__() instead of typeof() in BPF skeleton · 2a24e248

Andrii Nakryiko authored Apr 01, 2024

When generated BPF skeleton header is included in C++ code base, some
compiler setups will emit warning about using language extensions due to
typeof() usage, resulting in something like:

  error: extension used [-Werror,-Wlanguage-extension-token]
  obj->struct_ops.empty_tcp_ca = (typeof(obj->struct_ops.empty_tcp_ca))
                                  ^

It looks like __typeof__() is a preferred way to do typeof() with better
C++ compatibility behavior, so switch to that. With __typeof__() we get
no such warning.

Fixes: c2a0257c ("bpftool: Cast pointers for shadow types explicitly.")
Fixes: 00389c58 ("bpftool: Add support for subskeletons")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kui-Feng Lee <thinker.li@gmail.com>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240401170713.2081368-1-andrii@kernel.org

2a24e248

selftests/bpf: Using llvm may_goto inline asm for cond_break macro · 965c6167

Yonghong Song authored Apr 01, 2024

Currently, cond_break macro uses bytes to encode the may_goto insn.
Patch [1] in llvm implemented may_goto insn in BPF backend.
Replace byte-level encoding with llvm inline asm for better usability.
Using llvm may_goto insn is controlled by macro __BPF_FEATURE_MAY_GOTO.

  [1] https://github.com/llvm/llvm-project/commit/0e0bfacff71859d1f9212205f8f873d47029d3fbSigned-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402025446.3215182-1-yonghong.song@linux.dev

965c6167

bpf: Add a verbose message if map limit is reached · 9dc182c5

Anton Protopopov authored Apr 02, 2024

When more than 64 maps are used by a program and its subprograms the
verifier returns -E2BIG. Add a verbose message which highlights the
source of the error and also print the actual limit.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20240402073347.195920-1-aspsk@isovalent.com

9dc182c5

bpf: Fix typo in uapi doc comments · ca4ddc26

David Lechner authored Mar 29, 2024

In a few places in the bpf uapi headers, EOPNOTSUPP is missing a "P" in
the doc comments. This adds the missing "P".
Signed-off-by: David Lechner <dlechner@baylibre.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240329152900.398260-2-dlechner@baylibre.com

ca4ddc26

bpftool: Clean-up typos, punctuation, list formatting in docs · a70f5d84

Rameez Rehman authored Mar 31, 2024

Improve the formatting of the attach flags for cgroup programs in the
relevant man page, and fix typos ("can be on of", "an userspace inet
socket") when introducing that list. Also fix a couple of other trivial
issues in docs.

[ Quentin: Fixed trival issues in bpftool-gen.rst and bpftool-iter.rst ]
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-4-qmo@kernel.org

a70f5d84

bpftool: Remove useless emphasis on command description in man pages · ea379b3c

Rameez Rehman authored Mar 31, 2024

As it turns out, the terms in definition lists in the rST file are
already rendered with bold-ish formatting when generating the man pages;
all double-star sequences we have in the commands for the command
description are unnecessary, and can be removed to make the
documentation easier to read.

The rST files were automatically processed with:

    sed -i '/DESCRIPTION/,/OPTIONS/ { /^\*/ s/\*\*//g }' b*.rst
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-3-qmo@kernel.org

ea379b3c

bpftool: Use simpler indentation in source rST for documentation · f7b68543

Rameez Rehman authored Mar 31, 2024

The rST manual pages for bpftool would use a mix of tabs and spaces for
indentation. While this is the norm in C code, this is rather unusual
for rST documents, and over time we've seen many contributors use a
wrong level of indentation for documentation update.

Let's fix bpftool's indentation in docs once and for all:

- Let's use spaces, that are more common in rST files.
- Remove one level of indentation for the synopsis, the command
  description, and the "see also" section. As a result, all sections
  start with the same indentation level in the generated man page.
- Rewrap the paragraphs after the changes.

There is no content change in this patch, only indentation and
rewrapping changes. The wrapping in the generated source files for the
manual pages is changed, but the pages displayed with "man" remain the
same, apart from the adjusted indentation level on relevant sections.

[ Quentin: rebased on bpf-next, removed indent level for command
  description and options, updated synopsis, command summary, and "see
  also" sections. ]
Signed-off-by: Rameez Rehman <rameezrehman408@hotmail.com>
Signed-off-by: Quentin Monnet <qmo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240331200346.29118-2-qmo@kernel.org

f7b68543

30 Mar, 2024 1 commit

selftests/bpf: make multi-uprobe tests work in RELEASE=1 mode · 623bdd58

Andrii Nakryiko authored Mar 29, 2024

When BPF selftests are built in RELEASE=1 mode with -O2 optimization
level, uprobe_multi binary, called from multi-uprobe tests is optimized
to the point that all the thousands of target uprobe_multi_func_XXX
functions are eliminated, breaking tests.

So ensure they are preserved by using weak attribute.

But, actually, compiling uprobe_multi binary with -O2 takes a really
long time, and is quite useless (it's not a benchmark). So in addition
to ensuring that uprobe_multi_func_XXX functions are preserved, opt-out
of -O2 explicitly in Makefile and stick to -O0. This saves a lot of
compilation time.

With -O2, just recompiling uprobe_multi:

  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  291.66s user 2.54s system 99% cpu 4:55.52 total

With -O0:
  $ touch uprobe_multi.c
  $ time make RELEASE=1 -j90
  make RELEASE=1 -j90  22.40s user 1.91s system 99% cpu 24.355 total

5 minutes vs (still slow, but...) 24 seconds.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240329190410.4191353-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

623bdd58

29 Mar, 2024 24 commits

bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie. · 59f2f841

Alexei Starovoitov authored Mar 29, 2024

syzbot reported the following lock sequence:
cpu 2:
  grabs timer_base lock
    spins on bpf_lpm lock

cpu 1:
  grab rcu krcp lock
    spins on timer_base lock

cpu 0:
  grab bpf_lpm lock
    spins on rcu krcp lock

bpf_lpm lock can be the same.
timer_base lock can also be the same due to timer migration.
but rcu krcp lock is always per-cpu, so it cannot be the same lock.
Hence it's a false positive.
To avoid lockdep complaining move kfree_rcu() after spin_unlock.

Reported-by: syzbot+1fa663a2100308ab6eab@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240329171439.37813-1-alexei.starovoitov@gmail.com

59f2f841

Merge branch 'Use start_server and connect_fd_to_fd' · 201874fc

Martin KaFai Lau authored Mar 28, 2024

Geliang Tang says:

====================
Simplify bpf_tcp_ca test by using connect_fd_to_fd and start_server
helpers.

v4:
 - Matt reminded me that I shouldn't send a square-to patch to BPF (thanks),
   so I update them into two patches in v4.

v3:
 - split v2 as two patches as Daniel suggested.
 - The patch "selftests/bpf: Use start_server in bpf_tcp_ca" is merged
   by Daniel (thanks), but I forgot to drop 'settimeo(lfd, 0)' in it, so
   I send a squash-to patch to fix this.
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

201874fc

selftests/bpf: Drop settimeo in do_test · 42667092

Geliang Tang authored Mar 26, 2024

settimeo is invoked in start_server() and in connect_fd_to_fd() already,
no need to invoke settimeo(lfd, 0) and settimeo(fd, 0) in do_test()
anymore. This patch drops them.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/dbc3613bee3b1c78f95ac9ff468bf47c92f106ea.1711447102.git.tanggeliang@kylinos.cnSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

42667092

selftests/bpf: Use connect_fd_to_fd in bpf_tcp_ca · e5e1a3aa

Geliang Tang authored Mar 26, 2024

To simplify the code, use BPF selftests helper connect_fd_to_fd() in
bpf_tcp_ca.c instead of open-coding it. This helper is defined in
network_helpers.c, and exported in network_helpers.h, which is already
included in bpf_tcp_ca.c.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Link: https://lore.kernel.org/r/e105d1f225c643bee838409378dd90fd9aabb6dc.1711447102.git.tanggeliang@kylinos.cnSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>

e5e1a3aa

bpf: Mark bpf prog stack with kmsan_unposion_memory in interpreter mode · e8742081

Martin KaFai Lau authored Mar 28, 2024

syzbot reported uninit memory usages during map_{lookup,delete}_elem.

==========
BUG: KMSAN: uninit-value in __dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
BUG: KMSAN: uninit-value in dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
__dev_map_lookup_elem kernel/bpf/devmap.c:441 [inline]
dev_map_lookup_elem+0xf3/0x170 kernel/bpf/devmap.c:796
____bpf_map_lookup_elem kernel/bpf/helpers.c:42 [inline]
bpf_map_lookup_elem+0x5c/0x80 kernel/bpf/helpers.c:38
___bpf_prog_run+0x13fe/0xe0f0 kernel/bpf/core.c:1997
__bpf_prog_run256+0xb5/0xe0 kernel/bpf/core.c:2237
==========

The reproducer should be in the interpreter mode.

The C reproducer is trying to run the following bpf prog:

    0: (18) r0 = 0x0
    2: (18) r1 = map[id:49]
    4: (b7) r8 = 16777216
    5: (7b) *(u64 *)(r10 -8) = r8
    6: (bf) r2 = r10
    7: (07) r2 += -229
            ^^^^^^^^^^

    8: (b7) r3 = 8
    9: (b7) r4 = 0
   10: (85) call dev_map_lookup_elem#1543472
   11: (95) exit

It is due to the "void *key" (r2) passed to the helper. bpf allows uninit
stack memory access for bpf prog with the right privileges. This patch
uses kmsan_unpoison_memory() to mark the stack as initialized.

This should address different syzbot reports on the uninit "void *key"
argument during map_{lookup,delete}_elem.

Reported-by: syzbot+603bcd9b0bf1d94dbb9b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000f9ce6d061494e694@google.com/
Reported-by: syzbot+eb02dc7f03dce0ef39f3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000a5c69c06147c2238@google.com/
Reported-by: syzbot+b4e65ca24fd4d0c734c3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000ac56fb06143b6cfa@google.com/
Reported-by: syzbot+d2b113dc9fea5e1d2848@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000000d69b206142d1ff7@google.com/
Reported-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/0000000000006f876b061478e878@google.com/
Tested-by: syzbot+1a3cf6f08d68868f9db3@syzkaller.appspotmail.com
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240328185801.1843078-1-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e8742081

Merge branch 'bpf-fix-a-couple-of-test-failures-with-lto-kernel' · e478cf26

Alexei Starovoitov authored Mar 27, 2024

Yonghong Song says:

====================
bpf: Fix a couple of test failures with LTO kernel

With a LTO kernel built with clang, with one of earlier version of kernel,
I encountered two test failures, ksyms and kprobe_multi_bench_attach/kernel.
Now with latest bpf-next, only kprobe_multi_bench_attach/kernel failed.
But it is possible in the future ksyms selftest may fail again.

Both test failures are due to static variable/function renaming
due to cross-file inlining. For Ksyms failure, the solution is
to strip .llvm.<hash> suffixes for symbols in /proc/kallsyms before
comparing against the ksym in bpf program.
For kprobe_multi_bench_attach/kernel failure, the solution is
to either provide names in /proc/kallsyms to the kernel or
ignore those names who have .llvm.<hash> suffix since the kernel
sym name comparison is against /proc/kallsyms.

Please see each individual patches for details.

Changelogs:
  v2 -> v3:
    - no need to check config file, directly so strstr with '.llvm.'.
    - for kprobe_multi_bench with syms, instead of skipping the syms,
      consult /proc/kallyms to find corresponding names.
    - add a test with populating addrs to the kernel for kprobe
      multi attach.
  v1 -> v2:
    - Let libbpf handle .llvm.<hash suffixes since it may impact
      bpf program ksym.
====================

Link: https://lore.kernel.org/r/20240326041443.1197498-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

e478cf26

selftests/bpf: Add a kprobe_multi subtest to use addrs instead of syms · 6302bdeb

Yonghong Song authored Mar 25, 2024

Get addrs directly from available_filter_functions_addrs and
send to the kernel during kprobe_multi_attach. This avoids
consultation of /proc/kallsyms. But available_filter_functions_addrs
is introduced in 6.5, i.e., it is introduced recently,
so I skip the test if the kernel does not support it.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041523.1200301-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6302bdeb

selftests/bpf: Fix kprobe_multi_bench_attach test failure with LTO kernel · 9edaafad

Yonghong Song authored Mar 25, 2024

In my locally build clang LTO kernel (enabling CONFIG_LTO and
CONFIG_LTO_CLANG_THIN), kprobe_multi_bench_attach/kernel subtest
failed like:
  test_kprobe_multi_bench_attach:PASS:get_syms 0 nsec
  test_kprobe_multi_bench_attach:PASS:kprobe_multi_empty__open_and_load 0 nsec
  libbpf: prog 'test_kprobe_empty': failed to attach: No such process
  test_kprobe_multi_bench_attach:FAIL:bpf_program__attach_kprobe_multi_opts unexpected error: -3
  #117/1   kprobe_multi_bench_attach/kernel:FAIL

There are multiple symbols in /sys/kernel/debug/tracing/available_filter_functions
are renamed in /proc/kallsyms due to cross file inlining. One example is for
  static function __access_remote_vm in mm/memory.c.
In a non-LTO kernel, we have the following call stack:
  ptrace_access_vm (global, kernel/ptrace.c)
    access_remote_vm (global, mm/memory.c)
      __access_remote_vm (static, mm/memory.c)

With LTO kernel, it is possible that access_remote_vm() is inlined by
ptrace_access_vm(). So we end up with the following call stack:
  ptrace_access_vm (global, kernel/ptrace.c)
    __access_remote_vm (static, mm/memory.c)
The compiler renames __access_remote_vm to __access_remote_vm.llvm.<hash>
to prevent potential name collision.

The kernel bpf_kprobe_multi_link_attach() and ftrace_lookup_symbols() try
to find addresses based on /proc/kallsyms, hence the current test failed
with LTO kenrel.

This patch consulted /proc/kallsyms to find the corresponding entries
for the ksym and this solved the issue.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041518.1199758-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9edaafad

selftests/bpf: Add {load,search}_kallsyms_custom_local() · d1f02581

Yonghong Song authored Mar 25, 2024

These two functions allow selftests to do loading/searching
kallsyms based on their specific compare functions.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041513.1199440-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d1f02581

selftests/bpf: Refactor trace helper func load_kallsyms_local() · 9475dacb

Yonghong Song authored Mar 25, 2024

Refactor trace helper function load_kallsyms_local() such that
it invokes a common function with a compare function as input.
The common function will be used later for other local functions.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041508.1199239-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9475dacb

selftests/bpf: Refactor some functions for kprobe_multi_test · d1320649

Yonghong Song authored Mar 25, 2024

Refactor some functions in kprobe_multi_test.c to extract
some helper functions who will be used in later patches
to avoid code duplication.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041503.1198982-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d1320649

libbpf: Handle <orig_name>.llvm.<hash> symbol properly · c56e5977

Yonghong Song authored Mar 25, 2024

With CONFIG_LTO_CLANG_THIN enabled, with some of previous
version of kernel code base ([1]), I hit the following
error:
   test_ksyms:PASS:kallsyms_fopen 0 nsec
   test_ksyms:FAIL:ksym_find symbol 'bpf_link_fops' not found
   #118     ksyms:FAIL

The reason is that 'bpf_link_fops' is renamed to
   bpf_link_fops.llvm.8325593422554671469
Due to cross-file inlining, the static variable 'bpf_link_fops'
in syscall.c is used by a function in another file. To avoid
potential duplicated names, the llvm added suffix
'.llvm.<hash>' ([2]) to 'bpf_link_fops' variable.
Such renaming caused a problem in libbpf if 'bpf_link_fops'
is used in bpf prog as a ksym but 'bpf_link_fops' does not
match any symbol in /proc/kallsyms.

To fix this issue, libbpf needs to understand that suffix '.llvm.<hash>'
is caused by clang lto kernel and to process such symbols properly.

With latest bpf-next code base built with CONFIG_LTO_CLANG_THIN,
I cannot reproduce the above failure any more. But such an issue
could happen with other symbols or in the future for bpf_link_fops symbol.

For example, with my current kernel, I got the following from
/proc/kallsyms:
  ffffffff84782154 d __func__.net_ratelimit.llvm.6135436931166841955
  ffffffff85f0a500 d tk_core.llvm.726630847145216431
  ffffffff85fdb960 d __fs_reclaim_map.llvm.10487989720912350772
  ffffffff864c7300 d fake_dst_ops.llvm.54750082607048300

I could not easily create a selftest to test newly-added
libbpf functionality with a static C test since I do not know
which symbol is cross-file inlined. But based on my particular kernel,
the following test change can run successfully.

>  diff --git a/tools/testing/selftests/bpf/prog_tests/ksyms.c b/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  index 6a86d1f07800..904a103f7b1d 100644
>  --- a/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  +++ b/tools/testing/selftests/bpf/prog_tests/ksyms.c
>  @@ -42,6 +42,7 @@ void test_ksyms(void)
>          ASSERT_EQ(data->out__bpf_link_fops, link_fops_addr, "bpf_link_fops");
>          ASSERT_EQ(data->out__bpf_link_fops1, 0, "bpf_link_fops1");
>          ASSERT_EQ(data->out__btf_size, btf_size, "btf_size");
>  +       ASSERT_NEQ(data->out__fake_dst_ops, 0, "fake_dst_ops");
>          ASSERT_EQ(data->out__per_cpu_start, per_cpu_start_addr, "__per_cpu_start");
>
>   cleanup:
>  diff --git a/tools/testing/selftests/bpf/progs/test_ksyms.c b/tools/testing/selftests/bpf/progs/test_ksyms.c
>  index 6c9cbb5a3bdf..fe91eef54b66 100644
>  --- a/tools/testing/selftests/bpf/progs/test_ksyms.c
>  +++ b/tools/testing/selftests/bpf/progs/test_ksyms.c
>  @@ -9,11 +9,13 @@ __u64 out__bpf_link_fops = -1;
>   __u64 out__bpf_link_fops1 = -1;
>   __u64 out__btf_size = -1;
>   __u64 out__per_cpu_start = -1;
>  +__u64 out__fake_dst_ops = -1;
>
>   extern const void bpf_link_fops __ksym;
>   extern const void __start_BTF __ksym;
>   extern const void __stop_BTF __ksym;
>   extern const void __per_cpu_start __ksym;
>  +extern const void fake_dst_ops __ksym;
>   /* non-existing symbol, weak, default to zero */
>   extern const void bpf_link_fops1 __ksym __weak;
>
>  @@ -23,6 +25,7 @@ int handler(const void *ctx)
>          out__bpf_link_fops = (__u64)&bpf_link_fops;
>          out__btf_size = (__u64)(&__stop_BTF - &__start_BTF);
>          out__per_cpu_start = (__u64)&__per_cpu_start;
>  +       out__fake_dst_ops = (__u64)&fake_dst_ops;
>
>          out__bpf_link_fops1 = (__u64)&bpf_link_fops1;

This patch fixed the issue in libbpf such that
the suffix '.llvm.<hash>' will be ignored during comparison of
bpf prog ksym vs. symbols in /proc/kallsyms, this resolved the issue.
Currently, only static variables in /proc/kallsyms are checked
with '.llvm.<hash>' suffix since in bpf programs function ksyms
with '.llvm.<hash>' suffix are most likely kfunc's and unlikely
to be cross-file inlined.

Note that currently kernel does not support gcc build with lto.

  [1] https://lore.kernel.org/bpf/20240302165017.1627295-1-yonghong.song@linux.dev/
  [2] https://github.com/llvm/llvm-project/blob/release/18.x/llvm/include/llvm/IR/ModuleSummaryIndex.h#L1714-L1719Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041458.1198161-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c56e5977

libbpf: Mark libbpf_kallsyms_parse static function · ad2b0528

Yonghong Song authored Mar 25, 2024

Currently libbpf_kallsyms_parse() function is declared as a global
function but actually it is not a API and there is no external
users in bpftool/bpf-selftests. So let us mark the function as
static.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041453.1197949-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ad2b0528

selftests/bpf: Replace CHECK with ASSERT macros for ksyms test · cdfd9cc3

Yonghong Song authored Mar 25, 2024

Replace CHECK with ASSERT macros for ksyms tests.
This test failed earlier with clang lto kernel, but the
issue is gone with latest code base. But replacing
CHECK with ASSERT still improves code as ASSERT is
preferred in selftests.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240326041448.1197812-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cdfd9cc3

selftests/bpf: Test loading bpf-tcp-cc prog calling the kernel tcp-cc kfuncs · 5da7fb04

Martin KaFai Lau authored Mar 22, 2024

This patch adds a test to ensure all static tcp-cc kfuncs is visible to
the struct_ops bpf programs. It is checked by successfully loading
the struct_ops programs calling these tcp-cc kfuncs.

This patch needs to enable the CONFIG_TCP_CONG_DCTCP and
the CONFIG_TCP_CONG_BBR.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240322191433.4133280-2-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5da7fb04

bpf: Remove CONFIG_X86 and CONFIG_DYNAMIC_FTRACE guard from the tcp-cc kfuncs · 42e4ebd3

Martin KaFai Lau authored Mar 22, 2024

The commit 7aae231a ("bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE")
added CONFIG_DYNAMIC_FTRACE guard because pahole was only generating
btf for ftrace-able functions. The ftrace filter had already been
removed from pahole, so the CONFIG_DYNAMIC_FTRACE guard can be
removed.

The commit 569c484f ("bpf: Limit static tcp-cc functions in the .BTF_ids list to x86")
has added CONFIG_X86 guard because it failed the powerpc arch which
prepended a "." to the local static function, so "cubictcp_init" becomes
".cubictcp_init". "__bpf_kfunc" has been added to kfunc
since then and it uses the __unused compiler attribute.
There is an existing
"__bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused)"
test in bpf_testmod.c to cover the static kfunc case.

cross compile on ppc64 with CONFIG_DYNAMIC_FTRACE disabled:
> readelf -s vmlinux | grep cubictcp_
56938: c00000000144fd00   184 FUNC    LOCAL  DEFAULT    2 cubictcp_cwnd_event 	    [<localentry>: 8]
56939: c00000000144fdb8   200 FUNC    LOCAL  DEFAULT    2 cubictcp_recalc_[...]   [<localentry>: 8]
56940: c00000000144fe80   296 FUNC    LOCAL  DEFAULT    2 cubictcp_init 	    [<localentry>: 8]
56941: c00000000144ffa8   228 FUNC    LOCAL  DEFAULT    2 cubictcp_state 	    [<localentry>: 8]
56942: c00000000145008c  1908 FUNC    LOCAL  DEFAULT    2 cubictcp_cong_avoid  [<localentry>: 8]
56943: c000000001450800  1644 FUNC    LOCAL  DEFAULT    2 cubictcp_acked 	    [<localentry>: 8]

> bpftool btf dump file vmlinux | grep cubictcp_
[51540] FUNC 'cubictcp_acked' type_id=38137 linkage=static
[51541] FUNC 'cubictcp_cong_avoid' type_id=38122 linkage=static
[51543] FUNC 'cubictcp_cwnd_event' type_id=51542 linkage=static
[51544] FUNC 'cubictcp_init' type_id=9186 linkage=static
[51545] FUNC 'cubictcp_recalc_ssthresh' type_id=35021 linkage=static
[51547] FUNC 'cubictcp_state' type_id=38141 linkage=static

The patch removed both config guards.

Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240322191433.4133280-1-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

42e4ebd3

bpf: Mitigate latency spikes associated with freeing non-preallocated htab · ee3bad03

Yafang Shao authored Mar 27, 2024

Following the recent upgrade of one of our BPF programs, we encountered
significant latency spikes affecting other applications running on the same
host. After thorough investigation, we identified that these spikes were
primarily caused by the prolonged duration required to free a
non-preallocated htab with approximately 2 million keys.

Notably, our kernel configuration lacks the presence of CONFIG_PREEMPT. In
scenarios where kernel execution extends excessively, other threads might
be starved of CPU time, resulting in latency issues across the system. To
mitigate this, we've adopted a proactive approach by incorporating
cond_resched() calls within the kernel code. This ensures that during
lengthy kernel operations, the scheduler is invoked periodically to provide
opportunities for other threads to execute.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240327032022.78391-1-laoar.shao@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ee3bad03

Merge branch 'bench-fast-in-kernel-triggering-benchmarks' · a461a51e

Alexei Starovoitov authored Mar 27, 2024

Andrii Nakryiko says:

====================
bench: fast in-kernel triggering benchmarks

Remove "legacy" triggering benchmarks which rely on syscalls (and thus syscall
overhead is a noticeable part of benchmark, unfortunately). Replace them with
faster versions that rely on triggering BPF programs in-kernel through another
simple "driver" BPF program. See patch #2 with comparison results.

raw_tp/tp/fmodret benchmarks required adding a simple kfunc in kernel to be
able to trigger a simple tracepoint from BPF program (plus it is also allowed
to be replaced by fmod_ret programs). This limits raw_tp/tp/fmodret benchmarks
to new kernels only, but it keeps bench tool itself very portable and most of
other benchmarks will still work on wide variety of kernels without the need
to worry about building and deploying custom kernel module. See patches #5
and #6 for details.

v1->v2:
  - move new TP closer to BPF test run code;
  - rename/move kfunc and register it for fmod_rets (Alexei);
  - limit --trig-batch-iters param to [1, 1000] (Alexei).
====================

Link: https://lore.kernel.org/r/20240326162151.3981687-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a461a51e

selftests/bpf: add batched tp/raw_tp/fmodret tests · 985d0681

Andrii Nakryiko authored Mar 26, 2024

Utilize bpf_modify_return_test_tp() kfunc to have a fast way to trigger
tp/raw_tp/fmodret programs from another BPF program, which gives us
comparable batched benchmarks to (batched) kprobe/fentry benchmarks.

We don't switch kprobe/fentry batched benchmarks to this kfunc to make
bench tool usable on older kernels as well.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

985d0681

bpf: add bpf_modify_return_test_tp() kfunc triggering tracepoint · 3124591f

Andrii Nakryiko authored Mar 26, 2024

Add a simple bpf_modify_return_test_tp() kfunc, available to all program
types, that is useful for various testing and benchmarking scenarios, as
it allows to trigger most tracing BPF program types from BPF side,
allowing to do complex testing and benchmarking scenarios.

It is also attachable to for fmod_ret programs, making it a good and
simple way to trigger fmod_ret program under test/benchmark.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3124591f

selftests/bpf: lazy-load trigger bench BPF programs · b4ccf915

Andrii Nakryiko authored Mar 26, 2024

Instead of front-loading all possible benchmarking BPF programs for
trigger benchmarks, explicitly specify which BPF programs are used by
specific benchmark and load only it.

This allows to be more flexible in supporting older kernels, where some
program types might not be possible to load (e.g., those that rely on
newly added kfunc).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

b4ccf915

selftests/bpf: remove syscall-driven benchs, keep syscall-count only · 208c4391

Andrii Nakryiko authored Mar 26, 2024

Remove "legacy" benchmarks triggered by syscalls in favor of newly added
in-kernel/batched benchmarks. Drop -batched suffix now as well.
Next patch will restore "feature parity" by adding back
tp/raw_tp/fmodret benchmarks based on in-kernel kfunc approach.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

208c4391

selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks · 7df4e597

Andrii Nakryiko authored Mar 26, 2024

Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
one syscall execution and BPF program run. While we use a fast
get_pgid() syscall, syscall overhead can still be non-trivial.

This patch adds kprobe/fentry set of benchmarks significantly amortizing
the cost of syscall vs actual BPF triggering overhead. We do this by
employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
which does a tight parameterized loop calling cheap BPF helper
(bpf_get_numa_node_id()), to which kprobe/fentry programs are
attached for benchmarking.

This way 1 bpf() syscall causes N executions of BPF program being
benchmarked. N defaults to 100, but can be adjusted with
--trig-batch-iters CLI argument.

For comparison we also implement a new baseline program that instead of
triggering another BPF program just does N atomic per-CPU counter
increments, establishing the limit for all other types of program within
this batched benchmarking setup.

Taking the final set of benchmarks added in this patch set (including
tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy"
syscall-driven benchmarks, we can capture all triggering benchmarks in
one place for comparison, before we remove the legacy ones (and rename
xxx-batched into just xxx).

$ benchs/run_bench_trigger.sh
usermode-count       :   79.500 ± 0.024M/s
kernel-count         :   49.949 ± 0.081M/s
syscall-count        :    9.009 ± 0.007M/s

fentry-batch         :   31.002 ± 0.015M/s
fexit-batch          :   20.372 ± 0.028M/s
fmodret-batch        :   21.651 ± 0.659M/s
rawtp-batch          :   36.775 ± 0.264M/s
tp-batch             :   19.411 ± 0.248M/s
kprobe-batch         :   12.949 ± 0.220M/s
kprobe-multi-batch   :   15.400 ± 0.007M/s
kretprobe-batch      :    5.559 ± 0.011M/s
kretprobe-multi-batch:    5.861 ± 0.003M/s

fentry-legacy        :    8.329 ± 0.004M/s
fexit-legacy         :    6.239 ± 0.003M/s
fmodret-legacy       :    6.595 ± 0.001M/s
rawtp-legacy         :    8.305 ± 0.004M/s
tp-legacy            :    6.382 ± 0.001M/s
kprobe-legacy        :    5.528 ± 0.003M/s
kprobe-multi-legacy  :    5.864 ± 0.022M/s
kretprobe-legacy     :    3.081 ± 0.001M/s
kretprobe-multi-legacy:   3.193 ± 0.001M/s

Note how xxx-batch variants are measured with significantly higher
throughput, even though it's exactly the same in-kernel overhead. As
such, results can be compared only between benchmarks of the same kind
(syscall vs batched):

fentry-legacy        :    8.329 ± 0.004M/s
fentry-batch         :   31.002 ± 0.015M/s

kprobe-multi-legacy  :    5.864 ± 0.022M/s
kprobe-multi-batch   :   15.400 ± 0.007M/s

Note also that syscall-count is setting a theoretical limit for
syscall-triggered benchmarks, while kernel-count is setting similar
limits for batch variants. usermode-count is a happy and unachievable
case of user space counting without doing any syscalls, and is mostly
the measure of CPU speed for such a trivial benchmark.

As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce
similar benchmark, which we address in a separate patch.

Note that run_bench_trigger.sh allows to override a list of benchmarks
to run, which is very useful for performance work.

Cc: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7df4e597

selftests/bpf: rename and clean up userspace-triggered benchmarks · 1175f8de

Andrii Nakryiko authored Mar 26, 2024

Rename uprobe-base to more precise usermode-count (it will match other
baseline-like benchmarks, kernel-count and syscall-count). Also use
BENCH_TRIG_USERMODE() macro to define all usermode-based triggering
benchmarks, which include usermode-count and uprobe/uretprobe benchmarks.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240326162151.3981687-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1175f8de