Commits · f0dce1d9b7c81fc3dc9d0cc0bc7ef9b3eae22584 · Kirill Smelkov / linux

19 Aug, 2021 4 commits

bpf: Use kvmalloc for map values in syscall · f0dce1d9

Stanislav Fomichev authored Aug 18, 2021

Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).

Can be reproduced with netcnt test on qemu with "-smp 255".
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com

f0dce1d9

selftests/bpf: Adding delay in socketmap_listen to reduce flakyness · 3666b167

Yucong Sun authored Aug 19, 2021

This patch adds a 1ms delay to reduce flakyness of the test.
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210819163609.2583758-1-fallentree@fb.com

3666b167

bpf: Fix NULL event->prog pointer access in bpf_overflow_handler · 594286b7

Yonghong Song authored Aug 19, 2021

Andrii reported that libbpf CI hit the following oops when
running selftest send_signal:
  [ 1243.160719] BUG: kernel NULL pointer dereference, address: 0000000000000030
  [ 1243.161066] #PF: supervisor read access in kernel mode
  [ 1243.161066] #PF: error_code(0x0000) - not-present page
  [ 1243.161066] PGD 0 P4D 0
  [ 1243.161066] Oops: 0000 [#1] PREEMPT SMP NOPTI
  [ 1243.161066] CPU: 1 PID: 882 Comm: new_name Tainted: G           O      5.14.0-rc5 #1
  [ 1243.161066] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
  [ 1243.161066] RIP: 0010:bpf_overflow_handler+0x9a/0x1e0
  [ 1243.161066] Code: 5a 84 c0 0f 84 06 01 00 00 be 66 02 00 00 48 c7 c7 6d 96 07 82 48 8b ab 18 05 00 00 e8 df 55 eb ff 66 90 48 8d 75 48 48 89 e7 <ff> 55 30 41 89 c4 e8 fb c1 f0 ff 84 c0 0f 84 94 00 00 00 e8 6e 0f
  [ 1243.161066] RSP: 0018:ffffc900000c0d80 EFLAGS: 00000046
  [ 1243.161066] RAX: 0000000000000002 RBX: ffff8881002e0dd0 RCX: 00000000b4b47cf8
  [ 1243.161066] RDX: ffffffff811dcb06 RSI: 0000000000000048 RDI: ffffc900000c0d80
  [ 1243.161066] RBP: 0000000000000000 R08: 0000000000000000 R09: 1a9d56bb00000000
  [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: 0000000000000000
  [ 1243.161066] R13: ffffc900000c0e00 R14: ffffc900001c3c68 R15: 0000000000000082
  [ 1243.161066] FS:  00007fc0be2d3380(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
  [ 1243.161066] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1243.161066] CR2: 0000000000000030 CR3: 0000000104f8e000 CR4: 00000000000006e0
  [ 1243.161066] Call Trace:
  [ 1243.161066]  <IRQ>
  [ 1243.161066]  __perf_event_overflow+0x4f/0xf0
  [ 1243.161066]  perf_swevent_hrtimer+0x116/0x130
  [ 1243.161066]  ? __lock_acquire+0x378/0x2730
  [ 1243.161066]  ? __lock_acquire+0x372/0x2730
  [ 1243.161066]  ? lock_is_held_type+0xd5/0x130
  [ 1243.161066]  ? find_held_lock+0x2b/0x80
  [ 1243.161066]  ? lock_is_held_type+0xd5/0x130
  [ 1243.161066]  ? perf_event_groups_first+0x80/0x80
  [ 1243.161066]  ? perf_event_groups_first+0x80/0x80
  [ 1243.161066]  __hrtimer_run_queues+0x1a3/0x460
  [ 1243.161066]  hrtimer_interrupt+0x110/0x220
  [ 1243.161066]  __sysvec_apic_timer_interrupt+0x8a/0x260
  [ 1243.161066]  sysvec_apic_timer_interrupt+0x89/0xc0
  [ 1243.161066]  </IRQ>
  [ 1243.161066]  asm_sysvec_apic_timer_interrupt+0x12/0x20
  [ 1243.161066] RIP: 0010:finish_task_switch+0xaf/0x250
  [ 1243.161066] Code: 31 f6 68 90 2a 09 81 49 8d 7c 24 18 e8 aa d6 03 00 4c 89 e7 e8 12 ff ff ff 4c 89 e7 e8 ca 9c 80 00 e8 35 af 0d 00 fb 4d 85 f6 <58> 74 1d 65 48 8b 04 25 c0 6d 01 00 4c 3b b0 a0 04 00 00 74 37 f0
  [ 1243.161066] RSP: 0018:ffffc900001c3d18 EFLAGS: 00000282
  [ 1243.161066] RAX: 000000000000031f RBX: ffff888104cf4980 RCX: 0000000000000000
  [ 1243.161066] RDX: 0000000000000000 RSI: ffffffff82095460 RDI: ffffffff820adc4e
  [ 1243.161066] RBP: ffffc900001c3d58 R08: 0000000000000001 R09: 0000000000000001
  [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: ffff88813bd2bc80
  [ 1243.161066] R13: ffff8881002e8000 R14: ffff88810022ad80 R15: 0000000000000000
  [ 1243.161066]  ? finish_task_switch+0xab/0x250
  [ 1243.161066]  ? finish_task_switch+0x70/0x250
  [ 1243.161066]  __schedule+0x36b/0xbb0
  [ 1243.161066]  ? _raw_spin_unlock_irqrestore+0x2d/0x50
  [ 1243.161066]  ? lockdep_hardirqs_on+0x79/0x100
  [ 1243.161066]  schedule+0x43/0xe0
  [ 1243.161066]  pipe_read+0x30b/0x450
  [ 1243.161066]  ? wait_woken+0x80/0x80
  [ 1243.161066]  new_sync_read+0x164/0x170
  [ 1243.161066]  vfs_read+0x122/0x1b0
  [ 1243.161066]  ksys_read+0x93/0xd0
  [ 1243.161066]  do_syscall_64+0x35/0x80
  [ 1243.161066]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The oops can also be reproduced with the following steps:
  ./vmtest.sh -s
  # at qemu shell
  cd /root/bpf && while true; do ./test_progs -t send_signal

Further analysis showed that the failure is introduced with
commit b89fbfbb ("bpf: Implement minimal BPF perf link").
With the above commit, the following scenario becomes possible:
    cpu1                        cpu2
                                hrtimer_interrupt -> bpf_overflow_handler
    (due to closing link_fd)
    bpf_perf_link_release ->
    perf_event_free_bpf_prog ->
    perf_event_free_bpf_handler ->
      WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler)
      event->prog = NULL
                                bpf_prog_run(event->prog, &ctx)

In the above case, the event->prog is NULL for bpf_prog_run, hence
causing oops.

To fix the issue, check whether event->prog is NULL or not. If it
is, do not call bpf_prog_run. This seems working as the above
reproducible step runs more than one hour and I didn't see any
failures.

Fixes: b89fbfbb ("bpf: Implement minimal BPF perf link")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210819155209.1927994-1-yhs@fb.com

594286b7

bpf: Undo off-by-one in interpreter tail call count limit · f9dabe01

Daniel Borkmann authored Aug 19, 2021

The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing
up to 33 tail calls (however odd that number may be!). Recently, this was
changed for the interpreter to reduce it down to 32 with the assumption that
this should have been the actual limit "which is in line with the behavior of
the x86 JITs" according to b61a28cf ("bpf: Fix off-by-one in tail call
count limiting").

Paul recently reported:

  I'm a bit surprised by this because I had previously tested the tail call
  limit of several JIT compilers and found it to be 33 (i.e., allowing chains
  of up to 34 programs). I've just extended a test program I had to validate
  this again on the x86-64 JIT, and found a limit of 33 tail calls again [1].

  Also note we had previously changed the RISC-V and MIPS JITs to allow up to
  33 tail calls [2, 3], for consistency with other JITs and with the interpreter.
  We had decided to increase these two to 33 rather than decrease the other
  JITs to 32 for backward compatibility, though that probably doesn't matter
  much as I'd expect few people to actually use 33 tail calls.

  [1] https://github.com/pchaigno/tail-call-bench/commit/ae7887482985b4b1745c9b2ef7ff9ae506c82886
  [2] 96bc4432 ("bpf, riscv: Limit to 33 tail calls")
  [3] e49e6f6d ("bpf, mips: Limit to 33 tail calls")

Therefore, revert b61a28cf to re-align interpreter to limit a maximum of
33 tail calls. While it is unlikely to hit the limit for the vast majority,
programs in the wild could one way or another depend on this, so lets rather
be a bit more conservative, and lets align the small remainder of JITs to 33.
If needed in future, this limit could be slightly increased, but not decreased.

Fixes: b61a28cf ("bpf: Fix off-by-one in tail call count limiting")
Reported-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com

f9dabe01

18 Aug, 2021 3 commits

selftests/bpf: Test for get_netns_cookie · 374e74de

Xu Liu authored Aug 18, 2021

Add test to use get_netns_cookie() from BPF_PROG_TYPE_SOCK_OPS.
Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818105820.91894-3-liuxu623@gmail.com

374e74de

bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_SOCK_OPS · 6cf1770d

Xu Liu authored Aug 18, 2021

We'd like to be able to identify netns from sockops hooks to
accelerate local process communication form different netns.
Signed-off-by: Xu Liu <liuxu623@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818105820.91894-2-liuxu623@gmail.com

6cf1770d

libbpf: Rename libbpf documentation index file · d20b4111

Grant Seltzer authored Aug 18, 2021

This patch renames a documentation libbpf.rst to index.rst. In order
for readthedocs.org to pick this file up and properly build the
documentation site.

It also changes the title type of the ABI subsection in the
naming convention doc. This is so that readthedocs.org doesn't treat this
section as a separate document.
Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210818151313.49992-1-grantseltzer@gmail.com

d20b4111

17 Aug, 2021 19 commits

bpf: Remove redundant initialization of variable allow · 8cacfc85

Colin Ian King authored Aug 17, 2021

The variable allow is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com

8cacfc85

Merge branch 'selftests/bpf: fix flaky send_signal test' · 04d23194

Andrii Nakryiko authored Aug 17, 2021

Yonghong Song says:

====================

The bpf selftest send_signal() is flaky for its subtests trying to
send signals in softirq/nmi context. To reduce flakiness, the
signal-targetted process priority is boosted, which should minimize
preemption of that process and improve the possibility that
the underlying task in softirq/nmi context is the bpf_send_signal()
wanted task.

Patch #1 did a refactoring to use ASSERT_* instead of old CHECK macros.
Patch #2 did actual change of boosting priority.

Changelog:
  v1 -> v2:
    remove skip logic where the underlying task in interrupt context
    is not the intended one.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

04d23194

selftests/bpf: Fix flaky send_signal test · b16ac5bf

Yonghong Song authored Aug 17, 2021

libbpf CI has reported send_signal test is flaky although
I am not able to reproduce it in my local environment.
But I am able to reproduce with on-demand libbpf CI ([1]).

Through code analysis, the following is possible reason.
The failed subtest runs bpf program in softirq environment.
Since bpf_send_signal() only sends to a fork of "test_progs"
process. If the underlying current task is
not "test_progs", bpf_send_signal() will not be triggered
and the subtest will fail.

To reduce the chances where the underlying process is not
the intended one, this patch boosted scheduling priority to
-20 (highest allowed by setpriority() call). And I did
10 runs with on-demand libbpf CI with this patch and I
didn't observe any failures.

 [1] https://github.com/libbpf/libbpf/actions/workflows/ondemand.ymlSigned-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817190923.3186725-1-yhs@fb.com

b16ac5bf

selftests/bpf: Replace CHECK with ASSERT_* macros in send_signal.c · 6f6cc426

Yonghong Song authored Aug 17, 2021

Replace CHECK in send_signal.c with ASSERT_* macros as
ASSERT_* macros are generally preferred. There is no
funcitonality change.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817190918.3186400-1-yhs@fb.com

6f6cc426

Merge branch 'selftests/bpf: Improve the usability of test_progs' · 87bb11cc

Andrii Nakryiko authored Aug 17, 2021

Yucong Sun says:

====================

This short series adds two new switches to test_progs, "-a" and "-d",
adding support for both exact string matching, as well as '*' wildcards.
It also cleans up the output to make it possible to generate
allowlist/denylist using common cli tools.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

87bb11cc

selftests/bpf: Support glob matching for test selector. · 74339a8f

Yucong Sun authored Aug 16, 2021

This patch adds '-a' and '-d' arguments supporting both exact string match as
well as using '*' wildcard in test/subtests selection. '-a' and '-t' can
co-exists, same as '-d' and '-b', in which case they just add to the list of
allowed or denied test selectors.

Caveat: Same as the current substring matching mechanism, test and subtest
selector applies independently, 'a*/b*' will execute all tests matching "a*",
and with subtest name matching "b*", but tests matching "a*" that has no
subtests will also be executed.
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-5-fallentree@fb.com

74339a8f

selftests/bpf: Also print test name in subtest status message · 99c4fd8b

Yucong Sun authored Aug 16, 2021

This patch add test name in subtest status message line, making it possible to
grep ':OK' in the output to generate a list of passed test+subtest names, which
can be processed to generate argument list to be used with "-a", "-d" exact
string matching.

Example:

 #1/1 align/mov:OK
 ..
 #1/12 align/pointer variable subtraction:OK
 #1 align:OK
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-4-fallentree@fb.com

99c4fd8b

selftests/bpf: Correctly display subtest skip status · f667d1d6

Yucong Sun authored Aug 16, 2021

In skip_account(), test->skip_cnt is set to 0 at the end, this makes next print
statement never display SKIP status for the subtest. This patch moves the
accounting logic after the print statement, fixing the issue.

This patch also added SKIP status display for normal tests.
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-3-fallentree@fb.com

f667d1d6

selftests/bpf: Skip loading bpf_testmod when using -l to list tests. · 26d82640

Yucong Sun authored Aug 16, 2021

When using "-l", test_progs often is executed as non-root user,
load_bpf_testmod() will fail and output errors. This patch skips loading bpf
testmod when "-l" is specified, making output cleaner.
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817044732.3263066-2-fallentree@fb.com

26d82640

selftests/bpf: Add exponential backoff to map_delete_retriable in test_maps · 857f75ea

Yucong Sun authored Aug 16, 2021

Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210817045713.3307985-1-fallentree@fb.com

857f75ea

selftests/bpf: Add exponential backoff to map_update_retriable in test_maps · 3c3bd542

Yucong Sun authored Aug 16, 2021

Using a fixed delay of 1 microsecond has proven flaky in slow CPU environment,
e.g. Github Actions CI system. This patch adds exponential backoff with a cap
of 50ms to reduce the flakiness of the test. Initial delay is chosen at random
in the range [0ms, 5ms).
Signed-off-by: Yucong Sun <fallentree@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210816175250.296110-1-fallentree@fb.com

3c3bd542

Merge branch 'sockmap: add sockmap support for unix stream socket' · 1e1e49df

Andrii Nakryiko authored Aug 16, 2021

Jiang Wang says:

====================

This patch series add support for unix stream type
for sockmap. Sockmap already supports TCP, UDP,
unix dgram types. The unix stream support is similar
to unix dgram.

Also add selftests for unix stream type in sockmap tests.
====================
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

1e1e49df

selftest/bpf: Add new tests in sockmap for unix stream to tcp. · 31c50aee

Jiang Wang authored Aug 16, 2021

Add two new test cases in sockmap tests, where unix stream is
redirected to tcp and vice versa.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-6-jiang.wang@bytedance.com

31c50aee

selftest/bpf: Change udp to inet in some function names · 75e0e27d

Jiang Wang authored Aug 16, 2021

This is to prepare for adding new unix stream tests.
Mostly renames, also pass the socket types as an argument.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-5-jiang.wang@bytedance.com

75e0e27d

selftest/bpf: Add tests for sockmap with unix stream type. · 9b03152b

Jiang Wang authored Aug 16, 2021

Add two tests for unix stream to unix stream redirection
in sockmap tests.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-4-jiang.wang@bytedance.com

9b03152b

af_unix: Add unix_stream_proto for sockmap · 94531cfc

Jiang Wang authored Aug 16, 2021

Previously, sockmap for AF_UNIX protocol only supports
dgram type. This patch add unix stream type support, which
is similar to unix_dgram_proto. To support sockmap, dgram
and stream cannot share the same unix_proto anymore, because
they have different implementations, such as unhash for stream
type (which will remove closed or disconnected sockets from the map),
so rename unix_proto to unix_dgram_proto and add a new
unix_stream_proto.

Also implement stream related sockmap functions.
And add dgram key words to those dgram specific functions.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-3-jiang.wang@bytedance.com

94531cfc

af_unix: Add read_sock for stream socket types · 77462de1

Jiang Wang authored Aug 16, 2021

To support sockmap for af_unix stream type, implement
read_sock, which is similar to the read_sock for unix
dgram sockets.
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210816190327.2739291-2-jiang.wang@bytedance.com

77462de1

selftests/bpf: Test btf__load_vmlinux_btf/btf__load_module_btf APIs · edce1a24

Hengqi Chen authored Aug 15, 2021

Add test for btf__load_vmlinux_btf/btf__load_module_btf APIs. The test
loads bpf_testmod module BTF and check existence of a symbol which is
known to exist.
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210815081035.205879-1-hengqi.chen@gmail.com

edce1a24

bpf: Reconfigure libbpf docs to remove unversioned API · bb571649

grantseltzer authored Aug 09, 2021

This removes the libbpf_api.rst file from the kernel documentation.
The intention for this file was to pull documentation from comments
above API functions in libbpf. However, due to limitations of the
kernel documentation system, this API documentation could not be
versioned, which is counterintuative to how users expect to use it.
There is also currently no doc comments, making this a blank page.

Once the kernel comment documentation is actually contributed, it
will still exist in the kernel repository, just in the code itself.

A seperate site is being spun up to generate documentaiton from those
comments in a way in which it can be versioned properly.

This also reconfigures the bpf documentation index page to make it
easier to sync to the previously mentioned documentaiton site.
Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810020508.280639-1-grantseltzer@gmail.com

bb571649

16 Aug, 2021 14 commits

Merge branch 'bpf-perf-link' · 3a4ce01b

Daniel Borkmann authored Aug 17, 2021

Andrii Nakryiko says:

====================
This patch set implements an ability for users to specify custom black box u64
value for each BPF program attachment, bpf_cookie, which is available to BPF
program at runtime. This is a feature that's critically missing for cases when
some sort of generic processing needs to be done by the common BPF program
logic (or even exactly the same BPF program) across multiple BPF hooks (e.g.,
many uniformly handled kprobes) and it's important to be able to distinguish
between each BPF hook at runtime (e.g., for additional configuration lookup).

The choice of restricting this to a fixed-size 8-byte u64 value is an explicit
design decision. Making this configurable by users adds unnecessary complexity
(extra memory allocations, extra complications on the verifier side to validate
accesses to variable-sized data area) while not really opening up new
possibilities. If user's use case requires storing more data per attachment,
it's possible to use either global array, or ARRAY/HASHMAP BPF maps, where
bpf_cookie would be used as an index into respective storage, populated by
user-space code before creating BPF link. This gives user all the flexibility
and control while keeping BPF verifier and BPF helper API simple.

Currently, similar functionality can only be achieved through:

  - code-generation and BPF program cloning, which is very complicated and
    unmaintainable;
  - on-the-fly C code generation and further runtime compilation, which is
    what BCC uses and allows to do pretty simply. The big downside is a very
    heavy-weight Clang/LLVM dependency and inefficient memory usage (due to
    many BPF program clones and the compilation process itself);
  - in some cases (kprobes and sometimes uprobes) it's possible to do function
    IP lookup to get function-specific configuration. This doesn't work for
    all the cases (e.g., when attaching uprobes to shared libraries) and has
    higher runtime overhead and additional programming complexity due to
    BPF_MAP_TYPE_HASHMAP lookups. Up until recently, before bpf_get_func_ip()
    BPF helper was added, it was also very complicated and unstable (API-wise)
    to get traced function's IP from fentry/fexit and kretprobe.

With libbpf and BPF CO-RE, runtime compilation is not an option, so to be able
to build generic tracing tooling simply and efficiently, ability to provide
additional bpf_cookie value for each *attachment* (as opposed to each BPF
program) is extremely important. Two immediate users of this functionality are
going to be libbpf-based USDT library (currently in development) and retsnoop
([0]), but I'm sure more applications will come once users get this feature in
their kernels.

To achieve above described, all perf_event-based BPF hooks are made available
through a new BPF_LINK_TYPE_PERF_EVENT BPF link, which allows to use common
LINK_CREATE command for program attachments and generally brings
perf_event-based attachments into a common BPF link infrastructure.

With that, LINK_CREATE gets ability to pass throught bpf_cookie value during
link creation (BPF program attachment) time. bpf_get_attach_cookie() BPF
helper is added to allow fetching this value at runtime from BPF program side.
BPF cookie is stored either on struct perf_event itself and fetched from the
BPF program context, or is passed through ambient BPF run context, added in
c7603cfa ("bpf: Add ambient BPF runtime context stored in current").

On the libbpf side of things, BPF perf link is utilized whenever is supported
by the kernel instead of using PERF_EVENT_IOC_SET_BPF ioctl on perf_event FD.
All the tracing attach APIs are extended with OPTS and bpf_cookie is passed
through corresponding opts structs.

Last part of the patch set adds few self-tests utilizing new APIs.

There are also a few refactorings along the way to make things cleaner and
easier to work with, both in kernel (BPF_PROG_RUN and BPF_PROG_RUN_ARRAY), and
throughout libbpf and selftests.

Follow-up patches will extend bpf_cookie to fentry/fexit programs.

While adding uprobe_opts, also extend it with ref_ctr_offset for specifying
USDT semaphore (reference counter) offset. Update attach_probe selftests to
validate its functionality. This is another feature (along with bpf_cookie)
required for implementing libbpf-based USDT solution.

  [0] https://github.com/anakryiko/retsnoop

v4->v5:
  - rebase on latest bpf-next to resolve merge conflict;
  - add ref_ctr_offset to uprobe_opts and corresponding selftest;
v3->v4:
  - get rid of BPF_PROG_RUN macro in favor of bpf_prog_run() (Daniel);
  - move #ifdef CONFIG_BPF_SYSCALL check into bpf_set_run_ctx (Daniel);
v2->v3:
  - user_ctx -> bpf_cookie, bpf_get_user_ctx -> bpf_get_attach_cookie (Peter);
  - fix BPF_LINK_TYPE_PERF_EVENT value fix (Jiri);
  - use bpf_prog_run() from bpf_prog_run_pin_on_cpu() (Yonghong);
v1->v2:
  - fix build failures on non-x86 arches by gating on CONFIG_PERF_EVENTS.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

3a4ce01b

selftests/bpf: Add ref_ctr_offset selftests · 4bd11e08

Andrii Nakryiko authored Aug 15, 2021

Extend attach_probe selftests to specify ref_ctr_offset for uprobe/uretprobe
and validate that its value is incremented from zero.

Turns out that once uprobe is attached with ref_ctr_offset, uretprobe for the
same location/function *has* to use ref_ctr_offset as well, otherwise
perf_event_open() fails with -EINVAL. So this test uses ref_ctr_offset for
both uprobe and uretprobe, even though for the purpose of test uprobe would be
enough.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-17-andrii@kernel.org

4bd11e08

libbpf: Add uprobe ref counter offset support for USDT semaphores · 5e3b8356

Andrii Nakryiko authored Aug 15, 2021

When attaching to uprobes through perf subsystem, it's possible to specify
offset of a so-called USDT semaphore, which is just a reference counted u16,
used by kernel to keep track of how many tracers are attached to a given
location. Support for this feature was added in [0], so just wire this through
uprobe_opts. This is important to enable implementing USDT attachment and
tracing through libbpf's bpf_program__attach_uprobe_opts() API.

  [0] a6ca88b2 ("trace_uprobe: support reference counter in fd-based uprobe")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-16-andrii@kernel.org

5e3b8356

selftests/bpf: Add bpf_cookie selftests for high-level APIs · 0a80cf67

Andrii Nakryiko authored Aug 15, 2021

Add selftest with few subtests testing proper bpf_cookie usage.

Kprobe and uprobe subtests are pretty straightforward and just validate that
the same BPF program attached with different bpf_cookie will be triggered with
those different bpf_cookie values.

Tracepoint subtest is a bit more interesting, as it is the only
perf_event-based BPF hook that shares bpf_prog_array between multiple
perf_events internally. This means that the same BPF program can't be attached
to the same tracepoint multiple times. So we have 3 identical copies. This
arrangement allows to test bpf_prog_array_copy()'s handling of bpf_prog_array
list manipulation logic when programs are attached and detached. The test
validates that bpf_cookie isn't mixed up and isn't lost during such list
manipulations.

Perf_event subtest validates that two BPF links can be created against the
same perf_event (but not at the same time, only one BPF program can be
attached to perf_event itself), and that for each we can specify different
bpf_cookie value.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-15-andrii@kernel.org

0a80cf67

selftests/bpf: Extract uprobe-related helpers into trace_helpers.{c,h} · a549aaa6

Andrii Nakryiko authored Aug 15, 2021

Extract two helpers used for working with uprobes into trace_helpers.{c,h} to
be re-used between multiple uprobe-using selftests. Also rename get_offset()
into more appropriate get_uprobe_offset().
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-14-andrii@kernel.org

a549aaa6

selftests/bpf: Test low-level perf BPF link API · f36d3557

Andrii Nakryiko authored Aug 15, 2021

Add tests utilizing low-level bpf_link_create() API to create perf BPF link.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-13-andrii@kernel.org

f36d3557

libbpf: Add bpf_cookie to perf_event, kprobe, uprobe, and tp attach APIs · 47faff37

Andrii Nakryiko authored Aug 15, 2021

Wire through bpf_cookie for all attach APIs that use perf_event_open under the
hood:
  - for kprobes, extend existing bpf_kprobe_opts with bpf_cookie field;
  - for perf_event, uprobe, and tracepoint APIs, add their _opts variants and
    pass bpf_cookie through opts.

For kernel that don't support BPF_LINK_CREATE for perf_events, and thus
bpf_cookie is not supported either, return error and log warning for user.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-12-andrii@kernel.org

47faff37

libbpf: Add bpf_cookie support to bpf_link_create() API · 3ec84f4b

Andrii Nakryiko authored Aug 15, 2021

Add ability to specify bpf_cookie value when creating BPF perf link with
bpf_link_create() low-level API.

Given BPF_LINK_CREATE command is growing and keeps getting new fields that are
specific to the type of BPF_LINK, extend libbpf side of bpf_link_create() API
and corresponding OPTS struct to accomodate such changes. Add extra checks to
prevent using incompatible/unexpected combinations of fields.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-11-andrii@kernel.org

3ec84f4b

libbpf: Use BPF perf link when supported by kernel · 668ace0e

Andrii Nakryiko authored Aug 15, 2021

Detect kernel support for BPF perf link and prefer it when attaching to
perf_event, tracepoint, kprobe/uprobe. Underlying perf_event FD will be kept
open until BPF link is destroyed, at which point both perf_event FD and BPF
link FD will be closed.

This preserves current behavior in which perf_event FD is open for the
duration of bpf_link's lifetime and user is able to "disconnect" bpf_link from
underlying FD (with bpf_link__disconnect()), so that bpf_link__destroy()
doesn't close underlying perf_event FD.When BPF perf link is used, disconnect
will keep both perf_event and bpf_link FDs open, so it will be up to
(advanced) user to close them. This approach is demonstrated in bpf_cookie.c
selftests, added in this patch set.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-10-andrii@kernel.org

668ace0e

libbpf: Remove unused bpf_link's destroy operation, but add dealloc · d88b71d4

Andrii Nakryiko authored Aug 15, 2021

bpf_link->destroy() isn't used by any code, so remove it. Instead, add ability
to override deallocation procedure, with default doing plain free(link). This
is necessary for cases when we want to "subclass" struct bpf_link to keep
extra information, as is the case in the next patch adding struct
bpf_link_perf.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210815070609.987780-9-andrii@kernel.org

d88b71d4

libbpf: Re-build libbpf.so when libbpf.map changes · 61c7aa50

Andrii Nakryiko authored Aug 15, 2021

Ensure libbpf.so is re-built whenever libbpf.map is modified.  Without this,
changes to libbpf.map are not detected and versioned symbols mismatch error
will be reported until `make clean && make` is used, which is a suboptimal
developer experience.

Fixes: 306b267c ("libbpf: Verify versioned symbols")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-8-andrii@kernel.org

61c7aa50

bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie value · 7adfc6c9

Andrii Nakryiko authored Aug 15, 2021

Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs
to get access to a user-provided bpf_cookie value, specified during BPF
program attachment (BPF link creation) time.

Naming is hard, though. With the concept being named "BPF cookie", I've
considered calling the helper:
  - bpf_get_cookie() -- seems too unspecific and easily mistaken with socket
    cookie;
  - bpf_get_bpf_cookie() -- too much tautology;
  - bpf_get_link_cookie() -- would be ok, but while we create a BPF link to
    attach BPF program to BPF hook, it's still an "attachment" and the
    bpf_cookie is associated with BPF program attachment to a hook, not a BPF
    link itself. Technically, we could support bpf_cookie with old-style
    cgroup programs.So I ultimately rejected it in favor of
    bpf_get_attach_cookie().

Currently all perf_event-backed BPF program types support
bpf_get_attach_cookie() helper. Follow-up patches will add support for
fentry/fexit programs as well.

While at it, mark bpf_tracing_func_proto() as static to make it obvious that
it's only used from within the kernel/trace/bpf_trace.c.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-7-andrii@kernel.org

7adfc6c9

bpf: Allow to specify user-provided bpf_cookie for BPF perf links · 82e6b1ee

Andrii Nakryiko authored Aug 15, 2021

Add ability for users to specify custom u64 value (bpf_cookie) when creating
BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event,
tracepoints).

This is useful for cases when the same BPF program is used for attaching and
processing invocation of different tracepoints/kprobes/uprobes in a generic
fashion, but such that each invocation is distinguished from each other (e.g.,
BPF program can look up additional information associated with a specific
kernel function without having to rely on function IP lookups). This enables
new use cases to be implemented simply and efficiently that previously were
possible only through code generation (and thus multiple instances of almost
identical BPF program) or compilation at runtime (BCC-style) on target hosts
(even more expensive resource-wise). For uprobes it is not even possible in
some cases to know function IP before hand (e.g., when attaching to shared
library without PID filtering, in which case base load address is not known
for a library).

This is done by storing u64 bpf_cookie in struct bpf_prog_array_item,
corresponding to each attached and run BPF program. Given cgroup BPF programs
already use two 8-byte pointers for their needs and cgroup BPF programs don't
have (yet?) support for bpf_cookie, reuse that space through union of
cgroup_storage and new bpf_cookie field.

Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx.
This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF
program execution code, which luckily is now also split from
BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper
giving access to this user-provided cookie value from inside a BPF program.
Generic perf_event BPF programs will access this value from perf_event itself
through passed in BPF program context.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20210815070609.987780-6-andrii@kernel.org

82e6b1ee

bpf: Implement minimal BPF perf link · b89fbfbb

Andrii Nakryiko authored Aug 15, 2021

Introduce a new type of BPF link - BPF perf link. This brings perf_event-based
BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into
the common BPF link infrastructure, allowing to list all active perf_event
based attachments, auto-detaching BPF program from perf_event when link's FD
is closed, get generic BPF link fdinfo/get_info functionality.

BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags
are currently supported.

Force-detaching and atomic BPF program updates are not yet implemented, but
with perf_event-based BPF links we now have common framework for this without
the need to extend ioctl()-based perf_event interface.

One interesting consideration is a new value for bpf_attach_type, which
BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from
bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of
bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or
BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different
program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based
mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to
define a single BPF_PERF_EVENT attach type for all of them and adjust
link_create()'s logic for checking correspondence between attach type and
program type.

The alternative would be to define three new attach types (e.g., BPF_KPROBE,
BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill
and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by
libbpf. I chose to not do this to avoid unnecessary proliferation of
bpf_attach_type enum values and not have to deal with naming conflicts.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20210815070609.987780-5-andrii@kernel.org

b89fbfbb