Commits · fd9c663b9ad67dedfc9a3fd3429ddd3e83782b4d · Kirill Smelkov / linux

21 Apr, 2023 9 commits

bpf: minimal support for programs hooked into netfilter framework · fd9c663b

Florian Westphal authored Apr 21, 2023

This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.

Invocation incurs an indirect call. This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.

This isn't done here to keep the size of this chunk down.

Verifier restricts verdicts to either DROP or ACCEPT.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fd9c663b

bpf: add bpf_link support for BPF_NETFILTER programs · 84601d6e

Florian Westphal authored Apr 21, 2023

Add bpf_link support skeleton.  To keep this reviewable, no bpf program
can be invoked yet, if a program is attached only a c-stub is called and
not the actual bpf program.

Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.

Uapi example usage:
	union bpf_attr attr = { };

	attr.link_create.prog_fd = progfd;
	attr.link_create.attach_type = 0; /* unused */
	attr.link_create.netfilter.pf = PF_INET;
	attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
	attr.link_create.netfilter.priority = -128;

	err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));

... this would attach progfd to ipv4:input hook.

Such hook gets removed automatically if the calling program exits.

BPF_NETFILTER program invocation is added in followup change.

NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
allows to tell userspace which program is attached at the given hook
when user runs 'nft hook list' command rather than just the priority
and not-very-helpful 'this hook runs a bpf prog but I can't tell which
one'.

Will also be used to disallow registration of two bpf programs with
same priority in a followup patch.

v4: arm32 cmpxchg only supports 32bit operand
    s/prio/priority/
v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
    more use cases pop up (arptables, ebtables, netdev ingress/egress etc).
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

84601d6e

bpftool: Update doc to explain struct_ops register subcommand. · 45cea721

Kui-Feng Lee authored Apr 19, 2023

The "struct_ops register" subcommand now allows for an optional *LINK_DIR*
to be included. This specifies the directory path where bpftool will pin
struct_ops links with the same name as their corresponding map names.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20230420002822.345222-2-kuifeng@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

45cea721

bpftool: Register struct_ops with a link. · 0232b788

Kui-Feng Lee authored Apr 19, 2023

You can include an optional path after specifying the object name for the
'struct_ops register' subcommand.

Since the commit 226bc6ae ("Merge branch 'Transit between BPF TCP
congestion controls.'") has been accepted, it is now possible to create a
link for a struct_ops. This can be done by defining a struct_ops in
SEC(".struct_ops.link") to make libbpf returns a real link. If we don't pin
the links before leaving bpftool, they will disappear. To instruct bpftool
to pin the links in a directory with the names of the maps, we need to
provide the path of that directory.
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20230420002822.345222-1-kuifeng@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0232b788

selftests/bpf: Verify optval=NULL case · 833d67ec

Stanislav Fomichev authored Apr 18, 2023

Make sure we get optlen exported instead of getting EFAULT.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230418225343.553806-3-sdf@google.com

833d67ec

bpf: Don't EFAULT for getsockopt with optval=NULL · 00e74ae0

Stanislav Fomichev authored Apr 18, 2023

Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.

Fixes: 0d01da6a ("bpf: implement getsockopt and setsockopt hooks")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com

00e74ae0

selftests/xsk: Put MAP_HUGE_2MB in correct argument · 02e93e04

Magnus Karlsson authored Apr 21, 2023

Put the flag MAP_HUGE_2MB in the correct flags argument instead of the
wrong offset argument.

Fixes: 2ddade32 ("selftests/xsk: Fix munmap for hugepage allocated umem")
Reported-by: Kal Cutter Conley <kal.conley@dectris.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230421062208.3772-1-magnus.karlsson@gmail.com

02e93e04

bpf: Fix bpf_refcount_acquire's refcount_t address calculation · 4ab07209

Dave Marchevsky authored Apr 21, 2023

When calculating the address of the refcount_t struct within a local
kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the
address of the local kptr. Due to some missing parens, the function is
incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch
fixes the calculation.

Due to the incorrect calculation, bpf_refcount_acquire_impl was trying
to refcount_inc some memory well past the end of local kptrs, resulting
in kasan and refcount complaints, as reported in [0]. In that thread,
Florian and Eduard discovered that bpf selftests written in the new
style - with __success and an expected __retval, specifically - were
not actually being run. As a result, selftests added in bpf_refcount
series weren't really exercising this behavior, and thus didn't unearth
the bug.

With this fixed behavior it's safe to revert commit 7c4b96c0
("selftests/bpf: disable program test run for progs/refcounted_kptr.c"),
this patch does so.

  [0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/

Fixes: 7c50b1cb ("bpf: Add bpf_refcount_acquire kfunc")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com

4ab07209

bpf: Fix race between btf_put and btf_idr walk. · acf1c3d6

Alexei Starovoitov authored Apr 20, 2023

Florian and Eduard reported hard dead lock:
[   58.433327]  _raw_spin_lock_irqsave+0x40/0x50
[   58.433334]  btf_put+0x43/0x90
[   58.433338]  bpf_find_btf_id+0x157/0x240
[   58.433353]  btf_parse_fields+0x921/0x11c0

This happens since btf->refcount can be 1 at the time of btf_put() and
btf_put() will call btf_free_id() which will try to grab btf_idr_lock
and will dead lock.
Avoid the issue by doing btf_put() without locking.

Fixes: 3d78417b ("bpf: Add bpf_btf_find_by_name_kind() helper.")
Fixes: 1e89106d ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com

acf1c3d6

20 Apr, 2023 9 commits

Merge branch 'fix __retval() being always ignored' · 267a6e4e

Alexei Starovoitov authored Apr 20, 2023

Eduard Zingerman says:

====================

Florian Westphal found a bug in test_loader.c processing of __retval tag.
Because of this bug the function test_loader.c:do_prog_test_run()
never executed and all __retval test tags were ignored. See [1].

Fix for this bug uncovers two additional bugs:
- During test_verifier tests migration to inline assembly (see [2])
  I missed the fact that some tests require maps to contain mock values;
- Some issue with a new refcounted_kptr test, which causes kernel to
  produce dead lock and refcount saturation warnings when subject to
  libbpf's bpf_test_run_opts().

This series fixes the bug in __retval() processing, and address the
issue with test maps not being populated. The issue in refcounted_kptr
is not addressed, __retval tags in those tests are commented out.

I found that the following tests depend on test maps being populated:
- progs/verifier_array_access.c
- verifier/value_ptr_arith.c (planned for migration to inline assembly)

Given the small amount of these tests I decided to opt for simple
non-generic solution (see patch #4).

[1] https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/
[2] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

267a6e4e

selftests/bpf: populate map_array_ro map for verifier_array_access test · cbb110bc

Eduard Zingerman authored Apr 21, 2023

Two test cases:
- "valid read map access into a read-only array 1" and
- "valid read map access into a read-only array 2"

Expect that map_array_ro map is filled with mock data. This logic was
not taken into acount during initial test conversion.

This commit modifies prog_tests/verifier.c entry point for this test
to fill the map.

Fixes: a3c830ae ("selftests/bpf: verifier/array_access.c converted to inline assembly")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-5-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cbb110bc

selftests/bpf: add pre bpf_prog_test_run_opts() callback for test_loader · 5b22f4d1

Eduard Zingerman authored Apr 21, 2023

When a test case is annotated with __retval tag the test_loader engine
would use libbpf's bpf_prog_test_run_opts() to do a test run of the
program and compare retvals.

This commit allows to perform arbitrary actions on bpf object right
before test loader invokes bpf_prog_test_run_opts(). This could be
used to setup some state for program execution, e.g. fill some maps.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-4-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5b22f4d1

selftests/bpf: fix __retval() being always ignored · 7cdddb99

Eduard Zingerman authored Apr 21, 2023

Florian Westphal found a bug in and suggested a fix for test_loader.c
processing of __retval tag. Because of this bug the function
test_loader.c:do_prog_test_run() never executed and all __retval test
tags were ignored.

If this bug is fixed a number of test cases from
progs/verifier_array_access.c fail with retval not matching the
expected value. This test was recently converted to use test_loader.c
and inline assembly in [1]. When doing the conversion I missed the
important detail of test_verifier.c operation: when it creates
fixup_map_array_ro, fixup_map_array_wo and fixup_map_array_small it
populates these maps with a dummy record.

Disabling the __retval checks for the affected verifier_array_access
in this commit to avoid false-postivies in any potential bisects.
The issue is addressed in the next patch.

I verified that the __retval tags are now respected by changing
expected return values for all tests annotated with __retval, and
checking that these tests started to fail.

[1] https://lore.kernel.org/bpf/20230325025524.144043-1-eddyz87@gmail.com/

Fixes: 19a8e06f ("selftests/bpf: Tests execution support for test_loader.c")
Reported-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-3-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7cdddb99

selftests/bpf: disable program test run for progs/refcounted_kptr.c · 7c4b96c0

Eduard Zingerman authored Apr 21, 2023

Florian Westphal found a bug in test_loader.c processing of __retval
tag. Because of this bug the function test_loader.c:do_prog_test_run()
never executed and all __retval test tags were ignored. This hid an
issue with progs/refcounted_kptr.c tests.

When __retval tag bug is fixed and refcounted_kptr.c tests are run
kernel reports various issues and eventually hangs. Shortest reproducer
is the following command run a few times:

  $ for i in $(seq 1 4); do (./test_progs --allow=refcounted_kptr &); done

Commenting out __retval tags for these tests until this issue is resolved.
Reported-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/bpf/f4c4aee644425842ee6aa8edf1da68f0a8260e7c.camel@gmail.com/T/Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230420232317.2181776-2-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7c4b96c0

bpftool: Replace "__fallthrough" by a comment to address merge conflict · 4b7ef71a

Quentin Monnet authored Apr 20, 2023

The recent support for inline annotations in control flow graphs
generated by bpftool introduced the usage of the "__fallthrough" macro
in a switch/case block in btf_dumper.c. This change went through the
bpf-next tree, but resulted in a merge conflict in linux-next, because
this macro has been renamed "fallthrough" (no underscores) in the
meantime.

To address the conflict, we temporarily switch to a simple comment
instead of a macro.

Related: commit f7a858bf ("tools: Rename __fallthrough to fallthrough")

Fixes: 9fd49684 ("bpftool: Support inline annotations when dumping the CFG of a program")
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/all/yt9dttxlwal7.fsf@linux.ibm.com/
Link: https://lore.kernel.org/bpf/20230412123636.2358949-1-tmricht@linux.ibm.com/
Link: https://lore.kernel.org/bpf/20230420003333.90901-1-quentin@isovalent.com

4b7ef71a

Merge branch 'Access variable length array relaxed for integer type' · 780c6983

Alexei Starovoitov authored Apr 19, 2023

Feng zhou says:

====================

From: Feng Zhou <zhoufeng.zf@bytedance.com>

Add support for integer type of accessing variable length array.
Add a selftest to check it.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

780c6983

selftests/bpf: Add test to access integer type of variable array · 5ff54ded

Feng Zhou authored Apr 20, 2023

Add prog test for accessing integer type of variable array in tracing
program.
In addition, hook load_balance function to access sd->span[0], only
to confirm whether the load is successful. Because there is no direct
way to trigger load_balance call.
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-3-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5ff54ded

bpf: support access variable length array of integer type · 2569c7b8

Feng Zhou authored Apr 20, 2023

After this commit:
bpf: Support variable length array in tracing programs (9c5f8a10)
Trace programs can access variable length array, but for structure
type. This patch adds support for integer type.

Example:
Hook load_balance
struct sched_domain {
	...
	unsigned long span[];
}

The access: sd->span[0].
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2569c7b8

19 Apr, 2023 1 commit

selftests/xsk: Fix munmap for hugepage allocated umem · 2ddade32

Magnus Karlsson authored Apr 18, 2023

Fix the unmapping of hugepage allocated umems so that they are
properly unmapped. The new test referred to in the fixes label,
introduced a test that allocated a umem that is not a multiple of a 2M
hugepage size. This is fine for mmap() that rounds the size up the
nearest multiple of 2M. But munmap() requires the size to be a
multiple of the hugepage size in order for it to unmap the region. The
current behaviour of not properly unmapping the umem, was discovered
when further additions of tests that require hugepages (unaligned mode
tests only) started failing as the system was running out of
hugepages.

Fixes: c0801598 ("selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230418143617.27762-1-magnus.karlsson@gmail.com

2ddade32

18 Apr, 2023 7 commits

Merge branch 'Provide bpf_for() and bpf_for_each() by libbpf' · 276dcdd1

Alexei Starovoitov authored Apr 18, 2023

Andrii Nakryiko says:

====================

This patch set moves bpf_for(), bpf_for_each(), and bpf_repeat() macros from
selftests-internal bpf_misc.h header to libbpf-provided bpf_helpers.h header.
To do this in a way to allow users to feature-detect and guard such
bpf_for()/bpf_for_each() uses on old kernels we also extend libbpf to improve
unresolved kfunc calls handling and reporting. This lets us mark
bpf_iter_num_{new,next,destroy}() declarations as __weak, and thus not fail
program loading outright if such kfuncs are missing on the host kernel.

Patches #1 and #2 do some simple clean ups and logging improvements. Patch #3
adds kfunc call poisoning and log fixup logic and is the hear of this patch
set, effectively. Patch #4 adds selftest for this logic. Patches #4 and #5
move bpf_for()/bpf_for_each()/bpf_repeat() into bpf_helpers.h header and mark
kfuncs as __weak to allow users to feature-detect and guard their uses.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

276dcdd1

libbpf: mark bpf_iter_num_{new,next,destroy} as __weak · 94dccba7

Andrii Nakryiko authored Apr 17, 2023

Mark bpf_iter_num_{new,next,destroy}() kfuncs declared for
bpf_for()/bpf_repeat() macros as __weak to allow users to feature-detect
their presence and guard bpf_for()/bpf_repeat() loops accordingly for
backwards compatibility with old kernels.

Now that libbpf supports kfunc calls poisoning and better reporting of
unresolved (but called) kfuncs, declaring number iterator kfuncs in
bpf_helpers.h won't degrade user experience and won't cause unnecessary
kernel feature dependencies.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

94dccba7

libbpf: move bpf_for(), bpf_for_each(), and bpf_repeat() into bpf_helpers.h · c5e64741

Andrii Nakryiko authored Apr 17, 2023

To make it easier for bleeding-edge BPF applications, such as sched_ext,
to utilize open-coded iterators, move bpf_for(), bpf_for_each(), and
bpf_repeat() macros from selftests/bpf-internal bpf_misc.h helper, to
libbpf-provided bpf_helpers.h header.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c5e64741

selftests/bpf: add missing __weak kfunc log fixup test · 30bbfe32

Andrii Nakryiko authored Apr 17, 2023

Add test validating that libbpf correctly poisons and reports __weak
unresolved kfuncs in post-processed verifier log.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

30bbfe32

libbpf: improve handling of unresolved kfuncs · 05b6f766

Andrii Nakryiko authored Apr 17, 2023

Currently, libbpf leaves `call #0` instruction for __weak unresolved
kfuncs, which might lead to a confusing verifier log situations, where
invalid `call #0` will be treated as successfully validated.

We can do better. Libbpf already has an established mechanism of
poisoning instructions that failed some form of resolution (e.g., CO-RE
relocation and BPF map set to not be auto-created). Libbpf doesn't fail
them outright to allow users to guard them through other means, and as
long as BPF verifier can prove that such poisoned instructions cannot be
ever reached, this doesn't consistute an invalid BPF program. If user
didn't guard such code, libbpf will extract few pieces of information to
tie such poisoned instructions back to additional information about what
entitity wasn't resolved (e.g., BPF map name, or CO-RE relocation
information).

__weak unresolved kfuncs fit this model well, so this patch extends
libbpf with poisioning and log fixup logic for kfunc calls.

Note, this poisoning is done only for kfunc *calls*, not kfunc address
resolution (ldimm64 instructions). The former cannot be ever valid, if
reached, so it's safe to poison them. The latter is a valid mechanism to
check if __weak kfunc ksym was resolved, and do necessary guarding and
work arounds based on this result, supported in most recent kernels. As
such, libbpf keeps such ldimm64 instructions as loading zero, never
poisoning them.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

05b6f766

libbpf: report vmlinux vs module name when dealing with ksyms · f709160d

Andrii Nakryiko authored Apr 17, 2023

Currently libbpf always reports "kernel" as a source of ksym BTF type,
which is ambiguous given ksym's BTF can come from either vmlinux or
kernel module BTFs. Make this explicit and log module name, if used BTF
is from kernel module.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f709160d

libbpf: misc internal libbpf clean ups around log fixup · 3055ddd6

Andrii Nakryiko authored Apr 17, 2023

Normalize internal constants, field names, and comments related to log
fixup. Also add explicit `ext_idx` alias for relocation where relocation
is pointing to extern description for additional information.

No functional changes, just a clean up before subsequent additions.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230418002148.3255690-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3055ddd6

17 Apr, 2023 4 commits

selftests/bpf: Add a selftest for checking subreg equality · 49859de9

Yonghong Song authored Apr 17, 2023

Add a selftest to ensure subreg equality if source register
upper 32bit is 0. Without previous patch, the test will
fail verification.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

49859de9

bpf: Improve verifier u32 scalar equality checking · 3be49f79

Yonghong Song authored Apr 17, 2023

In [1], I tried to remove bpf-specific codes to prevent certain
llvm optimizations, and add llvm TTI (target transform info) hooks
to prevent those optimizations. During this process, I found
if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
transformation, I will hit the following verification failure with selftests:

  ...
  8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
  ;
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
  20: (61) r2 = *(u32 *)(r3 +0)
  R3 unbounded memory access, make sure to bounds check any such access
  processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
  -- END PROG LOAD LOG --
  libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
  libbpf: failed to load object 'test_tc_dtime'
  libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
  ...

At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
the previous mov is alu32 ('w2 = w1').

If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
For a 32bit subreg mov, if the src register upper 32bit is 0,
it is okay to claim equality between src and dst registers.

With this patch, the above verification sequence becomes

  ...
  8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
  10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
  ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
  12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
  13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
  ; if (test < __NR_TESTS)
  14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
  ...
  from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
  16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
  17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
  19: (0f) r3 += r2
  20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
  ...

and eventually the bpf program can be verified successfully.

  [1] https://reviews.llvm.org/D147968Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3be49f79

bpf: lirc program type should not require SYS_CAP_ADMIN · 69a8c792

Sean Young authored Apr 17, 2023

Make it possible to load lirc program type with just CAP_BPF. There is
nothing exceptional about lirc programs that means they require
SYS_CAP_ADMIN.

In order to attach or detach a lirc program type you need permission to
open /dev/lirc0; if you have permission to do that, you can alter all
sorts of lirc receiving options. Changing the IR protocol decoder is no
different.

Right now on a typical distribution /dev/lirc devices are only
read/write by root. Ideally we would make them group read/write like
other devices so that local users can use them without becoming root.
Signed-off-by: Sean Young <sean@mess.org>
Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

69a8c792

bpf: Set skb redirect and from_ingress info in __bpf_tx_skb · 59e498a3

Daniel Borkmann authored Apr 17, 2023

There are some use-cases where it is desirable to use bpf_redirect()
in combination with ifb device, which currently is not supported, for
example, around filtering inbound traffic with BPF to then push it to
ifb which holds the qdisc for shaping in contrast to doing that on the
egress device.

Toke mentions the following case related to OpenWrt:

Because there's not always a single egress on the other side. These are
mainly home routers, which tend to have one or more WiFi devices bridged
to one or more ethernet ports on the LAN side, and a single upstream WAN
port. And the objective is to control the total amount of traffic going
over the WAN link (in both directions), to deal with bufferbloat in the
ISP network (which is sadly still all too prevalent).

In this setup, the traffic can be split arbitrarily between the links
on the LAN side, and the only "single bottleneck" is the WAN link. So we
install both egress and ingress shapers on this, configured to something
like 95-98% of the true link bandwidth, thus moving the queues into the
qdisc layer in the router. It's usually necessary to set the ingress
bandwidth shaper a bit lower than the egress due to being "downstream"
of the bottleneck link, but it does work surprisingly well.

We usually use something like a matchall filter to put all ingress
traffic on the ifb, so doing the redirect from BPF has not been an
immediate requirement thus far. However, it does seem a bit odd that
this is not possible, and we do have a BPF-based filter that layers on
top of this kind of setup, which currently uses u32 as the ingress
filter and so it could presumably be improved to use BPF instead if
that was available.
Reported-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Reported-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.netSigned-off-by: Alexei Starovoitov <ast@kernel.org>

59e498a3

16 Apr, 2023 10 commits

Merge branch 'Remove KF_KPTR_GET kfunc flag' · d40f4f68

Alexei Starovoitov authored Apr 16, 2023

David Vernet says:

====================

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.

While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:

struct kptr_type *in_map, *new = NULL;

in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
	new = bpf_kptr_type_acquire(in_map);
	in_map = bpf_kptr_xchg(&map->value, in_map);
	if (in_map)
		bpf_kptr_type_release(in_map);
}

That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:

struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
	    struct task_struct *p;

	rcu_read_lock();
	p = READ_ONCE(*pp);
	/* If p is non-NULL, it could still be freed by another CPU,
 	 * so we have to do an opportunistic refcount_inc_not_zero()
	 * and return NULL if the task will be freed after the
	 * current RCU read region.
	 */
	|f (p && !refcount_inc_not_zero(&p->rcu_users))
		p = NULL;
	rcu_read_unlock();

	return p;
}

In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.

So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.

Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."

With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc
flag.
---

This is v2 of this patchset

v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/

Changelog:
----------

v1 -> v2:
- Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover
  letter (Alexei)
- In order to reduce churn, don't shift all KF_* flags down by 1. We'll
  just fill the now-empty slot the next time we add a flag (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

d40f4f68

bpf,docs: Remove KF_KPTR_GET from documentation · 530474e6

David Vernet authored Apr 16, 2023

A prior patch removed KF_KPTR_GET from the kernel. Now that it's no
longer accessible to kfunc authors, this patch removes it from the BPF
kfunc documentation.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

530474e6

bpf: Remove KF_KPTR_GET kfunc flag · 7b4ddf39

David Vernet authored Apr 16, 2023

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the existing use cases which previously had KF_KPTR_GET
kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
have all been updated to be synchronized using RCU. In other words,
their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
maps in an RCU read region thanks to the types being RCU safe.

While KF_KPTR_GET was a logical starting point for kptrs, it's become
clear that they're not the correct abstraction. KF_KPTR_GET is a flag
that essentially does nothing other than enforcing that the argument to
a function is a pointer to a referenced kptr map value. At first glance,
that's a useful thing to guarantee to a kfunc. It gives kfuncs the
ability to try and acquire a reference on that kptr without requiring
the BPF prog to do something like this:

struct kptr_type *in_map, *new = NULL;

in_map = bpf_kptr_xchg(&map->value, NULL);
if (in_map) {
        new = bpf_kptr_type_acquire(in_map);
        in_map = bpf_kptr_xchg(&map->value, in_map);
        if (in_map)
                bpf_kptr_type_release(in_map);
}

That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
the only alternative, it's better than nothing. However, the problem
with any KF_KPTR_GET kfunc lies in the fact that it always requires some
kind of synchronization in order to safely do an opportunistic acquire
of the kptr in the map. This is because a BPF program running on another
CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
after it's been read by the KF_KPTR_GET kfunc. For example, the
now-removed bpf_task_kptr_get() kfunc did the following:

struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
{
            struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(*pp);
        /* If p is non-NULL, it could still be freed by another CPU,
         * so we have to do an opportunistic refcount_inc_not_zero()
         * and return NULL if the task will be freed after the
         * current RCU read region.
         */
        |f (p && !refcount_inc_not_zero(&p->rcu_users))
                p = NULL;
        rcu_read_unlock();

        return p;
}

In other words, the kfunc uses RCU to ensure that the task remains valid
after it's been peeked from the map. However, this is completely
redundant with just defining a KF_RCU kfunc that itself does a
refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
does.

So, the question of whether KF_KPTR_GET is useful is actually, "Are
there any synchronization mechanisms / safety flags that are required by
certain kptrs, but which are not provided by the verifier to kfuncs?"
The answer to that question today is "No", because every kptr we
currently care about is RCU protected.

Even if the answer ever became "yes", the proper way to support that
referenced kptr type would be to add support for whatever
synchronization mechanism it requires in the verifier, rather than
giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
in a map, do whatever you need to do."

With all that said -- so as to allow us to consolidate the kfunc API,
and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
relevant logic from the verifier.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7b4ddf39

bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc · 09b501d9

David Vernet authored Apr 16, 2023

We've managed to improve the UX for kptrs significantly over the last 9
months. All of the prior main use cases, struct bpf_cpumask *, struct
task_struct *, and struct cgroup *, have all been updated to be
synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc
calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be
accessed in an RCU read region in BPF.

In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag.
This patch prepares for that by removing the
bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

09b501d9

Merge branch 'Shared ownership for local kptrs' · 7a0788fe

Alexei Starovoitov authored Apr 15, 2023

Dave Marchevsky says:

====================

This series adds support for refcounted local kptrs to the verifier. A local
kptr is 'refcounted' if its type contains a struct bpf_refcount field:

  struct refcounted_node {
    long data;
    struct bpf_list_node ll;
    struct bpf_refcount ref;
  };

bpf_refcount is used to implement shared ownership for local kptrs.

Motivating usecase
==================

If a struct has two collection node fields, e.g.:

  struct node {
    long key;
    long val;
    struct bpf_rb_node rb;
    struct bpf_list_node ll;
  };

It's not currently possible to add a node to both the list and rbtree:

  long bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(typeof(*n));
    if (!n) { /* ... */ }

    bpf_spin_lock(&lock);

    bpf_list_push_back(&head, &n->ll);
    bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
    bpf_spin_unlock(&lock);
  }

The above program will fail verification due to current owning / non-owning ref
logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
passed to bpf_rbtree_add. The only way to get an owning reference for the node
that was added is to bpf_list_pop_{front,back} it.

More generally, verifier ownership semantics expect that a node has one
owner (program, collection, or stashed in map) with exclusive ownership
of the node's lifetime. The owner free's the node's underlying memory when it
itself goes away.

Without a shared ownership concept it's impossible to express many real-world
usecases such that they pass verification.

Semantic Changes
================

Before this series, the verifier could make this statement: "whoever has the
owning reference has exclusive ownership of the referent's lifetime". As
demonstrated in the previous section, this implies that a BPF program can't
have an owning reference to some node if that node is in a collection. If
such a state were possible, the node would have multiple owners, each thinking
they have exclusive ownership. In order to support shared ownership it's
necessary to modify the exclusive ownership semantic.

After this series' changes, an owning reference has ownership of the referent's
lifetime, but it's not necessarily exclusive. The referent's underlying memory
is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
used for collection insert.

This change doesn't affect UX of owning or non-owning references much:

  * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
    an owning reference arg, as ownership still must be passed to the
    collection in a shared-ownership world.

  * non-owning references still refer to valid memory without claiming
    any ownership.

One important conclusion that followed from "exclusive ownership" statement
is no longer valid, though. In exclusive-ownership world, if a BPF prog has
an owning reference to a node, the verifier can conclude that no collection has
ownership of it. This conclusion was used to avoid runtime checking in the
implementations of insert and remove operations (""has the node already been
{inserted, removed}?").

In a shared-ownership world the aforementioned conclusion is no longer valid,
which necessitates doing runtime checking in insert and remove operation
kfuncs, and those functions possibly failing to insert or remove anything.

Luckily the verifier changes necessary to go from exclusive to shared ownership
were fairly minimal. Patches in this series which do change verifier semantics
generally have some summary dedicated to explaining why certain usecases
Just Work for shared ownership without verifier changes.

Implementation
==============

The changes in this series can be categorized as follows:

  * struct bpf_refcount opaque field + plumbing
  * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
  * bpf_refcount_acquire kfunc
    * enables shared ownershp by bumping refcount + acquiring owning ref
  * support for possibly-failing collection insertion and removal
    * insertion changes are more complex

If a patch's changes have some nuance to their effect - or lack of effect - on
verifier behavior, the patch summary talks about it at length.

Patch contents:
  * Patch 1 removes btf_field_offs struct
  * Patch 2 adds struct bpf_refcount and associated plumbing
  * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
    refcounted kptrs
  * Patch 4 adds bpf_refcount_acquire
  * Patches 5-7 add support for possibly-failing collection insert and remove
  * Patch 8 centralizes constructor-like functionality for local kptr types
  * Patch 9 adds tests for new functionality

base-commit: 4a1e885c

Changelog:

v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com

Patch #s used below refer to the patch's position in v1 unless otherwise
specified.

  * General
    * Rebase onto latest bpf-next (base-commit updated above)

  * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
    * Fix typo in summary (Alexei)
  * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
    * Modify a paragraph in patch summary to more clearly state that only
      bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
      patch (Alexei)
    * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
      and rb_node fields" check, since any negative value means "no
      bpf_refcount field found", and furthermore refcount_off is never
      explicitly set to -1, but rather -EINVAL. (Alexei)
    * Instead of just changing "btf: list_node and rb_node in same struct" test
      expectation to pass instead of fail, do some refactoring to test both
      "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
      _no_ bpf_refcount" (failure) cases. This ensures that logic change in
      previous bullet point is correct.
      * v1's "btf: list_node and rb_node in same struct" test changes didn't
        add bpf_refcount, so the fact that btf load succeeded w/ list and
        rb_nodes but no bpf_refcount field is further proof that this logic
        was incorrect in v1.
  * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
    * Instead of doing __init_field_infer_size in kfuncs when taking
      bpf_list_head type input which might've been 0-initialized in map, go
      back to simple oneliner initialization. Add short comment explaining why
      this is necessary. (Alexei)
  * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
    * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

7a0788fe

selftests/bpf: Add refcounted_kptr tests · 6147f151

Dave Marchevsky authored Apr 15, 2023

Test refcounted local kptr functionality added in previous patches in
the series.

Usecases which pass verification:

* Add refcounted local kptr to both tree and list. Then, read and -
  possibly, depending on test variant - delete from tree, then list.
  * Also test doing read-and-maybe-delete in opposite order
* Stash a refcounted local kptr in a map_value, then add it to a
  rbtree. Read from both, possibly deleting after tree read.
* Add refcounted local kptr to both tree and list. Then, try reading and
  deleting twice from one of the collections.
* bpf_refcount_acquire of just-added non-owning ref should work, as
  should bpf_refcount_acquire of owning ref just out of bpf_obj_new

Usecases which fail verification:

* The simple successful bpf_refcount_acquire cases from above should
  both fail to verify if the newly-acquired owning ref is not dropped
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6147f151

bpf: Centralize btf_field-specific initialization logic · 3e81740a

Dave Marchevsky authored Apr 15, 2023

All btf_fields in an object are 0-initialized by memset in
bpf_obj_init. This might not be a valid initial state for some field
types, in which case kfuncs that use the type will properly initialize
their input if it's been 0-initialized. Some BPF graph collection types
and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.

An earlier patch in this series added the bpf_refcount field, for which
the 0 state indicates that the refcounted object should be free'd.
bpf_obj_init treats this field specially, setting refcount to 1 instead
of relying on scattered "refcount is 0? Must have just been initialized,
let's set to 1" logic in kfuncs.

This patch extends this treatment to list and rbtree field types,
allowing most scattered initialization logic in kfuncs to be removed.

Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
they'll be 0-initialized without passing through the newly-added logic,
so scattered initialization logic must remain for these collection root
types.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3e81740a

bpf: Migrate bpf_rbtree_remove to possibly fail · 404ad75a

Dave Marchevsky authored Apr 15, 2023

This patch modifies bpf_rbtree_remove to account for possible failure
due to the input rb_node already not being in any collection.
The function can now return NULL, and does when the aforementioned
scenario occurs. As before, on successful removal an owning reference to
the removed node is returned.

Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
KF_ACQUIRE - provides the desired verifier semantics:

  * retval must be checked for NULL before use
  * if NULL, retval's ref_obj_id is released
  * retval is a "maybe acquired" owning ref, not a non-owning ref,
    so it will live past end of critical section (bpf_spin_unlock), and
    thus can be checked for NULL after the end of the CS

BPF programs must add checks
============================

This does change bpf_rbtree_remove's verifier behavior. BPF program
writers will need to add NULL checks to their programs, but the
resulting UX looks natural:

  bpf_spin_lock(&glock);

  n = bpf_rbtree_first(&ghead);
  if (!n) { /* ... */}
  res = bpf_rbtree_remove(&ghead, &n->node);

  bpf_spin_unlock(&glock);

  if (!res)  /* Newly-added check after this patch */
    return 1;

  n = container_of(res, /* ... */);
  /* Do something else with n */
  bpf_obj_drop(n);
  return 0;

The "if (!res)" check above is the only addition necessary for the above
program to pass verification after this patch.

bpf_rbtree_remove no longer clobbers non-owning refs
====================================================

An issue arises when bpf_rbtree_remove fails, though. Consider this
example:

  struct node_data {
    long key;
    struct bpf_list_node l;
    struct bpf_rb_node r;
    struct bpf_refcount ref;
  };

  long failed_sum;

  void bpf_prog()
  {
    struct node_data *n = bpf_obj_new(/* ... */);
    struct bpf_rb_node *res;
    n->key = 10;

    bpf_spin_lock(&glock);

    bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
    res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
    if (!res)
      failed_sum += n->key;  /* not possible */

    bpf_spin_unlock(&glock);
    /* if (res) { do something useful and drop } ... */
  }

The bpf_rbtree_remove in this example will always fail. Similarly to
bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
invalidation point. The verifier clobbers all non-owning refs after a
bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
verification, and in fact there's no good way to get information about
the node which failed to add after the invalidation. This patch removes
non-owning reference invalidation from bpf_rbtree_remove to allow the
above usecase to pass verification. The logic for why this is now
possible is as follows:

Before this series, bpf_rbtree_add couldn't fail and thus assumed that
its input, a non-owning reference, was in the tree. But it's easy to
construct an example where two non-owning references pointing to the same
underlying memory are acquired and passed to rbtree_remove one after
another (see rbtree_api_release_aliasing in
selftests/bpf/progs/rbtree_fail.c).

So it was necessary to clobber non-owning refs to prevent this
case and, more generally, to enforce "non-owning ref is definitely
in some collection" invariant. This series removes that invariant and
the failure / runtime checking added in this patch provide a clean way
to deal with the aliasing issue - just fail to remove.

Because the aliasing issue prevented by clobbering non-owning refs is no
longer an issue, this patch removes the invalidate_non_owning_refs
call from verifier handling of bpf_rbtree_remove. Note that
bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
clobbers non-owning refs for a different reason, so its clobbering
behavior remains unchanged.

No BPF program changes are necessary for programs to remain valid as a
result of this clobbering change. A valid program before this patch
passed verification with its non-owning refs having shorter (or equal)
lifetimes due to more aggressive clobbering.

Also, update existing tests to check bpf_rbtree_remove retval for NULL
where necessary, and move rbtree_api_release_aliasing from
progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
verification.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

404ad75a

selftests/bpf: Modify linked_list tests to work with macro-ified inserts · de67ba39

Dave Marchevsky authored Apr 15, 2023

The linked_list tests use macros and function pointers to reduce code
duplication. Earlier in the series, bpf_list_push_{front,back} were
modified to be macros, expanding to invoke actual kfuncs
bpf_list_push_{front,back}_impl. Due to this change, a code snippet
like:

  void (*p)(void *, void *) = (void *)&bpf_list_##op;
  p(hexpr, nexpr);

meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer
work as it's no longer valid to do &bpf_list_push_{front,back} since
they're no longer functions.

This patch fixes issues of this type, along with two other minor changes
- one improvement and one fix - both related to the node argument to
list_push_{front,back}.

  * The fix: migration of list_push tests away from (void *, void *)
    func ptr uncovered that some tests were incorrectly passing pointer
    to node, not pointer to struct bpf_list_node within the node. This
    patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node))

  * The improvement: In linked_list tests, the struct foo type has two
    list_node fields: node and node2, at byte offsets 0 and 40 within
    the struct, respectively. Currently node is used in ~all tests
    involving struct foo and lists. The verifier needs to do some work
    to account for the offset of bpf_list_node within the node type, so
    using node2 instead of node exercises that logic more in the tests.
    This patch migrates linked_list tests to use node2 instead of node.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

de67ba39

bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail · d2dcc67d

Dave Marchevsky authored Apr 15, 2023

Consider this code snippet:

  struct node {
    long key;
    bpf_list_node l;
    bpf_rb_node r;
    bpf_refcount ref;
  }

  int some_bpf_prog(void *ctx)
  {
    struct node *n = bpf_obj_new(/*...*/), *m;

    bpf_spin_lock(&glock);

    bpf_rbtree_add(&some_tree, &n->r, /* ... */);
    m = bpf_refcount_acquire(n);
    bpf_rbtree_add(&other_tree, &m->r, /* ... */);

    bpf_spin_unlock(&glock);

    /* ... */
  }

After bpf_refcount_acquire, n and m point to the same underlying memory,
and that node's bpf_rb_node field is being used by the some_tree insert,
so overwriting it as a result of the second insert is an error. In order
to properly support refcounted nodes, the rbtree and list insert
functions must be allowed to fail. This patch adds such support.

The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
return an int indicating success/failure, with 0 -> success, nonzero ->
failure.

bpf_obj_drop on failure
=======================

Currently the only reason an insert can fail is the example above: the
bpf_{list,rb}_node is already in use. When such a failure occurs, the
insert kfuncs will bpf_obj_drop the input node. This allows the insert
operations to logically fail without changing their verifier owning ref
behavior, namely the unconditional release_reference of the input
owning ref.

With insert that always succeeds, ownership of the node is always passed
to the collection, since the node always ends up in the collection.

With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
is always passed either to the collection (success), or to bpf_obj_drop
(failure). Regardless, it's correct to continue unconditionally
releasing the input owning ref, as something is always taking ownership
from the calling program on insert.

Keeping owning ref behavior unchanged results in a nice default UX for
insert functions that can fail. If the program's reaction to a failed
insert is "fine, just get rid of this owning ref for me and let me go
on with my business", then there's no reason to check for failure since
that's default behavior. e.g.:

  long important_failures = 0;

  int some_bpf_prog(void *ctx)
  {
    struct node *n, *m, *o; /* all bpf_obj_new'd */

    bpf_spin_lock(&glock);
    bpf_rbtree_add(&some_tree, &n->node, /* ... */);
    bpf_rbtree_add(&some_tree, &m->node, /* ... */);
    if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
      important_failures++;
    }
    bpf_spin_unlock(&glock);
  }

If we instead chose to pass ownership back to the program on failed
insert - by returning NULL on success or an owning ref on failure -
programs would always have to do something with the returned ref on
failure. The most likely action is probably "I'll just get rid of this
owning ref and go about my business", which ideally would look like:

  if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
    bpf_obj_drop(n);

But bpf_obj_drop isn't allowed in a critical section and inserts must
occur within one, so in reality error handling would become a
hard-to-parse mess.

For refcounted nodes, we can replicate the "pass ownership back to
program on failure" logic with this patch's semantics, albeit in an ugly
way:

  struct node *n = bpf_obj_new(/* ... */), *m;

  bpf_spin_lock(&glock);

  m = bpf_refcount_acquire(n);
  if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
    /* Do something with m */
  }

  bpf_spin_unlock(&glock);
  bpf_obj_drop(m);

bpf_refcount_acquire is used to simulate "return owning ref on failure".
This should be an uncommon occurrence, though.

Addition of two verifier-fixup'd args to collection inserts
===========================================================

The actual bpf_obj_drop kfunc is
bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
macro populating the second arg with 0 and the verifier later filling in
the arg during insn fixup.

Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
passed to bpf_obj_drop_impl.

Similarly, because the 'node' param to those insert functions is the
bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
pointer to the beginning of the node, the insert functions need to be
able to find the beginning of the node struct. A second
verifier-populated param is necessary: the offset of {list,rb}_node within the
node type.

These two new params allow the insert kfuncs to correctly call
__bpf_obj_drop_impl:

  beginning_of_node = bpf_rb_node_ptr - offset
  if (already_inserted)
    __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);

Similarly to other kfuncs with "hidden" verifier-populated params, the
insert functions are renamed with _impl prefix and a macro is provided
for common usage. For example, bpf_rbtree_add kfunc is now
bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
"hidden" args to 0.

Due to the two new args BPF progs will need to be recompiled to work
with the new _impl kfuncs.

This patch also rewrites the "hidden argument" explanation to more
directly say why the BPF program writer doesn't need to populate the
arguments with anything meaningful.

How does this new logic affect non-owning references?
=====================================================

Currently, non-owning refs are valid until the end of the critical
section in which they're created. We can make this guarantee because, if
a non-owning ref exists, the referent was added to some collection. The
collection will drop() its nodes when it goes away, but it can't go away
while our program is accessing it, so that's not a problem. If the
referent is removed from the collection in the same CS that it was added
in, it can't be bpf_obj_drop'd until after CS end. Those are the only
two ways to free the referent's memory and neither can happen until
after the non-owning ref's lifetime ends.

On first glance, having these collection insert functions potentially
bpf_obj_drop their input seems like it breaks the "can't be
bpf_obj_drop'd until after CS end" line of reasoning. But we care about
the memory not being _freed_ until end of CS end, and a previous patch
in the series modified bpf_obj_drop such that it doesn't free refcounted
nodes until refcount == 0. So the statement can be more accurately
rewritten as "can't be free'd until after CS end".

We can prove that this rewritten statement holds for any non-owning
reference produced by collection insert functions:

* If the input to the insert function is _not_ refcounted
  * We have an owning reference to the input, and can conclude it isn't
    in any collection
    * Inserting a node in a collection turns owning refs into
      non-owning, and since our input type isn't refcounted, there's no
      way to obtain additional owning refs to the same underlying
      memory
  * Because our node isn't in any collection, the insert operation
    cannot fail, so bpf_obj_drop will not execute
  * If bpf_obj_drop is guaranteed not to execute, there's no risk of
    memory being free'd

* Otherwise, the input to the insert function is refcounted
  * If the insert operation fails due to the node's list_head or rb_root
    already being in some collection, there was some previous successful
    insert which passed refcount to the collection
  * We have an owning reference to the input, it must have been
    acquired via bpf_refcount_acquire, which bumped the refcount
  * refcount must be >= 2 since there's a valid owning reference and the
    node is already in a collection
  * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
    resulting in a free

So although we may do bpf_obj_drop during the critical section, this
will never result in memory being free'd, and no changes to non-owning
ref logic are needed in this patch.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

d2dcc67d