Commits · 49b5300f1f8f2b542b31d63043c2febd13edbe3a · Kirill Smelkov / linux

11 Mar, 2023 4 commits

Merge branch 'Support stashing local kptrs with bpf_kptr_xchg' · 49b5300f

Alexei Starovoitov authored Mar 10, 2023

Dave Marchevsky says:

====================

Local kptrs are kptrs allocated via bpf_obj_new with a type specified in program
BTF. A BPF program which creates a local kptr has exclusive control of the
lifetime of the kptr, and, prior to terminating, must:

  * free the kptr via bpf_obj_drop
  * If the kptr is a {list,rbtree} node, add the node to a {list, rbtree},
    thereby passing control of the lifetime to the collection

This series adds a third option:

  * stash the kptr in a map value using bpf_kptr_xchg

As indicated by the use of "stash" to describe this behavior, the intended use
of this feature is temporary storage of local kptrs. For example, a sched_ext
([0]) scheduler may want to create an rbtree node for each new cgroup on cgroup
init, but to add that node to the rbtree as part of a separate program which
runs on enqueue. Stashing the node in a map_value allows its lifetime to outlive
the execution of the cgroup_init program.

Behavior:

There is no semantic difference between adding a kptr to a graph collection and
"stashing" it in a map. In both cases exclusive ownership of the kptr's lifetime
is passed to some containing data structure, which is responsible for
bpf_obj_drop'ing it when the container goes away.

Since graph collections also expect exclusive ownership of the nodes they
contain, graph nodes cannot be both stashed in a map_value and contained by
their corresponding collection.

Implementation:

Two observations simplify the verifier changes for this feature. First, kptrs
("referenced kptrs" until a recent renaming) require registration of a
dtor function as part of their acquire/release semantics, so that a referenced
kptr which is placed in a map_value is properly released when the map goes away.
We want this exact behavior for local kptrs, but with bpf_obj_drop as the dtor
instead of a per-btf_id dtor.

The second observation is that, in terms of identification, "referenced kptr"
and "local kptr" already don't interfere with one another. Consider the
following example:

  struct node_data {
          long key;
          long data;
          struct bpf_rb_node node;
  };

  struct map_value {
          struct node_data __kptr *node;
  };

  struct {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __type(key, int);
          __type(value, struct map_value);
          __uint(max_entries, 1);
  } some_nodes SEC(".maps");

  struct map_value *mapval;
  struct node_data *res;
  int key = 0;

  res = bpf_obj_new(typeof(*res));
  if (!res) { /* err handling */ }

  mapval = bpf_map_lookup_elem(&some_nodes, &key);
  if (!mapval) { /* err handling */ }

  res = bpf_kptr_xchg(&mapval->node, res);
  if (res)
          bpf_obj_drop(res);

The __kptr tag identifies map_value's node as a referenced kptr, while the
PTR_TO_BTF_ID which bpf_obj_new returns - a type in some non-vmlinux,
non-module BTF - identifies res as a local kptr. Type tag on the pointer
indicates referenced kptr, while the type of the pointee indicates local kptr.
So using existing facilities we can tell the verifier about a "referenced kptr"
pointer to a "local kptr" pointee.

When kptr_xchg'ing a kptr into a map_value, the verifier can recognize local
kptr types and treat them like referenced kptrs with a properly-typed
bpf_obj_drop as a dtor.

Other implementation notes:
  * We don't need to do anything special to enforce "graph nodes cannot be
    both stashed in a map_value and contained by their corresponding collection"
    * bpf_kptr_xchg both returns and takes as input a (possibly-null) owning
      reference. It does not accept non-owning references as input by virtue
      of requiring a ref_obj_id. By definition, if a program has an owning
      ref to a node, the node isn't in a collection, so it's safe to pass
      ownership via bpf_kptr_xchg.

Summary of patches:

  * Patch 1 modifies BTF plumbing to support using bpf_obj_drop as a dtor
  * Patch 2 adds verifier plumbing to support MEM_ALLOC-flagged param for
    bpf_kptr_xchg
  * Patch 3 adds selftests exercising the new behavior

Changelog:

v1 -> v2: https://lore.kernel.org/bpf/20230309180111.1618459-1-davemarchevsky@fb.com/

Patch #s used below refer to the patch's position in v1 unless otherwise
specified.

Patches 1-3 were applied and are not included in v2.
Rebase onto latest bpf-next: "libbpf: Revert poisoning of strlcpy"

Patch 4: "bpf: Support __kptr to local kptrs"
  * Remove !btf_is_kernel(btf) check, WARN_ON_ONCE instead (Alexei)

Patch 6: "selftests/bpf: Add local kptr stashing test"
  * Add test which stashes 2 nodes and later unstashes one of them using a
    separate BPF program (Alexei)
  * Fix incorrect runner subtest name for original test (was
    "rbtree_add_nodes")
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

49b5300f

selftests/bpf: Add local kptr stashing test · 5d8d6634

Dave Marchevsky authored Mar 10, 2023

Add a new selftest, local_kptr_stash, which uses bpf_kptr_xchg to stash
a bpf_obj_new-allocated object in a map. Test the following scenarios:

  * Stash two rb_nodes in an arraymap, don't unstash them, rely on map
    free to destruct them
  * Stash two rb_nodes in an arraymap, unstash the second one in a
    separate program, rely on map free to destruct first
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230310230743.2320707-4-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

5d8d6634

bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg · 738c96d5

Dave Marchevsky authored Mar 10, 2023

The previous patch added necessary plumbing for verifier and runtime to
know what to do with non-kernel PTR_TO_BTF_IDs in map values, but didn't
provide any way to get such local kptrs into a map value. This patch
modifies verifier handling of bpf_kptr_xchg to allow MEM_ALLOC kptr
types.

check_reg_type is modified accept MEM_ALLOC-flagged input to
bpf_kptr_xchg despite such types not being in btf_ptr_types. This could
have been done with a MAYBE_MEM_ALLOC equivalent to MAYBE_NULL, but
bpf_kptr_xchg is the only helper that I can forsee using
MAYBE_MEM_ALLOC, so keep it special-cased for now.

The verifier tags bpf_kptr_xchg retval MEM_ALLOC if and only if the BTF
associated with the retval is not kernel BTF.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230310230743.2320707-3-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

738c96d5

bpf: Support __kptr to local kptrs · c8e18754

Dave Marchevsky authored Mar 10, 2023

If a PTR_TO_BTF_ID type comes from program BTF - not vmlinux or module
BTF - it must have been allocated by bpf_obj_new and therefore must be
free'd with bpf_obj_drop. Such a PTR_TO_BTF_ID is considered a "local
kptr" and is tagged with MEM_ALLOC type tag by bpf_obj_new.

This patch adds support for treating __kptr-tagged pointers to "local
kptrs" as having an implicit bpf_obj_drop destructor for referenced kptr
acquire / release semantics. Consider the following example:

  struct node_data {
          long key;
          long data;
          struct bpf_rb_node node;
  };

  struct map_value {
          struct node_data __kptr *node;
  };

  struct {
          __uint(type, BPF_MAP_TYPE_ARRAY);
          __type(key, int);
          __type(value, struct map_value);
          __uint(max_entries, 1);
  } some_nodes SEC(".maps");

If struct node_data had a matching definition in kernel BTF, the verifier would
expect a destructor for the type to be registered. Since struct node_data does
not match any type in kernel BTF, the verifier knows that there is no kfunc
that provides a PTR_TO_BTF_ID to this type, and that such a PTR_TO_BTF_ID can
only come from bpf_obj_new. So instead of searching for a registered dtor,
a bpf_obj_drop dtor can be assumed.

This allows the runtime to properly destruct such kptrs in
bpf_obj_free_fields, which enables maps to clean up map_vals w/ such
kptrs when going away.

Implementation notes:
  * "kernel_btf" variable is renamed to "kptr_btf" in btf_parse_kptr.
    Before this patch, the variable would only ever point to vmlinux or
    module BTFs, but now it can point to some program BTF for local kptr
    type. It's later used to populate the (btf, btf_id) pair in kptr btf
    field.
  * It's necessary to btf_get the program BTF when populating btf_field
    for local kptr. btf_record_free later does a btf_put.
  * Behavior for non-local referenced kptrs is not modified, as
    bpf_find_btf_id helper only searches vmlinux and module BTFs for
    matching BTF type. If such a type is found, btf_field_kptr's btf will
    pass btf_is_kernel check, and the associated release function is
    some one-argument dtor. If btf_is_kernel check fails, associated
    release function is two-arg bpf_obj_drop_impl. Before this patch
    only btf_field_kptr's w/ kernel or module BTFs were created.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230310230743.2320707-2-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c8e18754

10 Mar, 2023 27 commits

bpf, docs: Explain helper functions · c1f9e14e

Dave Thaler authored Mar 08, 2023

Add brief text about existence of helper functions, with details to go in
separate psABI text.

Note that text about runtime functions (kfuncs) is part of a separate patch,
not this one.
Signed-off-by: Dave Thaler <dthaler@microsoft.com>
Link: https://lore.kernel.org/r/20230308205303.1308-1-dthaler1968@googlemail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c1f9e14e

bpf: Change btf_record_find enum parameter to field_mask · 74843b57

Dave Marchevsky authored Mar 09, 2023

btf_record_find's 3rd parameter can be multiple enum btf_field_type's
masked together. The function is called with BPF_KPTR in two places in
verifier.c, so it works with masked values already.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-4-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

74843b57

bpf: btf: Remove unused btf_field_info_type enum · a4aa3889

Dave Marchevsky authored Mar 09, 2023

This enum was added and used in commit aa3496ac ("bpf: Refactor kptr_off_tab
into btf_record"). Later refactoring in commit db559117 ("bpf: Consolidate
spin_lock, timer management into btf_record") resulted in the enum
values no longer being used anywhere.

Let's remove them.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-3-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a4aa3889

bpf: verifier: Rename kernel_type_name helper to btf_type_name · b32a5dae

Dave Marchevsky authored Mar 09, 2023

kernel_type_name was introduced in commit 9e15db66 ("bpf: Implement accurate raw_tp context access via BTF")
with type signature:

const char *kernel_type_name(u32 id)

At that time the function used global btf_vmlinux BTF for all id lookups. Later,
in commit 22dc4a0f ("bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier"),
the type signature was changed to:

static const char *kernel_type_name(const struct btf* btf, u32 id)

With the btf parameter used for lookups instead of global btf_vmlinux.

The helper will function as expected for type name lookup using non-kernel BTFs,
and will be used for such in further patches in the series. Let's rename it to
avoid incorrect assumptions that might arise when seeing the current name.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230309180111.1618459-2-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

b32a5dae

selftests/bpf: Add local-storage-create benchmark · 4659fba1

Martin KaFai Lau authored Mar 07, 2023

This patch tests how many kmallocs is needed to create and free
a batch of UDP sockets and each socket has a 64bytes bpf storage.
It also measures how fast the UDP sockets can be created.

The result is from my qemu setup.

Before bpf_mem_cache_alloc/free:
./bench -p 1 local-storage-create
Setting up benchmark 'local-storage-create'...
Benchmark 'local-storage-create' started.
Iter   0 ( 73.193us): creates  213.552k/s (213.552k/prod), 3.09 kmallocs/create
Iter   1 (-20.724us): creates  211.908k/s (211.908k/prod), 3.09 kmallocs/create
Iter   2 (  9.280us): creates  212.574k/s (212.574k/prod), 3.12 kmallocs/create
Iter   3 ( 11.039us): creates  213.209k/s (213.209k/prod), 3.12 kmallocs/create
Iter   4 (-11.411us): creates  213.351k/s (213.351k/prod), 3.12 kmallocs/create
Iter   5 ( -7.915us): creates  214.754k/s (214.754k/prod), 3.12 kmallocs/create
Iter   6 ( 11.317us): creates  210.942k/s (210.942k/prod), 3.12 kmallocs/create
Summary: creates  212.789 ± 1.310k/s (212.789k/prod), 3.12 kmallocs/create

After bpf_mem_cache_alloc/free:
./bench -p 1 local-storage-create
Setting up benchmark 'local-storage-create'...
Benchmark 'local-storage-create' started.
Iter   0 ( 68.265us): creates  243.984k/s (243.984k/prod), 1.04 kmallocs/create
Iter   1 ( 30.357us): creates  238.424k/s (238.424k/prod), 1.04 kmallocs/create
Iter   2 (-18.712us): creates  232.963k/s (232.963k/prod), 1.04 kmallocs/create
Iter   3 (-15.885us): creates  238.879k/s (238.879k/prod), 1.04 kmallocs/create
Iter   4 (  5.590us): creates  237.490k/s (237.490k/prod), 1.04 kmallocs/create
Iter   5 (  8.577us): creates  237.521k/s (237.521k/prod), 1.04 kmallocs/create
Iter   6 ( -6.263us): creates  238.508k/s (238.508k/prod), 1.04 kmallocs/create
Summary: creates  237.298 ± 2.198k/s (237.298k/prod), 1.04 kmallocs/create
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-18-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4659fba1

selftests/bpf: Check freeing sk->sk_local_storage with sk_local_storage->smap is NULL · 1f443d0f

Martin KaFai Lau authored Mar 07, 2023

This patch tweats the socket_bind bpf prog to test the
local_storage->smap == NULL case in the bpf_local_storage_free()
code path. The idea is to create the local_storage with
the sk_storage_map's selem first. Then add the sk_storage_map2's selem
and then delete the earlier sk_storeage_map's selem.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-17-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1f443d0f

selftests/bpf: Replace CHECK with ASSERT in test_local_storage · 57ef7715

Martin KaFai Lau authored Mar 07, 2023

This patch migrates the CHECK macro to ASSERT macro.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-16-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

57ef7715

bpf: Add bpf_local_storage_free() · 7e30a847

Martin KaFai Lau authored Mar 07, 2023

This patch refactors local_storage freeing logic into
bpf_local_storage_free(). It is a preparation work for a later
patch that uses bpf_mem_cache_alloc/free. The other kfree(local_storage)
cases are also changed to bpf_local_storage_free(..., reuse_now = true).
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-12-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7e30a847

bpf: Add bpf_local_storage_rcu callback · 1288aaa2

Martin KaFai Lau authored Mar 07, 2023

The existing bpf_local_storage_free_rcu is renamed to
bpf_local_storage_free_trace_rcu. A new bpf_local_storage_rcu
callback is added to do the kfree instead of using kfree_rcu.
It is a preparation work for a later patch using
bpf_mem_cache_alloc/free.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-11-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1288aaa2

bpf: Add bpf_selem_free() · c0d63f30

Martin KaFai Lau authored Mar 07, 2023

This patch refactors the selem freeing logic into bpf_selem_free().
It is a preparation work for a later patch using
bpf_mem_cache_alloc/free. The other kfree(selem) cases
are also changed to bpf_selem_free(..., reuse_now = true).
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-10-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c0d63f30

bpf: Add bpf_selem_free_rcu callback · f8ccf30c

Martin KaFai Lau authored Mar 07, 2023

Add bpf_selem_free_rcu() callback to do the kfree() instead
of using kfree_rcu. It is a preparation work for using
bpf_mem_cache_alloc/free in a later patch.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-9-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f8ccf30c

bpf: Remove bpf_selem_free_fields*_rcu · c6099813

Martin KaFai Lau authored Mar 07, 2023

This patch removes the bpf_selem_free_fields*_rcu. The
bpf_obj_free_fields() can be done before the call_rcu_trasks_trace()
and kfree_rcu(). It is needed when a later patch uses
bpf_mem_cache_alloc/free. In bpf hashtab, bpf_obj_free_fields()
is also called before calling bpf_mem_cache_free. The discussion
can be found in
https://lore.kernel.org/bpf/f67021ee-21d9-bfae-6134-4ca542fab843@linux.dev/Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-8-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c6099813

bpf: Repurpose use_trace_rcu to reuse_now in bpf_local_storage · a47eabf2

Martin KaFai Lau authored Mar 07, 2023

This patch re-purpose the use_trace_rcu to mean
if the freed memory can be reused immediately or not.
The use_trace_rcu is renamed to reuse_now. Other than
the boolean test is reversed, it should be a no-op.

The following explains the reason for the rename and how it will
be used in a later patch.

In a later patch, bpf_mem_cache_alloc/free will be used
in the bpf_local_storage. The bpf mem allocator will reuse
the freed memory immediately. Some of the free paths in
bpf_local_storage does not support memory to be reused immediately.
These paths are the "delete" elem cases from the bpf_*_storage_delete()
helper and the map_delete_elem() syscall. Note that "delete" elem
before the owner's (sk/task/cgrp/inode) lifetime ended is not
the common usage for the local storage.

The common free path, bpf_local_storage_destroy(), can reuse the
memory immediately. This common path means the storage stays with
its owner until the owner is destroyed.

The above mentioned "delete" elem paths that cannot
reuse immediately always has the 'use_trace_rcu == true'.
The cases that is safe for immediate reuse always have
'use_trace_rcu == false'. Instead of adding another arg
in a later patch, this patch re-purpose this arg
to reuse_now and have the test logic reversed.

In a later patch, 'reuse_now == true' will free to the
bpf_mem_cache_free() where the memory can be reused
immediately. 'reuse_now == false' will go through the
call_rcu_tasks_trace().
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-7-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a47eabf2

bpf: Remember smap in bpf_local_storage · fc6652aa

Martin KaFai Lau authored Mar 07, 2023

This patch remembers which smap triggers the allocation
of a 'struct bpf_local_storage' object. The local_storage is
allocated during the very first selem added to the owner.
The smap pointer is needed when using the bpf_mem_cache_free
in a later patch because it needs to free to the correct
smap's bpf_mem_alloc object.

When a selem is being removed, it needs to check if it is
the selem that triggers the creation of the local_storage.
If it is, the local_storage->smap pointer will be reset to NULL.
This NULL reset is done under the local_storage->lock in
bpf_selem_unlink_storage_nolock() when a selem is being removed.
Also note that the local_storage may not go away even
local_storage->smap is NULL because there may be other
selem still stored in the local_storage.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-6-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

fc6652aa

bpf: Remove the preceding __ from __bpf_selem_unlink_storage · 121f31f3

Martin KaFai Lau authored Mar 07, 2023

__bpf_selem_unlink_storage is taking the spin lock and there is
no name collision also. Having the preceding '__' is confusing
when reviewing the later patch.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-5-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

121f31f3

bpf: Remove __bpf_local_storage_map_alloc · 62827d61

Martin KaFai Lau authored Mar 07, 2023

bpf_local_storage_map_alloc() is the only caller of
__bpf_local_storage_map_alloc().  The remaining logic in
bpf_local_storage_map_alloc() is only a one liner setting
the smap->cache_idx.

Remove __bpf_local_storage_map_alloc() to simplify code.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-4-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

62827d61

bpf: Refactor codes into bpf_local_storage_destroy · 2ffcb6fc

Martin KaFai Lau authored Mar 07, 2023

This patch first renames bpf_local_storage_unlink_nolock to
bpf_local_storage_destroy(). It better reflects that it is only
used when the storage's owner (sk/task/cgrp/inode) is being kfree().

All bpf_local_storage_destroy's caller is taking the spin lock and
then free the storage. This patch also moves these two steps into
the bpf_local_storage_destroy.

This is a preparation work for a later patch that uses
bpf_mem_cache_alloc/free in the bpf_local_storage.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-3-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2ffcb6fc

bpf: Move a few bpf_local_storage functions to static scope · 4cbd23cc

Martin KaFai Lau authored Mar 07, 2023

This patch moves the bpf_local_storage_free_rcu() and
bpf_selem_unlink_map() to static because they are
not used outside of bpf_local_storage.c.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20230308065936.1550103-2-martin.lau@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4cbd23cc

bpf/selftests: Fix send_signal tracepoint tests · 4a54de65

David Vernet authored Mar 10, 2023

The send_signal tracepoint tests are non-deterministically failing in
CI. The test works as follows:

1. Two pairs of file descriptors are created using the pipe() function.
One pair is used to communicate between a parent process -> child
process, and the other for the reverse direction.

2. A child is fork()'ed. The child process registers a signal handler,
notifies its parent that the signal handler is registered, and then
and waits for its parent to have enabled a BPF program that sends a
signal.

3. The parent opens and loads a BPF skeleton with programs that send
signals to the child process. The different programs are triggered by
different perf events (either NMI or normal perf), or by regular
tracepoints. The signal is delivered to the child whenever the child
triggers the program.

4. The child's signal handler is invoked, which sets a flag saying that
the signal handler was reached. The child then signals to the parent
that it received the signal, and the test ends.

The perf testcases (send_signal_perf{_thread} and
send_signal_nmi{_thread}) work 100% of the time, but the tracepoint
testcases fail non-deterministically because the tracepoint is not
always being fired for the child.

There are two tracepoint programs registered in the test:
'tracepoint/sched/sched_switch', and
'tracepoint/syscalls/sys_enter_nanosleep'. The child never intentionally
blocks, nor sleeps, so neither tracepoint is guaranteed to be triggered.
To fix this, we can have the child trigger the nanosleep program with a
usleep().

Before this patch, the test would fail locally every 2-3 runs. Now, it
doesn't fail after more than 1000 runs.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230310061909.1420887-1-void@manifault.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4a54de65

bpf: take into account liveness when propagating precision · 52c2b005

Andrii Nakryiko authored Mar 09, 2023

When doing state comparison, if old state has register that is not
marked as REG_LIVE_READ, then we just skip comparison, regardless what's
the state of corresponing register in current state. This is because not
REG_LIVE_READ register is irrelevant for further program execution and
correctness. All good here.

But when we get to precision propagation, after two states were declared
equivalent, we don't take into account old register's liveness, and thus
attempt to propagate precision for register in current state even if
that register in old state was not REG_LIVE_READ anymore. This is bad,
because register in current state could be anything at all and this
could cause -EFAULT due to internal logic bugs.

Fix by taking into account REG_LIVE_READ liveness mark to keep the logic
in state comparison in sync with precision propagation.

Fixes: a3ce685d ("bpf: fix precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309224131.57449-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

52c2b005

bpf: ensure state checkpointing at iter_next() call sites · 4b5ce570

Andrii Nakryiko authored Mar 09, 2023

State equivalence check and checkpointing performed in is_state_visited()
employs certain heuristics to try to save memory by avoiding state checkpoints
if not enough jumps and instructions happened since last checkpoint. This leads
to unpredictability of whether a particular instruction will be checkpointed
and how regularly. While normally this is not causing much problems (except
inconveniences for predictable verifier tests, which we overcome with
BPF_F_TEST_STATE_FREQ flag), turns out it's not the case for open-coded
iterators.

Checking and saving state checkpoints at iter_next() call is crucial for fast
convergence of open-coded iterator loop logic, so we need to force it. If we
don't do that, is_state_visited() might skip saving a checkpoint, causing
unnecessarily long sequence of not checkpointed instructions and jumps, leading
to exhaustion of jump history buffer, and potentially other undesired outcomes.
It is expected that with correct open-coded iterators convergence will happen
quickly, so we don't run a risk of exhausting memory.

This patch adds, in addition to prune and jump instruction marks, also a
"forced checkpoint" mark, and makes sure that any iter_next() call instruction
is marked as such.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230310060149.625887-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4b5ce570

Merge branch 'selftests/bpf: make BPF_CFLAGS stricter with -Wall' · 1456ddcc

Alexei Starovoitov authored Mar 10, 2023

Andrii Nakryiko says:

====================

Make BPF-side compiler flags stricter by adding -Wall. Fix tons of small
issues pointed out by compiler immediately after that. That includes newly
added bpf_for(), bpf_for_each(), and bpf_repeat() macros.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1456ddcc

selftests/bpf: make BPF compiler flags stricter · 3d5a55dd

Andrii Nakryiko authored Mar 08, 2023

We recently added -Wuninitialized, but it's not enough to catch various
silly mistakes or omissions. Let's go all the way to -Wall, just like we
do for user-space code.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309054015.4068562-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

3d5a55dd

selftests/bpf: fix lots of silly mistakes pointed out by compiler · c8ed6685

Andrii Nakryiko authored Mar 08, 2023

Once we enable -Wall for BPF sources, compiler will complain about lots
of unused variables, variables that are set but never read, etc.

Fix all these issues first before enabling -Wall in Makefile.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309054015.4068562-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c8ed6685

selftests/bpf: add __sink() macro to fake variable consumption · 713461b8

Andrii Nakryiko authored Mar 08, 2023

Add __sink(expr) macro that forces compiler to believe that passed in
expression is both read and written. It used a simple embedded asm for
this. This is useful in a lot of tests where we assign value to some variable
to trigger some action, but later don't read variable, causing compiler
to complain (if corresponding compiler warnings are turned on, which
we'll do in the next patch).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309054015.4068562-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

713461b8

selftests/bpf: prevent unused variable warning in bpf_for() · 2498e623

Andrii Nakryiko authored Mar 08, 2023

Add __attribute__((unused)) to inner __p variable inside bpf_for(),
bpf_for_each(), and bpf_repeat() macros to avoid compiler warnings about
unused variable.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230309054015.4068562-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2498e623

selftests/bpf: Workaround verification failure for fexit_bpf2bpf/func_replace_return_code · 63d78b7e

Yonghong Song authored Mar 09, 2023

With latest llvm17, selftest fexit_bpf2bpf/func_replace_return_code
has the following verification failure:

  0: R1=ctx(off=0,imm=0) R10=fp0
  ; int connect_v4_prog(struct bpf_sock_addr *ctx)
  0: (bf) r7 = r1                       ; R1=ctx(off=0,imm=0) R7_w=ctx(off=0,imm=0)
  1: (b4) w6 = 0                        ; R6_w=0
  ; memset(&tuple.ipv4.saddr, 0, sizeof(tuple.ipv4.saddr));
  ...
  ; return do_bind(ctx) ? 1 : 0;
  179: (bf) r1 = r7                     ; R1=ctx(off=0,imm=0) R7=ctx(off=0,imm=0)
  180: (85) call pc+147
  Func#3 is global and valid. Skipping.
  181: R0_w=scalar()
  181: (bc) w6 = w0                     ; R0_w=scalar() R6_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  182: (05) goto pc-129
  ; }
  54: (bc) w0 = w6                      ; R0_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  55: (95) exit
  At program exit the register R0 has value (0x0; 0xffffffff) should have been in (0x0; 0x1)
  processed 281 insns (limit 1000000) max_states_per_insn 1 total_states 26 peak_states 26 mark_read 13
  -- END PROG LOAD LOG --
  libbpf: prog 'connect_v4_prog': failed to load: -22

The corresponding source code:

  __attribute__ ((noinline))
  int do_bind(struct bpf_sock_addr *ctx)
  {
        struct sockaddr_in sa = {};

        sa.sin_family = AF_INET;
        sa.sin_port = bpf_htons(0);
        sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4);

        if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
                return 0;

        return 1;
  }
  ...
  SEC("cgroup/connect4")
  int connect_v4_prog(struct bpf_sock_addr *ctx)
  {
  ...
        return do_bind(ctx) ? 1 : 0;
  }

Insn 180 is a call to 'do_bind'. The call's return value is also the return value
for the program. Since do_bind() returns 0/1, so it is legitimate for compiler to
optimize 'return do_bind(ctx) ? 1 : 0' to 'return do_bind(ctx)'. However, such
optimization breaks verifier as the return value of 'do_bind()' is marked as any
scalar which violates the requirement of prog return value 0/1.

There are two ways to fix this problem, (1) changing 'return 1' in do_bind() to
e.g. 'return 10' so the compiler has to do 'do_bind(ctx) ? 1 :0', or (2)
suggested by Andrii, marking do_bind() with __weak attribute so the compiler
cannot make any assumption on do_bind() return value.

This patch adopted adding __weak approach which is simpler and more resistant
to potential compiler optimizations.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230310012410.2920570-1-yhs@fb.com

63d78b7e

09 Mar, 2023 9 commits

selftests/bpf: Improve error logs in XDP compliance test tool · c1cd734c

Lorenzo Bianconi authored Mar 09, 2023

Improve some error logs reported in the XDP compliance test tool.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/212fc5bd214ff706f6ef1acbe7272cf4d803ca9c.1678382940.git.lorenzo@kernel.org

c1cd734c

selftests/bpf: Use ifname instead of ifindex in XDP compliance test tool · 27a36bc3

Lorenzo Bianconi authored Mar 09, 2023

Rely on interface name instead of interface index in error messages or
logs from XDP compliance test tool.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/7dc5a8ff56c252b1a7ae29b059d0b2b1543c8b5d.1678382940.git.lorenzo@kernel.org

27a36bc3

bpf: Fix a typo for BPF_F_ANY_ALIGNMENT in bpf.h · 5a70f4a6

Michael Weiß authored Mar 09, 2023

Fix s/BPF_PROF_LOAD/BPF_PROG_LOAD/ typo in the documentation comment
for BPF_F_ANY_ALIGNMENT in bpf.h.
Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230309133823.944097-1-michael.weiss@aisec.fraunhofer.de

5a70f4a6

selftests/bpf: Fix flaky fib_lookup test · a6865576

Martin KaFai Lau authored Mar 08, 2023

There is a report that fib_lookup test is flaky when running in parallel.
A symptom of slowness or delay. An example:

Testing IPv6 stale neigh
set_lookup_params:PASS:inet_pton(IPV6_IFACE_ADDR) 0 nsec
test_fib_lookup:PASS:bpf_prog_test_run_opts 0 nsec
test_fib_lookup:FAIL:fib_lookup_ret unexpected fib_lookup_ret: actual 0 != expected 7
test_fib_lookup:FAIL:dmac not match unexpected dmac not match: actual 1 != expected 0
dmac expected 11:11:11:11:11:11 actual 00:00:00:00:00:00

[ Note that the "fib_lookup_ret unexpected fib_lookup_ret actual 0 ..."
  is reversed in terms of expected and actual value. Fixing in this
  patch also. ]

One possibility is the testing stale neigh entry was marked dead by the
gc (in neigh_periodic_work). The default gc_stale_time sysctl is 60s.
This patch increases it to 15 mins.

It also:

- fixes the reversed arg (actual vs expected) in one of the
  ASSERT_EQ test
- removes the nodad command arg when adding v4 neigh entry which
  currently has a warning.

Fixes: 168de023 ("selftests/bpf: Add bpf_fib_lookup test")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230309060244.3242491-1-martin.lau@linux.dev

a6865576

Merge branch 'BPF open-coded iterators' · 23e403b3

Alexei Starovoitov authored Mar 08, 2023

Andrii Nakryiko says:

====================

Add support for open-coded (aka inline) iterators in BPF world. This is a next
evolution of gradually allowing more powerful and less restrictive looping and
iteration capabilities to BPF programs.

We set up a framework for implementing all kinds of iterators (e.g., cgroup,
task, file, etc, iterators), but this patch set only implements numbers
iterator, which is used to implement ergonomic bpf_for() for-like construct
(see patches #4-#5). We also add bpf_for_each(), which is a generic
foreach-like construct that will work with any kind of open-coded iterator
implementation, as long as we stick with bpf_iter_<type>_{new,next,destroy}()
naming pattern (which we now enforce on the kernel side).

Patch #1 is preparatory refactoring for easier way to check for special kfunc
calls. Patch #2 is adding iterator kfunc registration and validation logic,
which is mostly independent from the rest of open-coded iterator logic, so is
separated out for easier reviewing.

The meat of verifier-side logic is in patch #3. Patch #4 implements numbers
iterator. I kept them separate to have clean reference for how to integrate
new iterator types (now even simpler to do than in v1 of this patch set).
Patch #5 adds bpf_for(), bpf_for_each(), and bpf_repeat() macros to
bpf_misc.h, and also adds yet another pyperf test variant, now with bpf_for()
loop. Patch #6 is verification tests, based on numbers iterator (as the only
available right now). Patch #7 actually tests runtime behavior of numbers
iterator.

Finally, with changes in v2, it's possible and trivial to implement custom
iterators completely in kernel modules, which we showcase and test by adding
a simple iterator returning same number a given number of times to
bpf_testmod. Patch #8 is where all this happens and is tested.

Most of the relevant details are in corresponding commit messages or code
comments.

v4->v5:
  - fixing missed inner for() in is_iter_reg_valid_uninit, and fixed return
    false (kernel test robot);
  - typo fixes and comment/commit description improvements throughout the
    patch set;
v3->v4:
  - remove unused variable from is_iter_reg_valid_init (kernel test robot);
v2->v3:
  - remove special kfunc leftovers for bpf_iter_num_{new,next,destroy};
  - add iters/testmod_seq* to DENYLIST.s390x, it doesn't support kfuncs in
    modules yet (CI);
v1->v2:
  - rebased on latest, dropping previously landed preparatory patches;
  - each iterator type now have its own `struct bpf_iter_<type>` which allows
    each iterator implementation to use exactly as much stack space as
    necessary, allowing to avoid runtime allocations (Alexei);
  - reworked how iterator kfuncs are defined, no verifier changes are required
    when adding new iterator type;
  - added bpf_testmod-based iterator implementation;
  - address the rest of feedback, comments, commit message adjustment, etc.

Cc: Tejun Heo <tj@kernel.org>
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

23e403b3

selftests/bpf: implement and test custom testmod_seq iterator · 7e86a8c4

Andrii Nakryiko authored Mar 08, 2023

Implement a trivial iterator returning same specified integer value
N times as part of bpf_testmod kernel module. Add selftests to validate
everything works end to end.

We also reuse these tests as "verification-only" tests to validate that
kernel prints the state of custom kernel module-defined iterator correctly:

fp-16=iter_testmod_seq(ref_id=1,state=drained,depth=0)

"testmod_seq" part is an iterator type, and is coming from module's BTF
data dynamically at runtime.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-9-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7e86a8c4

selftests/bpf: add number iterator tests · f59b1460

Andrii Nakryiko authored Mar 08, 2023

Add number iterator (bpf_iter_num_{new,next,destroy}()) tests,
validating the correct handling of various corner and common cases
*at runtime*.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-8-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f59b1460

selftests/bpf: add iterators tests · 57400dcc

Andrii Nakryiko authored Mar 08, 2023

Add various tests for open-coded iterators. Some of them excercise
various possible coding patterns in C, some go down to low-level
assembly for more control over various conditions, especially invalid
ones.

We also make use of bpf_for(), bpf_for_each(), bpf_repeat() macros in
some of these tests.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-7-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

57400dcc

selftests/bpf: add bpf_for_each(), bpf_for(), and bpf_repeat() macros · 8c2b5e90

Andrii Nakryiko authored Mar 08, 2023

Add bpf_for_each(), bpf_for(), and bpf_repeat() macros that make writing
open-coded iterator-based loops much more convenient and natural. These
macros utilize cleanup attribute to ensure proper destruction of the
iterator and thanks to that manage to provide the ergonomics that is
very close to C language's for() construct. Typical loop would look like:

  int i;
  int arr[N];

  bpf_for(i, 0, N) {
      /* verifier will know that i >= 0 && i < N, so could be used to
       * directly access array elements with no extra checks
       */
       arr[i] = i;
  }

bpf_repeat() is very similar, but it doesn't expose iteration number and
is meant as a simple "repeat action N times" loop:

  bpf_repeat(N) { /* whatever, N times */ }

Note that `break` and `continue` statements inside the {} block work as
expected.

bpf_for_each() is a generalization over any kind of BPF open-coded
iterator allowing to use for-each-like approach instead of calling
low-level bpf_iter_<type>_{new,next,destroy}() APIs explicitly. E.g.:

  struct cgroup *cg;

  bpf_for_each(cgroup, cg, some, input, args) {
      /* do something with each cg */
  }

would call (not-yet-implemented) bpf_iter_cgroup_{new,next,destroy}()
functions to form a loop over cgroups, where `some, input, args` are
passed verbatim into constructor as

  bpf_iter_cgroup_new(&it, some, input, args).

As a first demonstration, add pyperf variant based on the bpf_for() loop.

Also clean up a few tests that either included bpf_misc.h header
unnecessarily from the user-space, which is unsupported, or included it
before any common types are defined (and thus leading to unnecessary
compilation warnings, potentially).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230308184121.1165081-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>

8c2b5e90