Commits · 79e104400fc3b94a2d04c65d222a0726b3ae2382 · Kirill Smelkov / linux

18 Jun, 2024 12 commits

sched_ext: Add sysrq-S which disables the BPF scheduler · 79e10440

Tejun Heo authored Jun 18, 2024

This enables the admin to abort the BPF scheduler and revert to CFS anytime.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

79e10440

sched_ext: Add scx_simple and scx_example_qmap example schedulers · 2a52ca7c

Tejun Heo authored Jun 18, 2024

Add two simple example BPF schedulers - simple and qmap.

* simple: In terms of scheduling, it behaves identical to not having any
  operation implemented at all. The two operations it implements are only to
  improve visibility and exit handling. On certain homogeneous
  configurations, this actually can perform pretty well.

* qmap: A fixed five level priority scheduler to demonstrate queueing PIDs
  on BPF maps for scheduling. While not very practical, this is useful as a
  simple example and will be used to demonstrate different features.

v7: - Compat helpers stripped out in prepartion of upstreaming as the
      upstreamed patchset will be the baselinfe. Utility macros that can be
      used to implement compat features are kept.

    - Explicitly disable map autoattach on struct_ops to avoid trying to
      attach twice while maintaining compatbility with older libbpf.

v6: - Common header files reorganized and cleaned up. Compat helpers are
      added to demonstrate how schedulers can maintain backward
      compatibility with older kernels while making use of newly added
      features.

    - simple_select_cpu() added to keep track of the number of local
      dispatches. This is needed because the default ops.select_cpu()
      implementation is updated to dispatch directly and won't call
      ops.enqueue().

    - Updated to reflect the sched_ext API changes. Switching all tasks is
      the default behavior now and scx_qmap supports partial switching when
      `-p` is specified.

    - tools/sched_ext/Kconfig dropped. This will be included in the doc
      instead.

v5: - Improve Makefile. Build artifects are now collected into a separate
      dir which change be changed. Install and help targets are added and
      clean actually cleans everything.

    - MEMBER_VPTR() improved to improve access to structs. ARRAY_ELEM_PTR()
      and RESIZEABLE_ARRAY() are added to support resizable arrays in .bss.

    - Add scx_common.h which provides common utilities to user code such as
      SCX_BUG[_ON]() and RESIZE_ARRAY().

    - Use SCX_BUG[_ON]() to simplify error handling.

v4: - Dropped _example prefix from scheduler names.

v3: - Rename scx_example_dummy to scx_example_simple and restructure a bit
      to ease later additions. Comment updates.

    - Added declarations for BPF inline iterators. In the future, hopefully,
      these will be consolidated into a generic BPF header so that they
      don't need to be replicated here.

v2: - Updated with the generic BPF cpumask helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

2a52ca7c

sched_ext: Implement BPF extensible scheduler class · f0e1a064

Tejun Heo authored Jun 18, 2024

Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:

1. Ease of experimentation and exploration: Enabling rapid iteration of new
   scheduling policies.

2. Customization: Building application-specific schedulers which implement
   policies that are not applicable to general-purpose schedulers.

3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
   policies in production environments.

sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.

For more detailed discussion on the motivations and overview, please refer
to the cover letter.

Later patches will also add several example schedulers and documentation.

This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.

include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:

- Both "sched_ext" and its shorthand "scx" are used. If the identifier
  already has "sched" in it, "ext" is used; otherwise, "scx".

- In sched_ext_ops, only .name is mandatory. Every operation is optional and
  if omitted a simple but functional default behavior is provided.

- A new policy constant SCHED_EXT is added and a task can select sched_ext
  by invoking sched_setscheduler(2) with the new policy constant. However,
  if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
  and the task is scheduled by CFS. When the BPF scheduler is loaded, all
  tasks which have the SCHED_EXT policy are switched to sched_ext.

- To bridge the workflow imbalance between the scheduler core and
  sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
  queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
  one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
  convenience and need not be used by a scheduler that doesn't require it.
  SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
  the next task on the CPU. The BPF scheduler can manage an arbitrary number
  of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().

- sched_ext guarantees system integrity no matter what the BPF scheduler
  does. To enable this, each task's ownership is tracked through
  p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
  can always recover and revert all tasks back to CFS. See p->scx.ops_state
  and scx_tasks.

- A task is not tied to its rq while enqueued. This decouples CPU selection
  from queueing and allows sharing a scheduling queue across an arbitrary
  subset of CPUs. This adds some complexities as a task may need to be
  bounced between rq's right before it starts executing. See
  dispatch_to_local_dsq() and move_task_to_local_dsq().

- One complication that arises from the above weak association between task
  and rq is that synchronizing with dequeue() gets complicated as dequeue()
  may happen anytime while the task is enqueued and the dispatch path might
  need to release the rq lock to transfer the task. Solving this requires a
  bit of complexity. See the logic around p->scx.sticky_cpu and
  p->scx.ops_qseq.

- Both enable and disable paths are a bit complicated. The enable path
  switches all tasks without blocking to avoid issues which can arise from
  partially switched states (e.g. the switching task itself being starved).
  The disable path can't trust the BPF scheduler at all, so it also has to
  guarantee forward progress without blocking. See scx_ops_enable() and
  scx_ops_disable_workfn().

- When sched_ext is disabled, static_branches are used to shut down the
  entry points from hot paths.

v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
      scx_ops_enable_mutex which can lead to deadlocks in the disable path.
      Fixed.

    - Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
      to use-after-free.

    - Consolidated per-cpu variable usages and other cleanups.

v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
      groups can be expressed. Later CPU hotplug operations are put into
      their own group.

    - SCX_OPS_DISABLING state is replaced with the new bypass mechanism
      which allows temporarily putting the system into simple FIFO
      scheduling mode bypassing the BPF scheduler. In addition to the shut
      down path, this will also be used to isolate the BPF scheduler across
      PM events. Enabling and disabling the bypass mode requires iterating
      all runnable tasks. rq->scx.runnable_list addition is moved from the
      later watchdog patch.

    - ops.prep_enable() is replaced with ops.init_task() and
      ops.enable/disable() are now called whenever the task enters and
      leaves sched_ext instead of when the task becomes schedulable on
      sched_ext and stops being so. A new operation - ops.exit_task() - is
      called when the task stops being schedulable on sched_ext.

    - scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
      removes the need for communicating local dispatch decision made by
      ops.select_cpu() to ops.enqueue() via per-task storage.
      SCX_KF_SELECT_CPU is added to support the change.

    - SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
      scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
      was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
      if it finds a suitable idle CPU. If such behavior is not desired,
      users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
      bool out param.

    - scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
      queueing many tasks on a local DSQ which makes tasks to execute in
      order while other CPUs stay idle which made some hackbench numbers
      really bad. Fixed.

    - The current state of sched_ext can now be monitored through files
      under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
      to enable monitoring on kernels which don't enable debugfs.

    - sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
      be NULL and a BPF scheduler which derefs the pointer without checking
      could crash the kernel. Tell BPF. This is currently a bit ugly. A
      better way to annotate this is expected in the future.

    - scx_exit_info updated to carry pointers to message buffers instead of
      embedding them directly. This decouples buffer sizes from API so that
      they can be changed without breaking compatibility.

    - exit_code added to scx_exit_info. This is used to indicate different
      exit conditions on non-error exits and will be used to handle e.g. CPU
      hotplugs.

    - The patch "sched_ext: Allow BPF schedulers to switch all eligible
      tasks into sched_ext" is folded in and the interface is changed so
      that partial switching is indicated with a new ops flag
      %SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
      and in turn SCX_KF_INIT. ops.init() is now called with
      SCX_KF_SLEEPABLE.

    - Code reorganized so that only the parts necessary to integrate with
      the rest of the kernel are in the header files.

    - Changes to reflect the BPF and other kernel changes including the
      addition of bpf_sched_ext_ops.cfi_stubs.

v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
      instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
      load_acquire/store_release is now unsigned long instead of u64.

    - Fix the bug where bpf_scx_btf_struct_access() was allowing write
      access to arbitrary fields.

    - Distinguish kfuncs which can be called from any sched_ext ops and from
      anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
      sched_ext ops.

    - Rename "type" to "kind" in scx_exit_info to make it easier to use on
      languages in which "type" is a reserved keyword.

    - Since cff9b233 ("kernel/sched: Modify initial boot task idle
      setup"), PF_IDLE is not set on idle tasks which haven't been online
      yet which made scx_task_iter_next_filtered() include those idle tasks
      in iterations leading to oopses. Update scx_task_iter_next_filtered()
      to directly test p->sched_class against idle_sched_class instead of
      using is_idle_task() which tests PF_IDLE.

    - Other updates to match upstream changes such as adding const to
      set_cpumask() param and renaming check_preempt_curr() to
      wakeup_preempt().

v4: - SCHED_CHANGE_BLOCK replaced with the previous
      sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
      because upstream is adaopting a different generic cleanup mechanism.
      Once that lands, the code will be adapted accordingly.

    - task_on_scx() used to test whether a task should be switched into SCX,
      which is confusing. Renamed to task_should_scx(). task_on_scx() now
      tests whether a task is currently on SCX.

    - scx_has_idle_cpus is barely used anymore and replaced with direct
      check on the idle cpumask.

    - SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
      fully idle cores.

    - ops.enable() now sees up-to-date p->scx.weight value.

    - ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
      schedulers expecting ->select_cpu() call.

    - Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
      of the scheduler.

v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
      without polling p->scx.weight.

    - move_task_to_local_dsq() was losing SCX-specific enq_flags when
      enqueueing the task on the target dsq because it goes through
      activate_task() which loses the upper 32bit of the flags. Carry the
      flags through rq->scx.extra_enq_flags.

    - scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
      and scx_bpf_task_cpu() now use the new KF_RCU instead of
      KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.

    - The kfunc helper access control mechanism implemented through
      sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
      used when invoking scx_ops operations.

v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
      called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
      To determine whether balance_scx() should be called from
      put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
      comment in put_prev_task_scx() for details.

    - sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
      with SCHED_CHANGE_BLOCK().

    - Unused all_dsqs list removed. This was a left-over from previous
      iterations.

    - p->scx.kf_mask is added to track and enforce which kfunc helpers are
      allowed. Also, init/exit sequences are updated to make some kfuncs
      always safe to call regardless of the current BPF scheduler state.
      Combined, this should make all the kfuncs safe.

    - BPF now supports sleepable struct_ops operations. Hacky workaround
      removed and operations and kfunc helpers are tagged appropriately.

    - BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
      and friends are added so that BPF schedulers can use the idle masks
      with the generic helpers. This replaces the hacky kfunc helpers added
      by a separate patch in V1.

    - CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
      enabled. This restriction will be removed by a later patch which adds
      core-sched support.

    - Add MAINTAINERS entries and other misc changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Co-authored-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Andrea Righi <andrea.righi@canonical.com>

f0e1a064

sched_ext: Add boilerplate for extensible scheduler class · a7a9fc54

Tejun Heo authored Jun 18, 2024

This adds dummy implementations of sched_ext interfaces which interact with
the scheduler core and hook them in the correct places. As they're all
dummies, this doesn't cause any behavior changes. This is split out to help
reviewing.

v2: balance_scx_on_up() dropped. This will be handled in sched_ext proper.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

a7a9fc54

sched: Add normal_policy() · 2c8d046d

Tejun Heo authored Jun 18, 2024

A new BPF extensible sched_class will need to dynamically change how a task
picks its sched_class. For example, if the loaded BPF scheduler progs fail,
the tasks will be forced back on CFS even if the task's policy is set to the
new sched_class. To support such mapping, add normal_policy() which wraps
testing for %SCHED_NORMAL. This doesn't cause any behavior changes.

v2: Update the description with more details on the expected use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

2c8d046d

sched: Factor out update_other_load_avgs() from __update_blocked_others() · 96fd6c65

Tejun Heo authored Jun 18, 2024

RT, DL, thermal and irq load and utilization metrics need to be decayed and
updated periodically and before consumption to keep the numbers reasonable.
This is currently done from __update_blocked_others() as a part of the fair
class load balance path. Let's factor it out to update_other_load_avgs().
Pure refactor. No functional changes.

This will be used by the new BPF extensible scheduling class to ensure that
the above metrics are properly maintained.

v2: Refreshed on top of tip:sched/core.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>

96fd6c65

sched: Factor out cgroup weight conversion functions · 4f9c7ca8

Tejun Heo authored Jun 18, 2024

Factor out sched_weight_from/to_cgroup() which convert between scheduler
shares and cgroup weight. No functional change. The factored out functions
will be used by a new BPF extensible sched_class so that the weights can be
exposed to the BPF programs in a way which is consistent cgroup weights and
easier to interpret.

The weight conversions will be used regardless of cgroup usage. It's just
borrowing the cgroup weight range as it's more intuitive.
CGROUP_WEIGHT_MIN/DFL/MAX constants are moved outside CONFIG_CGROUPS so that
the conversion helpers can always be defined.

v2: The helpers are now defined regardless of COFNIG_CGROUPS.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

4f9c7ca8

sched: Add sched_class->switching_to() and expose check_class_changing/changed() · d8c7bc2e

Tejun Heo authored Jun 18, 2024

When a task switches to a new sched_class, the prev and new classes are
notified through ->switched_from() and ->switched_to(), respectively, after
the switching is done.

A new BPF extensible sched_class will have callbacks that allow the BPF
scheduler to keep track of relevant task states (like priority and cpumask).
Those callbacks aren't called while a task is on a different sched_class.
When a task comes back, we wanna tell the BPF progs the up-to-date state
before the task gets enqueued, so we need a hook which is called before the
switching is committed.

This patch adds ->switching_to() which is called during sched_class switch
through check_class_changing() before the task is restored. Also, this patch
exposes check_class_changing/changed() in kernel/sched/sched.h. They will be
used by the new BPF extensible sched_class to implement implicit sched_class
switching which is used e.g. when falling back to CFS when the BPF scheduler
fails or unloads.

This is a prep patch and doesn't cause any behavior changes. The new
operation and exposed functions aren't used yet.

v3: Refreshed on top of tip:sched/core.

v2: Improve patch description w/ details on planned use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

d8c7bc2e

sched: Add sched_class->reweight_task() · e83edbf8

Tejun Heo authored Jun 18, 2024

Currently, during a task weight change, sched core directly calls
reweight_task() defined in fair.c if @p is on CFS. Let's make it a proper
sched_class operation instead. CFS's reweight_task() is renamed to
reweight_task_fair() and now called through sched_class.

While it turns a direct call into an indirect one, set_load_weight() isn't
called from a hot path and this change shouldn't cause any noticeable
difference. This will be used to implement reweight_task for a new BPF
extensible sched_class so that it can keep its cached task weight
up-to-date.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

e83edbf8

sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() · 304b3f2b

Tejun Heo authored Jun 18, 2024

A new BPF extensible sched_class will need more control over the forking
process. It wants to be able to fail from sched_cgroup_fork() after the new
task's sched_task_group is initialized so that the loaded BPF program can
prepare the task with its cgroup association is established and reject fork
if e.g. allocation fails.

Allow sched_cgroup_fork() to fail by making it return int instead of void
and adding sched_cancel_fork() to undo sched_fork() in the error path.

sched_cgroup_fork() doesn't fail yet and this patch shouldn't cause any
behavior changes.

v2: Patch description updated to detail the expected use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>

304b3f2b

sched: Restructure sched_class order sanity checks in sched_init() · df268382

Tejun Heo authored Jun 18, 2024

Currently, sched_init() checks that the sched_class'es are in the expected
order by testing each adjacency which is a bit brittle and makes it
cumbersome to add optional sched_class'es. Instead, let's verify whether
they're in the expected order using sched_class_above() which is what
matters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: David Vernet <dvernet@meta.com>

df268382

Merge branch 'bpf/for-next' into sched_ext-base · 8cce4759
Tejun Heo authored Jun 18, 2024

8cce4759

17 Jun, 2024 7 commits

Merge branch 'bpf-support-resilient-split-btf' · f6afdaf7

Andrii Nakryiko authored Jun 17, 2024

Alan Maguire says:

====================
bpf: support resilient split BTF

Split BPF Type Format (BTF) provides huge advantages in that kernel
modules only have to provide type information for types that they do not
share with the core kernel; for core kernel types, split BTF refers to
core kernel BTF type ids.  So for a STRUCT sk_buff, a module that
uses that structure (or a pointer to it) simply needs to refer to the
core kernel type id, saving the need to define the structure and its many
dependents.  This cuts down on duplication and makes BTF as compact
as possible.

However, there is a downside.  This scheme requires the references from
split BTF to base BTF to be valid not just at encoding time, but at use
time (when the module is loaded).  Even a small change in kernel types
can perturb the type ids in core kernel BTF, and - if the new reproducible
BTF option is not used - pahole's parallel processing of compilation units
can lead to different type ids for the same kernel if the BTF is
regenerated.

So we have a robustness problem for split BTF for cases where a module is
not always compiled at the same time as the kernel.  This problem is
particularly acute for distros which generally want module builders to be
able to compile a module for the lifetime of a Linux stable-based release,
and have it continue to be valid over the lifetime of that release, even
as changes in data structures (and hence BTF types) accrue.  Today it's not
possible to generate BTF for modules that works beyond the initial
kernel it is compiled against - kernel bugfixes etc invalidate the split
BTF references to vmlinux BTF, and BTF is no longer usable for the
module.

The goal of this series is to provide options to provide additional
context for cases like this.  That context comes in the form of
distilled base BTF; it stands in for the base BTF, and contains
information about the types referenced from split BTF, but not their
full descriptions.  The modified split BTF will refer to type ids in
this .BTF.base section, and when the kernel loads such modules it
will use that .BTF.base to map references from split BTF to the
equivalent current vmlinux base BTF types.  Once this relocation
process has succeeded, the module BTF available in /sys/kernel/btf
will look exactly as if it was built with the current vmlinux;
references to base types will be fixed up etc.

A module builder - using this series along with the pahole changes -
can then build a module with distilled base BTF via an out-of-tree
module build, i.e.

make -C . M=path/2/module

The module will have a .BTF section (the split BTF) and a
.BTF.base section.  The latter is small in size - distilled base
BTF does not need full struct/union/enum information for named
types for example.  For 2667 modules built with distilled base BTF,
the average size observed was 1556 bytes (stddev 1563).  The overall
size added to this 2667 modules was 5.3Mb.

Note that for the in-tree modules, this approach is not needed as
split and base BTF in the case of in-tree modules are always built
and re-built together.

The series first focuses on generating split BTF with distilled base
BTF; then relocation support is added to allow split BTF with
an associated distlled base to be relocated with a new base BTF.

Next Eduard's patch allows BTF ELF parsing to work with both
.BTF and .BTF.base sections; this ensures that bpftool will be
able to dump BTF for a module with a .BTF.base section for example,
or indeed dump relocated BTF where a module and a "-B vmlinux"
is supplied.

Then we add support to resolve_btfids to ignore base BTF - i.e.
to avoid relocation - if a .BTF.base section is found.  This ensures
the .BTF.ids section is populated with ids relative to the distilled
base (these will be relocated as part of module load).

Finally the series supports storage of .BTF.base data/size in modules
and supports sharing of relocation code with the kernel to allow
relocation of module BTF.  For the kernel, this relocation
process happens at module load time, and we relocate split BTF
references to point at types in the current vmlinux BTF.  As part of
this, .BTF.ids references need to be mapped also.

So concretely, what happens is

- we generate split BTF in the .BTF section of a module that refers to
  types in the .BTF.base section as base types; the latter are not full
  type descriptions but provide information about the base type.  So
  a STRUCT sk_buff would be represented as a FWD struct sk_buff in
  distilled base BTF for example.
- when the module is loaded, the split BTF is relocated with vmlinux
  BTF; in the case of the FWD struct sk_buff, we find the STRUCT sk_buff
  in vmlinux BTF and map all split BTF references to the distilled base
  FWD sk_buff, replacing them with references to the vmlinux BTF
  STRUCT sk_buff.

A previous approach to this problem [1] utilized standalone BTF for such
cases - where the BTF is not defined relative to base BTF so there is no
relocation required.  The problem with that approach is that from
the verifier perspective, some types are special, and having a custom
representation of a core kernel type that did not necessarily match the
current representation is not tenable.  So the approach taken here was
to preserve the split BTF model while minimizing the representation of
the context needed to relocate split and current vmlinux BTF.

To generate distilled .BTF.base sections the associated dwarves
patch (to be applied on the "next" branch there) is needed [3]
Without it, things will still work but modules will not be built
with a .BTF.base section.

Changes since v5[4]:

- Update search of distilled types to return the first occurrence
  of a string (or a string+size pair); this allows us to iterate
  over all matches in distilled base BTF (Andrii, patch 3)
- Update to use BTF field iterators (Andrii, patches 1, 3 and 8)
- Update tests to cover multiple match and associated error cases
  (Eduard, patch 4)
- Rename elf_sections_info to btf_elf_secs, remove use of
  libbpf_get_error(), reset btf->owns_base when relocation
  succeeds (Andrii, patch 5)

Changes since v4[5]:

- Moved embeddedness, duplicate name checks to relocation time
  and record struct/union size for all distilled struct/unions
  instead of using forwards.  This allows us to carry out
  type compatibility checks based on the base BTF we want to
  relocate with (Eduard, patches 1, 3)
- Moved to using qsort() instead of qsort_r() as support for
  qsort_r() appears to be missing in Android libc (Andrii, patch 3)
- Sorting/searching now incorporates size matching depending
  on BTF kind and embeddedness of struct/union (Eduard, Andrii,
  patch 3)
- Improved naming of various types during relocation to avoid
  confusion (Andrii, patch 3)
- Incorporated Eduard's patch (patch 5) which handles .BTF.base
  sections internally in btf_parse_elf().  This makes ELF parsing
  work with split BTF, split BTF with a distilled base, split
  BTF with a distilled base _and_ base BTF (by relocating) etc.
  Having this avoids the need for bpftool changes; it will work
  as-is with .BTF.base sections (Eduard, patch 4)
- Updated resolve_btfids to _not_ relocate BTF for modules
  where a .BTF.base section is present; in that one case we
  do not want to relocate BTF as the .BTF.ids section should
  reflect ids in .BTF.base which will later be relocated on
  module load (Eduard, Andrii, patch 5)

Changes since v3[6]:

- distill now checks for duplicate-named struct/unions and records
  them as a sized struct/union to help identify which of the
  multiple base BTF structs/unions it refers to (Eduard, patch 1)
- added test support for multiple name handling (Eduard, patch 2)
- simplified the string mapping when updating split BTF to use
  base BTF instead of distilled base.  Since the only string
  references split BTF can make to base BTF are the names of
  the base types, create a string map from distilled string
  offset -> base BTF string offset and update string offsets
  by visiting all strings in split BTF; this saves having to
  do costly searches of base BTF (Eduard, patch 7,10)
- fixed bpftool manpage and indentation issues (Quentin, patch 11)

Also explored Eduard's suggestion of doing an implicit fallback
to checking for .BTF.base section in btf__parse() when it is
called to get base BTF.  However while it is doable, it turned
out to be difficult operationally.  Since fallback is implicit
we do not know the source of the BTF - was it from .BTF or
.BTF.base? In bpftool, we want to try first standalone BTF,
then split, then split with distilled base.  Having a way
to explicitly request .BTF.base via btf__parse_opts() fits
that model better.

Changes since v2[7]:

- submitted patch to use --btf_features in Makefile.btf for pahole
  v1.26 and later separately (Andrii).  That has landed in bpf-next
  now.
- distilled base now encodes ENUM64 as fwd ENUM (size 8), eliminating
  the need for support for ENUM64 in btf__add_fwd (patch 1, Andrii)
- moved to distilling only named types, augmenting split BTF with
  associated reference types; this simplifies greatly the distilled
  base BTF and the mapping operation between distilled and base
  BTF when relocating (most of the series changes, Andrii)
- relocation now iterates over base BTF, looking for matches based
  on name in distilled BTF.  Distilled BTF is pre-sorted by name
  (Andrii, patch 8)
- removed most redundant compabitiliby checks aside from struct
  size for base types/embedded structs and kind compatibility
  (since we only match on name) (Andrii, patch 8)
- btf__parse_opts() now replaces btf_parse() internally in libbpf
  (Eduard, patch 3)

Changes since RFC [8]:

- updated terminology; we replace clunky "base reference" BTF with
  distilling base BTF into a .BTF.base section. Similarly BTF
  reconcilation becomes BTF relocation (Andrii, most patches)
- add distilled base BTF by default for out-of-tree modules
  (Alexei, patch 8)
- distill algorithm updated to record size of embedded struct/union
  by recording it as a 0-vlen STRUCT/UNION with size preserved
  (Andrii, patch 2)
- verify size match on relocation for such STRUCT/UNIONs (Andrii,
  patch 9)
- with embedded STRUCT/UNION recording size, we can have bpftool
  dump a header representation using .BTF.base + .BTF sections
  rather than special-casing and refusing to use "format c" for
  that case (patch 5)
- match enum with enum64 and vice versa (Andrii, patch 9)
- ensure that resolve_btfids works with BTF without .BTF.base
  section (patch 7)
- update tests to cover embedded types, arrays and function
  prototypes (patches 3, 12)

[1] https://lore.kernel.org/bpf/20231112124834.388735-14-alan.maguire@oracle.com/
[2] https://lore.kernel.org/bpf/20240501175035.2476830-1-alan.maguire@oracle.com/
[3] https://lore.kernel.org/bpf/20240517102714.4072080-1-alan.maguire@oracle.com/
[4] https://lore.kernel.org/bpf/20240528122408.3154936-1-alan.maguire@oracle.com/
[5] https://lore.kernel.org/bpf/20240517102246.4070184-1-alan.maguire@oracle.com/
[6] https://lore.kernel.org/bpf/20240510103052.850012-1-alan.maguire@oracle.com/
[7] https://lore.kernel.org/bpf/20240424154806.3417662-1-alan.maguire@oracle.com/
[8] https://lore.kernel.org/bpf/20240322102455.98558-1-alan.maguire@oracle.com/
====================

Link: https://lore.kernel.org/r/20240613095014.357981-1-alan.maguire@oracle.comSigned-off-by: Andrii Nakryiko <andrii@kernel.org>

f6afdaf7

resolve_btfids: Handle presence of .BTF.base section · 6ba77385

Alan Maguire authored Jun 13, 2024

Now that btf_parse_elf() handles .BTF.base section presence,
we need to ensure that resolve_btfids uses .BTF.base when present
rather than the vmlinux base BTF passed in via the -B option.
Detect .BTF.base section presence and unset the base BTF path
to ensure that BTF ELF parsing will do the right thing.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-7-alan.maguire@oracle.com

6ba77385

libbpf: Make btf_parse_elf process .BTF.base transparently · c86f180f

Eduard Zingerman authored Jun 13, 2024

Update btf_parse_elf() to check if .BTF.base section is present.
The logic is as follows:

  if .BTF.base section exists:
     distilled_base := btf_new(.BTF.base)
  if distilled_base:
     btf := btf_new(.BTF, .base_btf=distilled_base)
     if base_btf:
        btf_relocate(btf, base_btf)
  else:
     btf := btf_new(.BTF)
  return btf

In other words:
- if .BTF.base section exists, load BTF from it and use it as a base
  for .BTF load;
- if base_btf is specified and .BTF.base section exist, relocate newly
  loaded .BTF against base_btf.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240613095014.357981-6-alan.maguire@oracle.com

c86f180f

selftests/bpf: Extend distilled BTF tests to cover BTF relocation · affdeb50

Alan Maguire authored Jun 13, 2024

Ensure relocated BTF looks as expected; in this case identical to
original split BTF, with a few duplicate anonymous types added to
split BTF by the relocation process.  Also add relocation tests
for edge cases like missing type in base BTF and multiple types
of the same name.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-5-alan.maguire@oracle.com

affdeb50

libbpf: Split BTF relocation · 19e00c89

Alan Maguire authored Jun 13, 2024

Map distilled base BTF type ids referenced in split BTF and their
references to the base BTF passed in, and if the mapping succeeds,
reparent the split BTF to the base BTF.

Relocation is done by first verifying that distilled base BTF
only consists of named INT, FLOAT, ENUM, FWD, STRUCT and
UNION kinds; then we sort these to speed lookups.  Once sorted,
the base BTF is iterated, and for each relevant kind we check
for an equivalent in distilled base BTF.  When found, the
mapping from distilled -> base BTF id and string offset is recorded.
In establishing mappings, we need to ensure we check STRUCT/UNION
size when the STRUCT/UNION is embedded in a split BTF STRUCT/UNION,
and when duplicate names exist for the same STRUCT/UNION.  Otherwise
size is ignored in matching STRUCT/UNIONs.

Once all mappings are established, we can update type ids
and string offsets in split BTF and reparent it to the new base.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-4-alan.maguire@oracle.com

19e00c89

selftests/bpf: Test distilled base, split BTF generation · eb20e727

Alan Maguire authored Jun 13, 2024

Test generation of split+distilled base BTF, ensuring that

- named base BTF STRUCTs and UNIONs are represented as 0-vlen sized
  STRUCT/UNIONs
- named ENUM[64]s are represented as 0-vlen named ENUM[64]s
- anonymous struct/unions are represented in full in split BTF
- anonymous enums are represented in full in split BTF
- types unreferenced from split BTF are not present in distilled
  base BTF

Also test that with vmlinux BTF and split BTF based upon it,
we only represent needed base types referenced from split BTF
in distilled base.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-3-alan.maguire@oracle.com

eb20e727

libbpf: Add btf__distill_base() creating split BTF with distilled base BTF · 58e185a0

Alan Maguire authored Jun 13, 2024

To support more robust split BTF, adding supplemental context for the
base BTF type ids that split BTF refers to is required.  Without such
references, a simple shuffling of base BTF type ids (without any other
significant change) invalidates the split BTF.  Here the attempt is made
to store additional context to make split BTF more robust.

This context comes in the form of distilled base BTF providing minimal
information (name and - in some cases - size) for base INTs, FLOATs,
STRUCTs, UNIONs, ENUMs and ENUM64s along with modified split BTF that
points at that base and contains any additional types needed (such as
TYPEDEF, PTR and anonymous STRUCT/UNION declarations).  This
information constitutes the minimal BTF representation needed to
disambiguate or remove split BTF references to base BTF.  The rules
are as follows:

- INT, FLOAT, FWD are recorded in full.
- if a named base BTF STRUCT or UNION is referred to from split BTF, it
  will be encoded as a zero-member sized STRUCT/UNION (preserving
  size for later relocation checks).  Only base BTF STRUCT/UNIONs
  that are either embedded in split BTF STRUCT/UNIONs or that have
  multiple STRUCT/UNION instances of the same name will _need_ size
  checks at relocation time, but as it is possible a different set of
  types will be duplicates in the later to-be-resolved base BTF,
  we preserve size information for all named STRUCT/UNIONs.
- if an ENUM[64] is named, a ENUM forward representation (an ENUM
  with no values) of the same size is used.
- in all other cases, the type is added to the new split BTF.

Avoiding struct/union/enum/enum64 expansion is important to keep the
distilled base BTF representation to a minimum size.

When successful, new representations of the distilled base BTF and new
split BTF that refers to it are returned.  Both need to be freed by the
caller.

So to take a simple example, with split BTF with a type referring
to "struct sk_buff", we will generate distilled base BTF with a
0-member STRUCT sk_buff of the appropriate size, and the split BTF
will refer to it instead.

Tools like pahole can utilize such split BTF to populate the .BTF
section (split BTF) and an additional .BTF.base section.  Then
when the split BTF is loaded, the distilled base BTF can be used
to relocate split BTF to reference the current (and possibly changed)
base BTF.

So for example if "struct sk_buff" was id 502 when the split BTF was
originally generated,  we can use the distilled base BTF to see that
id 502 refers to a "struct sk_buff" and replace instances of id 502
with the current (relocated) base BTF sk_buff type id.

Distilled base BTF is small; when building a kernel with all modules
using distilled base BTF as a test, overall module size grew by only
5.3Mb total across ~2700 modules.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613095014.357981-2-alan.maguire@oracle.com

58e185a0

14 Jun, 2024 4 commits

selftests/bpf: Add tests for add_const · dedf56d7

Alexei Starovoitov authored Jun 12, 2024

Improve arena based tests and add several C and asm tests
with specific pattern.
These tests would have failed without add_const verifier support.

Also add several loop_inside_iter*() tests that are not related to add_const,
but nice to have.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240613013815.953-5-alexei.starovoitov@gmail.com

dedf56d7

bpf: Support can_loop/cond_break on big endian · 6870bdb3

Alexei Starovoitov authored Jun 12, 2024

Add big endian support for can_loop/cond_break macros.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20240613013815.953-4-alexei.starovoitov@gmail.com

6870bdb3

bpf: Track delta between "linked" registers. · 98d7ca37

Alexei Starovoitov authored Jun 12, 2024

Compilers can generate the code
  r1 = r2
  r1 += 0x1
  if r2 < 1000 goto ...
  use knowledge of r2 range in subsequent r1 operations

So remember constant delta between r2 and r1 and update r1 after 'if' condition.

Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)

The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com

98d7ca37

bpf: Relax tuple len requirement for sk helpers. · 124e8c2b

Alexei Starovoitov authored Jun 12, 2024

__bpf_skc_lookup() safely handles incorrect values of tuple len,
hence we can allow zero to be passed as tuple len.
This patch alone doesn't make an observable verifier difference.
It's a trivial improvement that might simplify bpf programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-2-alexei.starovoitov@gmail.com

124e8c2b

13 Jun, 2024 11 commits

Merge branch 'bpf-make-trusted-args-nullable' · cdbde084

Alexei Starovoitov authored Jun 13, 2024

Vadim Fedorenko says:

====================
bpf: make trusted args nullable

Current verifier checks for the arg to be nullable after checking for
certain pointer types. It prevents programs to pass NULL to kfunc args
even if they are marked as nullable. This patchset adjusts verifier and
changes bpf crypto kfuncs to allow null for IV parameter which is
optional for some ciphers. Benchmark shows ~4% improvements when there
is no need to initialise 0-sized dynptr.

v3:
- add special selftest for nullable parameters
v2:
- adjust kdoc accordingly
====================

Link: https://lore.kernel.org/r/20240613211817.1551967-1-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cdbde084

selftests: bpf: add testmod kfunc for nullable params · 2d45ab1e

Vadim Fedorenko authored Jun 13, 2024

Add special test to be sure that only __nullable BTF params can be
replaced by NULL. This patch adds fake kfuncs in bpf_testmod to
properly test different params.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-6-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

2d45ab1e

selftests: bpf: crypto: adjust bench to use nullable IV · 9b560751

Vadim Fedorenko authored Jun 13, 2024

The bench shows some improvements, around 4% faster on decrypt.

Before:

Benchmark 'crypto-decrypt' started.
Iter 0 (325.719us): hits 5.105M/s ( 5.105M/prod), drops 0.000M/s, total operations 5.105M/s
Iter 1 (-17.295us): hits 5.224M/s ( 5.224M/prod), drops 0.000M/s, total operations 5.224M/s
Iter 2 ( 5.504us): hits 4.630M/s ( 4.630M/prod), drops 0.000M/s, total operations 4.630M/s
Iter 3 ( 9.239us): hits 5.148M/s ( 5.148M/prod), drops 0.000M/s, total operations 5.148M/s
Iter 4 ( 37.885us): hits 5.198M/s ( 5.198M/prod), drops 0.000M/s, total operations 5.198M/s
Iter 5 (-53.282us): hits 5.167M/s ( 5.167M/prod), drops 0.000M/s, total operations 5.167M/s
Iter 6 (-17.809us): hits 5.186M/s ( 5.186M/prod), drops 0.000M/s, total operations 5.186M/s
Summary: hits 5.092 ± 0.228M/s ( 5.092M/prod), drops 0.000 ±0.000M/s, total operations 5.092 ± 0.228M/s

After:

Benchmark 'crypto-decrypt' started.
Iter 0 (268.912us): hits 5.312M/s ( 5.312M/prod), drops 0.000M/s, total operations 5.312M/s
Iter 1 (124.869us): hits 5.354M/s ( 5.354M/prod), drops 0.000M/s, total operations 5.354M/s
Iter 2 (-36.801us): hits 5.334M/s ( 5.334M/prod), drops 0.000M/s, total operations 5.334M/s
Iter 3 (254.628us): hits 5.334M/s ( 5.334M/prod), drops 0.000M/s, total operations 5.334M/s
Iter 4 (-77.691us): hits 5.275M/s ( 5.275M/prod), drops 0.000M/s, total operations 5.275M/s
Iter 5 (-164.510us): hits 5.313M/s ( 5.313M/prod), drops 0.000M/s, total operations 5.313M/s
Iter 6 (-81.376us): hits 5.346M/s ( 5.346M/prod), drops 0.000M/s, total operations 5.346M/s
Summary: hits 5.326 ± 0.029M/s ( 5.326M/prod), drops 0.000 ±0.000M/s, total operations 5.326 ± 0.029M/s
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-5-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9b560751

selftests: bpf: crypto: use NULL instead of 0-sized dynptr · 9363dc8d

Vadim Fedorenko authored Jun 13, 2024

Adjust selftests to use nullable option for state and IV arg.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-4-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9363dc8d

bpf: crypto: make state and IV dynptr nullable · 65d6d61d

Vadim Fedorenko authored Jun 13, 2024

Some ciphers do not require state and IV buffer, but with current
implementation 0-sized dynptr is always needed. With adjustment to
verifier we can provide NULL instead of 0-sized dynptr. Make crypto
kfuncs ready for this.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-3-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

65d6d61d

bpf: verifier: make kfuncs args nullalble · a9079799

Vadim Fedorenko authored Jun 13, 2024

Some arguments to kfuncs might be NULL in some cases. But currently it's
not possible to pass NULL to any BTF structures because the check for
the suffix is located after all type checks. Move it to earlier place
to allow nullable args.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-2-vadfed@meta.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

a9079799

Merge branch 'fixes-for-kfunc-prototype-generation' · 373a4e13

Alexei Starovoitov authored Jun 13, 2024

Daniel Xu says:

====================
Fixes for kfunc prototype generation

This patchset fixes new warnings and errors that kfunc prototype
generation caused.
====================

Link: https://lore.kernel.org/r/cover.1718295425.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

373a4e13

bpf: selftests: Do not use generated kfunc prototypes for arena progs · 6a826014

Daniel Xu authored Jun 13, 2024

When selftests are built with a new enough clang, the arena selftests
opt-in to use LLVM address_space attribute annotations for arena
pointers.

These annotations are not emitted by kfunc prototype generation. This
causes compilation errors when clang sees conflicting prototypes.

Fix by opting arena selftests out of using generated kfunc prototypes.

Fixes: 770abbb5 ("bpftool: Support dumping kfunc prototypes from BTF")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202406131810.c1B8hTm8-lkp@intel.com/Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/fc59a617439ceea9ad8dfbb4786843c2169496ae.1718295425.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6a826014

bpf: Fix bpf_dynptr documentation comments · 78746f93

Daniel Xu authored Jun 13, 2024

The function argument names were changed but the doc comment was not.
Fix htmldocs build warning by updating doc comments.

Fixes: cce4c40b ("bpf: treewide: Align kfunc signatures to prog point-of-view")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d0b0eb05f91e12e5795966153b11998d3fc1d433.1718295425.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

78746f93

selftests/bpf: Validate CHECKSUM_COMPLETE option · 041c1dc9

Vadim Fedorenko authored Jun 06, 2024

Adjust skb program test to run with checksum validation.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240606145851.229116-2-vadfed@meta.com

041c1dc9

bpf: Add CHECKSUM_COMPLETE to bpf test progs · a3cfe84c

Vadim Fedorenko authored Jun 06, 2024

Add special flag to validate that TC BPF program properly updates
checksum information in skb.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240606145851.229116-1-vadfed@meta.com

a3cfe84c

12 Jun, 2024 6 commits

Merge branch 'bpf-support-dumping-kfunc-prototypes-from-btf' · 4ff57471

Alexei Starovoitov authored Jun 12, 2024

Daniel Xu says:

====================
bpf: Support dumping kfunc prototypes from BTF

This patchset enables both detecting as well as dumping compilable
prototypes for kfuncs.

The first commit instructs pahole to DECL_TAG kfuncs when available.
This requires v1.27 which was released on 6/11/24. With it, users will
be able to look at BTF inside vmlinux (or modules) and check if the
kfunc they want is available.

The final commit teaches bpftool how to dump kfunc prototypes. This
is done for developer convenience.

The rest of the commits are fixups to enable selftests to use the
newly dumped kfunc prototypes. With these, selftests will regularly
exercise the newly added codepaths.

Tested with and without the required pahole changes:

  * https://github.com/kernel-patches/bpf/pull/7186
  * https://github.com/kernel-patches/bpf/pull/7187

=== Changelog ===
From v4:
* Change bpf_session_cookie() return type
* Only fixup used fentry test kfunc prototypes
* Extract out projection detection into shared btf_is_projection_of()
* Fix kernel test robot build warnings about doc comments

From v3:
* Teach selftests to use dumped prototypes

From v2:
* Update Makefile.btf with pahole flag
* More error checking
* Output formatting changes
* Drop already-merged commit

From v1:
* Add __weak annotation
* Use btf_dump for kfunc prototypes
* Update kernel bpf_rdonly_cast() signature
====================

Link: https://lore.kernel.org/r/cover.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

4ff57471

bpftool: Support dumping kfunc prototypes from BTF · 770abbb5

Daniel Xu authored Jun 12, 2024

This patch enables dumping kfunc prototypes from bpftool. This is useful
b/c with this patch, end users will no longer have to manually define
kfunc prototypes. For the kernel tree, this also means we can optionally
drop kfunc prototypes from:

        tools/testing/selftests/bpf/bpf_kfuncs.h
        tools/testing/selftests/bpf/bpf_experimental.h

Example usage:

        $ make PAHOLE=/home/dxu/dev/pahole/build/pahole -j30 vmlinux

        $ ./tools/bpf/bpftool/bpftool btf dump file ./vmlinux format c | rg "__ksym;" | head -3
        extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __weak __ksym;
        extern void cgroup_rstat_flush(struct cgroup *cgrp) __weak __ksym;
        extern struct bpf_key *bpf_lookup_user_key(u32 serial, u64 flags) __weak __ksym;
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/bf6c08f9263c4bd9d10a717de95199d766a13f61.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

770abbb5

bpf: selftests: xfrm: Opt out of using generated kfunc prototypes · c567cba3

Daniel Xu authored Jun 12, 2024

The xfrm_info selftest locally defines an aliased type such that folks
with CONFIG_XFRM_INTERFACE=m/n configs can still build the selftests.
See commit aa67961f ("selftests/bpf: Allow building bpf tests with CONFIG_XFRM_INTERFACE=[m|n]").

Thus, it is simpler if this selftest opts out of using enerated kfunc
prototypes. The preprocessor macro this commit uses will be introduced
in the final commit.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/afe0bb1c50487f52542cdd5230c4aef9e36ce250.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

c567cba3

bpf: selftests: nf: Opt out of using generated kfunc prototypes · f709124d

Daniel Xu authored Jun 12, 2024

The bpf-nf selftests play various games with aliased types such that
folks with CONFIG_NF_CONNTRACK=m/n configs can still build the
selftests. See commits:

1058b6a7 ("selftests/bpf: Do not fail build if CONFIG_NF_CONNTRACK=m/n")
92afc532 ("selftests/bpf: Fix build errors if CONFIG_NF_CONNTRACK=m")

Thus, it is simpler if these selftests opt out of using generated kfunc
prototypes. The preprocessor macro this commit uses will be introduced
in the final commit.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/044a5b10cb3abd0d71cb1c818ee0bfc4a2239332.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f709124d

bpf: treewide: Align kfunc signatures to prog point-of-view · cce4c40b

Daniel Xu authored Jun 12, 2024

Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1e ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

cce4c40b

bpf: verifier: Relax caller requirements for kfunc projection type args · ec209ad8

Daniel Xu authored Jun 12, 2024

Currently, if a kfunc accepts a projection type as an argument (eg
struct __sk_buff *), the caller must exactly provide exactly the same
type with provable provenance.

However in practice, kfuncs that accept projection types _must_ cast to
the underlying type before use b/c projection type layouts are
completely made up. Thus, it is ok to relax the verifier rules around
implicit conversions.

We will use this functionality in the next commit when we align kfuncs
to user-facing types.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyzSigned-off-by: Alexei Starovoitov <ast@kernel.org>

ec209ad8