• Andrii Nakryiko's avatar
    selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks · 7df4e597
    Andrii Nakryiko authored
    Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
    one syscall execution and BPF program run. While we use a fast
    get_pgid() syscall, syscall overhead can still be non-trivial.
    
    This patch adds kprobe/fentry set of benchmarks significantly amortizing
    the cost of syscall vs actual BPF triggering overhead. We do this by
    employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
    which does a tight parameterized loop calling cheap BPF helper
    (bpf_get_numa_node_id()), to which kprobe/fentry programs are
    attached for benchmarking.
    
    This way 1 bpf() syscall causes N executions of BPF program being
    benchmarked. N defaults to 100, but can be adjusted with
    --trig-batch-iters CLI argument.
    
    For comparison we also implement a new baseline program that instead of
    triggering another BPF program just does N atomic per-CPU counter
    increments, establishing the limit for all other types of program within
    this batched benchmarking setup.
    
    Taking the final set of benchmarks added in this patch set (including
    tp/raw_tp/fmodret, added in later patch), and keeping for now "legacy"
    syscall-driven benchmarks, we can capture all triggering benchmarks in
    one place for comparison, before we remove the legacy ones (and rename
    xxx-batched into just xxx).
    
    $ benchs/run_bench_trigger.sh
    usermode-count       :   79.500 ± 0.024M/s
    kernel-count         :   49.949 ± 0.081M/s
    syscall-count        :    9.009 ± 0.007M/s
    
    fentry-batch         :   31.002 ± 0.015M/s
    fexit-batch          :   20.372 ± 0.028M/s
    fmodret-batch        :   21.651 ± 0.659M/s
    rawtp-batch          :   36.775 ± 0.264M/s
    tp-batch             :   19.411 ± 0.248M/s
    kprobe-batch         :   12.949 ± 0.220M/s
    kprobe-multi-batch   :   15.400 ± 0.007M/s
    kretprobe-batch      :    5.559 ± 0.011M/s
    kretprobe-multi-batch:    5.861 ± 0.003M/s
    
    fentry-legacy        :    8.329 ± 0.004M/s
    fexit-legacy         :    6.239 ± 0.003M/s
    fmodret-legacy       :    6.595 ± 0.001M/s
    rawtp-legacy         :    8.305 ± 0.004M/s
    tp-legacy            :    6.382 ± 0.001M/s
    kprobe-legacy        :    5.528 ± 0.003M/s
    kprobe-multi-legacy  :    5.864 ± 0.022M/s
    kretprobe-legacy     :    3.081 ± 0.001M/s
    kretprobe-multi-legacy:   3.193 ± 0.001M/s
    
    Note how xxx-batch variants are measured with significantly higher
    throughput, even though it's exactly the same in-kernel overhead. As
    such, results can be compared only between benchmarks of the same kind
    (syscall vs batched):
    
    fentry-legacy        :    8.329 ± 0.004M/s
    fentry-batch         :   31.002 ± 0.015M/s
    
    kprobe-multi-legacy  :    5.864 ± 0.022M/s
    kprobe-multi-batch   :   15.400 ± 0.007M/s
    
    Note also that syscall-count is setting a theoretical limit for
    syscall-triggered benchmarks, while kernel-count is setting similar
    limits for batch variants. usermode-count is a happy and unachievable
    case of user space counting without doing any syscalls, and is mostly
    the measure of CPU speed for such a trivial benchmark.
    
    As was mentioned, tp/raw_tp/fmodret require kernel-side kfunc to produce
    similar benchmark, which we address in a separate patch.
    
    Note that run_bench_trigger.sh allows to override a list of benchmarks
    to run, which is very useful for performance work.
    
    Cc: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20240326162151.3981687-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
    7df4e597
trigger_bench.c 2.82 KB