Commits · b168852eb8eff57b63e72e3bc584bda3d2cb9577 · Kirill Smelkov / linux

24 Jun, 2022 4 commits

perf tools: Rework prologue generation code · b168852e

Jiri Olsa authored Jun 16, 2022

Some functions we use for bpf prologue generation are going to be
deprecated. This change reworks current code not to use them.

We need to replace following functions/struct:
   bpf_program__set_prep
   bpf_program__nth_fd
   struct bpf_prog_prep_result

Currently we use bpf_program__set_prep to hook perf callback before
program is loaded and provide new instructions with the prologue.

We replace this function/ality by taking instructions for specific
program, attaching prologue to them and load such new ebpf programs
with prologue using separate bpf_prog_load calls (outside libbpf
load machinery).

Before we can take and use program instructions, we need libbpf to
actually load it. This way we get the final shape of its instructions
with all relocations and verifier adjustments).

There's one glitch though.. perf kprobe program already assumes
generated prologue code with proper values in argument registers,
so loading such program directly will fail in the verifier.

That's where the fallback pre-load handler fits in and prepends
the initialization code to the program. Once such program is loaded
we take its instructions, cut off the initialization code and prepend
the prologue.

I know.. sorry ;-)

To have access to the program when loading this patch adds support to
register 'fallback' section handler to take care of perf kprobe programs.
The fallback means that it handles any section definition besides the
ones that libbpf handles.

The handler serves two purposes:
  - allows perf programs to have special arguments in section name
  - allows perf to use pre-load callback where we can attach init
    code (zeroing all argument registers) to each perf program
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/bpf/20220616202214.70359-2-jolsa@kernel.org

b168852e

selftest/bpf: Test for use-after-free bug fix in inline_bpf_loop · 41188e9e

Eduard Zingerman authored Jun 24, 2022

This test verifies that bpf_loop() inlining works as expected when
address of `env->prog` is updated. This address is updated upon BPF
program reallocation.

Reallocation is handled by bpf_prog_realloc(), which reuses old memory
if page boundary is not crossed. The value of `len` in the test is
chosen to cross this boundary on bpf_loop() patching.

Verify that the use-after-free bug in inline_bpf_loop() reported by
Dan Carpenter is fixed.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-3-eddyz87@gmail.com

41188e9e

bpf: Fix for use-after-free bug in inline_bpf_loop · fb4e3b33

Eduard Zingerman authored Jun 24, 2022

As reported by Dan Carpenter, the following statements in inline_bpf_loop()
might cause a use-after-free bug:

  struct bpf_prog *new_prog;
  // ...
  new_prog = bpf_patch_insn_data(env, position, insn_buf, *cnt);
  // ...
  env->prog->insnsi[call_insn_offset].imm = callback_offset;

The bpf_patch_insn_data() might free the memory used by env->prog.

Fixes: 1ade2371 ("bpf: Inline calls to bpf_loop when callback is known")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220624020613.548108-2-eddyz87@gmail.com

fb4e3b33

bpf: Replace hard-coded 0 with BPF_K in check_alu_op · 395e942d

Simon Wang authored Jun 21, 2022

Enhance readability a bit.
Signed-off-by: Simon Wang <wangchuanguo@inspur.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220622031923.65692-1-wangchuanguo@inspur.com

395e942d

23 Jun, 2022 9 commits

selftests/bpf: Fix rare segfault in sock_fields prog test · 6dc7a0ba

Jörn-Thorben Hinz authored Jun 21, 2022

test_sock_fields__detach() got called with a null pointer here when one
of the CHECKs or ASSERTs up to the test_sock_fields__open_and_load()
call resulted in a jump to the "done" label.

A skeletons *__detach() is not safe to call with a null pointer, though.
This led to a segfault.

Go the easy route and only call test_sock_fields__destroy() which is
null-pointer safe and includes detaching.

Came across this while looking[1] to introduce the usage of
bpf_tcp_helpers.h (included in progs/test_sock_fields.c) together with
vmlinux.h.

[1] https://lore.kernel.org/bpf/629bc069dd807d7ac646f836e9dca28bbc1108e2.camel@mailbox.tu-berlin.de/

Fixes: 8f50f16f ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads")
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220621070116.307221-1-jthinz@mailbox.tu-berlin.de

6dc7a0ba

Merge branch 'Align BPF TCP CCs implementing cong_control() with non-BPF CCs' · bb7a4257

Alexei Starovoitov authored Jun 23, 2022

Jörn-Thorben Hinz says:

====================

This series corrects some inconveniences for a BPF TCP CC that
implements and uses tcp_congestion_ops.cong_control(). Until now, such a
CC did not have all necessary write access to struct sock and
unnecessarily needed to implement cong_avoid().

v4:
 - Remove braces around single statements after if
 - Don’t check pointer passed to bpf_link__destroy()
v3:
 - Add a selftest writing sk_pacing_*
 - Add a selftest with incomplete tcp_congestion_ops
 - Add a selftest with unsupported get_info()
 - Remove an unused variable
 - Reword a comment about reg() in bpf_struct_ops_map_update_elem()
v2:
 - Drop redundant check for required functions and just rely on
   tcp_register_congestion_control() (Martin KaFai Lau)
====================
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bb7a4257

selftests/bpf: Test a BPF CC implementing the unsupported get_info() · f14a3f64

Jörn-Thorben Hinz authored Jun 22, 2022

Test whether a TCP CC implemented in BPF providing get_info() is
rejected correctly. get_info() is unsupported in a BPF CC. The check for
required functions in a BPF CC has moved, this test ensures unsupported
functions are still rejected correctly.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-6-jthinz@mailbox.tu-berlin.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f14a3f64

selftests/bpf: Test an incomplete BPF CC · 0735627d

Jörn-Thorben Hinz authored Jun 22, 2022

Test whether a TCP CC implemented in BPF providing neither cong_avoid()
nor cong_control() is correctly rejected. This check solely depends on
tcp_register_congestion_control() now, which is invoked during
bpf_map__attach_struct_ops().
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-5-jthinz@mailbox.tu-berlin.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0735627d

selftests/bpf: Test a BPF CC writing sk_pacing_* · 6e945d57

Jörn-Thorben Hinz authored Jun 22, 2022

Test whether a TCP CC implemented in BPF is allowed to write
sk_pacing_rate and sk_pacing_status in struct sock. This is needed when
cong_control() is implemented and used.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Link: https://lore.kernel.org/r/20220622191227.898118-4-jthinz@mailbox.tu-berlin.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

6e945d57

bpf: Require only one of cong_avoid() and cong_control() from a TCP CC · 9f0265e9

Jörn-Thorben Hinz authored Jun 22, 2022

Remove the check for required and optional functions in a struct
tcp_congestion_ops from bpf_tcp_ca.c. Rely on
tcp_register_congestion_control() to reject a BPF CC that does not
implement all required functions, as it will do for a non-BPF CC.

When a CC implements tcp_congestion_ops.cong_control(), the alternate
cong_avoid() is not in use in the TCP stack. Previously, a BPF CC was
still forced to implement cong_avoid() as a no-op since it was
non-optional in bpf_tcp_ca.c.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-3-jthinz@mailbox.tu-berlin.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9f0265e9

bpf: Allow a TCP CC to write sk_pacing_rate and sk_pacing_status · 41c95dd6

Jörn-Thorben Hinz authored Jun 22, 2022

A CC that implements tcp_congestion_ops.cong_control() should be able to
control sk_pacing_rate and set sk_pacing_status, since
tcp_update_pacing_rate() is never called in this case. A built-in CC or
one from a kernel module is already able to write to both struct sock
members. For a BPF program, write access has not been allowed, yet.
Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220622191227.898118-2-jthinz@mailbox.tu-berlin.deSigned-off-by: Alexei Starovoitov <ast@kernel.org>

41c95dd6

test_bpf: fix incorrect netdev features · 9676fecc

Jian Shen authored Jun 22, 2022

The prototype of .features is netdev_features_t, it should use
NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT
and NETIF_F_HW_VLAN_STAG_TX_BIT.

Fixes: cf204a71 ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

9676fecc

selftests/bpf: Add benchmark for local_storage get · 73087489

Dave Marchevsky authored Jun 20, 2022

Add a benchmarks to demonstrate the performance cliff for local_storage
get as the number of local_storage maps increases beyond current
local_storage implementation's cache size.

"sequential get" and "interleaved get" benchmarks are added, both of
which do many bpf_task_storage_get calls on sets of task local_storage
maps of various counts, while considering a single specific map to be
'important' and counting task_storage_gets to the important map
separately in addition to normal 'hits' count of all gets. Goal here is
to mimic scenario where a particular program using one map - the
important one - is running on a system where many other local_storage
maps exist and are accessed often.

While "sequential get" benchmark does bpf_task_storage_get for map 0, 1,
..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4
bpf_task_storage_gets for the important map for every 10 map gets. This
is meant to highlight performance differences when important map is
accessed far more frequently than non-important maps.

A "hashmap control" benchmark is also included for easy comparison of
standard bpf hashmap lookup vs local_storage get. The benchmark is
similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH
instead of local storage. Only one inner map is created - a hashmap
meant to hold tid -> data mapping for all tasks. Size of the hashmap is
hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these
keys which are actually fetched as part of the benchmark is
configurable.

Addition of this benchmark is inspired by conversation with Alexei in a
previous patchset's thread [0], which highlighted the need for such a
benchmark to motivate and validate improvements to local_storage
implementation. My approach in that series focused on improving
performance for explicitly-marked 'important' maps and was rejected
with feedback to make more generally-applicable improvements while
avoiding explicitly marking maps as important. Thus the benchmark
reports both general and important-map-focused metrics, so effect of
future work on both is clear.

Regarding the benchmark results. On a powerful system (Skylake, 20
cores, 256gb ram):

Hashmap Control
===============
        num keys: 10
hashmap (control) sequential    get:  hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s

        num keys: 1000
hashmap (control) sequential    get:  hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s

        num keys: 10000
hashmap (control) sequential    get:  hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s

        num keys: 100000
hashmap (control) sequential    get:  hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s

        num keys: 4194304
hashmap (control) sequential    get:  hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s

Local Storage
=============
        num_maps: 1
local_storage cache sequential  get:  hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s
local_storage cache interleaved get:  hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s
local_storage cache interleaved get:  hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s
local_storage cache interleaved get:  hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s
local_storage cache interleaved get:  hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s
local_storage cache interleaved get:  hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s
local_storage cache interleaved get:  hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s
local_storage cache interleaved get:  hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s

Looking at the "sequential get" results, it's clear that as the
number of task local_storage maps grows beyond the current cache size
(16), there's a significant reduction in hits throughput. Note that
current local_storage implementation assigns a cache_idx to maps as they
are created. Since "sequential get" is creating maps 0..n in order and
then doing bpf_task_storage_get calls in the same order, the benchmark
is effectively ensuring that a map will not be in cache when the program
tries to access it.

For "interleaved get" results, important-map hits throughput is greatly
increased as the important map is more likely to be in cache by virtue
of being accessed far more frequently. Throughput still reduces as #
maps increases, though.

To get a sense of the overhead of the benchmark program, I
commented out bpf_task_storage_get/bpf_map_lookup_elem in
local_storage_bench.c and ran the benchmark on the same host as the
'real' run. Results:

Hashmap Control
===============
        num keys: 10
hashmap (control) sequential    get:  hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s

        num keys: 1000
hashmap (control) sequential    get:  hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s

        num keys: 10000
hashmap (control) sequential    get:  hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s

        num keys: 100000
hashmap (control) sequential    get:  hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s

        num keys: 4194304
hashmap (control) sequential    get:  hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s

Local Storage
=============
        num_maps: 1
local_storage cache sequential  get:  hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s
local_storage cache interleaved get:  hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s

        num_maps: 10
local_storage cache sequential  get:  hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s
local_storage cache interleaved get:  hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s

        num_maps: 16
local_storage cache sequential  get:  hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s
local_storage cache interleaved get:  hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s

        num_maps: 17
local_storage cache sequential  get:  hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s
local_storage cache interleaved get:  hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s

        num_maps: 24
local_storage cache sequential  get:  hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s
local_storage cache interleaved get:  hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s

        num_maps: 32
local_storage cache sequential  get:  hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s
local_storage cache interleaved get:  hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s

        num_maps: 100
local_storage cache sequential  get:  hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s
local_storage cache interleaved get:  hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s

        num_maps: 1000
local_storage cache sequential  get:  hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s
local_storage cache interleaved get:  hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s

Adjusting for overhead, latency numbers for "hashmap control" and
"sequential get" are:

hashmap_control_1k:   ~53.8ns
hashmap_control_10k:  ~124.2ns
hashmap_control_100k: ~206.5ns
sequential_get_1:     ~12.1ns
sequential_get_10:    ~16.0ns
sequential_get_16:    ~13.8ns
sequential_get_17:    ~16.8ns
sequential_get_24:    ~40.9ns
sequential_get_32:    ~65.2ns
sequential_get_100:   ~148.2ns
sequential_get_1000:  ~2204ns

Clearly demonstrating a cliff.

In the discussion for v1 of this patch, Alexei noted that local_storage
was 2.5x faster than a large hashmap when initially implemented [1]. The
benchmark results show that local_storage is 5-10x faster: a
long-running BPF application putting some pid-specific info into a
hashmap for each pid it sees will probably see on the order of 10-100k
pids. Bench numbers for hashmaps of this size are ~10x slower than
sequential_get_16, but as the number of local_storage maps grows far
past local_storage cache size the performance advantage shrinks and
eventually reverses.

When running the benchmarks it may be necessary to bump 'open files'
ulimit for a successful run.

  [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com
  [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

73087489

22 Jun, 2022 1 commit

samples/bpf: fixup some tools to be able to support xdp multibuffer · 77225174

Andy Gospodarek authored Jun 21, 2022

This changes the section name for the bpf program embedded in these
files to "xdp.frags" to allow the programs to be loaded on drivers that
are using an MTU greater than PAGE_SIZE.  Rather than directly accessing
the buffers, the packet data is now accessed via xdp helper functions to
provide an example for those who may need to write more complex
programs.

v2: remove new unnecessary variable
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/20220621175402.35327-1-gospo@broadcom.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

77225174

21 Jun, 2022 9 commits

bpf, arm64: Keep tail call count across bpf2bpf calls · d4609a5d

Jakub Sitnicki authored Jun 17, 2022

Today doing a BPF tail call after a BPF to BPF call, that is from a
subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features
requires support from JIT. Tail call count has to be tracked through BPF to
BPF calls, as well as through BPF tail calls to prevent unbounded chains of
tail calls.

arm64 BPF JIT stores the tail call count (TCC) in a dedicated
register (X26). This makes it easier to support bpf2bpf calls mixed with
tail calls than on x86 platform.

In order to keep the tail call count in tact throughout bpf2bpf calls, all
we need to do is tweak the program prologue generator. When emitting
prologue for a subprogram, we skip the block that initializes the tail call
count and emits a jump pad for the tail call.

With this change, a sample execution flow where a bpf2bpf call is followed
by a tail call would look like so:

int entry(struct __sk_buff *skb):
   0xffffffc0090151d4:  paciasp
   0xffffffc0090151d8:  stp     x29, x30, [sp, #-16]!
   0xffffffc0090151dc:  mov     x29, sp
   0xffffffc0090151e0:  stp     x19, x20, [sp, #-16]!
   0xffffffc0090151e4:  stp     x21, x22, [sp, #-16]!
   0xffffffc0090151e8:  stp     x25, x26, [sp, #-16]!
   0xffffffc0090151ec:  stp     x27, x28, [sp, #-16]!
   0xffffffc0090151f0:  mov     x25, sp
   0xffffffc0090151f4:  mov     x26, #0x0                       // <- init TCC only
   0xffffffc0090151f8:  bti     j                               //    in main prog
   0xffffffc0090151fc:  sub     x27, x25, #0x0
   0xffffffc009015200:  sub     sp, sp, #0x10
   0xffffffc009015204:  mov     w1, #0x0
   0xffffffc009015208:  mov     x10, #0xffffffffffffffff
   0xffffffc00901520c:  strb    w1, [x25, x10]
   0xffffffc009015210:  mov     x10, #0xffffffffffffd25c
   0xffffffc009015214:  movk    x10, #0x902, lsl #16
   0xffffffc009015218:  movk    x10, #0xffc0, lsl #32
   0xffffffc00901521c:  blr     x10 -------------------.        // bpf2bpf call
   0xffffffc009015220:  add     x7, x0, #0x0 <-------------.
   0xffffffc009015224:  add     sp, sp, #0x10          |   |
   0xffffffc009015228:  ldp     x27, x28, [sp], #16    |   |
   0xffffffc00901522c:  ldp     x25, x26, [sp], #16    |   |
   0xffffffc009015230:  ldp     x21, x22, [sp], #16    |   |
   0xffffffc009015234:  ldp     x19, x20, [sp], #16    |   |
   0xffffffc009015238:  ldp     x29, x30, [sp], #16    |   |
   0xffffffc00901523c:  add     x0, x7, #0x0           |   |
   0xffffffc009015240:  autiasp                        |   |
   0xffffffc009015244:  ret                            |   |
                                                       |   |
int subprog_tail(struct __sk_buff *skb):               |   |
   0xffffffc00902d25c:  paciasp <----------------------'   |
   0xffffffc00902d260:  stp     x29, x30, [sp, #-16]!      |
   0xffffffc00902d264:  mov     x29, sp                    |
   0xffffffc00902d268:  stp     x19, x20, [sp, #-16]!      |
   0xffffffc00902d26c:  stp     x21, x22, [sp, #-16]!      |
   0xffffffc00902d270:  stp     x25, x26, [sp, #-16]!      |
   0xffffffc00902d274:  stp     x27, x28, [sp, #-16]!      |
   0xffffffc00902d278:  mov     x25, sp                    |
   0xffffffc00902d27c:  sub     x27, x25, #0x0             |
   0xffffffc00902d280:  sub     sp, sp, #0x10              |    // <- end of prologue, notice:
   0xffffffc00902d284:  add     x19, x0, #0x0              |    //    1) TCC not touched, and
   0xffffffc00902d288:  mov     w0, #0x1                   |    //    2) no tail call jump pad
   0xffffffc00902d28c:  mov     x10, #0xfffffffffffffffc   |
   0xffffffc00902d290:  str     w0, [x25, x10]             |
   0xffffffc00902d294:  mov     x20, #0xffffff80ffffffff   |
   0xffffffc00902d298:  movk    x20, #0xc033, lsl #16      |
   0xffffffc00902d29c:  movk    x20, #0x4e00               |
   0xffffffc00902d2a0:  add     x0, x19, #0x0              |
   0xffffffc00902d2a4:  add     x1, x20, #0x0              |
   0xffffffc00902d2a8:  mov     x2, #0x0                   |
   0xffffffc00902d2ac:  mov     w10, #0x24                 |
   0xffffffc00902d2b0:  ldr     w10, [x1, x10]             |
   0xffffffc00902d2b4:  add     w2, w2, #0x0               |
   0xffffffc00902d2b8:  cmp     w2, w10                    |
   0xffffffc00902d2bc:  b.cs    0xffffffc00902d2f8         |
   0xffffffc00902d2c0:  mov     w10, #0x21                 |
   0xffffffc00902d2c4:  cmp     x26, x10                   |    // TCC >= MAX_TAIL_CALL_CNT?
   0xffffffc00902d2c8:  b.cs    0xffffffc00902d2f8         |
   0xffffffc00902d2cc:  add     x26, x26, #0x1             |    // TCC++
   0xffffffc00902d2d0:  mov     w10, #0x110                |
   0xffffffc00902d2d4:  add     x10, x1, x10               |
   0xffffffc00902d2d8:  lsl     x11, x2, #3                |
   0xffffffc00902d2dc:  ldr     x11, [x10, x11]            |
   0xffffffc00902d2e0:  cbz     x11, 0xffffffc00902d2f8    |
   0xffffffc00902d2e4:  mov     w10, #0x30                 |
   0xffffffc00902d2e8:  ldr     x10, [x11, x10]            |
   0xffffffc00902d2ec:  add     x10, x10, #0x24            |
   0xffffffc00902d2f0:  add     sp, sp, #0x10              |    // <- destroy just current
   0xffffffc00902d2f4:  br      x10 ---------------------. |    //    BPF stack frame
   0xffffffc00902d2f8:  mov     x10, #0xfffffffffffffffc | |    //    before the tail call
   0xffffffc00902d2fc:  ldr     w7, [x25, x10]           | |
   0xffffffc00902d300:  add     sp, sp, #0x10            | |
   0xffffffc00902d304:  ldp     x27, x28, [sp], #16      | |
   0xffffffc00902d308:  ldp     x25, x26, [sp], #16      | |
   0xffffffc00902d30c:  ldp     x21, x22, [sp], #16      | |
   0xffffffc00902d310:  ldp     x19, x20, [sp], #16      | |
   0xffffffc00902d314:  ldp     x29, x30, [sp], #16      | |
   0xffffffc00902d318:  add     x0, x7, #0x0             | |
   0xffffffc00902d31c:  autiasp                          | |
   0xffffffc00902d320:  ret                              | |
                                                         | |
int classifier_0(struct __sk_buff *skb):                 | |
   0xffffffc008ff5874:  paciasp                          | |
   0xffffffc008ff5878:  stp     x29, x30, [sp, #-16]!    | |
   0xffffffc008ff587c:  mov     x29, sp                  | |
   0xffffffc008ff5880:  stp     x19, x20, [sp, #-16]!    | |
   0xffffffc008ff5884:  stp     x21, x22, [sp, #-16]!    | |
   0xffffffc008ff5888:  stp     x25, x26, [sp, #-16]!    | |
   0xffffffc008ff588c:  stp     x27, x28, [sp, #-16]!    | |
   0xffffffc008ff5890:  mov     x25, sp                  | |
   0xffffffc008ff5894:  mov     x26, #0x0                | |
   0xffffffc008ff5898:  bti     j <----------------------' |
   0xffffffc008ff589c:  sub     x27, x25, #0x0             |
   0xffffffc008ff58a0:  sub     sp, sp, #0x0               |
   0xffffffc008ff58a4:  mov     x0, #0xffffffc0ffffffff    |
   0xffffffc008ff58a8:  movk    x0, #0x8fc, lsl #16        |
   0xffffffc008ff58ac:  movk    x0, #0x6000                |
   0xffffffc008ff58b0:  mov     w1, #0x1                   |
   0xffffffc008ff58b4:  str     w1, [x0]                   |
   0xffffffc008ff58b8:  mov     w7, #0x0                   |
   0xffffffc008ff58bc:  mov     sp, sp                     |
   0xffffffc008ff58c0:  ldp     x27, x28, [sp], #16        |
   0xffffffc008ff58c4:  ldp     x25, x26, [sp], #16        |
   0xffffffc008ff58c8:  ldp     x21, x22, [sp], #16        |
   0xffffffc008ff58cc:  ldp     x19, x20, [sp], #16        |
   0xffffffc008ff58d0:  ldp     x29, x30, [sp], #16        |
   0xffffffc008ff58d4:  add     x0, x7, #0x0               |
   0xffffffc008ff58d8:  autiasp                            |
   0xffffffc008ff58dc:  ret -------------------------------'
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com

d4609a5d

bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT · 95acd881

Tony Ambardar authored Jun 17, 2022

The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail
calls for only x86-64. Change the logic to instead rely on a new weak
function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable
JIT backend can override.

Update the x86-64 eBPF JIT to reflect this.
Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com>
[jakub: drop MIPS bits and tweak patch subject]
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com

95acd881

Merge branch 'bpf_loop inlining' · b40b414e

Alexei Starovoitov authored Jun 20, 2022

Eduard Zingerman says:

====================

Hi Everyone,

This is the next iteration of the patch. It includes changes suggested
by Song, Joanne and Alexei. Please find updated intro message and
change log below.

This patch implements inlining of calls to bpf_loop helper function
when bpf_loop's callback is statically known. E.g. the rewrite does
the following transformation during BPF program processing:

  bpf_loop(10, foo, NULL, 0);

 ->

  for (int i = 0; i < 10; ++i)
    foo(i, NULL);

The transformation leads to measurable latency change for simple
loops. Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU /
KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.

The change is split in five parts:

* Update to test_verifier.c to specify expected and unexpected
  instruction sequences. This allows to check BPF program rewrites
  applied by e.g. do_mix_fixups function.

* Update to test_verifier.c to specify BTF function infos and types
  per test case. This is necessary for tests that load sub-program
  addresses to a variable because of the checks applied by
  check_ld_imm function.

* The update to verifier.c that tracks state of the parameters for
  each bpf_loop call in a program and decides whether it could be
  replaced by a loop.

* A set of test cases for `test_verifier` that use capabilities added
  by the first two patches to verify instructions produced by inlining
  logic.

* Two test cases for `test_prog` to check that possible corner cases
  behave as expected.

Additional details are available in commit messages for each patch.

Changes since v7:
 - Call to `mark_chain_precision` is added in `loop_flag_is_zero` to
   avoid potential issues with state pruning and precision tracking.
 - `flags non-zero` test_verifier test case is updated to have two
   execution paths reaching `bpf_loop` call, one with flags = 0,
   another with flags = 1. Potentially this test case should be able
   to show that call to `mark_chain_precision` is necessary in
   `loop_flag_is_zero` but not at the moment. Please refer to
   discussion for [PATCH bpf-next v7 3/5] for additional details.
 - `stack_depth_extra` computation is updated to guarantee that R6, R7
   and R8 offsets are always aligned on 8 byte boundary.
 - `stack locations for loop vars` test_verifier test case updated to
   show that R6, R7, R8 offsets are indeed aligned when function stack
   depth is not a multiple of 8.
 - I removed Song Liu's ACK from commit message for [PATCH bpf-next v8
   4/5] because I updated the patch. (Please let me know if I had to
   keep the ACK tag).

Changes since v6:
 - Return value of the `optimize_bpf_loop` function is no longer
   ignored. This is necessary to properly propagate -ENOMEM error.

Changes since v5:
 - Added function `loop_flag_is_zero` to skip a few checks in
   `update_loop_inline_state` when loop instruction is not fit for
   inline.

Changes since v4:
 - Added missing `static` modifier for `update_loop_inline_state` and
   `inline_bpf_loop` functions.
 - `update_loop_inline_state` updated for better readability.
 - Fields `initialized` and `fit_for_inline` of `struct
   bpf_loop_inline_state` are changed back from `bool` to bitfields.
 - Acks from Song Liu added to comments for patches 1/5, 2/5, 4/5,
   5/5.

Changes since v3:
 - Function `adjust_stack_depth_for_loop_inlining` is replaced by
   function `optimize_bpf_loop`. Function `optimize_bpf_loop` is
   responsible for both stack depth adjustment and call instruction
   replacement.
 - Changes in `do_misc_fixups` are reverted.
 - Changes in `adjust_subprog_starts_after_remove` are reverted and
   function `adjust_loop_inline_subprogno` is removed. This is
   possible because call to `optimize_bpf_loop` is placed before the
   dead code removal in `opt_remove_dead_code` (in contrast to the
   position of `do_misc_fixups` where inlining was done in v3).
 - Field `bpf_insn_aux_data.loop_inline_state` is now a part of
   anonymous union at the start of the `bpf_insn_aux_data`.
 - Data structure `bpf_loop_inline_state` is simplified to use single
   flag field `fit_for_inline` instead of separate fields
   `flags_is_zero` & `callback_is_constant`.
 - Macro definition `BPF_MAX_LOOPS` is moved from
   `include/linux/bpf_verifier.h` to `include/linux/bpf.h` to avoid
   include of `include/linux/bpf_verifier.h` in `bpf_iter.c`.
 - `inline_bpf_loop` changed back to use array initialization and hard
   coded offsets as in v2.
 - Style / formatting updates.

Changes since v2:
 - fix for `stack_check` test case in `test_progs-no_alu32`, all tests
   are passing now;
 - v2 3/3 patch is split in three parts:
   - kernel changes
   - test_verifier changes
   - test_prog changes
 - updated `inline_bpf_loop` in `verifier.c` to calculate each offset
   used in instructions to avoid "magic" numbers;
 - removed newline handling logic in `fail_log` branch of
   `do_single_test` in `test_verifier.c` to simplify the patch set;
 - styling fixes suggested in review for v2 of this patch set.

Changes since v1:
 - allow to use SKIP_INSNS in instruction pattern specification in
   test_verifier tests;
 - fix for a bug in spill offset assignement for loop vars when
   bpf_loop is located in a non-main function.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

b40b414e

selftests/bpf: BPF test_prog selftests for bpf_loop inlining · 0e1bf9ed

Eduard Zingerman authored Jun 21, 2022

Two new test BPF programs for test_prog selftests checking bpf_loop
behavior. Both are corner cases for bpf_loop inlinig transformation:
 - check that bpf_loop behaves correctly when callback function is not
   a compile time constant
 - check that local function variables are not affected by allocating
   additional stack storage for registers spilled by loop inlining
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

0e1bf9ed

selftests/bpf: BPF test_verifier selftests for bpf_loop inlining · f8acfdd0

Eduard Zingerman authored Jun 21, 2022

A number of test cases for BPF selftests test_verifier to check how
bpf_loop inline transformation rewrites the BPF program. The following
cases are covered:
 - happy path
 - no-rewrite when flags is non-zero
 - no-rewrite when callback is non-constant
 - subprogno in insn_aux is updated correctly when dead sub-programs
   are removed
 - check that correct stack offsets are assigned for spilling of R6-R8
   registers
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

f8acfdd0

bpf: Inline calls to bpf_loop when callback is known · 1ade2371

Eduard Zingerman authored Jun 21, 2022

Calls to `bpf_loop` are replaced with direct loops to avoid
indirection. E.g. the following:

  bpf_loop(10, foo, NULL, 0);

Is replaced by equivalent of the following:

  for (int i = 0; i < 10; ++i)
    foo(i, NULL);

This transformation could be applied when:
- callback is known and does not change during program execution;
- flags passed to `bpf_loop` are always zero.

Inlining logic works as follows:

- During execution simulation function `update_loop_inline_state`
  tracks the following information for each `bpf_loop` call
  instruction:
  - is callback known and constant?
  - are flags constant and zero?
- Function `optimize_bpf_loop` increases stack depth for functions
  where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop`
  to apply the inlining. The additional stack space is used to spill
  registers R6, R7 and R8. These registers are used as loop counter,
  loop maximal bound and callback context parameter;

Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on
i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

1ade2371

selftests/bpf: allow BTF specs and func infos in test_verifier tests · 7a42008c

Eduard Zingerman authored Jun 21, 2022

The BTF and func_info specification for test_verifier tests follows
the same notation as in prog_tests/btf.c tests. E.g.:

  ...
  .func_info = { { 0, 6 }, { 8, 7 } },
  .func_info_cnt = 2,
  .btf_strings = "\0int\0",
  .btf_types = {
    BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4),
    BTF_PTR_ENC(1),
  },
  ...

The BTF specification is loaded only when specified.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

7a42008c

selftests/bpf: specify expected instructions in test_verifier tests · 933ff531

Eduard Zingerman authored Jun 21, 2022

Allows to specify expected and unexpected instruction sequences in
test_verifier test cases. The instructions are requested from kernel
after BPF program loading, thus allowing to check some of the
transformations applied by BPF verifier.

- `expected_insn` field specifies a sequence of instructions expected
  to be found in the program;
- `unexpected_insn` field specifies a sequence of instructions that
  are not expected to be found in the program;
- `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask
  `off` and `imm` fields.
- `SKIP_INSNS` could be used to specify that some instructions in the
  (un)expected pattern are not important (behavior similar to usage of
  `\t` in `errstr` field).

The intended usage is as follows:

  {
	"inline simple bpf_loop call",
	.insns = {
	/* main */
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
	BPF_RAW_INSN(BPF_LD | BPF_IMM | BPF_DW, BPF_REG_2,
			BPF_PSEUDO_FUNC, 0, 6),
    ...
	BPF_EXIT_INSN(),
	/* callback */
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1),
	BPF_EXIT_INSN(),
	},
	.expected_insns = {
	BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1),
	SKIP_INSNS(),
	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1)
	},
	.unexpected_insns = {
	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0,
			INSN_OFF_MASK, INSN_IMM_MASK),
	},
	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
	.result = ACCEPT,
	.runs = 0,
  },

Here it is expected that move of 1 to register 1 would remain in place
and helper function call instruction would be replaced by a relative
call instruction.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

933ff531

uprobe: gate bpf call behind BPF_EVENTS · aca80dd9

Delyan Kratunov authored Jun 20, 2022

The call into bpf from uprobes needs to be gated now that it doesn't use
the trace_events.h helpers.

Randy found this as a randconfig build failure on linux-next [1].

  [1]: https://lore.kernel.org/linux-next/2de99180-7d55-2fdf-134d-33198c27cc58@infradead.org/Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/cb8bfbbcde87ed5d811227a393ef4925f2aadb7b.camel@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>

aca80dd9

20 Jun, 2022 17 commits

selftests/bpf: Enable config options needed for xdp_synproxy test · e068c077

Maxim Mikityanskiy authored Jun 20, 2022

This commit adds the kernel config options needed to run the recently
added xdp_synproxy test. Users without these options will hit errors
like this:

test_synproxy:FAIL:iptables -t raw -I PREROUTING         -i tmp1 -p
tcp -m tcp --syn --dport 8080 -j CT --notrack unexpected error: 256
(errno 22)
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220620104939.4094104-1-maximmi@nvidia.com

e068c077

skmsg: Get rid of unncessary memset() · 43312915

Cong Wang authored Jun 15, 2022

We always allocate skmsg with kzalloc(), so there is no need
to call memset(0) on it, the only thing we need from
sk_msg_init() is sg_init_marker(). So introduce a new helper
which is just kzalloc()+sg_init_marker(), this saves an
unncessary memset(0) for skmsg on fast path.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-5-xiyou.wangcong@gmail.com

43312915

skmsg: Get rid of skb_clone() · 57452d76

Cong Wang authored Jun 15, 2022

With ->read_skb() now we have an entire skb dequeued from
receive queue, now we just need to grab an addtional refcnt
before passing its ownership to recv actors.

And we should not touch them any more, particularly for
skb->sk. Fortunately, skb->sk is already set for most of
the protocols except UDP where skb->sk has been stolen,
so we have to fix it up for UDP case.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com

57452d76

net: Introduce a new proto_ops ->read_skb() · 965b57b4

Cong Wang authored Jun 15, 2022

Currently both splice() and sockmap use ->read_sock() to
read skb from receive queue, but for sockmap we only read
one entire skb at a time, so ->read_sock() is too conservative
to use. Introduce a new proto_ops ->read_skb() which supports
this sematic, with this we can finally pass the ownership of
skb to recv actors.

For non-TCP protocols, all ->read_sock() can be simply
converted to ->read_skb().
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com

965b57b4

tcp: Introduce tcp_read_skb() · 04919bed

Cong Wang authored Jun 15, 2022

This patch inroduces tcp_read_skb() based on tcp_read_sock(),
a preparation for the next patch which actually introduces
a new sock ops.

TCP is special here, because it has tcp_read_sock() which is
mainly used by splice(). tcp_read_sock() supports partial read
and arbitrary offset, neither of them is needed for sockmap.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com

04919bed

Merge branch 'mlxsw-unified-bridge-conversion-part-1' · 4336487e

David S. Miller authored Jun 20, 2022

Ido Schimmel says:

====================
mlxsw: Unified bridge conversion - part 1/6

This set starts converting mlxsw to the unified bridge model and mainly
adds new device registers and extends existing ones that will be used in
follow-up patchsets.

High-level summary
==================

The unified bridge model is a new way of managing low-level device
objects such as filtering identifiers (FIDs). The conversion moves a lot
of logic out of the device's firmware towards the driver, but its main
selling point is that it allows to overcome various scalability issues
related to the amount of entries that need to be programmed to the
device.

The only (intended) user visible changes of the conversion are
improvement in resource utilization and ability to support more router
interfaces (RIFs) in Spectrum-{2,3}.

Details
=======

Commit 50853808 ("Merge branch
'mlxsw-Prepare-for-VLAN-aware-bridge-w-VxLAN'") converted mlxsw to
emulate 802.1Q FIDs (represent VLANs in a VLAN-aware bridge) using
802.1D FIDs (represent VLAN-unaware bridges). This was necessary because
at that time VNI could not be assigned to 802.1Q FIDs, which effectively
meant that mlxsw could not support VXLAN with VLAN-aware bridges.

The downside of this approach is that multiple {Port,VID}->FID entries
are required in order to classify incoming traffic to a FID, as opposed
to a single VID->FID entry that can be used with actual 802.1Q FIDs.

For example, if 10 ports are members in the same VLAN-aware bridge and
the same 100 VLANs are configured on each port, then only 100 VID->FID
entries are required with 802.1Q FIDs, whereas 1000 {Port,VID}->FID
entries are required with emulated 802.1Q FIDs.

The above limitation is the result of various assumptions that were made
in the design of the API that was exposed to software. In the unified
bridge model the API is much more "raw" and therefore avoids these
assumptions, allowing software to configure the device in a more
efficient manner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4336487e

mlxsw: reg: Add support for VLAN RIF as part of RITR register · b3820922

Amit Cohen authored Jun 19, 2022

Router interfaces (RIFs) constructed on top of VLAN-aware bridges are of
"VLAN" type, whereas RIFs constructed on top of VLAN-unaware bridges of
"FID" type.

In other words, the RIF type is derived from the underlying FID type.
VLAN RIFs are used on top of 802.1Q FIDs, whereas FID RIFs are used on
top of 802.1D FIDs.

Currently 802.1Q FIDs are emulated using 802.1D FIDs, and therefore VLAN
RIFs are emulated using FID RIFs.

As part of converting the driver to use unified bridge, 802.1Q FIDs and
VLAN RIFs will be used.

Add the relevant fields to RITR register, add pack() function for VLAN
RIF and rename one field to fit the internal name.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3820922

mlxsw: Add support for egress FID classification after decapsulation · 1b1c198c

Amit Cohen authored Jun 19, 2022

As preparation for unified bridge model, add support for VNI->FID mapping
via SVFA register.

When performing VXLAN encapsulation, the VXLAN header needs to contain a
VNI. This VNI is derived from the FID classification performed on
ingress, through which the ingress RIF is also determined.

Similarly, when performing VXLAN decapsulation, the FID of the packet
needs to be determined. This FID is derived from VNI classification
performed during decapsulation.

In the old model, both entries (i.e., FID->VNI and VNI->FID) were
configured via SFMR.vni.

In the new model, where ingress is separated from egress, ingress
configuration (VNI->FID) is performed via SVFA, while SFMR only
configures egress (FID->VNI).

Add 'vni' field to SVFA, add new mapping table - VNI to FID, add new
pack() function for VNI mapping and edit the comment in SFMR.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1b1c198c

mlxsw: reg: Add egress FID field to RITR register · ad9592c0

Amit Cohen authored Jun 19, 2022

RITR configures the router interface table. As preparation for unified
bridge model, add egress FID field to RITR.

After routing, a packet has to perform a layer-2 lookup using the
destination MAC it got from the routing and a FID.
In the new model, the egress FID is configured by RITR for both sub-port
and FID RIFs.

Add 'efid' field to sub-port router interface and update FID router
interface related comment.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad9592c0

mlxsw: reg: Add Router Egress Interface to VID Register · 27f0b6ce

Amit Cohen authored Jun 19, 2022

The REIV maps {egress router interface (eRIF), egress_port} -> {vlan ID}.
As preparation for unified bridge model, add REIV register for future use.

In the past, firmware would take care of the above mentioned mapping,
but in the new model this should be done by software using REIV register.

REIV register supports a simultaneous update of 256 ports using
'port_page' field. When 'port_page'=0 the records represent ports
0-255, when 'port_page'=1 the records represent ports 256-511 and so
on.

The register is reserved while using the legacy model.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

27f0b6ce

mlxsw: reg: Replace MID related fields in SFGC register · 48bca94f

Amit Cohen authored Jun 19, 2022

SFGC register maps {packet type, bridge type} -> {MID base, table type}.
As preparation for unified bridge model, remove 'mid' field and add
'mid_base' field.

The MID index (index to PGT table which maps MID to local port list and
SMPE index) is a result of 'mid_base' + 'fid_offset'. Using the legacy
bridge model, firmware configures 'mid_base'. However, using the new model,
software is responsible to configure it via SFGC register.

The 'mid_base' is configured per {packet type, bridge type}, for
example, for {Unicast, .1Q}, {Broadcast, .1D}.

Add the field 'mid_base' to SFGC register and increase the length of the
register accordingly.

Remove the field 'mid' as currently it is ignored by the device, its use
is an old leftover.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

48bca94f

mlxsw: reg: Add flood related field to SFMR register · 94536249

Amit Cohen authored Jun 19, 2022

SFMR register creates and configures FIDs. As preparation for unified
bridge model, add a required field for future use.

The PGT (Port Group) table maps multicast ID (MID) to
{local port list, SMPE index} on Spectrum-1 and to {local port list} on
the other ASICs.

In the legacy model, software did not interact with this table directly.
Instead, it was accessed by firmware in response to registers such as
SFTR and SMID.
In the new model, the SFTR register is deprecated and software has full
control over the PGT table using the SMID register.

The configuration of MDB entries (using SFD) is unchanged, but flooding
configuration is completely different.
SFGC register maps {packet type, bridge type} -> {MID base, table type},
then with FID and FID-offset which are configured via SFMR, the MID index
is obtained.

Add the field 'flood_bridge_type' to SFMR, software can separate between
802.1q FIDs and vFIDs using two types which are supported.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

94536249

mlxsw: reg: Add VID related fields to SFD register · 485c281c

Amit Cohen authored Jun 19, 2022

SFD register configures FDB table. As preparation for unified bridge model,
add some required fields for future use.

In the new model, firmware no longer configures the egress VID, this
responsibility is moved to software. For layer 2 this means that software
needs to determine the egress VID for both unicast and multicast.

For unicast FDB records and unicast LAG FDB records, the VID needs to be
set via new fields in SFD - 'set_vid' and 'vid'.

Add the two mentioned fields for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

485c281c

mlxsw: reg: Add SMPE related fields to SFMR register · 92e4e543

Amit Cohen authored Jun 19, 2022

SFMR register creates and configures FIDs. As preparation unified bridge
model, add some required fields for future use.

The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}

In Spectrum-2 and later ASICs, the SMPE index is an attribute of the FID
and programmed via new fields in SFMR register - 'smpe_valid' and 'smpe'.

Add the two mentioned fields for future use and increase the length of
the register accordingly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92e4e543

mlxsw: Add SMPE related fields to SMID2 register · 894b98d5

Amit Cohen authored Jun 19, 2022

SMID register maps multicast ID (MID) into a list of local ports.
As preparation for unified bridge model, add some required fields for
future use.

The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}

In Spectrum-1, both indexes into the MPE table (local port and SMPE) are
derived from the PGT table. Therefore, the SMPE index needs to be
programmed as part of the PGT entry via new fields in SMID - 'smpe_valid'
and 'smpe'.

Add the two mentioned fields for future use and align the callers of
mlxsw_reg_smid2_pack() to pass zeros for SMPE fields.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

894b98d5

mlxsw: reg: Add Switch Multicast Port to Egress VID Register · e0f071c5

Amit Cohen authored Jun 19, 2022

The SMPE register maps {egress_port, SMPE index} -> VID.

The device includes two main tables to support layer 2 multicast (i.e.,
MDB and flooding). These are the PGT (Port Group Table) and the
MPE (Multicast Port Egress) table.
- PGT is {MID -> (bitmap of local_port, SPME index)}
- MPE is {(Local port, SMPE index) -> eVID}

In Spectrum-1, the index into the MPE table - called switch multicast to
port egress VID (SMPE) - is derived from the PGT entry, whereas in
Spectrum-2 and later ASICs it is derived from the FID.

In the legacy model, software did not interact with this table as it was
completely hidden in firmware. In the new model, software needs to
populate the table itself in order to map from {Local port, SMPE index} to
an egress VID. This is done using the SMPE register.

Add the register for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0f071c5

mlxsw: reg: Add ingress RIF related fields to SVFA register · dd326565

Amit Cohen authored Jun 19, 2022

SVFA register controls the VID to FID mapping and {Port, VID} to FID
mapping for virtualized ports. As preparation for unified bridge model,
add some required fields for future use.

On ingress, after ingress ACL, a packet needs to be classified to a FID.
The key for this lookup can be one of:
1. VID. When port is not in virtual mode.
2. {RQ, VID}. When port is in virtual mode.
3. FID. When FID was set by ingress ACL.

Since RITR no longer performs ingress configuration, the ingress RIF for
the first two entry types needs to be set via new fields in SVFA -
'irif_v' and 'irif'.

Add the two mentioned fields for future use and increase the length of
the register accordingly.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd326565