- 23 Jan, 2024 39 commits
-
-
Kuniyuki Iwashima authored
We will support arbitrary SYN Cookie with BPF. If BPF prog validates ACK and kfunc allocates a reqsk, it will be carried to TCP stack as skb->sk with req->syncookie 1. In skb_steal_sock(), we need to check inet_reqsk(sk)->syncookie to see if the reqsk is created by kfunc. However, inet_reqsk() is not available in sock.h. Let's move skb_steal_sock() to request_sock.h. While at it, we refactor skb_steal_sock() so it returns early if skb->sk is NULL to minimise the following patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240115205514.68364-3-kuniyu@amazon.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kuniyuki Iwashima authored
We will support arbitrary SYN Cookie with BPF. When BPF prog validates ACK and kfunc allocates a reqsk, we need to call tcp_ns_to_ts() to calculate an offset of TSval for later use: time t0 : Send SYN+ACK -> tsval = Initial TSval (Random Number) t1 : Recv ACK of 3WHS -> tsoff = TSecr - tcp_ns_to_ts(usec_ts_ok, tcp_clock_ns()) = Initial TSval - t1 t2 : Send ACK -> tsval = t2 + tsoff = Initial TSval + (t2 - t1) = Initial TSval + Time Delta (x) (x) Note that the time delta does not include the initial RTT from t0 to t1. Let's move tcp_ns_to_ts() to tcp.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240115205514.68364-2-kuniyu@amazon.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Artem Savkov authored
It is possible for bpf_kfunc_call_test_release() to be called from bpf_map_free_deferred() when bpf_testmod is already unloaded and perf_test_stuct.cnt which it tries to decrease is no longer in memory. This patch tries to fix the issue by waiting for all references to be dropped in bpf_testmod_exit(). The issue can be triggered by running 'test_progs -t map_kptr' in 6.5, but is obscured in 6.6 by d119357d ("rcu-tasks: Treat only synchronous grace periods urgently"). Fixes: 65eb006d ("bpf: Move kernel test kfuncs to bpf_testmod") Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Cc: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/82f55c0e-0ec8-4fe1-8d8c-b1de07558ad9@linux.dev Link: https://lore.kernel.org/bpf/20240110085737.8895-1-asavkov@redhat.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Tiezhu Yang authored
There exists the following warning when building bpftool: CC prog.o prog.c: In function ‘profile_open_perf_events’: prog.c:2301:24: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args] 2301 | sizeof(int), obj->rodata->num_cpu * obj->rodata->num_metric); | ^~~ prog.c:2301:24: note: earlier argument should specify number of elements, later size of each element Tested with the latest upstream GCC which contains a new warning option -Wcalloc-transposed-args. The first argument to calloc is documented to be number of elements in array, while the second argument is size of each element, just switch the first and second arguments of calloc() to silence the build warning, compile tested only. Fixes: 47c09d6a ("bpftool: Introduce "prog profile" command") Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20240116061920.31172-1-yangtiezhu@loongson.cnSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Few minor improvements for bpf_cmp() macro: . reduce number of args in __bpf_cmp() . rename NOFLIP to UNLIKELY . add a comment about 64-bit truncation in "i" constraint . use "ri" constraint for sizeof(rhs) <= 4 . improve error message for bpf_cmp_likely() Before: progs/iters_task_vma.c:31:7: error: variable 'ret' is uninitialized when used here [-Werror,-Wuninitialized] 31 | if (bpf_cmp_likely(seen, <==, 1000)) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../bpf/bpf_experimental.h:325:3: note: expanded from macro 'bpf_cmp_likely' 325 | ret; | ^~~ progs/iters_task_vma.c:31:7: note: variable 'ret' is declared here ../bpf/bpf_experimental.h:310:3: note: expanded from macro 'bpf_cmp_likely' 310 | bool ret; | ^ After: progs/iters_task_vma.c:31:7: error: invalid operand for instruction 31 | if (bpf_cmp_likely(seen, <==, 1000)) | ^ ../bpf/bpf_experimental.h:324:17: note: expanded from macro 'bpf_cmp_likely' 324 | asm volatile("r0 " #OP " invalid compare"); | ^ <inline asm>:1:5: note: instantiated into assembly here 1 | r0 <== invalid compare | ^ Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240112220134.71209-1-alexei.starovoitov@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
In verifier.rst, I found an incorrect statement (maybe a typo) in section 'Liveness marks tracking'. Basically, the wrong register is attributed to have a read mark. This may confuse the user. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240111052136.3440417-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Add a selftest with a 4 bytes BPF_ST of 0 where the store is not 8-byte aligned. The goal is to ensure that STACK_ZERO is properly marked in stack slots and the STACK_ZERO value can propagate properly during the load. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240110051355.2737232-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
With patch set [1], precision backtracing supports register spill/fill to/from the stack. The patch [2] allows initial imprecise register spill with content 0. This is a common case for cpuv3 and lower for initializing the stack variables with pattern r1 = 0 *(u64 *)(r10 - 8) = r1 and the [2] has demonstrated good verification improvement. For cpuv4, the initialization could be *(u64 *)(r10 - 8) = 0 The current verifier marks the r10-8 contents with STACK_ZERO. Similar to [2], let us permit the above insn to behave like imprecise register spill which can reduce number of verified states. The change is in function check_stack_write_fixed_off(). Before this patch, spilled zero will be marked as STACK_ZERO which can provide precise values. In check_stack_write_var_off(), STACK_ZERO will be maintained if writing a const zero so later it can provide precise values if needed. The above handling of '*(u64 *)(r10 - 8) = 0' as a spill will have issues in check_stack_write_var_off() as the spill will be converted to STACK_MISC and the precise value 0 is lost. To fix this issue, if the spill slots with const zero and the BPF_ST write also with const zero, the spill slots are preserved, which can later provide precise values if needed. Without the change in check_stack_write_var_off(), the test_verifier subtest 'BPF_ST_MEM stack imm zero, variable offset' will fail. I checked cpuv3 and cpuv4 with and without this patch with veristat. There is no state change for cpuv3 since '*(u64 *)(r10 - 8) = 0' is only generated with cpuv4. For cpuv4: $ ../veristat -C old.cpuv4.csv new.cpuv4.csv -e file,prog,insns,states -f 'insns_diff!=0' File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------------------------------------ ------------------- --------- --------- --------------- ---------- ---------- ------------- local_storage_bench.bpf.linked3.o get_local 228 168 -60 (-26.32%) 17 14 -3 (-17.65%) pyperf600_bpf_loop.bpf.linked3.o on_event 6066 4889 -1177 (-19.40%) 403 321 -82 (-20.35%) test_cls_redirect.bpf.linked3.o cls_redirect 35483 35387 -96 (-0.27%) 2179 2177 -2 (-0.09%) test_l4lb_noinline.bpf.linked3.o balancer_ingress 4494 4522 +28 (+0.62%) 217 219 +2 (+0.92%) test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 1432 1455 +23 (+1.61%) 92 94 +2 (+2.17%) test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3462 3458 -4 (-0.12%) 216 216 +0 (+0.00%) verifier_iterating_callbacks.bpf.linked3.o widening 52 41 -11 (-21.15%) 4 3 -1 (-25.00%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 12412 11719 -693 (-5.58%) 345 330 -15 (-4.35%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 12478 11794 -684 (-5.48%) 346 331 -15 (-4.34%) test_l4lb_noinline and test_l4lb_noinline_dynptr has minor regression, but pyperf600_bpf_loop and local_storage_bench gets pretty good improvement. [1] https://lore.kernel.org/all/20231205184248.1502704-1-andrii@kernel.org/ [2] https://lore.kernel.org/all/20231205184248.1502704-9-andrii@kernel.org/ Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240110051348.2737007-1-yonghong.song@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
The previous commit implemented assigning IDs to registers holding scalars before spill. Add the test cases to check the new functionality. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-10-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
Currently, when a scalar bounded register is spilled to the stack, its ID is preserved, but only if was already assigned, i.e. if this register was MOVed before. Assign an ID on spill if none is set, so that equal scalars could be tracked if a register is spilled to the stack and filled into another register. One test is adjusted to reflect the change in register IDs. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-9-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
Put calculation of the register value width into a dedicated function. This function will also be used in a following commit. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Link: https://lore.kernel.org/r/20240108205209.838365-8-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
Extract the common code that generates a register ID for src_reg before MOV if needed into a new function. This function will also be used in a following commit. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-7-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
When a range check is performed on a register that was 32-bit spilled to the stack, the IDs of the two instances of the register are the same, so the range should also be the same. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-6-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
Adjust the check in bpf_get_spilled_reg to take into account spilled registers narrower than 64 bits. That allows find_equal_scalars to properly adjust the range of all spilled registers that have the same ID. Before this change, it was possible for a register and a spilled register to have the same IDs but different ranges if the spill was narrower than 64 bits and a range check was performed on the register. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-5-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Verify that infinite loop detection logic separates states with identical register states but different imprecise scalars spilled to stack. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-4-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Current infinite loops detection mechanism is speculative: - first, states_maybe_looping() check is done which simply does memcmp for R1-R10 in current frame; - second, states_equal(..., exact=false) is called. With exact=false states_equal() would compare scalars for equality only if in old state scalar has precision mark. Such logic might be problematic if compiler makes some unlucky stack spill/fill decisions. An artificial example of a false positive looks as follows: r0 = ... unknown scalar ... r0 &= 0xff; *(u64 *)(r10 - 8) = r0; r0 = 0; loop: r0 = *(u64 *)(r10 - 8); if r0 > 10 goto exit_; r0 += 1; *(u64 *)(r10 - 8) = r0; r0 = 0; goto loop; This commit updates call to states_equal to use exact=true, forcing all scalar comparisons to be exact. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-3-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Maxim Mikityanskiy authored
The u64_offset_to_skb_data test is supposed to make a 64-bit fill, but instead makes a 16-bit one. Fix the test according to its intention and update the comments accordingly (umax is no longer 0xffff). The 16-bit fill is covered by u16_offset_to_skb_data. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240108205209.838365-2-maxtram95@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Nathan Chancellor authored
reviews.llvm.org was LLVM's Phabricator instances for code review. It has been abandoned in favor of GitHub pull requests. While the majority of links in the kernel sources still work because of the work Fangrui has done turning the dynamic Phabricator instance into a static archive, there are some issues with that work, so preemptively convert all the links in the kernel sources to point to the commit on GitHub. Most of the commits have the corresponding differential review link in the commit message itself so there should not be any loss of fidelity in the relevant information. Additionally, fix a typo in the xdpwall.c print ("LLMV" -> "LLVM") while in the area. Link: https://discourse.llvm.org/t/update-on-github-pull-requests/71540/172Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240111-bpf-update-llvm-phabricator-links-v2-1-9a7ae976bd64@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Various tests specify extra testing prog_flags when loading BPF programs, like BPF_F_TEST_RND_HI32, and more recently also BPF_F_TEST_REG_INVARIANTS. While BPF_F_TEST_RND_HI32 is old enough to not cause much problem on older kernels, BPF_F_TEST_REG_INVARIANTS is very fresh and unconditionally specifying it causes selftests to fail on even slightly outdated kernels. This breaks libbpf CI test against 4.9 and 5.15 kernels, it can break some local development (done outside of VM), etc. To prevent this, and guard against similar problems in the future, do runtime detection of supported "testing flags", and only provide those that host kernel recognizes. Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240109231738.575844-1-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Dave Thaler authored
The discussion of what the actual conformance groups should be is still in progress, so this is just part 1 which only uses "legacy" for deprecated instructions and "basic" for everything else. Subsequent patches will add more groups as discussion continues. Signed-off-by: Dave Thaler <dthaler1968@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240108214231.5280-1-dthaler1968@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Randy Dunlap authored
Fix spelling errors as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: bpf@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240106065545.16855-1-rdunlap@infradead.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add ability to iterate multiple decl_tag types pointed to the same function argument. Use this to support multiple __arg_xxx tags per global subprog argument. We leave btf_find_decl_tag_value() intact, but change its implementation to use a new btf_find_next_decl_tag() which can be straightforwardly used to find next BTF type ID of a matching btf_decl_tag type. btf_prepare_func_args() is switched from btf_find_decl_tag_value() to btf_find_next_decl_tag() to gain multiple tags per argument support. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105000909.2818934-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add btf_arg_tag flags enum to be able to record multiple tags per argument. Also streamline pointer argument processing some more. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105000909.2818934-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Move scalar arg processing in btf_prepare_func_args() after all pointer arg processing is done. This makes it easier to do validation. One example of unintended behavior right now is ability to specify __arg_nonnull for integer/enum arguments. This patch fixes this. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105000909.2818934-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Seeing: process_subtest:PASS:Can't alloc specs array 0 nsec ... in verbose successful test log is very confusing. Use smaller identifier-like test tag to denote that we are asserting specs array allocation success. Now it's much less distracting: process_subtest:PASS:specs_alloc 0 nsec Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105000909.2818934-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Hou Tao says: ==================== The motivation of inlining bpf_kptr_xchg() comes from the performance profiling of bpf memory allocator benchmark [1]. The benchmark uses bpf_kptr_xchg() to stash the allocated objects and to pop the stashed objects for free. After inling bpf_kptr_xchg(), the performance for object free on 8-CPUs VM increases about 2%~10%. However the performance gain comes with costs: both the kasan and kcsan checks on the pointer will be unavailable. Initially the inline is implemented in do_jit() for x86-64 directly, but I think it will more portable to implement the inline in verifier. Patch #1 supports inlining bpf_kptr_xchg() helper and enables it on x86-4. Patch #2 factors out a helper for newly-added test in patch #3. Patch #3 tests whether the inlining of bpf_kptr_xchg() is expected. Please see individual patches for more details. And comments are always welcome. Change Log: v3: * rebased on bpf-next tree * patch 1 & 2: Add Rvb-by and Ack-by tags from Eduard * patch 3: use inline assembly and naked function instead of c code (suggested by Eduard) v2: https://lore.kernel.org/bpf/20231223104042.1432300-1-houtao@huaweicloud.com/ * rebased on bpf-next tree * drop patch #1 in v1 due to discussion in [2] * patch #1: add the motivation in the commit message, merge patch #1 and #3 into the new patch in v2. (Daniel) * patch #2/#3: newly-added patch to test the inlining of bpf_kptr_xchg() (Eduard) v1: https://lore.kernel.org/bpf/95b8c2cd-44d5-5fe1-60b5-7e8218779566@huaweicloud.com/ [1]: https://lore.kernel.org/bpf/20231221141501.3588586-1-houtao@huaweicloud.com/ [2]: https://lore.kernel.org/bpf/fd94efb9-4a56-c982-dc2e-c66be5202cb7@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20240105104819.3916743-1-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
The test uses bpf_prog_get_info_by_fd() to obtain the xlated instructions of the program first. Since these instructions have already been rewritten by the verifier, the tests then checks whether the rewritten instructions are as expected. And to ensure LLVM generates code exactly as expected, use inline assembly and a naked function. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240105104819.3916743-4-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
Both test_verifier and test_progs use get_xlated_program(), so moving the helper into testing_helpers.h to reuse it. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240105104819.3916743-3-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
The motivation of inlining bpf_kptr_xchg() comes from the performance profiling of bpf memory allocator benchmark. The benchmark uses bpf_kptr_xchg() to stash the allocated objects and to pop the stashed objects for free. After inling bpf_kptr_xchg(), the performance for object free on 8-CPUs VM increases about 2%~10%. The inline also has downside: both the kasan and kcsan checks on the pointer will be unavailable. bpf_kptr_xchg() can be inlined by converting the calling of bpf_kptr_xchg() into an atomic_xchg() instruction. But the conversion depends on two conditions: 1) JIT backend supports atomic_xchg() on pointer-sized word 2) For the specific arch, the implementation of xchg is the same as atomic_xchg() on pointer-sized words. It seems most 64-bit JIT backends satisfies these two conditions. But as a precaution, defining a weak function bpf_jit_supports_ptr_xchg() to state whether such conversion is safe and only supporting inline for 64-bit host. For x86-64, it supports BPF_XCHG atomic operation and both xchg() and atomic_xchg() use arch_xchg() to implement the exchange, so enabling the inline of bpf_kptr_xchg() on x86-64 first. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20240105104819.3916743-2-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Paolo Abeni authored
Eric Dumazet says: ==================== inet_diag: remove three mutexes in diag dumps Surprisingly, inet_diag operations are serialized over a stack of three mutexes, giving legacy /proc based files an unfair advantage on modern hosts. This series removes all of them, making inet_diag operations (eg iproute2/ss) fully parallel. 1-2) Two first patches are adding data-race annotations and can be backported to stable kernels. 3-4) inet_diag_table_mutex can be replaced with RCU protection, if we add corresponding protection against module unload. 5-7) sock_diag_table_mutex can be replaced with RCU protection, if we add corresponding protection against module unload. 8) sock_diag_mutex is removed, as the old bug it was working around has been fixed more elegantly. 9) inet_diag_dump_icsk() can skip over empty buckets to reduce spinlock contention. ==================== Link: https://lore.kernel.org/r/20240122112603.3270097-1-edumazet@google.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
After the removal of inet_diag_table_mutex, sock_diag_table_mutex and sock_diag_mutex, I was able so see spinlock contention from inet_diag_dump_icsk() when running 100 parallel invocations. It is time to skip over empty buckets. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
sock_diag_rcv() is still serializing its operations using a mutex, for no good reason. This came with commit 0a9c7301 ("[INET_DIAG]: Fix oops in netlink_rcv_skb"), but the root cause has been fixed with commit cd40b7d3 ("[NET]: make netlink user -> kernel interface synchronious") Remove this mutex to let multiple threads run concurrently. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
TCPDIAG_GETSOCK and DCCPDIAG_GETSOCK diag are serialized on sock_diag_table_mutex. This is to make sure inet_diag module is not unloaded while diag was ongoing. It is time to get rid of this mutex and use RCU protection, allowing full parallelism. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
sock_diag_broadcast_destroy_work() and __sock_diag_cmd() are currently using sock_diag_table_mutex to protect against concurrent sock_diag_handlers[] changes. This makes inet_diag dump serialized, thus less scalable than legacy /proc files. It is time to switch to full RCU protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
Following patch is going to use RCU instead of sock_diag_table_mutex acquisition. This patch is a preparation, no change of behavior yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
inet_diag_lock_handler() current implementation uses a mutex to protect inet_diag_table[] array against concurrent changes. This makes inet_diag dump serialized, thus less scalable than legacy /proc files. It is time to switch to full RCU protection. As a bonus, if a target is statically linked instead of being modular, inet_diag_lock_handler() & inet_diag_unlock_handler() reduce to reads only. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
Following patch is going to use RCU instead of inet_diag_table_mutex acquisition. This patch is a preparation, no change of behavior yet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
inet_diag_lock_handler() reads inet_diag_table[proto] locklessly. Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues. Fixes: d523a328 ("[INET]: Fix inet_diag dead-lock regression") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
Eric Dumazet authored
__sock_diag_cmd() and sock_diag_bind() read sock_diag_handlers[family] without a lock held. Use READ_ONCE()/WRITE_ONCE() annotations to avoid potential issues. Fixes: 8ef874bf ("sock_diag: Move the sock_ code to net/core/") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-
- 19 Jan, 2024 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds authored
Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. Previous releases - regressions: - Revert "net: rtnetlink: Enslave device before bringing it up", breaks the case inverse to the one it was trying to fix - net: dsa: fix oob access in DSA's netdevice event handler dereference netdev_priv() before check its a DSA port - sched: track device in tcf_block_get/put_ext() only for clsact binder types - net: tls, fix WARNING in __sk_msg_free when record becomes full during splice and MORE hint set - sfp-bus: fix SFP mode detect from bitrate - drv: stmmac: prevent DSA tags from breaking COE Previous releases - always broken: - bpf: fix no forward progress in in bpf_iter_udp if output buffer is too small - bpf: reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access - netfilter: tighten input validation - net: add more sanity check in virtio_net_hdr_to_skb() - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid infinite loop - amt: do not use the portion of skb->cb area which may get clobbered - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option Misc: - spring cleanup of inactive maintainers" * tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) i40e: Include types.h to some headers ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes selftests: mlxsw: qos_pfc: Remove wrong description mlxsw: spectrum_router: Register netdevice notifier before nexthop mlxsw: spectrum_acl_tcam: Fix stack corruption mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure ethtool: netlink: Add missing ethnl_ops_begin/complete selftests: bonding: Add more missing config options selftests: netdevsim: add a config file libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length ...
-