- 24 Oct, 2023 19 commits
-
-
Daniel Borkmann authored
Add a bigger batch of test coverage to assert correct operation of netkit devices and their BPF program management: # ./test_progs -t tc_netkit [...] [ 1.166267] bpf_testmod: loading out-of-tree module taints kernel. [ 1.166831] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.270957] tsc: Refined TSC clocksource calibration: 3407.988 MHz [ 1.272579] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc932722, max_idle_ns: 440795381586 ns [ 1.275336] clocksource: Switched to clocksource tsc #257 tc_netkit_basic:OK #258 tc_netkit_device:OK #259 tc_netkit_multi_links:OK #260 tc_netkit_multi_opts:OK #261 tc_netkit_neigh_links:OK Summary: 5/0 PASSED, 0 SKIPPED, 0 FAILED [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-8-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
Add a minimal netlink helper library for the BPF selftests. This has been taken and cut down and cleaned up from iproute2. This covers basics such as netdevice creation which we need for BPF selftests / BPF CI given iproute2 package cannot cover it yet. Stanislav Fomichev suggested that this could be replaced in future by ynl tool generated C code once it has RTNL support to create devices. Once we get to this point the BPF CI would also need to add libmnl. If no further extensions are needed, a second option could be that we remove this code again once iproute2 package has support. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-7-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
Add support to dump BPF programs on netkit via bpftool. This includes both the BPF link and attach ops programs. Dumped information contain the attach location, function entry name, program ID and link ID when applicable. Example with tc BPF link: # ./bpftool net xdp: tc: nk1(22) netkit/peer tc1 prog_id 43 link_id 12 [...] Example with json dump: # ./bpftool net --json | jq [ { "xdp": [], "tc": [ { "devname": "nk1", "ifindex": 18, "kind": "netkit/primary", "name": "tc1", "prog_id": 29, "prog_flags": [], "link_id": 8, "link_flags": [] } ], "flow_dissector": [], "netfilter": [] } ] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-6-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
Add support to dump netkit link information to bpftool in similar way as we have for XDP. The netkit link info only exposes the ifindex and the attach_type. Below shows an example link dump output, and a cgroup link is included for comparison, too: # bpftool link [...] 10: cgroup prog 2466 cgroup_id 1 attach_type cgroup_inet6_post_bind [...] 8: netkit prog 35 ifindex nk1(18) attach_type netkit_primary [...] Equivalent json output: # bpftool link --json [...] { "id": 10, "type": "cgroup", "prog_id": 2466, "cgroup_id": 1, "attach_type": "cgroup_inet6_post_bind" }, [...] { "id": 12, "type": "netkit", "prog_id": 61, "devname": "nk1", "ifindex": 21, "attach_type": "netkit_primary" } [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-5-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
This adds bpf_program__attach_netkit() API to libbpf. Overall it is very similar to tcx. The API looks as following: LIBBPF_API struct bpf_link * bpf_program__attach_netkit(const struct bpf_program *prog, int ifindex, const struct bpf_netkit_opts *opts); The struct bpf_netkit_opts is done in similar way as struct bpf_tcx_opts for supporting bpf_mprog control parameters. The attach location for the primary and peer device is derived from the program section "netkit/primary" and "netkit/peer", respectively. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-4-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
Sync if_link uapi header to the latest version as we need the refresher in tooling for netkit device. Given it's been a while since the last sync and the diff is fairly big, it has been done as its own commit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231024214904.29825-3-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Daniel Borkmann authored
This work adds a new, minimal BPF-programmable device called "netkit" (former PoC code-name "meta") we recently presented at LSF/MM/BPF. The core idea is that BPF programs are executed within the drivers xmit routine and therefore e.g. in case of containers/Pods moving BPF processing closer to the source. One of the goals was that in case of Pod egress traffic, this allows to move BPF programs from hostns tcx ingress into the device itself, providing earlier drop or forward mechanisms, for example, if the BPF program determines that the skb must be sent out of the node, then a redirect to the physical device can take place directly without going through per-CPU backlog queue. This helps to shift processing for such traffic from softirq to process context, leading to better scheduling decisions/performance (see measurements in the slides). In this initial version, the netkit device ships as a pair, but we plan to extend this further so it can also operate in single device mode. The pair comes with a primary and a peer device. Only the primary device, typically residing in hostns, can manage BPF programs for itself and its peer. The peer device is designated for containers/Pods and cannot attach/detach BPF programs. Upon the device creation, the user can set the default policy to 'pass' or 'drop' for the case when no BPF program is attached. Additionally, the device can be operated in L3 (default) or L2 mode. The management of BPF programs is done via bpf_mprog, so that multi-attach is supported right from the beginning with similar API and dependency controls as tcx. For details on the latter see commit 053c8e1f ("bpf: Add generic attach/detach/query API for multi-progs"). tc BPF compatibility is provided, so that existing programs can be easily migrated. Going forward, we plan to use netkit devices in Cilium as the main device type for connecting Pods. They will be operated in L3 mode in order to simplify a Pod's neighbor management and the peer will operate in default drop mode, so that no traffic is leaving between the time when a Pod is brought up by the CNI plugin and programs attached by the agent. Additionally, the programs we attach via tcx on the physical devices are using bpf_redirect_peer() for inbound traffic into netkit device, hence the latter is also supporting the ndo_get_peer_dev callback. Similarly, we use bpf_redirect_neigh() for the way out, pushing from netkit peer to phys device directly. Also, BIG TCP is supported on netkit device. For the follow-up work in single device mode, we plan to convert Cilium's cilium_host/_net devices into a single one. An extensive test suite for checking device operations and the BPF program and link management API comes as BPF selftests in this series. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://github.com/borkmann/iproute2/tree/pr/netkit Link: http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf (24ff.) Link: https://lore.kernel.org/r/20231024214904.29825-2-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Andrii Nakryiko authored
When determining if an if/else branch will always or never be taken, use signed range knowledge in addition to currently used unsigned range knowledge. If either signed or unsigned range suggests that condition is always/never taken, return corresponding branch_taken verdict. Current use of unsigned range for this seems arbitrary and unnecessarily incomplete. It is possible for *signed* operations to be performed on register, which could "invalidate" unsigned range for that register. In such case branch_taken will be artificially useless, even if we can still tell that some constant is outside of register value range based on its signed bounds. veristat-based validation shows zero differences across selftests, Cilium, and Meta-internal BPF object files. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
-
Paul E. McKenney authored
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set() immediately preceded by smp_mb__before_atomic() so as to order storing of ring-buffer consumer and producer positions prior to the atomic_set() call's clearing of the ->busy flag, as follows: smp_mb__before_atomic(); atomic_set(&rb->busy, 0); Although this works given current architectures and implementations, and given that this only needs to order prior writes against a later write. However, it does so by accident because the smp_mb__before_atomic() is only guaranteed to work with read-modify-write atomic operations, and not at all with things like atomic_set() and atomic_read(). Note especially that smp_mb__before_atomic() will not, repeat *not*, order the prior write to "a" before the subsequent non-read-modify-write atomic read from "b", even on strongly ordered systems such as x86: WRITE_ONCE(a, 1); smp_mb__before_atomic(); r1 = atomic_read(&b); Therefore, replace the smp_mb__before_atomic() and atomic_set() with atomic_set_release() as follows: atomic_set_release(&rb->busy, 0); This is no slower (and sometimes is faster) than the original, and also provides a formal guarantee of ordering that the original lacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
-
Song Liu authored
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Fixes: 20b6cc34 ("bpf: Avoid hashtab deadlock with map_locked") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
-
Albert Huang authored
In the previous implementation, when multiple xsk sockets were associated with a single xsk_buff_pool, a situation could arise where the xsk_tx_list maintained data at the front for one xsk socket while starving the xsk sockets at the back of the list. This could result in issues such as the inability to transmit packets, increased latency, and jitter. To address this problem, we introduce a new variable called tx_budget_spent, which limits each xsk to transmit a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures equitable opportunities for subsequent xsk sockets to send tx descriptors. The value of MAX_PER_SOCKET_BUDGET is set to 32. Signed-off-by: Albert Huang <huangjie.albert@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
-
Alexei Starovoitov authored
Eduard Zingerman says: ==================== exact states comparison for iterator convergence checks Iterator convergence logic in is_state_visited() uses state_equals() for states with branches counter > 0 to check if iterator based loop converges. This is not fully correct because state_equals() relies on presence of read and precision marks on registers. These marks are not guaranteed to be finalized while state has branches. Commit message for patch #3 describes a program that exhibits such behavior. This patch-set aims to fix iterator convergence logic by adding notion of exact states comparison. Exact comparison does not rely on presence of read or precision marks and thus is more strict. As explained in commit message for patch #3 exact comparisons require addition of speculative register bounds widening. The end result for BPF verifier users could be summarized as follows: (!) After this update verifier would reject programs that conjure an imprecise value on the first loop iteration and use it as precise on the second (for iterator based loops). I urge people to at least skim over the commit message for patch #3. Patches are organized as follows: - patches #1,2: moving/extracting utility functions; - patch #3: introduces exact mode for states comparison and adds widening heuristic; - patch #4: adds test-cases that demonstrate why the series is necessary; - patch #5: extends patch #3 with a notion of state loop entries, these entries have to be tracked to correctly identify that different verifier states belong to the same states loop; - patch #6: adds a test-case that demonstrates a program which requires loop entry tracking for correct verification; - patch #7: just adds a few debug prints. The following actions are planned as a followup for this patch-set: - implementation has to be adapted for callbacks handling logic as a part of a fix for [1]; - it is necessary to explore ways to improve widening heuristic to handle iters_task_vma test w/o need to insert barrier_var() calls; - explored states eviction logic on cache miss has to be extended to either: - allow eviction of checkpoint states -or- - be sped up in case if there are many active checkpoints associated with the same instruction. The patch-set is a followup for mailing list discussion [1]. Changelog: - V2 [3] -> V3: - correct check for stack spills in widen_imprecise_scalars(), added test case progs/iters.c:widen_spill to check the behavior (suggested by Andrii); - allow eviction of checkpoint states in is_state_visited() to avoid pathological verifier performance when iterator based loop does not converge (discussion with Alexei). - V1 [2] -> V2, applied changes suggested by Alexei offlist: - __explored_state() function removed; - same_callsites() function is now used in clean_live_states(); - patches #1,2 are added as preparatory code movement; - in process_iter_next_call() a safeguard is added to verify that cur_st->parent exists and has expected insn index / call sites. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ [2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/ [3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/ ==================== Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Additional logging in is_state_visited(): if infinite loop is detected print full verifier state for both current and equivalent states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
A convoluted test case for iterators convergence logic that demonstrates that states with branch count equal to 0 might still be a part of not completely explored loop. E.g. consider the following state diagram: initial Here state 'succ' was processed first, | it was eventually tracked to produce a V state identical to 'hdr'. .---------> hdr All branches from 'succ' had been explored | | and thus 'succ' has its .branches == 0. | V | .------... Suppose states 'cur' and 'succ' correspond | | | to the same instruction + callsites. | V V In such case it is necessary to check | ... ... whether 'succ' and 'cur' are identical. | | | If 'succ' and 'cur' are a part of the same loop | V V they have to be compared exactly. | succ <- cur | | | V | ... | | '----' Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-7-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
These test cases try to hide read and precision marks from loop convergence logic: marks would only be assigned on subsequent loop iterations or after exploring states pushed to env->head stack first. Without verifier fix to use exact states comparison logic for iterators convergence these tests (except 'triple_continue') would be errorneously marked as safe. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-5-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <awerner32@gmail.com> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Extract same_callsites() from clean_live_states() as a utility function. This function would be used by the next patch in the set. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Eduard Zingerman authored
Subsequent patches would make use of explored_state() function. Move it up to avoid adding unnecessary prototype. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 23 Oct, 2023 2 commits
-
-
Daniel Borkmann authored
Small clean up to get rid of the extra tcx_link_const() and only retain the tcx_link(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.netSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Denys Zagorui authored
This modification doesn't change behaviour of the syscall_tp But such code is often used as a reference so it should be correct anyway Signed-off-by: Denys Zagorui <dzagorui@cisco.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231019113521.4103825-1-dzagorui@cisco.com
-
- 20 Oct, 2023 18 commits
-
-
Alexei Starovoitov authored
Hou Tao says: ==================== bpf: Fixes for per-cpu kptr From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the problems found in the review of per-cpu kptr patch-set [0]. Patch #1 moves pcpu_lock after the invocation of pcpu_chunk_addr_search() and it is a micro-optimization for free_percpu(). The reason includes it in the patch is that the same logic is used in newly-added API pcpu_alloc_size(). Patch #2 introduces pcpu_alloc_size() for dynamic per-cpu area. Patch #2 and #3 use pcpu_alloc_size() to check whether or not unit_size matches with the size of underlying per-cpu area and to select a matching bpf_mem_cache. Patch #4 fixes the freeing of per-cpu kptr when these kptrs are freed by map destruction. The last patch adds test cases for these problems. Please see individual patches for details. And comments are always welcome. Change Log: v3: * rebased on bpf-next * patch 2: update API document to note that pcpu_alloc_size() doesn't support statically allocated per-cpu area. (Dennis) * patch 1 & 2: add Acked-by from Dennis v2: https://lore.kernel.org/bpf/20231018113343.2446300-1-houtao@huaweicloud.com/ * add a new patch "don't acquire pcpu_lock for pcpu_chunk_addr_search()" * patch 2: change type of bit_off and end to unsigned long (Andrew) * patch 2: rename the new API as pcpu_alloc_size and follow 80-column convention (Dennis) * patch 5: move the common declaration into bpf.h (Stanislav, Alxei) v1: https://lore.kernel.org/bpf/20231007135106.3031284-1-houtao@huaweicloud.com/ [0]: https://lore.kernel.org/bpf/20230827152729.1995219-1-yonghong.song@linux.dev ==================== Link: https://lore.kernel.org/r/20231020133202.4043247-1-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
Add the following 3 test cases for bpf memory allocator: 1) Do allocation in bpf program and free through map free 2) Do batch per-cpu allocation and per-cpu free in bpf program 3) Do per-cpu allocation in bpf program and free through map free For per-cpu allocation, because per-cpu allocation can not refill timely sometimes, so test 2) and test 3) consider it is OK for bpf_percpu_obj_new_impl() to return NULL. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-8-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
Introduce pcpu_alloc_size() to get the size of the dynamic per-cpu area. It will be used by bpf memory allocator in the following patches. BPF memory allocator maintains per-cpu area caches for multiple area sizes and its free API only has the to-be-freed per-cpu pointer, so it needs the size of dynamic per-cpu area to select the corresponding cache when bpf program frees the dynamic per-cpu pointer. Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-3-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hou Tao authored
There is no need to acquire pcpu_lock for pcpu_chunk_addr_search(): 1) both pcpu_first_chunk & pcpu_reserved_chunk must have been initialized before the invocation of free_percpu(). 2) The dynamically-created chunk must be valid before the per-cpu pointers allocated from it are freed. So acquire pcpu_lock() after the invocation of pcpu_chunk_addr_search(). Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-2-houtao@huaweicloud.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Kumar Kartikeya Dwivedi authored
The linked list failure test 'pop_front_off' and 'pop_back_off' currently rely on matching exact instruction and register values. The purpose of the test is to ensure the offset is correctly incremented for the returned pointers from list pop helpers, which can then be used with container_of to obtain the real object. Hence, somehow obtaining the information that the offset is 48 will work for us. Make the test more robust by relying on verifier error string of bpf_spin_lock and remove dependence on fragile instruction index or register number, which can be affected by different clang versions used to build the selftests. Fixes: 300f19dc ("selftests/bpf: Add BPF linked list API tests") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231020144839.2734006-1-memxor@gmail.com
-
Alexei Starovoitov authored
Chuyi Zhou says: ==================== Add Open-coded task, css_task and css iters This is version 6 of task, css_task and css iters support. --- Changelog --- v5 -> v6: Patch #3: * In bpf_iter_task_next, return pos rather than goto out. (Andrii) Patch #2, #3, #4: * Add the missing __diag_ignore_all to avoid kernel build warning Patch #5, #6, #7: * Add Andrii's ack Patch #8: * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and ensure stack_mprotect() in iters.c not success. If not, it would cause the subsequent 'test_lsm' fail, since the 'is_stack' check in test_int_hook(lsm.c) would not be guaranteed. (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790) v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/ Patch 3~4: * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid netdev/build_32bit CI error. (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr) Patch 8: * Initialize skel pointer to fix the LLVM-16 build CI error (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863) v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/ * Address all the comments from Andrii in patch-3 ~ patch-6 * Collect Tejun's ack * Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c * Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c) v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/ Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF) * Add tj's ack and Alexei's suggest-by. Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs) * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc() * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei) * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 3 (Introduce task open coded iterator kfuncs) * Change th API design keep consistent with SEC("iter/task"), support iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a specific task (BPF_TASK_ITERATE_THREAD).(Andrii) * Move bpf_iter_task's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 4 (Introduce css open-coded iterator kfuncs) * Change th API design keep consistent with cgroup_iters, reuse BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii) * Add KF_TRUSTED_ARGS for bpf_iter_css_new * Move bpf_iter_css's definition from uapi/linux/bpf.h to kernel/bpf/task_iter.c and we can use it from vmlinux.h * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to bpf_experimental.h Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS) * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii) * Consider STACK_ITER when using bpf_for_each_spilled_reg. Patch 6 (Let bpf_iter_task_new accept null task ptr) * Add this extra patch to let bpf_iter_task_new accept a 'nullable' * task pointer(Andrii) Patch 7 (selftests/bpf: Add tests for open-coded task and css iter) * Add failure testcase(Alexei) Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/): - Add a pre-patch to make some preparations before supporting css_task iters.(Alexei) - Add an allowlist for css_task iters(Alexei) - Let bpf progs do explicit bpf_rcu_read_lock() when using process iters and css_descendant iters.(Alexei) --------------------- In some BPF usage scenarios, it will be useful to iterate the process and css directly in the BPF program. One of the expected scenarios is customizable OOM victim selection via BPF[1]. Inspired by Dave's task_vma iter[2], this patchset adds three types of open-coded iterator kfuncs: 1. bpf_task_iters. It can be used to 1) iterate all process in the system, like for_each_forcess() in kernel. 2) iterate all threads in the system. 3) iterate all threads of a specific task 2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would be used to iterating tasks/threads under a css. 3. css_iters. It works like css_next_descendant_{pre, post} to iterating all descendant css. BPF programs can use these kfuncs directly or through bpf_for_each macro. link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/ link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/ ==================== Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
This patch adds 4 subtests to demonstrate these patterns and validating correctness. subtest1: 1) We use task_iter to iterate all process in the system and search for the current process with a given pid. 2) We create some threads in current process context, and use BPF_TASK_ITER_PROC_THREADS to iterate all threads of current process. As expected, we would find all the threads of current process. 3) We create some threads and use BPF_TASK_ITER_ALL_THREADS to iterate all threads in the system. As expected, we would find all the threads which was created. subtest2: We create a cgroup and add the current task to the cgroup. In the BPF program, we would use bpf_for_each(css_task, task, css) to iterate all tasks under the cgroup. As expected, we would find the current process. subtest3: 1) We create a cgroup tree. In the BPF program, we use bpf_for_each(css, pos, root, XXX) to iterate all descendant under the root with pre and post order. As expected, we would find all descendant and the last iterating cgroup in post-order is root cgroup, the first iterating cgroup in pre-order is root cgroup. 2) We wse BPF_CGROUP_ITER_ANCESTORS_UP to traverse the cgroup tree starting from leaf and root separately, and record the height. The diff of the hights would be the total tree-high - 1. subtest4: Add some failure testcase when using css_task, task and css iters, e.g, unlock when using task-iters to iterate tasks. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-9-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
The newly-added struct bpf_iter_task has a name collision with a selftest for the seq_file task iter's bpf skel, so the selftests/bpf/progs file is renamed in order to avoid the collision. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-8-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
When using task_iter to iterate all threads of a specific task, we enforce that the user must pass a valid task pointer to ensure safety. However, when iterating all threads/process in the system, BPF verifier still require a valid ptr instead of "nullable" pointer, even though it's pointless, which is a kind of surprising from usability standpoint. It would be nice if we could let that kfunc accept a explicit null pointer when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer when using BPF_TASK_ITER_THREAD. Given a trival kfunc: __bpf_kfunc void FN(struct TYPE_A *obj); BPF Prog would reject a nullptr for obj. The error info is: "arg#x pointer type xx xx must point to scalar, or struct with scalar" reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and the btf type of ref_t is not scalar or scalar_struct which leads to the rejection of get_kfunc_ptr_arg_type. This patch add "__nullable" annotation: __bpf_kfunc void FN(struct TYPE_A *obj__nullable); Here __nullable indicates obj can be optional, user can pass a explicit nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will detect whether the current arg is optional and register is null, If so, return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next arg in check_kfunc_args(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
css_iter and task_iter should be used in rcu section. Specifically, in sleepable progs explicit bpf_rcu_read_lock() is needed before use these iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to use them directly. This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check whether we are in rcu cs before we want to invoke this kfunc. If the rcu protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU. Once user do rcu_unlock during the iteration, state MEM_RCU of regs would be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED. It is worth noting that currently, bpf_rcu_read_unlock does not clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg search STACK_ITER. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css in open-coded iterator style. These kfuncs actually wrapps css_next_descendant_{pre, post}. css_iter can be used to: 1) iterating a sepcific cgroup tree with pre/post/up order 2) iterating cgroup_subsystem in BPF Prog, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel. The API design is consistent with cgroup_iter. bpf_iter_css_new accepts parameters defining iteration order and starting css. Here we also reuse BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST, BPF_CGROUP_ITER_ANCESTORS_UP enums. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task in open-coded iterator style. BPF programs can use these kfuncs or through bpf_for_each macro to iterate all processes in the system. The API design keep consistent with SEC("iter/task"). bpf_iter_task_new() accepts a specific task and iterating type which allows: 1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS) 2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS) 3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS) Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css_task in open-coded iterator style. These kfuncs actually wrapps css_task_iter_{start,next, end}. BPF programs can use these kfuncs through bpf_for_each macro for iteration of all tasks under a css. css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so the bpf side has to be careful in where it allows to use this iter. Currently we only allow it in bpf_lsm and bpf iter-s. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Chuyi Zhou authored
This patch makes some preparations for using css_task_iter_*() in BPF Program. 1. Flags CSS_TASK_ITER_* are #define-s and it's not easy for bpf prog to use them. Convert them to enum so bpf prog can take them from vmlinux.h. 2. In the next patch we will add css_task_iter_*() in common kfuncs which is not safe. Since css_task_iter_*() does spin_unlock_irq() which might screw up irq flags depending on the context where bpf prog is running. So we should use irqsave/irqrestore here and the switching is harmless. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-2-zhouchuyi@bytedance.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 19 Oct, 2023 1 commit
-
-
Manu Bretelle authored
When dumping a struct_ops, 2 dictionaries are emitted. When using `name`, they were already wrapped in an array, but not when using `id`. Causing `jq` to fail at parsing the payload as it reached the comma following the first dict. This change wraps those dictionaries in an array so valid json is emitted. Before, jq fails to parse the output: ``` $ sudo bpftool struct_ops dump id 1523612 | jq . > /dev/null parse error: Expected value before ',' at line 19, column 2 ``` After, no error parsing the output: ``` sudo ./bpftool struct_ops dump id 1523612 | jq . > /dev/null ``` Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20231018230133.1593152-3-chantr4@gmail.com
-