- 19 Jul, 2022 7 commits
-
-
Stanislav Fomichev authored
This particular ones is about having the following: CONFIG_BPF_LSM=y # CONFIG_CGROUP_BPF is not set Also, add __maybe_unused to the args for the !CONFIG_NET cases. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220714185404.3647772-1-sdf@google.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Andrii Nakryiko says: ==================== Add SEC("ksyscall")/SEC("kretsyscall") sections and corresponding bpf_program__attach_ksyscall() API that simplifies tracing kernel syscalls through kprobe mechanism. Kprobing syscalls isn't trivial due to varying syscall handler names in the kernel and various ways syscall argument are passed, depending on kernel architecture and configuration. SEC("ksyscall") allows user to not care about such details and just get access to syscall input arguments, while libbpf takes care of necessary feature detection logic. There are still more quirks that are not straightforward to hide completely (see comments about mmap(), clone() and compat syscalls), so in such more advanced scenarios user might need to fall back to plain SEC("kprobe") approach, but for absolute majority of users SEC("ksyscall") is a big improvement. As part of this patch set libbpf adds two more virtual __kconfig externs, in addition to existing LINUX_KERNEL_VERSION: LINUX_HAS_BPF_COOKIE and LINUX_HAS_SYSCALL_WRAPPER, which let's libbpf-provided BPF-side code minimize external dependencies and assumptions and let's user-space part of libbpf to perform all the feature detection logic. This benefits USDT support code, which now doesn't depend on BPF CO-RE for its functionality. v1->v2: - normalize extern variable-related warn and debug message formats (Alan); rfc->v1: - drop dependency on kallsyms and speed up SYSCALL_WRAPPER detection (Alexei); - drop dependency on /proc/config.gz in bpf_tracing.h (Yaniv); - add doc comment and ephasize mmap(), clone() and compat quirks that are not supported (Ilya); - use mechanism similar to LINUX_KERNEL_VERSION to also improve USDT code. ==================== Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Convert few selftest that used plain SEC("kprobe") with arch-specific syscall wrapper prefix to ksyscall/kretsyscall and corresponding BPF_KSYSCALL macro. test_probe_user.c is especially benefiting from this simplification. Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220714070755.3235561-6-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Add SEC("ksyscall")/SEC("ksyscall/<syscall_name>") and corresponding kretsyscall variants (for return kprobes) to allow users to kprobe syscall functions in kernel. These special sections allow to ignore complexities and differences between kernel versions and host architectures when it comes to syscall wrapper and corresponding __<arch>_sys_<syscall> vs __se_sys_<syscall> differences, depending on whether host kernel has CONFIG_ARCH_HAS_SYSCALL_WRAPPER (though libbpf itself doesn't rely on /proc/config.gz for detecting this, see BPF_KSYSCALL patch for how it's done internally). Combined with the use of BPF_KSYSCALL() macro, this allows to just specify intended syscall name and expected input arguments and leave dealing with all the variations to libbpf. In addition to SEC("ksyscall+") and SEC("kretsyscall+") add bpf_program__attach_ksyscall() API which allows to specify syscall name at runtime and provide associated BPF cookie value. At the moment SEC("ksyscall") and bpf_program__attach_ksyscall() do not handle all the calling convention quirks for mmap(), clone() and compat syscalls. It also only attaches to "native" syscall interfaces. If host system supports compat syscalls or defines 32-bit syscalls in 64-bit kernel, such syscall interfaces won't be attached to by libbpf. These limitations may or may not change in the future. Therefore it is recommended to use SEC("kprobe") for these syscalls or if working with compat and 32-bit interfaces is required. Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220714070755.3235561-5-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Improve BPF_KPROBE_SYSCALL (and rename it to shorter BPF_KSYSCALL to match libbpf's SEC("ksyscall") section name, added in next patch) to use __kconfig variable to determine how to properly fetch syscall arguments. Instead of relying on hard-coded knowledge of whether kernel's architecture uses syscall wrapper or not (which only reflects the latest kernel versions, but is not necessarily true for older kernels and won't necessarily hold for later kernel versions on some particular host architecture), determine this at runtime by attempting to create perf_event (with fallback to kprobe event creation through tracefs on legacy kernels, just like kprobe attachment code is doing) for kernel function that would correspond to bpf() syscall on a system that has CONFIG_ARCH_HAS_SYSCALL_WRAPPER set (e.g., for x86-64 it would try '__x64_sys_bpf'). If host kernel uses syscall wrapper, syscall kernel function's first argument is a pointer to struct pt_regs that then contains syscall arguments. In such case we need to use bpf_probe_read_kernel() to fetch actual arguments (which we do through BPF_CORE_READ() macro) from inner pt_regs. But if the kernel doesn't use syscall wrapper approach, input arguments can be read from struct pt_regs directly with no probe reading. All this feature detection is done without requiring /proc/config.gz existence and parsing, and BPF-side helper code uses newly added LINUX_HAS_SYSCALL_WRAPPER virtual __kconfig extern to keep in sync with user-side feature detection of libbpf. BPF_KSYSCALL() macro can be used both with SEC("kprobe") programs that define syscall function explicitly (e.g., SEC("kprobe/__x64_sys_bpf")) and SEC("ksyscall") program added in the next patch (which are the same kprobe program with added benefit of libbpf determining correct kernel function name automatically). Kretprobe and kretsyscall (added in next patch) programs don't need BPF_KSYSCALL as they don't provide access to input arguments. Normal BPF_KRETPROBE is completely sufficient and is recommended. Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220714070755.3235561-4-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Exercise libbpf's logic for unknown __weak virtual __kconfig externs. USDT selftests are already excercising non-weak known virtual extern already (LINUX_HAS_BPF_COOKIE), so no need to add explicit tests for it. Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220714070755.3235561-3-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
Libbpf supports single virtual __kconfig extern currently: LINUX_KERNEL_VERSION. LINUX_KERNEL_VERSION isn't coming from /proc/kconfig.gz and is intead customly filled out by libbpf. This patch generalizes this approach to support more such virtual __kconfig externs. One such extern added in this patch is LINUX_HAS_BPF_COOKIE which is used for BPF-side USDT supporting code in usdt.bpf.h instead of using CO-RE-based enum detection approach for detecting bpf_get_attach_cookie() BPF helper. This allows to remove otherwise not needed CO-RE dependency and keeps user-space and BPF-side parts of libbpf's USDT support strictly in sync in terms of their feature detection. We'll use similar approach for syscall wrapper detection for BPF_KSYSCALL() BPF-side macro in follow up patch. Generally, currently libbpf reserves CONFIG_ prefix for Kconfig values and LINUX_ for virtual libbpf-backed externs. In the future we might extend the set of prefixes that are supported. This can be done without any breaking changes, as currently any __kconfig extern with unrecognized name is rejected. For LINUX_xxx externs we support the normal "weak rule": if libbpf doesn't recognize given LINUX_xxx extern but such extern is marked as __weak, it is not rejected and defaults to zero. This follows CONFIG_xxx handling logic and will allow BPF applications to opportunistically use newer libbpf virtual externs without breaking on older libbpf versions unnecessarily. Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220714070755.3235561-2-andrii@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 15 Jul, 2022 8 commits
-
-
Jon Doron authored
Add support for writing a custom event reader, by exposing the ring buffer. With the new API perf_buffer__buffer() you will get access to the raw mmaped()'ed per-cpu underlying memory of the ring buffer. This region contains both the perf buffer data and header (struct perf_event_mmap_page), which manages the ring buffer state (head/tail positions, when accessing the head/tail position it's important to take into consideration SMP). With this type of low level access one can implement different types of consumers here are few simple examples where this API helps with: 1. perf_event_read_simple is allocating using malloc, perhaps you want to handle the wrap-around in some other way. 2. Since perf buf is per-cpu then the order of the events is not guarnteed, for example: Given 3 events where each event has a timestamp t0 < t1 < t2, and the events are spread on more than 1 CPU, then we can end up with the following state in the ring buf: CPU[0] => [t0, t2] CPU[1] => [t1] When you consume the events from CPU[0], you could know there is a t1 missing, (assuming there are no drops, and your event data contains a sequential index). So now one can simply do the following, for CPU[0], you can store the address of t0 and t2 in an array (without moving the tail, so there data is not perished) then move on the CPU[1] and set the address of t1 in the same array. So you end up with something like: void **arr[] = [&t0, &t1, &t2], now you can consume it orderely and move the tails as you process in order. 3. Assuming there are multiple CPUs and we want to start draining the messages from them, then we can "pick" with which one to start with according to the remaining free space in the ring buffer. Signed-off-by: Jon Doron <jond@wiz.io> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220715181122.149224-1-arilou@gmail.com
-
Andrii Nakryiko authored
Pu Lehui says: ==================== Currently, samples/bpf, tools/runqslower and bpf/iterators use bpftool for vmlinux.h, skeleton, and static linking only. We can use lightweight bootstrap version of bpftool to handle these, and it will be faster. v2: - make libbpf and bootstrap bpftool independent. and make it simple. v1: https://lore.kernel.org/bpf/20220712030813.865410-1-pulehui@huawei.com ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
-
Pu Lehui authored
kernel/bpf/preload/iterators use bpftool for vmlinux.h, skeleton, and static linking only. So we can use lightweight bootstrap version of bpftool to handle these, and it will be faster. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220714024612.944071-4-pulehui@huawei.com
-
Pu Lehui authored
tools/runqslower use bpftool for vmlinux.h, skeleton, and static linking only. So we can use lightweight bootstrap version of bpftool to handle these, and it will be faster. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220714024612.944071-3-pulehui@huawei.com
-
Pu Lehui authored
Currently, when cross compiling bpf samples, the host side cannot use arch-specific bpftool to generate vmlinux.h or skeleton. Since samples/bpf use bpftool for vmlinux.h, skeleton, and static linking only, we can use lightweight bootstrap version of bpftool to handle these, and it's always host-native. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220714024612.944071-2-pulehui@huawei.com
-
Ben Dooks authored
When checking with sparse, btf_show_type_value() is causing a warning about checking integer vs NULL when the macro is passed a pointer, due to the 'value != 0' check. Stop sparse complaining about any type-casting by adding a cast to the typeof(value). This fixes the following sparse warnings: kernel/bpf/btf.c:2579:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:2581:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:3407:17: warning: Using plain integer as NULL pointer kernel/bpf/btf.c:3758:9: warning: Using plain integer as NULL pointer Signed-off-by: Ben Dooks <ben.dooks@sifive.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220714100322.260467-1-ben.dooks@sifive.com
-
Alexei Starovoitov authored
The commit 7337224f ("bpf: Improve the info.func_info and info.func_info_rec_size behavior") accidently made bpf_prog_ksym_set_name() conservative for bpf subprograms. Fixed it so instead of "bpf_prog_tag_F" the stack traces print "bpf_prog_tag_full_subprog_name". Fixes: 7337224f ("bpf: Improve the info.func_info and info.func_info_rec_size behavior") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220714211637.17150-1-alexei.starovoitov@gmail.com
-
Jiri Olsa authored
Alexei reported crash by running test_progs -j on system with 32 cpus. It turned out the kprobe_multi bench test that attaches all ftrace-able functions will race with bpf_dispatcher_update, that calls bpf_arch_text_poke on bpf_dispatcher_xdp_func, which is ftrace-able function. Ftrace is not aware of this update so this will cause ftrace_bug with: WARNING: CPU: 6 PID: 1985 at arch/x86/kernel/ftrace.c:94 ftrace_verify_code+0x27/0x50 ... ftrace_replace_code+0xa3/0x170 ftrace_modify_all_code+0xbd/0x150 ftrace_startup_enable+0x3f/0x50 ftrace_startup+0x98/0xf0 register_ftrace_function+0x20/0x60 register_fprobe_ips+0xbb/0xd0 bpf_kprobe_multi_link_attach+0x179/0x430 __sys_bpf+0x18a1/0x2440 ... ------------[ ftrace bug ]------------ ftrace failed to modify [<ffffffff818d9380>] bpf_dispatcher_xdp_func+0x0/0x10 actual: ffffffe9:7b:ffffff9c:77:1e Setting ftrace call site to call ftrace function It looks like we need some way to hide some functions from ftrace, but meanwhile we workaround this by skipping bpf_dispatcher_xdp_func from kprobe_multi bench test. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220714082316.479181-1-jolsa@kernel.org
-
- 14 Jul, 2022 7 commits
-
-
Ben Dooks authored
A couple of the syscalls which load values (bpf_skb_load_helper_16() and bpf_skb_load_helper_32()) are using u16/u32 types which are triggering warnings as they are then converted from big-endian to CPU-endian. Fix these by making the types __be instead. Fixes the following sparse warnings: net/core/filter.c:246:32: warning: cast to restricted __be16 net/core/filter.c:246:32: warning: cast to restricted __be16 net/core/filter.c:246:32: warning: cast to restricted __be16 net/core/filter.c:246:32: warning: cast to restricted __be16 net/core/filter.c:273:32: warning: cast to restricted __be32 net/core/filter.c:273:32: warning: cast to restricted __be32 net/core/filter.c:273:32: warning: cast to restricted __be32 net/core/filter.c:273:32: warning: cast to restricted __be32 net/core/filter.c:273:32: warning: cast to restricted __be32 net/core/filter.c:273:32: warning: cast to restricted __be32 Signed-off-by: Ben Dooks <ben.dooks@sifive.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220714105101.297304-1-ben.dooks@sifive.com
-
Yafang Shao authored
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE is also tracing type, which may cause unexpected memory allocation if we set BPF_F_NO_PREALLOC. Let's also warn on it similar as we do in case of BPF_PROG_TYPE_RAW_TRACEPOINT. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220713160936.57488-1-laoar.shao@gmail.com
-
Maciej Fijalkowski authored
When application runs in busy poll mode and does not receive a single packet but only sends them, it is currently impossible to get into napi_busy_loop() as napi_id is only marked on Rx side in xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info carried by xdp_buff. From Tx perspective, we do not have access to it. What we have handy is the xsk pool. Xsk pool works on a pool of internal xdp_buff wrappers called xdp_buff_xsk. AF_XDP ZC enabled drivers call xp_set_rxq_info() so each of xdp_buff_xsk has a valid pointer to xdp_rxq_info of underlying queue. Therefore, on Tx side, napi_id can be pulled from xs->pool->heads[0].xdp.rxq->napi_id. Hide this pointer chase under helper function, xsk_pool_get_napi_id(). Do this only for sockets working in ZC mode as otherwise rxq pointers would not be initialized. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220707130842.49408-1-maciej.fijalkowski@intel.com
-
Nathan Chancellor authored
When building with clang + CONFIG_CFI_CLANG=y, the following error occurs at link time: ld.lld: error: undefined symbol: dummy_tramp dummy_tramp is declared globally in C but its definition in inline assembly does not use .global, which prevents clang from properly resolving the references to it when creating the CFI jump tables. Mark dummy_tramp as global so that the reference can be properly resolved. Fixes: b2ad54e1 ("bpf, arm64: Implement bpf_arch_text_poke() for arm64") Suggested-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://github.com/ClangBuiltLinux/linux/issues/1661 Link: https://lore.kernel.org/bpf/20220713173503.3889486-1-nathan@kernel.org
-
Linkui Xiao authored
Return boolean values ("true" or "false") instead of 1 or 0 from bool functions. This fixes the following warnings from coccicheck: tools/testing/selftests/bpf/progs/test_xdp_noinline.c:407:9-10: WARNING: return of 0/1 in function 'decap_v4' with return type bool tools/testing/selftests/bpf/progs/test_xdp_noinline.c:389:9-10: WARNING: return of 0/1 in function 'decap_v6' with return type bool tools/testing/selftests/bpf/progs/test_xdp_noinline.c:290:9-10: WARNING: return of 0/1 in function 'encap_v6' with return type bool tools/testing/selftests/bpf/progs/test_xdp_noinline.c:264:9-10: WARNING: return of 0/1 in function 'parse_tcp' with return type bool tools/testing/selftests/bpf/progs/test_xdp_noinline.c:242:9-10: WARNING: return of 0/1 in function 'parse_udp' with return type bool Generated by: scripts/coccinelle/misc/boolreturn.cocci Suggested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20220714015647.25074-1-xiaolinkui@kylinos.cn
-
Anquan Wu authored
BPF map name is limited to BPF_OBJ_NAME_LEN. A map name is defined as being longer than BPF_OBJ_NAME_LEN, it will be truncated to BPF_OBJ_NAME_LEN when a userspace program calls libbpf to create the map. A pinned map also generates a path in the /sys. If the previous program wanted to reuse the map, it can not get bpf_map by name, because the name of the map is only partially the same as the name which get from pinned path. The syscall information below show that map name "process_pinned_map" is truncated to "process_pinned_". bpf(BPF_OBJ_GET, {pathname="/sys/fs/bpf/process_pinned_map", bpf_fd=0, file_flags=0}, 144) = -1 ENOENT (No such file or directory) bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH, key_size=4, value_size=4,max_entries=1024, map_flags=0, inner_map_fd=0, map_name="process_pinned_",map_ifindex=0, btf_fd=3, btf_key_type_id=6, btf_value_type_id=10,btf_vmlinux_value_type_id=0}, 72) = 4 This patch check that if the name of pinned map are the same as the actual name for the first (BPF_OBJ_NAME_LEN - 1), bpf map still uses the name which is included in bpf object. Fixes: 26736eb9 ("tools: libbpf: allow map reuse") Signed-off-by: Anquan Wu <leiqi96@hotmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/OSZP286MB1725CEA1C95C5CB8E7CCC53FB8869@OSZP286MB1725.JPNP286.PROD.OUTLOOK.COM
-
Linkui Xiao authored
The ARRAY_SIZE macro is more compact and more formal in linux source. Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20220712072302.13761-1-xiaolinkui@kylinos.cn
-
- 13 Jul, 2022 4 commits
-
-
Joanne Koong authored
This patch does two things: 1. For matching against the arg type, the match should be against the base type of the arg type, since the arg type can have different bpf_type_flags set on it. 2. Uses switch casing to improve readability + efficiency. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220712210603.123791-1-joannelkoong@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Hengqi Chen authored
binary_path is a required non-null parameter for bpf_program__attach_usdt and bpf_program__attach_uprobe_opts. Check it against NULL to prevent coredump on strchr. Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220712025745.2703995-1-hengqi.chen@gmail.com
-
Yafang Shao authored
GFP_ATOMIC doesn't cooperate well with memcg pressure so far, especially if we allocate too much GFP_ATOMIC memory. For example, when we set the memcg limit to limit a non-preallocated bpf memory, the GFP_ATOMIC can easily break the memcg limit by force charge. So it is very dangerous to use GFP_ATOMIC in non-preallocated case. One way to make it safe is to remove __GFP_HIGH from GFP_ATOMIC, IOW, use (__GFP_ATOMIC | __GFP_KSWAPD_RECLAIM) instead, then it will be limited if we allocate too much memory. There's a plan to completely remove __GFP_ATOMIC in the mm side[1], so let's use GFP_NOWAIT instead. We introduced BPF_F_NO_PREALLOC is because full map pre-allocation is too memory expensive for some cases. That means removing __GFP_HIGH doesn't break the rule of BPF_F_NO_PREALLOC, but has the same goal with it-avoiding issues caused by too much memory. So let's remove it. This fix can also apply to other run-time allocations, for example, the allocation in lpm trie, local storage and devmap. So let fix it consistently over the bpf code It also fixes a typo in the comment. [1]. https://lore.kernel.org/linux-mm/163712397076.13692.4727608274002939094@noble.neil.brown.name/ Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20220709154457.57379-2-laoar.shao@gmail.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Song Liu authored
syzbot reported a few issues with bpf_prog_pack [1], [2]. This only happens with multiple subprogs. In jit_subprogs(), we first call bpf_int_jit_compile() on each sub program. And then, we call it on each sub program again. jit_data is not freed in the first call of bpf_int_jit_compile(). Similarly we don't call bpf_jit_binary_pack_finalize() in the first call of bpf_int_jit_compile(). If bpf_int_jit_compile() failed for one sub program, we will call bpf_jit_binary_pack_finalize() for this sub program. However, we don't have a chance to call it for other sub programs. Then we will hit "goto out_free" in jit_subprogs(), and call bpf_jit_free on some subprograms that haven't got bpf_jit_binary_pack_finalize() yet. At this point, bpf_jit_binary_pack_free() is called and the whole 2MB page is freed erroneously. Fix this with a custom bpf_jit_free() for x86_64, which calls bpf_jit_binary_pack_finalize() if necessary. Also, with custom bpf_jit_free(), bpf_prog_aux->use_bpf_prog_pack is not needed any more, remove it. Fixes: 1022a549 ("bpf, x86_64: Use bpf_jit_binary_pack_alloc") [1] https://syzkaller.appspot.com/bug?extid=2f649ec6d2eea1495a8f [2] https://syzkaller.appspot.com/bug?extid=87f65c75f4a72db05445 Reported-by: syzbot+2f649ec6d2eea1495a8f@syzkaller.appspotmail.com Reported-by: syzbot+87f65c75f4a72db05445@syzkaller.appspotmail.com Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220706002612.4013790-1-song@kernel.orgSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 12 Jul, 2022 6 commits
-
-
Roman Gushchin authored
The memory consumed by a bpf map is always accounted to the memory cgroup of the process which created the map. The map can outlive the memory cgroup if it's used by processes in other cgroups or is pinned on bpffs. In this case the map pins the original cgroup in the dying state. For other types of objects (slab objects, non-slab kernel allocations, percpu objects and recently LRU pages) there is a reparenting process implemented: on cgroup offlining charged objects are getting reassigned to the parent cgroup. Because all charges and statistics are fully recursive it's a fairly cheap operation. For efficiency and consistency with other types of objects, let's do the same for bpf maps. Fortunately thanks to the objcg API, the required changes are minimal. Please, note that individual allocations (slabs, percpu and large kmallocs) already have the reparenting mechanism. This commit adds it to the saved map->memcg pointer by replacing it to map->objcg. Because dying cgroups are not visible for a user and all charges are recursive, this commit doesn't bring any behavior changes for a user. v2: added a missing const qualifier Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20220711162827.184743-1-roman.gushchin@linux.devSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Alan Maguire says: ==================== a ksym BPF iterator would be useful as it would allow more flexible interactions with kernel symbols than are currently supported; it could for example create more efficient map representations for lookup, speed up symbol resolution etc. The idea was initially discussed here [1]. Changes since v5 [2]: - no need to add kallsym_iter to bpf_iter.h as it has existed in kernels for a long time so will by in vmlinux.h for older kernels too, unlike struct bpf_iter__ksym (Yonghong, patch 2) Changes since v4 [3]: - add BPF_ITER_RESCHED to improve responsiveness (Hao, patch 1) - remove pr_warn to be consistent with other iterators (Andrii, patch 1) - add definitions to bpf_iter.h to ensure iter tests build on older kernels (Andrii, patch 2) Changes since v3 [4]: - use late_initcall() to register iter; means we are both consistent with other iters and can encapsulate all iter-specific code in kallsyms.c in CONFIG_BPF_SYSCALL (Alexei, Yonghong, patch 1). Changes since v2 [5]: - set iter->show_value on initialization based on current creds and use it in selftest to determine if we show values (Yonghong, patches 1/2) - inline iter registration into kallsyms_init (Yonghong, patch 1) Changes since RFC [6]: - change name of iterator (and associated structures/fields) to "ksym" (Andrii, patches 1, 2) - remove dependency on CONFIG_PROC_FS; it was used for other BPF iterators, and I assumed it was needed because of seq ops but I don't think it is required on digging futher (Andrii, patch 1) [1] https://lore.kernel.org/all/YjRPZj6Z8vuLeEZo@krava/ [2] https://lore.kernel.org/bpf/1657490998-31468-1-git-send-email-alan.maguire@oracle.com/ [3] https://lore.kernel.org/bpf/1657113391-5624-1-git-send-email-alan.maguire@oracle.com/ [4] https://lore.kernel.org/bpf/1656942916-13491-1-git-send-email-alan.maguire@oracle.com [5] https://lore.kernel.org/bpf/1656667620-18718-1-git-send-email-alan.maguire@oracle.com/ [6] https://lore.kernel.org/all/1656089118-577-1-git-send-email-alan.maguire@oracle.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alan Maguire authored
add subtest verifying BPF ksym iter behaviour. The BPF ksym iter program shows an example of dumping a format different to /proc/kallsyms. It adds KIND and MAX_SIZE fields which represent the kind of symbol (core kernel, module, ftrace, bpf, or kprobe) and the maximum size the symbol can be. The latter is calculated from the difference between current symbol value and the next symbol value. The key benefit for this iterator will likely be supporting in-kernel data-gathering rather than dumping symbol details to userspace and parsing the results. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/1657629105-7812-3-git-send-email-alan.maguire@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alan Maguire authored
add a "ksym" iterator which provides access to a "struct kallsym_iter" for each symbol. Intent is to support more flexible symbol parsing as discussed in [1]. [1] https://lore.kernel.org/all/YjRPZj6Z8vuLeEZo@krava/Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/1657629105-7812-2-git-send-email-alan.maguire@oracle.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Matthieu Baerts authored
Our CI[1] reported these warnings when using Sparse: $ touch net/mptcp/bpf.c $ make C=1 net/mptcp/bpf.o net/mptcp/bpf.c: note: in included file: include/linux/bpf_verifier.h:348:26: error: dubious one-bit signed bitfield include/linux/bpf_verifier.h:349:29: error: dubious one-bit signed bitfield Set them as 'unsigned' to avoid warnings. [1] https://github.com/multipath-tcp/mptcp_net-next/actions/runs/2643588487 Fixes: 1ade2371 ("bpf: Inline calls to bpf_loop when callback is known") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220711081200.2081262-1-matthieu.baerts@tessares.net
-
Jesper Dangaard Brouer authored
LLVM compiler optimized out the memcpy in xdp_redirect_map_egress, which caused the Ethernet source MAC-addr to always be zero when enabling the devmap egress prog via cmdline --load-egress. Issue observed with LLVM version 14.0.0 - Shipped with Fedora 36 on target: x86_64-redhat-linux-gnu. In verbose mode print the source MAC-addr in case xdp_devmap_attached mode is used. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/165754826292.575614.5636444052787717159.stgit@firesoul
-
- 11 Jul, 2022 8 commits
-
-
Xu Kuohai authored
This is arm64 version of commit fec56f58 ("bpf: Introduce BPF trampoline"). A bpf trampoline converts native calling convention to bpf calling convention and is used to implement various bpf features, such as fentry, fexit, fmod_ret and struct_ops. This patch does essentially the same thing that bpf trampoline does on x86. Tested on Raspberry Pi 4B and qemu: #18 /1 bpf_tcp_ca/dctcp:OK #18 /2 bpf_tcp_ca/cubic:OK #18 /3 bpf_tcp_ca/invalid_license:OK #18 /4 bpf_tcp_ca/dctcp_fallback:OK #18 /5 bpf_tcp_ca/rel_setsockopt:OK #18 bpf_tcp_ca:OK #51 /1 dummy_st_ops/dummy_st_ops_attach:OK #51 /2 dummy_st_ops/dummy_init_ret_value:OK #51 /3 dummy_st_ops/dummy_init_ptr_arg:OK #51 /4 dummy_st_ops/dummy_multiple_args:OK #51 dummy_st_ops:OK #57 /1 fexit_bpf2bpf/target_no_callees:OK #57 /2 fexit_bpf2bpf/target_yes_callees:OK #57 /3 fexit_bpf2bpf/func_replace:OK #57 /4 fexit_bpf2bpf/func_replace_verify:OK #57 /5 fexit_bpf2bpf/func_sockmap_update:OK #57 /6 fexit_bpf2bpf/func_replace_return_code:OK #57 /7 fexit_bpf2bpf/func_map_prog_compatibility:OK #57 /8 fexit_bpf2bpf/func_replace_multi:OK #57 /9 fexit_bpf2bpf/fmod_ret_freplace:OK #57 fexit_bpf2bpf:OK #237 xdp_bpf2bpf:OK Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20220711150823.2128542-5-xukuohai@huawei.com
-
Xu Kuohai authored
Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline can be patched with it. When the target address is NULL, the original instruction is patched to a NOP. When the target address and the source address are within the branch range, the original instruction is patched to a bl instruction to the target address directly. To support attaching bpf trampoline to both regular kernel function and bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two instructions are inserted at the beginning of bpf prog, the first one saves the return address to x9, and the second is a nop which will be patched to a bl instruction when a bpf trampoline is attached. However, when a bpf trampoline is attached to bpf prog, the distance between target address and source address may exceed 128MB, the maximum branch range, because bpf trampoline and bpf prog are allocated separately with vmalloc. So long jump should be handled. When a bpf prog is constructed, a plt pointing to empty trampoline dummy_tramp is placed at the end: bpf_prog: mov x9, lr nop // patchsite ... ret plt: ldr x10, target br x10 target: .quad dummy_tramp // plt target This is also the state when no trampoline is attached. When a short-jump bpf trampoline is attached, the patchsite is patched to a bl instruction to the trampoline directly: bpf_prog: mov x9, lr bl <short-jump bpf trampoline address> // patchsite ... ret plt: ldr x10, target br x10 target: .quad dummy_tramp // plt target When a long-jump bpf trampoline is attached, the plt target is filled with the trampoline address and the patchsite is patched to a bl instruction to the plt: bpf_prog: mov x9, lr bl plt // patchsite ... ret plt: ldr x10, target br x10 target: .quad <long-jump bpf trampoline address> dummy_tramp is used to prevent another CPU from jumping to an unknown location during the patching process, making the patching process easier. The patching process is as follows: 1. when neither the old address or the new address is a long jump, the patchsite is replaced with a bl to the new address, or nop if the new address is NULL; 2. when the old address is not long jump but the new one is, the branch target address is written to plt first, then the patchsite is replaced with a bl instruction to the plt; 3. when the old address is long jump but the new one is not, the address of dummy_tramp is written to plt first, then the patchsite is replaced with a bl to the new address, or a nop if the new address is NULL; 4. when both the old address and the new address are long jump, the new address is written to plt and the patchsite is not changed. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
-
Xu Kuohai authored
Add LDR (literal) instruction to load data from address relative to PC. This instruction will be used to implement long jump from bpf prog to bpf trampoline in the follow-up patch. The instruction encoding: 3 2 2 2 0 0 0 7 6 4 5 0 +-----+-------+---+-----+-------------------------------------+--------+ | 0 x | 0 1 1 | 0 | 0 0 | imm19 | Rt | +-----+-------+---+-----+-------------------------------------+--------+ for 32-bit, variant x == 0; for 64-bit, x == 1. branch_imm_common() is used to check the distance between pc and target address, since it's reused by this patch and LDR (literal) is not a branch instruction, rename it to label_imm_common(). Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/bpf/20220711150823.2128542-3-xukuohai@huawei.com
-
Xu Kuohai authored
Before generating bpf trampoline, x86 calls is_valid_bpf_tramp_flags() to check the input flags. This check is architecture independent. So, to be consistent with x86, arm64 should also do this check before generating bpf trampoline. However, the BPF_TRAMP_F_XXX flags are not used by user code and the flags argument is almost constant at compile time, so this run time check is a bit redundant. Remove is_valid_bpf_tramp_flags() and add some comments to the usage of BPF_TRAMP_F_XXX flags, as suggested by Alexei. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220711150823.2128542-2-xukuohai@huawei.com
-
Liu Jian authored
In sk_psock_skb_ingress_enqueue function, if the linear area + nr_frags + frag_list of the SKB has NR_MSG_FRAG_IDS blocks in total, skb_to_sgvec will return NR_MSG_FRAG_IDS, then msg->sg.end will be set to NR_MSG_FRAG_IDS, and in addition, (NR_MSG_FRAG_IDS - 1) is set to the last SG of msg. Recv the msg in sk_msg_recvmsg, when i is (NR_MSG_FRAG_IDS - 1), the sk_msg_iter_var_next(i) will change i to 0 (not NR_MSG_FRAG_IDS), the judgment condition "msg_rx->sg.start==msg_rx->sg.end" and "i != msg_rx->sg.end" can not work. As a result, the processed msg cannot be deleted from ingress_msg list. But the length of all the sge of the msg has changed to 0. Then the next recvmsg syscall will process the msg repeatedly, because the length of sge is 0, the -EFAULT error is always returned. Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220628123616.186950-1-liujian56@huawei.com
-
Jilin Yuan authored
Delete the redundant word 'test'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jilin Yuan authored
Delete the redundant word 'driver'. Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
sewookseo authored
If we set XFRM security policy by calling setsockopt with option IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock' struct. However tcp_v6_send_response doesn't look up dst_entry with the actual socket but looks up with tcp control socket. This may cause a problem that a RST packet is sent without ESP encryption & peer's TCP socket can't receive it. This patch will make the function look up dest_entry with actual socket, if the socket has XFRM policy(sock_policy), so that the TCP response packet via this function can be encrypted, & aligned on the encrypted TCP socket. Tested: We encountered this problem when a TCP socket which is encrypted in ESP transport mode encryption, receives challenge ACK at SYN_SENT state. After receiving challenge ACK, TCP needs to send RST to establish the socket at next SYN try. But the RST was not encrypted & peer TCP socket still remains on ESTABLISHED state. So we verified this with test step as below. [Test step] 1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED). 2. Client tries a new connection on the same TCP ports(src & dst). 3. Server will return challenge ACK instead of SYN,ACK. 4. Client will send RST to server to clear the SOCKET. 5. Client will retransmit SYN to server on the same TCP ports. [Expected result] The TCP connection should be established. Cc: Maciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Sehee Lee <seheele@google.com> Signed-off-by: Sewook Seo <sewookseo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-