- 03 Sep, 2022 14 commits
-
-
Martin KaFai Lau authored
This patch changes bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). It removes the duplicated code from bpf_getsockopt(SOL_TCP). Before this patch, there were some optnames available to bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP). For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL, and a few more. It surprises users from time to time. This patch automatically closes this gap without duplicating more code. bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn, so it stays in sol_tcp_sockopt(). For string name value like TCP_CONGESTION, bpf expects it is always null terminated, so sol_tcp_sockopt() decrements optlen by one before calling do_tcp_getsockopt() and the 'if (optlen < saved_optlen) memset(..,0,..);' in __bpf_getsockopt() will always do a null termination. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch changes bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt(). It removes all duplicated code from bpf_getsockopt(SOL_SOCKET). Before this patch, there were some optnames available to bpf_setsockopt(SOL_SOCKET) but missing in bpf_getsockopt(SOL_SOCKET). It surprises users from time to time. For example, SO_REUSEADDR, SO_KEEPALIVE, SO_RCVLOWAT, and SO_MAX_PACING_RATE. This patch automatically closes this gap without duplicating more code. The only exception is SO_BINDTODEVICE because it needs to acquire a blocking lock. Thus, SO_BINDTODEVICE is not supported. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002912.2894040-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch moves the "#ifdef CONFIG_XXX" check into the "if/else" statement itself. The change is done for the bpf_getsockopt() function only. It will make the latter patches easier to follow without the surrounding ifdef macro. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002906.2893572-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier patch that changes sk_getsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking the lock when called from bpf. This patch also changes do_ipv6_getsockopt() to use sockopt_{lock,release}_sock() such that bpf_getsockopt(SOL_IPV6) can reuse do_ipv6_getsockopt(). Although bpf_getsockopt(SOL_IPV6) currently does not support optname that requires lock_sock(), using sockopt_{lock,release}_sock() consistently across *_getsockopt() will make future optname addition harder to miss the sockopt_{lock,release}_sock() usage. eg. when adding new optname that requires a lock and the new optname is needed in bpf_getsockopt() also. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002859.2893064-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument . This patch also changes do_ipv6_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt(). Note on the change in ip6_mc_msfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ipv6_get_msfilter() and compat_ipv6_get_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip6_mc_msfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002853.2892532-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Pass the len to the compat_ipv6_get_msfilter() instead of compat_ipv6_get_msfilter() getting it again from optlen. Its counter part ipv6_get_msfilter() is also taking the len from do_ipv6_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002846.2892091-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
The 'unsigned int flags' argument is always 0, so it can be removed. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002840.2891763-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_ip_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002834.2891514-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument. This patch also changes do_ip_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Note on the change in ip_mc_gsfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ip_get_mcast_msfilter() and compat_ip_get_mcast_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip_mc_gsfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_tcp_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002821.2889765-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument . This patch also changes do_tcp_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002815.2889332-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes sk_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_SOCKET) to reuse sk_getsockopt(). Only sk_get_filter() requires this change and it is used by the optname SO_GET_FILTER. The '.getname' implementations in sock->ops->getname() is not changed also since bpf does not always have the sk->sk_socket pointer and cannot support SO_PEERNAME. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002809.2888981-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
This patch changes sk_getsockopt() to take the sockptr_t argument such that it can be used by bpf_getsockopt(SOL_SOCKET) in a latter patch. security_socket_getpeersec_stream() is not changed. It stays with the __user ptr (optval.user and optlen.user) to avoid changes to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not support SO_PEERSEC. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
Martin KaFai Lau authored
A latter patch refactors bpf_getsockopt(SOL_SOCKET) with the sock_getsockopt() to avoid code duplication and code drift between the two duplicates. The current sock_getsockopt() takes sock ptr as the argument. The very first thing of this function is to get back the sk ptr by 'sk = sock->sk'. bpf_getsockopt() could be called when the sk does not have the sock ptr created. Meaning sk->sk_socket is NULL. For example, when a passive tcp connection has just been established but has yet been accept()-ed. Thus, it cannot use the sock_getsockopt(sk->sk_socket) or else it will pass a NULL ptr. This patch moves all sock_getsockopt implementation to the newly added sk_getsockopt(). The new sk_getsockopt() takes a sk ptr and immediately gets the sock ptr by 'sock = sk->sk_socket' The existing sock_getsockopt(sock) is changed to call sk_getsockopt(sock->sk). All existing callers have both sock->sk and sk->sk_socket pointer. The latter patch will make bpf_getsockopt(SOL_SOCKET) call sk_getsockopt(sk) directly. The bpf_getsockopt(SOL_SOCKET) does not use the optnames that require sk->sk_socket, so it will be safe. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002756.2887884-1-kafai@fb.comSigned-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 02 Sep, 2022 11 commits
-
-
Ian Rogers authored
The put lowers the reference count to 0 and frees ctx, reading it afterwards is invalid. Move the put after the uses and determine the last use by the reference count being 1. Fixes: 39e940d4 ("selftests/xsk: Destroy BPF resources only when ctx refcount drops to 0") Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901202645.1463552-1-irogers@google.com
-
Daniel Müller authored
BPF object files are, in a way, the final artifact produced as part of the ahead-of-time compilation process. That makes them somewhat special compared to "regular" object files, which are a intermediate build artifacts that can typically be removed safely. As such, it can make sense to name them differently to make it easier to spot this difference at a glance. Among others, libbpf-bootstrap [0] has established the extension .bpf.o for BPF object files. It seems reasonable to follow this example and establish the same denomination for selftest build artifacts. To that end, this change adjusts the corresponding part of the build system and the test programs loading BPF object files to work with .bpf.o files. [0] https://github.com/libbpf/libbpf-bootstrapSuggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220901222253.1199242-1-deso@posteo.net
-
Maciej Fijalkowski authored
Introduce new mode to xdpxceiver responsible for testing AF_XDP zero copy support of driver that serves underlying physical device. When setting up test suite, determine whether driver has ZC support or not by trying to bind XSK ZC socket to the interface. If it succeeded, interpret it as ZC support being in place and do softirq and busy poll tests for zero copy mode. Note that Rx dropped tests are skipped since ZC path is not touching rx_dropped stat at all. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-7-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
For single threaded poll tests call pthread_kill() from main thread so that we are sure worker thread has finished its job and it is possible to proceed with next test types from test suite. It was observed that on some platforms it takes a bit longer for worker thread to exit and next test case sees device as busy in this case. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-6-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
Currently, architecture of xdpxceiver is designed strictly for conducting veth based tests. Veth pair is created together with a network namespace and one of the veth interfaces is moved to the mentioned netns. Then, separate threads for Tx and Rx are spawned which will utilize described setup. Infrastructure described in the paragraph above can not be used for testing AF_XDP support on physical devices. That testing will be conducted on a single network interface and same queue. Xskxceiver needs to be extended to distinguish between veth tests and physical interface tests. Since same iface/queue id pair will be used by both Tx/Rx threads for physical device testing, Tx thread, which happen to run after the Rx thread, is going to create XSK socket with shared umem flag. In order to track this setting throughout the lifetime of spawned threads, introduce 'shared_umem' boolean variable to struct ifobject and set it to true when xdpxceiver is run against physical device. In such case, UMEM size needs to be doubled, so half of it will be used by Rx thread and other half by Tx thread. For two step based test types, value of XSKMAP element under key 0 has to be updated as there is now another socket for the second step. Also, to avoid race conditions when destroying XSK resources, move this activity to the main thread after spawned Rx and Tx threads have finished its job. This way it is possible to gracefully remove shared umem without introducing synchronization mechanisms. To run xsk selftests suite on physical device, append "-i $IFACE" when invoking test_xsk.sh. For veth based tests, simply skip it. When "-i $IFACE" is in place, under the hood test_xsk.sh will use $IFACE for both interfaces supplied to xdpxceiver, which in turn will interpret that this execution of test suite is for a physical device. Note that currently this makes it possible only to test SKB and DRV mode (in case underlying device has native XDP support). ZC testing support is added in a later patch. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-5-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
So that "enp240s0f0" or such name can be used against xskxceiver. While at it, also extend character count for netns name. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-4-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
In order to prepare xdpxceiver for physical device testing, let us introduce default Rx pkt stream. Reason for doing it is that physical device testing will use a UMEM with a doubled size where half of it will be used by Tx and other half by Rx. This means that pkt addresses will differ for Tx and Rx streams. Rx thread will initialize the xsk_umem_info::base_addr that is added here so that pkt_set(), when working on Rx UMEM will add this offset and second half of UMEM space will be used. Note that currently base_addr is 0 on both sides. Future commit will do the mentioned initialization. Previously, veth based testing worked on separate UMEMs, so single default stream was fine. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-3-maciej.fijalkowski@intel.com
-
Maciej Fijalkowski authored
Currently, xdpxceiver assumes that underlying device supports XDP in native mode - it is fine by now since tests can run only on a veth pair. Future commit is going to allow running test suite against physical devices, so let us query the device if it is capable of running XDP programs in native mode. This way xdpxceiver will not try to run TEST_MODE_DRV if device being tested is not supporting it. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220901114813.16275-2-maciej.fijalkowski@intel.com
-
Shmulik Ladkani authored
Get the tunnel flags in {ipv6}vxlan_get_tunnel_src and ensure they are aligned with tunnel params set at {ipv6}vxlan_set_tunnel_dst. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-2-shmulik.ladkani@gmail.com
-
Shmulik Ladkani authored
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters (id, ttl, tos, local and remote) but does not expose ip_tunnel_info's tun_flags to the BPF program. It makes sense to expose tun_flags to the BPF program. Assume for example multiple GRE tunnels maintained on a single GRE interface in collect_md mode. The program expects origins to initiate over GRE, however different origins use different GRE characteristics (e.g. some prefer to use GRE checksum, some do not; some pass a GRE key, some do not, etc..). A BPF program getting tun_flags can therefore remember the relevant flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In the reply path, the program can use 'bpf_skb_set_tunnel_key' in order to correctly reply to the remote, using similar characteristics, based on the stored tunnel flags. Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags. Decided to use the existing unused 'tunnel_ext' as the storage for the 'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout. Also, the following has been considered during the design: 1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy and place into the new 'tunnel_flags' field. This has 2 drawbacks: - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space, e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into tunnel_flags from a *get_tunnel_key* call. - Not all "interesting" TUNNEL_xxx flags can be mapped to existing BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy flags just for purposes of the returned tunnel_flags. 2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only "interesting" flags. That's ok, but the drawback is that what's "interesting" for my usecase might be limiting for other usecases. Therefore I decided to expose what's in key.tun_flags *as is*, which seems most flexible. The BPF user can just choose to ignore bits he's not interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them back in the get_tunnel_key call. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
-
Shung-Hsi Yu authored
Commit a657182a ("bpf: Don't use tnum_range on array range checking for poke descriptors") has shown that using tnum_range() as argument to tnum_in() can lead to misleading code that looks like tight bound check when in fact the actual allowed range is much wider. Document such behavior to warn against its usage in general, and suggest some scenario where result can be trusted. Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net Link: https://www.openwall.com/lists/oss-security/2022/08/26/1 Link: https://lore.kernel.org/bpf/20220831031907.16133-3-shung-hsi.yu@suse.com Link: https://lore.kernel.org/bpf/20220831031907.16133-2-shung-hsi.yu@suse.com
-
- 01 Sep, 2022 7 commits
-
-
Hou Tao authored
When CONFIG_SECURITY_NETWORK is disabled, there will be build warnings from resolve_btfids: WARN: resolve_btfids: unresolved symbol bpf_lsm_socket_socketpair ...... WARN: resolve_btfids: unresolved symbol bpf_lsm_inet_conn_established Fixing it by wrapping these BTF ID definitions by CONFIG_SECURITY_NETWORK. Fixes: 69fd337a ("bpf: per-cgroup lsm flavor") Fixes: 9113d7e4 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220901065126.3856297-1-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Jiapeng Chong authored
The assignment of the else and else if branches is the same, so the else if here is redundant, so we remove it and add a comment to make the code here readable. ./kernel/bpf/cgroup_iter.c:81:6-8: WARNING: possible condition with no effect (if == else). Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2016Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220831021618.86770-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Martin KaFai Lau authored
Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to make the update of per-cpu prog->active and per-cpu bpf_task_storage_busy being preemption-safe. The problem is on same architectures (e.g. arm64), __this_cpu_{inc|dec|inc_return} are neither preemption-safe nor IRQ-safe, so under fully preemptible kernel the concurrent updates on these per-cpu variables may be interleaved and the final values of these variables may be not zero. Patch 1 & 2 use the preemption-safe per-cpu helpers to manipulate prog->active and bpf_task_storage_busy. Patch 3 & 4 add a test case in map_tests to show the concurrent updates on the per-cpu bpf_task_storage_busy by using __this_cpu_{inc|dec} are not atomic. Comments are always welcome. Regards, Tao Change Log: v2: * Patch 1: update commit message to indicate the problem is only possible for fully preemptible kernel * Patch 2: a new patch which fixes the problem for prog->active * Patch 3 & 4: move it to test_maps and make it depend on CONFIG_PREEMPT v1: https://lore.kernel.org/bpf/20220829142752.330094-1-houtao@huaweicloud.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Under full preemptible kernel, task local storage lookup operations on the same CPU may update per-cpu bpf_task_storage_busy concurrently. If the update of bpf_task_storage_busy is not preemption safe, the final value of bpf_task_storage_busy may become not-zero forever and bpf_task_storage_trylock() will always fail. So add a test case to ensure the update of bpf_task_storage_busy is preemption safe. Will skip the test case when CONFIG_PREEMPT is disabled, and it can only reproduce the problem probabilistically. By increasing TASK_STORAGE_MAP_NR_LOOP and running it under ARM64 VM with 4-cpus, it takes about four rounds to reproduce: > test_maps is modified to only run test_task_storage_map_stress_lookup() $ export TASK_STORAGE_MAP_NR_THREAD=256 $ export TASK_STORAGE_MAP_NR_LOOP=81920 $ export TASK_STORAGE_MAP_PIN_CPU=1 $ time ./test_maps test_task_storage_map_stress_lookup(135):FAIL:bad bpf_task_storage_busy got -2 real 0m24.743s user 0m6.772s sys 0m17.966s Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-5-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
sys_pidfd_open() is defined twice in both test_bprm_opts.c and test_local_storage.c, so move it to a common header file. And it will be used in map_tests as well. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-4-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Both __this_cpu_inc_return() and __this_cpu_dec() are not preemption safe and now migrate_disable() doesn't disable preemption, so the update of prog-active is not atomic and in theory under fully preemptible kernel recurisve prevention may do not work. Fixing by using the preemption-safe and IRQ-safe variants. Fixes: ca06f55b ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-3-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Now migrate_disable() does not disable preemption and under some architectures (e.g. arm64) __this_cpu_{inc|dec|inc_return} are neither preemption-safe nor IRQ-safe, so for fully preemptible kernel concurrent lookups or updates on the same task local storage and on the same CPU may make bpf_task_storage_busy be imbalanced, and bpf_task_storage_trylock() on the specific cpu will always fail. Fixing it by using this_cpu_{inc|dec|inc_return} when manipulating bpf_task_storage_busy. Fixes: bc235cdb ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20220901061938.3789460-2-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
- 31 Aug, 2022 7 commits
-
-
Martin KaFai Lau authored
Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the issues found during investigating the syzkaller problem reported in [0]. It seems that the concurrent updates to the same hash-table bucket may fail as shown in patch 1. Patch 1 uses preempt_disable() to fix the problem for htab_use_raw_lock() case. For !htab_use_raw_lock() case, the problem is left to "BPF specific memory allocator" patchset [1] in which !htab_use_raw_lock() will be removed. Patch 2 fixes the out-of-bound memory read problem reported in [0]. The problem has the root cause as patch 1 and it is fixed by handling -EBUSY from htab_lock_bucket() correctly. Patch 3 add two cases for hash-table update: one for the reentrancy of bpf_map_update_elem(), and another one for concurrent updates of the same hash-table bucket. Comments are always welcome. Regards, Tao [0]: https://lore.kernel.org/bpf/CACkBjsbuxaR6cv0kXJoVnBfL9ZJXjjoUcMpw_Ogc313jSrg14A@mail.gmail.com/ [1]: https://lore.kernel.org/bpf/20220819214232.18784-1-alexei.starovoitov@gmail.com/ Change Log: v4: * rebased on bpf-next * add htab_update to DENYLIST.s390x v3: https://lore.kernel.org/bpf/20220829023709.1958204-1-houtao@huaweicloud.com/ * patch 1: update commit message and add Fixes tag * patch 2: add Fixes tag * patch 3: elaborate the description of test cases v2: https://lore.kernel.org/bpf/bd60ef93-1c6a-2db2-557d-b09b92ad22bd@huaweicloud.com/ * Note the fix is for CONFIG_PREEMPT case in commit message and add Reviewed-by tag for patch 1 * Drop patch "bpf: Allow normally concurrent map updates for !htab_use_raw_lock() case" v1: https://lore.kernel.org/bpf/20220821033223.2598791-1-houtao@huaweicloud.com/ ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
One test demonstrates the reentrancy of hash map update on the same bucket should fail, and another one shows concureently updates of the same hash map bucket should succeed and not fail due to the reentrancy checking for bucket lock. There is no trampoline support on s390x, so move htab_update to denylist. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-4-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
In __htab_map_lookup_and_delete_batch() if htab_lock_bucket() returns -EBUSY, it will go to next bucket. Going to next bucket may not only skip the elements in current bucket silently, but also incur out-of-bound memory access or expose kernel memory to userspace if current bucket_cnt is greater than bucket_size or zero. Fixing it by stopping batch operation and returning -EBUSY when htab_lock_bucket() fails, and the application can retry or skip the busy batch as needed. Fixes: 20b6cc34 ("bpf: Avoid hashtab deadlock with map_locked") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-3-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Hou Tao authored
Per-cpu htab->map_locked is used to prohibit the concurrent accesses from both NMI and non-NMI contexts. But since commit 74d862b6 ("sched: Make migrate_disable/enable() independent of RT"), migrate_disable() is also preemptible under CONFIG_PREEMPT case, so now map_locked also disallows concurrent updates from normal contexts (e.g. userspace processes) unexpectedly as shown below: process A process B htab_map_update_elem() htab_lock_bucket() migrate_disable() /* return 1 */ __this_cpu_inc_return() /* preempted by B */ htab_map_update_elem() /* the same bucket as A */ htab_lock_bucket() migrate_disable() /* return 2, so lock fails */ __this_cpu_inc_return() return -EBUSY A fix that seems feasible is using in_nmi() in htab_lock_bucket() and only checking the value of map_locked for nmi context. But it will re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered through non-tracing program (e.g. fentry program). One cannot use preempt_disable() to fix this issue as htab_use_raw_lock being false causes the bucket lock to be a spin lock which can sleep and does not work with preempt_disable(). Therefore, use migrate_disable() when using the spinlock instead of preempt_disable() and defer fixing concurrent updates to when the kernel has its own BPF memory allocator. Fixes: 74d862b6 ("sched: Make migrate_disable/enable() independent of RT") Reviewed-by: Hao Luo <haoluo@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220831042629.130006-2-houtao@huaweicloud.comSigned-off-by: Martin KaFai Lau <martin.lau@kernel.org>
-
Martin KaFai Lau authored
This patch adds a test to ensure bpf_setsockopt(TCP_CONGESTION, "not_exist") will not trigger the kernel module autoload. Before the fix: [ 40.535829] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 [...] [ 40.552134] tcp_ca_find_autoload.constprop.0+0xcb/0x200 [ 40.552689] tcp_set_congestion_control+0x99/0x7b0 [ 40.553203] do_tcp_setsockopt+0x3ed/0x2240 [...] [ 40.556041] __bpf_setsockopt+0x124/0x640 Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231953.792412-1-martin.lau@linux.dev
-
Martin KaFai Lau authored
When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION), it should not try to load module which may be a blocking operation. This details was correct in the v1 [0] but missed by mistake in the later revision in commit cb388e7e ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()"). This patch fixes it by checking the has_current_bpf_ctx(). [0] https://lore.kernel.org/bpf/20220727060921.2373314-1-kafai@fb.com/ Fixes: cb388e7e ("bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()") Signed-off-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220830231946.791504-1-martin.lau@linux.dev
-
James Hilliard authored
The bpf_tail_call_static function is currently not defined unless using clang >= 8. To support bpf_tail_call_static on GCC we can check if __clang__ is not defined to enable bpf_tail_call_static. We need to use GCC assembly syntax when the compiler does not define __clang__ as LLVM inline assembly is not fully compatible with GCC. Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220829210546.755377-1-james.hilliard1@gmail.com
-
- 30 Aug, 2022 1 commit
-
-
Hao Luo authored
Support dumping info of a cgroup_iter link. This includes showing the cgroup's id and the order for walking the cgroup hierarchy. Example output is as follows: > bpftool link show 1: iter prog 2 target_name bpf_map 2: iter prog 3 target_name bpf_prog 3: iter prog 12 target_name cgroup cgroup_id 72 order self_only > bpftool -p link show [{ "id": 1, "type": "iter", "prog_id": 2, "target_name": "bpf_map" },{ "id": 2, "type": "iter", "prog_id": 3, "target_name": "bpf_prog" },{ "id": 3, "type": "iter", "prog_id": 12, "target_name": "cgroup", "cgroup_id": 72, "order": "self_only" } ] Signed-off-by: Hao Luo <haoluo@google.com> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220829231828.1016835-1-haoluo@google.comSigned-off-by: Martin KaFai Lau <martin.lau@linux.dev>
-