- 03 Dec, 2020 40 commits
-
-
Alexei Starovoitov authored
Mariusz Dudek says: ==================== From: Mariusz Dudek <mariuszx.dudek@intel.com> This patch series adds support for separation of eBPF program load and xsk socket creation. In for example a Kubernetes environment you can have an AF_XDP CNI or daemonset that is responsible for launching pods that execute an application using AF_XDP sockets. It is desirable that the pod runs with as low privileges as possible, CAP_NET_RAW in this case, and that all operations that require privileges are contained in the CNI or daemonset. In this case, you have to be able separate ePBF program load from xsk socket creation. Currently, this will not work with the xsk_socket__create APIs because you need to have CAP_NET_ADMIN privileges to load eBPF program and CAP_SYS_ADMIN privileges to create update xsk_bpf_maps. To be exact xsk_set_bpf_maps does not need those privileges but it takes the prog_fd and xsks_map_fd and those are known only to process that was loading eBPF program. The api bpf_prog_get_fd_by_id that looks up the fd of the prog using an prog_id and bpf_map_get_fd_by_id that looks for xsks_map_fd usinb map_id both requires CAP_SYS_ADMIN. With this patch, the pod can be run with CAP_NET_RAW capability only. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. Without this patch in case of insufficient rights ENOPERM is returned by xsk_socket__create. To resolve this privileges issue two new APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_sokcet__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Usage example: int xsk_setup_xdp_prog(int ifindex, int *xsks_map_fd) int xsk_socket__update_xskmap(struct xsk_socket *xsk, int xsks_map_fd); Inserts AF_XDP socket "fd" into the xskmap. The first patch introduces the new APIs. The second patch provides a new sample applications working as control and modification to existing xdpsock application to work with less privileges. This patch set is based on bpf-next commit 97306be4 (Merge branch 'switch to memcg-based memory accounting') Since v6 - rebase on 97306be4 to resolve RLIMIT conflicts Since v5 - fixed sample/bpf/xdpsock_user.c to resolve merge conflicts Since v4 - sample/bpf/Makefile issues fixed Since v3: - force_set_map flag removed - leaking of xsk struct fixed - unified function error returning policy implemented Since v2: - new APIs moved itto LIBBPF_0.3.0 section - struct bpf_prog_cfg_opts removed - loading own eBPF program via xsk_setup_xdp_prog functionality removed Since v1: - struct bpf_prog_cfg improved for backward/forward compatibility - API xsk_update_xskmap renamed to xsk_socket__update_xskmap - commit message formatting fixed ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Mariusz Dudek authored
Introduce a sample program to demonstrate the control and data plane split. For the control plane part a new program called xdpsock_ctrl_proc is introduced. For the data plane part, some code was added to xdpsock_user.c to act as the data plane entity. Application xdpsock_ctrl_proc works as control entity with sudo privileges (CAP_SYS_ADMIN and CAP_NET_ADMIN are sufficient) and the extended xdpsock as data plane entity with CAP_NET_RAW capability only. Usage example: sudo ./samples/bpf/xdpsock_ctrl_proc -i <interface> sudo ./samples/bpf/xdpsock -i <interface> -q <queue_id> -n <interval> -N -l -R Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-3-mariuszx.dudek@intel.com
-
Mariusz Dudek authored
Add support for separation of eBPF program load and xsk socket creation. This is needed for use-case when you want to privide as little privileges as possible to the data plane application that will handle xsk socket creation and incoming traffic. With this patch the data entity container can be run with only CAP_NET_RAW capability to fulfill its purpose of creating xsk socket and handling packages. In case your umem is larger or equal process limit for MEMLOCK you need either increase the limit or CAP_IPC_LOCK capability. To resolve privileges issue two APIs are introduced: - xsk_setup_xdp_prog - loads the built in XDP program. It can also return xsks_map_fd which is needed by unprivileged process to update xsks_map with AF_XDP socket "fd" - xsk_socket__update_xskmap - inserts an AF_XDP socket into an xskmap for a particular xsk_socket Signed-off-by: Mariusz Dudek <mariuszx.dudek@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201203090546.11976-2-mariuszx.dudek@intel.com
-
Brendan Jackman authored
Add missing newlines and fix polarity of strerror argument. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20201203102234.648540-1-jackmanb@google.com
-
Stanislav Fomichev authored
Splice (copy_file_range) doesn't work on all filesystems. I'm running test kernels on top of my read-only disk image and it uses plan9 under the hood. This prevents test_local_storage from successfully passing. There is really no technical reason to use splice, so lets do old-school read/write to copy file; this should work in all environments. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202174947.3621989-1-sdf@google.com
-
Alexei Starovoitov authored
Andrii Nakryiko says: ==================== Few follow up improvements to bpftool for split BTF support: - emit "name <anon>" for non-named BTFs in `bpftool btf show` command; - when dumping /sys/kernel/btf/<module> use /sys/kernel/btf/vmlinux as the base BTF, unless base BTF is explicitly specified with -B flag. This patch set also adds btf__base_btf() getter to access base BTF of the struct btf. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Andrii Nakryiko authored
In case of working with module's split BTF from /sys/kernel/btf/*, auto-substitute /sys/kernel/btf/vmlinux as the base BTF. This makes using bpftool with module BTFs faster and simpler. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-4-andrii@kernel.org
-
Andrii Nakryiko authored
Add ability to get base BTF. It can be also used to check if BTF is split BTF. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-3-andrii@kernel.org
-
Andrii Nakryiko authored
For consistency of output, emit "name <anon>" for BTFs without the name. This keeps output more consistent and obvious. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202065244.530571-2-andrii@kernel.org
-
Alexei Starovoitov authored
Roman Gushchin says: ==================== Currently bpf is using the memlock rlimit for the memory accounting. This approach has its downsides and over time has created a significant amount of problems: 1) The limit is per-user, but because most bpf operations are performed as root, the limit has a little value. 2) It's hard to come up with a specific maximum value. Especially because the counter is shared with non-bpf use cases (e.g. memlock()). Any specific value is either too low and creates false failures or is too high and useless. 3) Charging is not connected to the actual memory allocation. Bpf code should manually calculate the estimated cost and charge the counter, and then take care of uncharging, including all fail paths. It adds to the code complexity and makes it easy to leak a charge. 4) There is no simple way of getting the current value of the counter. We've used drgn for it, but it's far from being convenient. 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had a function to "explain" this case for users. 6) rlimits are generally considered as (at least partially) obsolete. They do not provide a comprehensive system for the control of physical resources: memory, cpu, io etc. All resource control developments in the recent years were related to cgroups. In order to overcome these problems let's switch to the memory cgroup-based memory accounting of bpf objects. With the recent addition of the percpu memory accounting, now it's possible to provide a comprehensive accounting of the memory used by bpf programs and maps. This approach has the following advantages: 1) The limit is per-cgroup and hierarchical. It's way more flexible and allows a better control over memory usage by different workloads. 2) The actual memory consumption is taken into account. It happens automatically on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also performed automatically on releasing the memory. So the code on the bpf side becomes simpler and safer. 3) There is a simple way to get the current value and statistics. Cgroup-based accounting adds new requirements: 1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled. These options are usually enabled, maybe excluding tiny builds for embedded devices. 2) The system should have a configured cgroup hierarchy, including reasonable memory limits and/or guarantees. Modern systems usually delegate this task to systemd or similar task managers. Without meeting these requirements there are no limits on how much memory bpf can use and a non-root user is able to hurt the system by allocating too much. But because per-user rlimits do not provide a functional system to protect and manage physical resources anyway, anyone who seriously depends on it, should use cgroups. When a bpf map is created, the memory cgroup of the process which creates the map is recorded. Subsequently all memory allocation related to the bpf map are charged to the same cgroup. It includes allocations made from interrupts and by any processes. Bpf program memory is charged to the memory cgroup of a process which loads the program. The patchset consists of the following parts: 1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped to userspace 2) memcg-based accounting for various bpf objects: progs and maps 3) removal of the rlimit-based accounting 4) removal of rlimit adjustments in userspace samples v9: - always charge the saved memory cgroup, by Daniel, Toke and Alexei - added bpf_map_kzalloc() - rebase and minor fixes v8: - extended the cover letter to be more clear on new requirements, by Daniel - an approximate value is provided by map memlock info, by Alexei v7: - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei - switched allocations made from an interrupt context to new helpers, by Daniel - rebase and minor fixes v6: - rebased to the latest version of the remote charging API - fixed signatures, added acks v5: - rebased to the latest version of the remote charging API - implemented kmem accounting from an interrupt context, by Shakeel - rebased to latest changes in mm allowed to map vmallocs to userspace - fixed a build issue in kselftests, by Alexei - fixed a use-after-free bug in bpf_map_free_deferred() - added bpf line info coverage, by Shakeel - split bpf map charging preparations into a separate patch v4: - covered allocations made from an interrupt context, by Daniel - added some clarifications to the cover letter v3: - droped the userspace part for further discussions/refinements, by Andrii and Song v2: - fixed build issue, caused by the remaining rlimit-based accounting for sockhash maps ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Roman Gushchin authored
Since bpf is not using rlimit memlock for the memory accounting and control, do not change the limit in sample applications. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-35-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for bpf progs. It has been replaced with memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-34-guro@fb.com
-
Roman Gushchin authored
Remove rlimit-based accounting infrastructure code, which is not used anymore. To provide a backward compatibility, use an approximation of the bpf map memory footprint as a "memlock" value, available to a user via map info. The approximation is based on the maximal number of elements and key and value sizes. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-33-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for bpf local storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-32-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for xskmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for sockmap and sockhash maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for bpf ringbuffer. It has been replaced with the memcg-based memory accounting. bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM) and a valid pointer, so to simplify the code make it return NULL in the first case. This allows to drop a couple of lines in ringbuf_map_alloc() and also makes it look similar to other memory allocating function like kmalloc(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-28-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for reuseport_array maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-27-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for queue_stack maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-26-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for lpm_trie maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-25-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for hashtab maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-24-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for devmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for cgroup storage maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-22-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for cpumap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-21-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for bpf_struct_ops maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-20-guro@fb.com
-
Roman Gushchin authored
Do not use rlimit-based memory accounting for arraymap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-19-guro@fb.com
-
Roman Gushchin authored
Extend xskmap memory accounting to include the memory taken by the xsk_map_node structure. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com
-
Roman Gushchin authored
Include internal metadata into the memcg-based memory accounting. Also include the memory allocated on updating an element. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com
-
Roman Gushchin authored
Account memory used by bpf local storage maps: per-socket, per-inode and per-task storages. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-16-guro@fb.com
-
Roman Gushchin authored
Enable the memcg-based memory accounting for the memory used by the bpf ringbuffer. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-15-guro@fb.com
-
Roman Gushchin authored
Include lpm trie and lpm trie node objects into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-14-guro@fb.com
-
Roman Gushchin authored
Include percpu objects and the size of map metadata into the accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-13-guro@fb.com
-
Roman Gushchin authored
Include map metadata and the node size (struct bpf_dtab_netdev) into the accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-12-guro@fb.com
-
Roman Gushchin authored
Account memory used by cgroup storage maps including metadata structures. Account the percpu memory for the percpu flavor of cgroup storage. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-11-guro@fb.com
-
Roman Gushchin authored
Include metadata and percpu data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-10-guro@fb.com
-
Roman Gushchin authored
Include percpu arrays and auxiliary data into the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-9-guro@fb.com
-
Roman Gushchin authored
This patch enables memcg-based memory accounting for memory allocated by __bpf_map_area_alloc(), which is used by many types of bpf maps for large initial memory allocations. Please note, that __bpf_map_area_alloc() should not be used outside of map creation paths without setting the active memory cgroup to the map's memory cgroup. Following patches in the series will refine the accounting for some of the map types. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-8-guro@fb.com
-
Roman Gushchin authored
Bpf maps can be updated from an interrupt context and in such case there is no process which can be charged. It makes the memory accounting of bpf maps non-trivial. Fortunately, after commit 4127c650 ("mm: kmem: enable kernel memcg accounting from interrupt contexts") and commit b87d8cef ("mm, memcg: rework remote charging API to support nesting") it's finally possible. To make the ownership model simple and consistent, when the map is created, the memory cgroup of the current process is recorded. All subsequent allocations related to the bpf map are charged to the same memory cgroup. It includes allocations made by any processes (even if they do belong to a different cgroup) and from interrupts. This commit introduces 3 new helpers, which will be used by following commits to enable the accounting of bpf maps memory: - bpf_map_kmalloc_node() - bpf_map_kzalloc() - bpf_map_alloc_percpu() They are wrapping popular memory allocation functions. They set the active memory cgroup to the map's memory cgroup and add __GFP_ACCOUNT to the passed gfp flags. Then they call into the corresponding memory allocation function and restore the original active memory cgroup. These helpers are supposed to use everywhere except the map creation path. During the map creation when the map structure is allocated by itself, it cannot be passed to those helpers. In those cases default memory allocation function will be used with the __GFP_ACCOUNT flag. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201201215900.3569844-7-guro@fb.com
-
Roman Gushchin authored
Include memory used by bpf programs into the memcg-based accounting. This includes the memory used by programs itself, auxiliary data, statistics and bpf line info. A memory cgroup containing the process which loads the program is getting charged. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-6-guro@fb.com
-