- 31 Mar, 2018 10 commits
-
-
Daniel Borkmann authored
Andrey Ignatov says: ==================== v2->v3: - rebase due to conflicts - fix ipv6=m build v1->v2: - support expected_attach_type at prog load time so that prog (incl. context accesses and calls to helpers) can be validated with regard to specific attach point it is supposed to be attached to. Later, at attach time, attach type is checked so that it must be same as at load time if it was provided - reworked hooks to rely on expected_attach_type, and reduced number of new prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks - add selftests for post-sys_bind hook For our container management we've been using complicated and fragile setup consisting of LD_PRELOAD wrapper intercepting bind and connect calls from all containerized applications. Unfortunately it doesn't work for apps that don't use glibc and changing all applications that run in the datacenter is not possible due to 3rd party code and libraries (despite being open source code) and sheer amount of legacy code that has to be rewritten (we're rewriting what we can in parallel) These applications are written without containers in mind and have builtin assumptions about network services. Like an application X expects to connect localhost:special_port and find service Y in there. To move application X and service Y into two different containers LD_PRELOAD approach is used to help one service connect to another without rewriting them. Moving these two applications into different L2 (netns) or L3 (vrf) network isolation scopes doesn't help to solve the problem, since applications need to see each other like they were running on the host without containers. So if app X and app Y would run in different netns something would need to punch a connectivity hole in those namespaces. That would be real layering violation (with corresponding network debugging pains), since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Instead we used LD_PRELOAD (and now bpf programs) at bind/connect time to help applications discover and connect to each other. All applications are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do and the whole system is clean from network point of view and can be debugged with standard tools. We also considered resurrecting Hannes's afnetns work, but all hierarchical namespace abstraction don't work due to these builtin networking assumptions inside the apps. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature. BPF+cgroup looks to be the best solution for this problem. Hence we introduce 3 hooks: - at entry into sys_bind and sys_connect to let bpf prog look and modify 'struct sockaddr' provided by user space and fail bind/connect when appropriate - post sys_bind after port is allocated The approach works great and has zero overhead for anyone who doesn't use it and very low overhead when deployed. Different use case for this feature is to do low overhead firewall that doesn't need to inspect all packets and works at bind/connect time. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`. The main things tested are: * prog load behaves as expected (valid/invalid accesses in prog); * prog attach behaves as expected (load- vs attach-time attach types); * `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible way; * post-hooks return expected result and errno. Example: # ./test_sock Test case: bind4 load with invalid access: src_ip6 .. [PASS] Test case: bind4 load with invalid access: mark .. [PASS] Test case: bind6 load with invalid access: src_ip4 .. [PASS] Test case: sock_create load with invalid access: src_port .. [PASS] Test case: sock_create load w/o expected_attach_type (compat mode) .. [PASS] Test case: sock_create load w/ expected_attach_type .. [PASS] Test case: attach type mismatch bind4 vs bind6 .. [PASS] Test case: attach type mismatch bind6 vs bind4 .. [PASS] Test case: attach type mismatch default vs bind4 .. [PASS] Test case: attach type mismatch bind6 vs sock_create .. [PASS] Test case: bind4 reject all .. [PASS] Test case: bind6 reject all .. [PASS] Test case: bind6 deny specific IP & port .. [PASS] Test case: bind4 allow specific IP & port .. [PASS] Test case: bind4 allow all .. [PASS] Test case: bind6 allow all .. [PASS] Summary: 16 PASSED, 0 FAILED Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
"Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT attach types. Try to connect(2) to specified IP:port and test that: * remote IP:port pair is overridden; * local end of connection is bound to specified IP. All combinations of IPv4/IPv6 and TCP/UDP are tested. Example: # tcpdump -pn -i lo -w connect.pcap 2>/dev/null & [1] 478 # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh Wait for testing IPv4/IPv6 to become available ... OK Load bind4 with invalid type (can pollute stderr) ... REJECTED Load bind4 with valid type ... OK Attach bind4 with invalid type ... REJECTED Attach bind4 with valid type ... OK Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r2 = 23569 1: (63) *(u32 *)(r1 +24) = r2 2: (b7) r2 = 16777343 3: (63) *(u32 *)(r1 +4) = r2 invalid bpf_context access off=4 size=4 [ 1518.404609] random: crng init done libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect4' libbpf: failed to load object './connect4_prog.o' ... REJECTED Load connect4 with valid type ... OK Attach connect4 with invalid type ... REJECTED Attach connect4 with valid type ... OK Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56068) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56447) Load bind6 with invalid type (can pollute stderr) ... REJECTED Load bind6 with valid type ... OK Attach bind6 with invalid type ... REJECTED Attach bind6 with valid type ... OK Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r6 = 0 1: (63) *(u32 *)(r1 +12) = r6 invalid bpf_context access off=12 size=4 libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect6' libbpf: failed to load object './connect6_prog.o' ... REJECTED Load connect6 with valid type ... OK Attach connect6 with invalid type ... REJECTED Attach connect6 with valid type ... OK Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 37458) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 39315) ### SUCCESS # egrep 'connect\(.*AF_INET' connect.trace | \ > egrep -vw 'htons\(1025\)' | fold -b -s -w 72 502 connect(7, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(8, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(9, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 502 connect(10, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 # fg tcpdump -pn -i lo -w connect.pcap 2> /dev/null # tcpdump -r connect.pcap -n tcp | cut -c 1-72 reading from file connect.pcap, link-type EN10MB (Ethernet) 17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [S], seq 1333 17:57:40.383566 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [S.], seq 112 17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [.], ack 1, w 17:57:40.384578 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [R.], seq 1, 17:57:40.403327 IP6 ::6.37458 > ::1.6666: Flags [S], seq 406513443, win 17:57:40.403357 IP6 ::1.6666 > ::6.37458: Flags [S.], seq 2448389240, ac 17:57:40.403376 IP6 ::6.37458 > ::1.6666: Flags [.], ack 1, win 342, opt 17:57:40.404263 IP6 ::1.6666 > ::6.37458: Flags [R.], seq 1, ack 1, win Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
== The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. bpf_bind() sets `with_lock` to `false` when calling to __inet_bind() and __inet6_bind() since all call-sites, where bpf_bind() is called, already hold socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
Refactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
Add selftest to work with bpf_sock_addr context from `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs. Try to bind(2) on IP:port and apply: * loads to make sure context can be read correctly, including narrow loads (byte, half) for IP and full-size loads (word) for all fields; * stores to those fields allowed by verifier. All combination from IPv4/IPv6 and TCP/UDP are tested. Both scenarios are tested: * valid programs can be loaded and attached; * invalid programs can be neither loaded nor attached. Test passes when expected data can be read from context in the BPF-program, and after the call to bind(2) socket is bound to IP:port pair that was written by BPF-program to the context. Example: # ./test_sock_addr Attached bind4 program. Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Attached bind6 program. Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) ### SUCCESS Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
== The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
Support setting `expected_attach_type` at prog load time in both `bpf/bpf.h` and `bpf/libbpf.h`. Since both headers already have API to load programs, new functions are added not to break backward compatibility for existing ones: * `bpf_load_program_xattr()` is added to `bpf/bpf.h`; * `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`. Both new functions accept structures, `struct bpf_load_program_attr` and `struct bpf_prog_load_attr` correspondingly, where new fields can be added in the future w/o changing the API. Standard `_xattr` suffix is used to name the new API functions. Since `bpf_load_program_name()` is not used as heavily as `bpf_load_program()`, it was removed in favor of more generic `bpf_load_program_xattr()`. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Andrey Ignatov authored
== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 30 Mar, 2018 3 commits
-
-
Daniel Borkmann authored
Prashant Bhole says: ==================== These patches fix sg api usage in sockmap. Previously sockmap didn't use sg_init_table(), which caused hitting BUG_ON in sg api, when CONFIG_DEBUG_SG is enabled v1: added sg_init_table() calls wherever needed. v2: - Patch1 adds new helper function in sg api. sg_init_marker() - Patch2 sg_init_marker() and sg_init_table() in appropriate places Backgroud: While reviewing v1, John Fastabend raised a valid point about unnecessary memset in sg_init_table() because sockmap uses sg table which embedded in a struct. As enclosing struct is zeroed out, there is unnecessary memset in sg_init_table. So Daniel Borkmann suggested to define another static inline function in scatterlist.h which only initializes sg_magic. Also this function will be called from sg_init_table. From this suggestion I defined a function sg_init_marker() which sets sg_magic and calls sg_mark_end() ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Prashant Bhole authored
When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in sg_init_table() and it is verified in sg api while navigating. We hit BUG_ON when magic check is failed. In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing the scatterlist is already zeroed out. So to avoid extra memset, we use sg_init_marker() to initialize sg_magic. Fixed following things: - In bpf_tcp_sendpage: initialize sg using sg_init_marker - In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker - In bpf_tcp_push: Replace memset with sg_init_table where consumed sg entry needs to be re-initialized. Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Prashant Bhole authored
sg_init_marker initializes sg_magic in the sg table and calls sg_mark_end() on the last entry of the table. This can be useful to avoid memset in sg_init_table() when scatterlist is already zeroed out For example: when scatterlist is embedded inside other struct and that container struct is zeroed out Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 29 Mar, 2018 20 commits
-
-
Daniel Borkmann authored
John Fastabend says: ==================== This series adds the BPF_F_INGRESS flag support to the redirect APIs. Bringing the sockmap API in-line with the cls_bpf redirect APIs. We add it to both variants of sockmap programs, the first patch adds support for tx ulp hooks and the third patch adds support for the recv skb hooks. Patches two and four add tests for the corresponding ingress redirect hooks. Follow on patches can address busy polling support, but next series from me will move the sockmap sample program into selftests. v2: added static to function definition caught by kbuild bot v3: fixed an error branch with missing mem_uncharge in recvmsg op moved receive_queue check outside of RCU region ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add BPF_SK_SKB_STREAM_VERDICT tests for ingress hook. While we do this also bring stream tests in-line with MSG based testing. A map for skb options is added for userland to push options at BPF programs. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add support for the BPF_F_INGRESS flag in skb redirect helper. To do this convert skb into a scatterlist and push into ingress queue. This is the same logic that is used in the sk_msg redirect helper so it should feel familiar. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add a set of tests to verify ingress flag in redirect helpers works correctly with various msg sizes. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper. To do this add a scatterlist ring for receiving socks to check before calling into regular recvmsg call path. Additionally, because the poll wakeup logic only checked the skb recv queue we need to add a hook in TCP stack (similar to write side) so that we have a way to wake up polling socks when a scatterlist is redirected to that sock. After this all that is needed is for the redirect helper to push the scatterlist into the psock receive queue. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
Jakub Kicinski says: ==================== This set adds support for update and delete calls from the datapath, as well as XADD instructions (32 and 64 bit) and pseudo random numbers. The XADD support depends on verifier enforcing alignment which Daniel recently added. XADD uses NFP's atomic engine which requires values to be in big endian, therefore we need to keep track of which parts of the values are used as atomics and byte swap them accordingly. Pseudo random numbers are generated using NFP's HW pseudo random number generator. Jiong tackles initial implementation of packet cache, which he describes as follows: Memory reads on NFP would first fetch data from memory to transfer-in registers, then move them from transfer-in to general registers. Given NFP is rich on transfer-in registers, they could serve as memory cache. This patch tries to identify a sequence of packet data read (BPF_LDX) that are executed sequentially, then the total access range of the sequence is calculated and attached to each read instruction, the first instruction in this sequence is marked with an cache init flag so the execution of it would bring in the whole range of packet data for the sequence. All later packet reads in this sequence would fetch data from transfer-in registers directly, no need to JIT NFP memory access. Function call, non-packet-data memory read, packet write and memcpy will invalidate the cache and start a new cache range. Cache invalidation could be improved in the future, for example packet write doesn't need to invalidate the cache if the the write destination won't be read again. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
When FW responds with a message of wrong size or type make sure the type is checked first and included in the wrong size message. This makes it easier to figure out which FW command failed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
NFP has a prng register, which we can read to obtain a u32 worth of pseudo random data. Generate code for it. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Allow atomic add to be used even when the value is not guaranteed to fit into a 16 bit immediate. This requires the value to be pulled as data, and therefore use of a transfer register and a context swap. Track the information about possible lengths of the value, if it's guaranteed to be larger than 16bits don't generate the code for the optimized case at all. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Allow callers to control the delay slots of commands, instead of giving them just a wait/nowait choice. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Implement atomic add operation for 32 and 64 bit values. Depend on the verifier to ensure alignment. Values have to be kept in big endian and swapped upon read/write. For now only support atomic add of a constant. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
BPF_LDST_BYTES() does not put it's argument in parenthesis when referencing it. This makes it impossible to pass pointers obtained by address-of operator (e.g. BPF_LDST_BYTES(&insn)). Add the parenthesis. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Support calling map_delete_elem() FW helper from the datapath programs. For JIT checks and code are basically equivalent to map lookups. Similarly to other map helper key must be on the stack. Different pointer types are left for future extension. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Support calling map_update_elem() from the datapath programs by calling into FW-provided helper. Value pointer is passed in LM pointer #2. Keeping track of old state for arg3 is not necessary, since LM pointer #2 will be always loaded in this case, the trivial optimization for value at the bottom of the stack can't be done here. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Add a verifier helper for performing the basic state checks before a call to a map helper. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
Our implementation has restriction on stack pointers for function calls. Move the common checks into a helper for reuse. The state has to be encapsulated into a structure to support parameters other than BPF_REG_2. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jakub Kicinski authored
We will reuse most of map call code gen for other map calls. Rename the lookup gen function and use meta->func_id instead of hard-coding lookup. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jiong Wang authored
This patch is the front end of this optimisation, it detects and marks those packet reads that could be cached. Then the optimisation "backend" will be activated automatically. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jiong Wang authored
This patch add the support for unaligned read offset, i.e. the read offset to the start of packet cache area is not aligned to REG_WIDTH. In this case, the read area might across maximum three transfer-in registers. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Jiong Wang authored
This patch assumes there is a packet data cache, and would try to read packet data from the cache instead of from memory. This patch only implements the optimisation "backend", it doesn't build the packet data cache, so this optimisation is not enabled. This patch has only enabled aligned packet data read, i.e. when the read offset to the start of cache is REG_WIDTH aligned. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
- 28 Mar, 2018 7 commits
-
-
Daniel Borkmann authored
Alexei Starovoitov says: ==================== v7->v8: - moved 'u32 num_args' from 'struct tracepoint' into 'struct bpf_raw_event_map' that increases memory overhead, but can be optimized/compressed later. Now it's zero changes in tracepoint.[ch] v6->v7: - adopted Steven's bpf_raw_tp_map section approach to find tracepoint and corresponding bpf probe function instead of kallsyms approach. dropped kernel_tracepoint_find_by_name() patch v5->v6: - avoid changing semantics of for_each_kernel_tracepoint() function, instead introduce kernel_tracepoint_find_by_name() helper v4->v5: - adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 6 v3->v4: - adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small struct to u64. That nicely reduced the size of patch 1 v2->v3: - with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros (or rather moved them from apparmor) that cleaned up patch 6 - added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4 Now any tracepoint with >12 args will have build error v1->v2: - simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into bpf_raw_tp_open(name, prog_fd) as suggested by Daniel. That simplifies bpf_detach as well which is now simple close() of fd. - fixed memory leak in error path which was spotted by Daniel. - fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints - added more tests - fixed allyesconfig build caught by buildbot v1: This patch set is a different way to address the pressing need to access task_struct pointers in sched tracepoints from bpf programs. The first approach simply added these pointers to sched tracepoints: https://lkml.org/lkml/2017/12/14/753 which Peter nacked. Few options were discussed and eventually the discussion converged on doing bpf specific tracepoint_probe_register() probe functions. Details here: https://lkml.org/lkml/2017/12/20/929 Patch 1 is kernel wide cleanup of pass-struct-by-value into pass-struct-by-reference into tracepoints. Patches 2 and 3 are minor cleanups to address allyesconfig build Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args Patch 5 introduces COUNT_ARGS macro Patch 6 introduces BPF_RAW_TRACEPOINT api. the auto-cleanup and multiple concurrent users are must have features of tracing api. For bpf raw tracepoints it looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint // and attach bpf program to it raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool will automatically detach bpf program, unload it and unregister tracepoint probe. More details in patch 6. Patch 7 - trivial support in libbpf Patches 8, 9 - user space tests samples/bpf/test_overhead performance on 1 cpu: tracepoint base kprobe+bpf tracepoint+bpf raw_tracepoint+bpf task_rename 1.1M 769K 947K 1.0M urandom_read 789K 697K 750K 755K ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
similar to traditional traceopint test add bpf_get_stackid() test from raw tracepoints and reduce verbosity of existing stackmap test Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
add empty raw_tracepoint bpf program to test overhead similar to kprobe and traditional tracepoint tests Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access kernel internal arguments of the tracepoints in their raw form. >From bpf program point of view the access to the arguments look like: struct bpf_raw_tracepoint_args { __u64 args[0]; }; int bpf_prog(struct bpf_raw_tracepoint_args *ctx) { // program can read args[N] where N depends on tracepoint // and statically verified at program load+attach time } kprobe+bpf infrastructure allows programs access function arguments. This feature allows programs access raw tracepoint arguments. Similar to proposed 'dynamic ftrace events' there are no abi guarantees to what the tracepoints arguments are and what their meaning is. The program needs to type cast args properly and use bpf_probe_read() helper to access struct fields when argument is a pointer. For every tracepoint __bpf_trace_##call function is prepared. In assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3> where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is. All of ~500 of __bpf_trace_*() functions are only 5-10 byte long and in total this approach adds 7k bytes to .text. This approach gives the lowest possible overhead while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since tracepoint+bpf are used at speeds of 1M+ events per second this is valuable optimization. The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced that returns anon_inode FD of 'bpf-raw-tracepoint' object. The user space looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint with prog attached raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool that uses this feature will automatically detach bpf program, unload it and unregister tracepoint probe. On the kernel side the __bpf_raw_tp_map section of pointers to tracepoint definition and to __bpf_trace_*() probe function is used to find a tracepoint with "xdp_exception" name and corresponding __bpf_trace_xdp_exception() probe function which are passed to tracepoint_probe_register() to connect probe with tracepoint. Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf tracepoint mechanisms. perf_event_open() can be used in parallel on the same tracepoint. Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted. Each with its own bpf program. The kernel will execute all tracepoint probes and all attached bpf programs. In the future bpf_raw_tracepoints can be extended with query/introspection logic. __bpf_raw_tp_map section logic was contributed by Steven Rostedt Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
move COUNT_ARGS() macro from apparmor to generic header and extend it to count till twelve. COUNT() was an alternative name for this logic, but it's used for different purpose in many other places. Similarly for CONCATENATE() macro. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Alexei Starovoitov authored
fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table instead of all 17 arguments by value. dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table' defined with very similar yet subtly different fields and offsets. tracepoint is still common and using definition of 'struct iwl_error_event_table' from dvm/commands.h while copying fields. Long term this tracepoint probably should be split into two. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-