- 19 Mar, 2018 13 commits
-
-
John Fastabend authored
To verify data is not being dropped or corrupted this adds an option to verify test-patterns on recv. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
To exercise TX ULP sendpage implementation we need a test that does a sendfile. Add sendfile test option here. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add sockmap option to use SK_MSG program types. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Test read and writes for BPF_PROG_TYPE_SK_MSG. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Currently, if a bpf sk msg program is run the program can only parse data that the (start,end) pointers already consumed. For sendmsg hooks this is likely the first scatterlist element. For sendpage this will be the range (0,0) because the data is shared with userspace and by default we want to avoid allowing userspace to modify data while (or after) BPF verdict is being decided. To support pulling in additional bytes for parsing use a new helper bpf_sk_msg_pull(start, end, flags) which works similar to cls tc logic. This helper will attempt to point the data start pointer at 'start' bytes offest into msg and data end pointer at 'end' bytes offset into message. After basic sanity checks to ensure 'start' <= 'end' and 'end' <= msg_length there are a few cases we need to handle. First the sendmsg hook has already copied the data from userspace and has exclusive access to it. Therefor, it is not necessesary to copy the data. However, it may be required. After finding the scatterlist element with 'start' offset byte in it there are two cases. One the range (start,end) is entirely contained in the sg element and is already linear. All that is needed is to update the data pointers, no allocate/copy is needed. The other case is (start, end) crosses sg element boundaries. In this case we allocate a block of size 'end - start' and copy the data to linearize it. Next sendpage hook has not copied any data in initial state so that data pointers are (0,0). In this case we handle it similar to the above sendmsg case except the allocation/copy must always happen. Then when sending the data we have possibly three memory regions that need to be sent, (0, start - 1), (start, end), and (end + 1, msg_length). This is required to ensure any writes by the BPF program are correctly transmitted. Lastly this operation will invalidate any previous data checks so BPF programs will have to revalidate pointers after making this BPF call. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
In the case where we need a specific number of bytes before a verdict can be assigned, even if the data spans multiple sendmsg or sendfile calls. The BPF program may use msg_cork_bytes(). The extreme case is a user can call sendmsg repeatedly with 1-byte msg segments. Obviously, this is bad for performance but is still valid. If the BPF program needs N bytes to validate a header it can use msg_cork_bytes to specify N bytes and the BPF program will not be called again until N bytes have been accumulated. The infrastructure will attempt to coalesce data if possible so in many cases (most my use cases at least) the data will be in a single scatterlist element with data pointers pointing to start/end of the element. However, this is dependent on available memory so is not guaranteed. So BPF programs must validate data pointer ranges, but this is the case anyways to convince the verifier the accesses are valid. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
A single sendmsg or sendfile system call can contain multiple logical messages that a BPF program may want to read and apply a verdict. But, without an apply_bytes helper any verdict on the data applies to all bytes in the sendmsg/sendfile. Alternatively, a BPF program may only care to read the first N bytes of a msg. If the payload is large say MB or even GB setting up and calling the BPF program repeatedly for all bytes, even though the verdict is already known, creates unnecessary overhead. To allow BPF programs to control how many bytes a given verdict applies to we implement a bpf_msg_apply_bytes() helper. When called from within a BPF program this sets a counter, internal to the BPF infrastructure, that applies the last verdict to the next N bytes. If the N is smaller than the current data being processed from a sendmsg/sendfile call, the first N bytes will be sent and the BPF program will be re-run with start_data pointing to the N+1 byte. If N is larger than the current data being processed the BPF verdict will be applied to multiple sendmsg/sendfile calls until N bytes are consumed. Note1 if a socket closes with apply_bytes counter non-zero this is not a problem because data is not being buffered for N bytes and is sent as its received. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The current implementation of sk_alloc_sg expects scatterlist to always start at entry 0 and complete at entry MAX_SKB_FRAGS. Future patches will want to support starting at arbitrary offset into scatterlist so add an additional sg_start parameters and then default to the current values in TLS code paths. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
When calling do_tcp_sendpages() from in kernel and we know the data has no references from user side we can omit SKBTX_SHARED_FRAG flag. This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used to omit setting SKBTX_SHARED_FRAG. The flag is not exposed to userspace because the sendpage call from the splice logic masks out all bits except MSG_MORE. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The sockmap refcnt up until now has been wrapped in the sk_callback_lock(). So its not actually needed any locking of its own. The counter itself tracks the lifetime of the psock object. Sockets in a sockmap have a lifetime that is independent of the map they are part of. This is possible because a single socket may be in multiple maps. When this happens we can only release the psock data associated with the socket when the refcnt reaches zero. There are three possible delete sock reference decrement paths first through the normal sockmap process, the user deletes the socket from the map. Second the map is removed and all sockets in the map are removed, delete path is similar to case 1. The third case is an asyncronous socket event such as a closing the socket. The last case handles removing sockets that are no longer available. For completeness, although inc does not pose any problems in this patch series, the inc case only happens when a psock is added to a map. Next we plan to add another socket prog type to handle policy and monitoring on the TX path. When we do this however we will need to keep a reference count open across the sendmsg/sendpage call and holding the sk_callback_lock() here (on every send) seems less than ideal, also it may sleep in cases where we hit memory pressure. Instead of dealing with these issues in some clever way simply make the reference counting a refcnt_t type and do proper atomic ops. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The TLS ULP module builds scatterlists from a sock using page_frag_refill(). This is going to be useful for other ULPs so move it into sock file for more general use. In the process remove useless goto at end of while loop. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 16 Mar, 2018 5 commits
-
-
Daniel Borkmann authored
Jakub Kicinski says: ==================== As promised this series addresses nits and minor issues in tools/bpf build infra. One GCC-7 warning which is nice to get rid of. Dependencies when built with OUTPUT are fixed. make clean will now remove the FEATURE-DUMP.* files. PHONY target is also updated to match reality. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jakub Kicinski authored
bpf tools use feature detection for libbfd dependency, clean up the output files on make clean. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jakub Kicinski authored
There is no FORCE target in the Makefile and some of the PHONY targets are missing, update the list. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jakub Kicinski authored
GCC 7 complains: xlated_dumper.c: In function ‘print_callâ€
™ : xlated_dumper.c:179:10: warning: ‘%s’ directive output may be truncated writing up to 255 bytes into a region of size between 249 and 253 [-Wformat-truncation=] "%+d#%s", insn->off, sym->name); Add a bit more space to the buffer so it can handle the entire string and integer without truncation. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -
Jakub Kicinski authored
Auto-generated dependency files are in the OUTPUT directory, we need to include them from there. This fixes object files not being rebuilt after header changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 15 Mar, 2018 3 commits
-
-
Daniel Borkmann authored
Song Liu says: ==================== This work follows up discussion at Plumbers'17 on improving addr->sym resolution of user stack traces. The following links have more information of the discussion: http://www.linuxplumbersconf.org/2017/ocw/proposals/4764 https://lwn.net/Articles/734453/ Section "Stack traces and kprobes" Currently, bpf stackmap store address for each entry in the call trace. To map these addresses to user space files, it is necessary to maintain the mapping from these virtual address to symbols in the binary. Usually, the user space profiler (such as perf) has to scan /proc/pid/maps at the beginning of profiling, and monitor mmap2() calls afterwards. Given the cost of maintaining the address map, this solution is not practical for system wide profiling that is always on. This patch tries to address this with a variation to stackmap. Instead of storing addresses, the variation stores ELF file build_id + offset. After profiling, a user space tool will look up these functions with build_id (to find the binary or shared library) and the offset. I also updated bcc/cc library for the stackmap (no python/lua support yet). You can find the work at: https://github.com/liu-song-6/bcc/commits/bpf_get_stackid_v02 Changes v5 -> v6: 1. When kernel stack is added to stackmap with build_id, use fallback mechanism to store ip (status == BPF_STACK_BUILD_ID_IP). Changes v4 -> v5: 1. Only allow build_id lookup in non-nmi context. Added comment and commit message to highlight this limitation. 2. Minor fix reported by kbuild test robot. Changes v3 -> v4: 1. Add fallback when build_id lookup failed. In this case, status is set to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved. 2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0). Thanks to Teng for helping me identify this issue! 3. Address feedbacks for previous versions. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Song Liu authored
test_stacktrace_build_id() is added. It accesses tracepoint urandom_read with "dd" and "urandom_read" and gathers stack traces. Then it reads the stack traces from the stackmap. urandom_read is a statically link binary that reads from /dev/urandom. test_stacktrace_build_id() calls readelf to read build ID of urandom_read and compares it with build ID from the stackmap. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Song Liu authored
Currently, bpf stackmap store address for each entry in the call trace. To map these addresses to user space files, it is necessary to maintain the mapping from these virtual address to symbols in the binary. Usually, the user space profiler (such as perf) has to scan /proc/pid/maps at the beginning of profiling, and monitor mmap2() calls afterwards. Given the cost of maintaining the address map, this solution is not practical for system wide profiling that is always on. This patch tries to solve this problem with a variation of stackmap. This variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing addresses, the variation stores ELF file build_id + offset. Build ID is a 20-byte unique identifier for ELF files. The following command shows the Build ID of /bin/bash: [user@]$ readelf -n /bin/bash ... Build ID: XXXXXXXXXX ... With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID for each entry in the call trace, and translate it into the following struct: struct bpf_stack_build_id_offset { __s32 status; unsigned char build_id[BPF_BUILD_ID_SIZE]; union { __u64 offset; __u64 ip; }; }; The search of build_id is limited to the first page of the file, and this page should be in page cache. Otherwise, we fallback to store ip for this entry (ip field in struct bpf_stack_build_id_offset). This requires the build_id to be stored in the first page. A quick survey of binary and dynamic library files in a few different systems shows that almost all binary and dynamic library files have build_id in the first page. Build_id is only meaningful for user stack. If a kernel stack is added to a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id lookup failed for some reason, it will also fallback to store ip. User space can access struct bpf_stack_build_id_offset with bpf syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to maintain mapping from build id to binary files. This mostly static mapping is much easier to maintain than per process address maps. Note: Stackmap with build_id only works in non-nmi context at this time. This is because we need to take mm->mmap_sem for find_vma(). If this changes, we would like to allow build_id lookup in nmi context. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 09 Mar, 2018 9 commits
-
-
Quentin Monnet authored
When pinning a file under the BPF virtual file system (traditionally /sys/fs/bpf), using a dot in the name of the location to pin at is not allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be rejected with -EPERM. This check was introduced at the same time as the BPF file system itself, with commit b2197755 ("bpf: add support for persistent maps/progs"). At this time, it was checked in a function called "bpf_dname_reserved()", which made clear that using a dot was reserved for future extensions. This function disappeared and the check was moved elsewhere with commit 0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the meaning of the dot ban was lost. The present commit simply adds a comment in the source to explain to the reader that the usage of dots is reserved for future usage. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Daniel Borkmann authored
Jiri Benc says: ==================== Currently, 'make bpf' in the tools/ directory does not provide the standard quiet output except for bpftool (which is however listed with a wrong directory). Worse, it does not respect the build output directory. The 'make bpf_install' does not work as one would expect, either. It installs unconditionally to /usr/bin without respecting DESTDIR and prefix. This patchset improves that behavior. ==================== Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Even in quiet mode, make finishes with rm tools/bpf/bpf_exp.lex.c That's because it considers the file to be intermediate. Silence that by mentioning the lex.c file instead of the lex.o file; the dependency still stays. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Default to quiet build, with V=1 enabling verbose build as is usual. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Use the descend macro to properly propagate $(subdir) to bpftool. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Make the 'install' target depend on the 'all' target to build the binaries first. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it installs to /usr/bin/ unconditionally. Let it respect DESTDIR and allow prefix to be specified. Also, to be more consistent with bpftool and with the usual customs, default the prefix to /usr/local instead of /usr. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
Currently, the programs under tools/bpf (with the notable exception of bpftool) do not respect the output directory (make O=dir). Fix that. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Jiri Benc authored
When building bpf tool, gcc emits piles of warnings: prog.c: In function ‘prog_fd_by_tag’: prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct bpf_prog_info’ [-Wmissing-field-initializers] struct bpf_prog_info info = {}; ^ In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0, from prog.c:47: /home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: ‘type’ declared here __u32 type; ^ As these warnings are not useful, switch them off. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 08 Mar, 2018 3 commits
-
-
Daniel Borkmann authored
Teng Qin says: ==================== These patches add support that allows bpf programs attached to perf events to read the address values recorded with the perf events. These values are requested by specifying sample_type with PERF_SAMPLE_ADDR when calling perf_event_open(). The main motivation for these changes is to support building memory or lock access profiling and tracing tools. For example on Intel CPUs, the recorded address values for supported memory or lock access perf events would be the access or lock target addresses from PEBS buffer. Such information would be very valuable for building tools that help understand memory access or lock acquire pattern. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Teng Qin authored
This commit adds additional test in the trace_event example, by attaching the bpf program to MEM_UOPS_RETIRED.LOCK_LOADS event with PERF_SAMPLE_ADDR requested, and print the lock address value read from the bpf program to trace_pipe. Signed-off-by: Teng Qin <qinteng@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Teng Qin authored
This commit adds new field "addr" to bpf_perf_event_data which could be read and used by bpf programs attached to perf events. The value of the field is copied from bpf_perf_event_data_kern.addr and contains the address value recorded by specifying sample_type with PERF_SAMPLE_ADDR when calling perf_event_open. Signed-off-by: Teng Qin <qinteng@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
- 07 Mar, 2018 7 commits
-
-
Eric Dumazet authored
Kirill found that recently added synchronize_rcu() call in ip6mr_sk_done() was slowing down netns dismantle and posted a patch to use it only if the socket was found. I instead suggested to get rid of this call, and use instead SOCK_RCU_FREE We might later change IPv4 side to use the same technique and unify both stacks. IPv4 does not use synchronize_rcu() but has a call_rcu() that could be replaced by SOCK_RCU_FREE. Tested: time for i in {1..1000}; do unshare -n /bin/false;done Before : real 7m18.911s After : real 10.187s Fixes: 8571ab47 ("ip6mr: Make mroute_sk rcu-based") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Yuval Mintz <yuvalm@mellanox.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Sowmini Varadhan says: ==================== RDS: zerocopy code enhancements A couple of enhancements to the rds zerocop code - patch 1 refactors rds_message_copy_from_user to pull the zcopy logic into its own function - patch 2 drops the usage sk_buff to track MSG_ZEROCOPY cookies and uses a simple linked list (enhancement suggested by willemb during code review) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Commit 401910db ("rds: deliver zerocopy completion notification with data") removes support fo r zerocopy completion notification on the sk_error_queue, thus we no longer need to track the cookie information in sk_buff structures. This commit removes the struct sk_buff_head rs_zcookie_queue by a simpler list that results in a smaller memory footprint as well as more efficient memory_allocation time. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
Move the large block of code predicated on zcopy from rds_message_copy_from_user into a new function, rds_message_zcopy_from_user() Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Gustavo A. R. Silva authored
Remove VLA usage and change the 'len' argument to a u8 and use a 256 byte buffer on the stack. Notice that these lengths are limited by the encoding field in the VPD structure, which is a u8 [1]. [1] https://marc.info/?l=linux-netdev&m=152044354814024&w=2Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jesus Sanchez-Palencia authored
Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the ret values to be overwritten by the one set on the default case. Fixes: 28190752 ("sock: permit SO_ZEROCOPY on PF_RDS socket") Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Maxime Chevallier says: ==================== net: mvpp2: Add Unicast filtering capabilities This series adds unicast filtering support to the Marvell PPv2 controller. This is implemented using the header parser cababilities of the PPv2, which allows for generic packet filtering based on matching patterns in the packet headers. PPv2 controller only has 256 of these entries, and we need to share them with other features, such as VLAN filtering. For each interface, we have 5 entries dedicated to unicast filtering (the controller's own address, and 4 other), and 21 to multicast filtering. When this number is reached, the controller switches to unicast or multicast promiscuous mode. The first patch reworks the function that adds and removes addresses to the filter. This is preparatory work to ease UC filter implementation. The second patch adds the UC filtering feature. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-