- 22 Mar, 2018 5 commits
-
-
Christian Brauner authored
This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets to allow sending uevent messages into the network namespace the socket belongs to. Currently non-initial network namespaces are already isolated and don't receive uevents. There are a number of cases where it is beneficial for a sufficiently privileged userspace process to send a uevent into a network namespace. One such use case would be debugging and fuzzing of a piece of software which listens and reacts to uevents. By running a copy of that software inside a network namespace, specific uevents could then be presented to it. More concretely, this would allow for easy testing of udevd/ueventd. This will also allow some piece of software to run components inside a separate network namespace and then effectively filter what that software can receive. Some examples of software that do directly listen to uevents and that we have in the past attempted to run inside a network namespace are rbd (CEPH client) or the X server. Implementation: The implementation has been kept as simple as possible from the kernel's perspective. Specifically, a simple input method uevent_net_rcv() is added to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing af_netlink infrastructure and does neither add an additional netlink family nor requires any user-visible changes. For example, by using netlink_rcv_skb() we can make use of existing netlink infrastructure to report back informative error messages to userspace. Furthermore, this implementation does not introduce any overhead for existing uevent generating codepaths. The struct netns got a new uevent socket member that records the uevent socket associated with that network namespace including its position in the uevent socket list. Since we record the uevent socket for each network namespace in struct net we don't have to walk the whole uevent socket list. Instead we can directly retrieve the relevant uevent socket and send the message. At exit time we can now also trivially remove the uevent socket from the uevent socket list. This keeps the codepath very performant without introducing needless overhead and even makes older codepaths faster. Uevent sequence numbers are kept global. When a uevent message is sent to another network namespace the implementation will simply increment the global uevent sequence number and append it to the received uevent. This has the advantage that the kernel will never need to parse the received uevent message to replace any existing uevent sequence numbers. Instead it is up to the userspace process to remove any existing uevent sequence numbers in case the uevent message to be sent contains any. Security: In order for a caller to send uevent messages to a target network namespace the caller must have CAP_SYS_ADMIN in the owning user namespace of the target network namespace. Additionally, any received uevent message is verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space needed to append the uevent sequence number. Testing: This patch has been tested and verified to work with the following udev implementations: 1. CentOS 6 with udevd version 147 2. Debian Sid with systemd-udevd version 237 3. Android 7.1.1 with ueventd Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christian Brauner authored
This commit adds struct uevent_sock to struct net. Since struct uevent_sock records the position of the uevent socket in the uevent socket list we can trivially remove it from the uevent socket list during cleanup. This speeds up the old removal codepath. Note, list_del() will hit __list_del_entry_valid() in its call chain which will validate that the element is a member of the list. If it isn't it will take care that the list is not modified. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
These pernet_operations register and unregister sysctl. Also, there is inet_frags_exit_net() called in exit method, which has to be safe after a5600024 "net: Fix hlist corruptions in inet_evict_bucket()". Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
These pernet_operations register and unregister sysctl. Also, there is inet_frags_exit_net() called in exit method, which has to be safe after a5600024 "net: Fix hlist corruptions in inet_evict_bucket()". Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Kirill Tkhai authored
These pernet_operations create and destroy /proc entries and cancel per-net timer. Also, there are unneed iterations over empty list of net devices, since all net devices must be already moved to init_net or unregistered by default_device_ops. This already was mentioned here: https://marc.info/?l=linux-can&m=150169589119335&w=2 So, it looks safe to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 21 Mar, 2018 1 commit
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-03-21 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a BPF hook for sendmsg and sendfile by reusing the ULP infrastructure and sockmap. Three helpers are added along with this, bpf_msg_apply_bytes(), bpf_msg_cork_bytes(), and bpf_msg_pull_data(). The first is used to tell for how many bytes the verdict should be applied to, the second to tell that x bytes need to be queued first to retrigger the BPF program for a verdict, and the third helper is mainly for the sendfile case to pull in data for making it private for reading and/or writing, from John. 2) Improve address to symbol resolution of user stack traces in BPF stackmap. Currently, the latter stores the address for each entry in the call trace, however to map these addresses to user space files, it is necessary to maintain the mapping from these virtual addresses to symbols in the binary which is not practical for system-wide profiling. Instead, this option for the stackmap rather stores the ELF build id and offset for the call trace entries, from Song. 3) Add support that allows BPF programs attached to perf events to read the address values recorded with the perf events. They are requested through PERF_SAMPLE_ADDR via perf_event_open(). Main motivation behind it is to support building memory or lock access profiling and tracing tools with the help of BPF, from Teng. 4) Several improvements to the tools/bpf/ Makefiles. The 'make bpf' in the tools directory does not provide the standard quiet output except for bpftool and it also does not respect specifying a build output directory. 'make bpf_install' command neither respects specified destination nor prefix, all from Jiri. In addition, Jakub fixes several other minor issues in the Makefiles on top of that, e.g. fixing dependency paths, phony targets and more. 5) Various doc updates e.g. add a comment for BPF fs about reserved names to make the dentry lookup from there a bit more obvious, and a comment to the bpf_devel_QA file in order to explain the diff between native and bpf target clang usage with regards to pointer size, from Quentin and Daniel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 20 Mar, 2018 11 commits
-
-
Daniel Borkmann authored
As this recently came up on netdev [0], lets add it to the BPF devel doc. [0] https://www.spinics.net/lists/netdev/msg489612.htmlSigned-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
David S. Miller authored
Uwe Kleine-König says: ==================== net: dsa: mv88e6xxx: some fixes these patches target net-next and got approved by Andrew Lunn. Compared to (implicit) v1, I dropped the patch that I didn't know if it was right because of missing documentation on my side. But Andrew already cared for that in a patch that is now adfccf11 in net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Uwe Kleine-König authored
This changes the respective line in /proc/interrupts from 49: x x mv88e6xxx-g1 7 Edge mv88e6xxx-g1 to 49: x x mv88e6xxx-g1 7 Edge mv88e6xxx-g2 which makes more sense. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Uwe Kleine-König authored
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Uwe Kleine-König authored
The switch name is emitted in the kernel log, so having the right name there is nice. Fixes: 1558727a ("net: dsa: mv88e6xxx: Add support for ethernet switch 88E6141") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ido Schimmel says: ==================== mlxsw: Adapt driver to upcoming firmware versions The first two patches make sure that reserved fields are set to zero, as required by the device's programmer's reference manual (PRM). Last two patches prevent the driver from performing an invalid operation that is going to be denied by upcoming firmware versions. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
When a new ACL group is created its region (ACL) list is initially empty. Thus, the call to mlxsw_sp_acl_tcam_group_update() would basically invalidate an already invalid (non-existent) group. Remove the unnecessary call and make the function symmetric to its del() counterpart. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The driver currently creates empty ACL groups, binds them to the requested port and then fills them with actual ACLs that point to TCAM regions. However, empty ACL groups are considered invalid and upcoming firmware versions are going to forbid their binding. Work around this limitation by only performing the binding after the first ACL was added to the group. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tal Bar authored
There is no need to set some of the fields within 'mbox_config_profile', since they are reserved and capability mask should be set to zero. Signed-off-by: Tal Bar <talb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Shalom Toledo authored
Some of the opcodes don't use in, out or both mboxes. In such cases, the mbox address is a reserved field and FW expects it to be zero. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matthew Wilcox authored
The mlx5 driver calls ida_pre_get() in a loop for no readily apparent reason. The driver uses ida_simple_get() which will call ida_pre_get() by itself and there's no need to use ida_pre_get() unless using ida_get_new(). Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 19 Mar, 2018 23 commits
-
-
Daniel Borkmann authored
John Fastabend says: ==================== This series adds a BPF hook for sendmsg and senfile by using the ULP infrastructure and sockmap. A simple pseudocode example would be, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above snippet any socket attached to the map would run msg_prog on sendmsg and sendfile system calls. Three additional helpers are added bpf_msg_apply_bytes(), bpf_msg_cork_bytes(), and bpf_msg_pull_data(). With bpf_msg_apply_bytes BPF programs can tell the infrastructure how many bytes the given verdict should apply to. This has two cases. First, a BPF program applies verdict to fewer bytes than in the current sendmsg/sendfile msg this will apply the verdict to the first N bytes of the message then run the BPF program again with data pointers recalculated to the N+1 byte. The second case is the BPF program applies a verdict to more bytes than the current sendmsg or sendfile system call. In this case the infrastructure will cache the verdict and apply it to future sendmsg/sendfile calls until the byte limit is reached. This avoids the overhead of running BPF programs on large payloads. The helper bpf_msg_cork_bytes() handles a different case where a BPF program can not reach a verdict on a msg until it receives more bytes AND the program doesn't want to forward the packet until it is known to be "good". The example case being a user (albeit a dumb one probably) sends a N byte header in 1B system calls. The BPF program can call bpf_msg_cork_bytes with the required byte limit to reach a verdict and then the program will only be called again once N bytes are received. The last helper added in this series is bpf_msg_pull_data(). It is used to pull data in for modification or reading. Similar to how sk_pull_data() works msg_pull_data can be used to access data not in the initial (data_start, data_end) range. For sendpage() calls this is needed if any data is accessed because the BPF sendpage hook initializes the data_start and data_end pointers to zero. We do this because sendpage data is shared with the user and can be modified during or after the BPF verdict possibly invalidating any verdict the BPF program decides. For sendmsg the data is already copied by the sendmsg bpf infrastructure so we only copy the data if the user request a data range that is not already linearized. This happens if the user requests larger blocks of data that are not in a single scatterlist element. The common case seems to be accessing headers which normally are in the first scatterlist element and already linearized. For more examples please review the sample program. There are examples for all the actions and helpers there. Patches 1-8 implement the above sockmap/BPF infrastructure. The remaining patches flush out some minimal selftests and the sample sockmap program. The sockmap sample program is the main vehicle for testing this infrastructure and will be moved into selftests shortly. The final patch in this series is a simple shell script to run a set of tests. These are the tests I run after any changes to sockmap. The next task on the list after this series is to push those into selftests so we can avoid manually testing. Couple notes on future items in the pipeline, 0. move sample sockmap programs into selftests (noted above) 1. add additional support for tcp flags, most are ignored now. 2. add a Documentation/bpf/sockmap file with these details 3. support stacked ULP types to allow this and ktls to cooperate 4. Ingress flag support, redirect only supports egress here. The other redirect helpers support ingress and egress flags. 5. add optimizations, I cut a few optimizations here in the first iteration of the code for later study/implementation -v3 updates : u32 data pointers in msg_md changed to void * : page_address NULL check and flag verification in msg_pull_data : remove old note in commit msg that is no longer relevant : remove enum sk_msg_action its not used anywhere : fixup test_verifier W -> DW insn to account for data pointers : unintentionally dropped a smap_stop_tx() call in sockmap.c I propagated the ACKs forward because above changes were small one/two line changes. -v2 updates (discussion): Dave noticed that sendpage call was previously (in v1) running on the data directly. This allowed users to potentially modify the data after or during the BPF program. However doing a copy automatically even if the data is not accessed has measurable performance impact. So we added another helper modeled after the existing skb_pull_data() helper to allow users to selectively pull data from the msg. This is also useful in the sendmsg case when users need to access data outside the first scatterlist element or across scatterlist boundaries. While doing this I also unified the sendmsg and sendfile handlers a bit. Originally the sendfile call was optimized for never touching the data. I've decided for a first submission to drop this optimization and we can add it back later. It introduced unnecessary complexity, at least for a first posting, for a use case I have not entirely flushed out yet. When the use case is deployed we can add it back if needed. Then we can review concrete performance deltas as well on real-world use-cases/applications. Lastly, I reorganized the patches a bit. Now all sockmap changes are in a single patch and each helper gets its own patch. This, at least IMO, makes it easier to review because sockmap changes are not spread across the patch series. On the other hand now apply_bytes, cork_bytes logic is only activated later in the series. But that should be OK. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This adds the test script I am currently using to validate the latest sockmap changes. Shortly sockmap will be ported to selftests and these will be run from the infrastructure there. Until then add the script here so we have a coverage checklist when porting into selftests. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This adds an option to test the msg_pull_data helper. This uses two options txmsg_start and txmsg_end to let the user specify start and end bytes to pull. The options can be used with txmsg_apply, txmsg_cork options as well as with any of the basic tests, txmsg, txmsg_redir and txmsg_drop (plus noisy variants) to run pull_data inline with those tests. By giving user direct control over the variables we can easily do negative testing as well as positive tests. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add tests for SK_DROP. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add sample application support for the bpf_msg_cork_bytes helper. This lets the user specify how many bytes each verdict should apply to. Similar to apply_bytes() tests these can be run as a stand-alone test when used without other options or inline with other tests by using the txmsg_cork option along with any of the basic tests txmsg, txmsg_redir, txmsg_drop. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This adds an option to test the apply_bytes helper. This option lets the user specify an int on the command line specifying how much data each verdict should apply to. When this is set a map entry is set with the bytes input by the user and then the specified program --txmsg or --txmsg_redir will use the value and set the applied data. If no other option is set then a default --txmsg_apply program is run. This program will drop pkts if an error is detected on the bytes map lookup. Useful to verify the map lookup and apply helper are working and causing a hard error if it is not. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
To verify data is not being dropped or corrupted this adds an option to verify test-patterns on recv. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
To exercise TX ULP sendpage implementation we need a test that does a sendfile. Add sendfile test option here. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add sockmap option to use SK_MSG program types. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Test read and writes for BPF_PROG_TYPE_SK_MSG. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
Currently, if a bpf sk msg program is run the program can only parse data that the (start,end) pointers already consumed. For sendmsg hooks this is likely the first scatterlist element. For sendpage this will be the range (0,0) because the data is shared with userspace and by default we want to avoid allowing userspace to modify data while (or after) BPF verdict is being decided. To support pulling in additional bytes for parsing use a new helper bpf_sk_msg_pull(start, end, flags) which works similar to cls tc logic. This helper will attempt to point the data start pointer at 'start' bytes offest into msg and data end pointer at 'end' bytes offset into message. After basic sanity checks to ensure 'start' <= 'end' and 'end' <= msg_length there are a few cases we need to handle. First the sendmsg hook has already copied the data from userspace and has exclusive access to it. Therefor, it is not necessesary to copy the data. However, it may be required. After finding the scatterlist element with 'start' offset byte in it there are two cases. One the range (start,end) is entirely contained in the sg element and is already linear. All that is needed is to update the data pointers, no allocate/copy is needed. The other case is (start, end) crosses sg element boundaries. In this case we allocate a block of size 'end - start' and copy the data to linearize it. Next sendpage hook has not copied any data in initial state so that data pointers are (0,0). In this case we handle it similar to the above sendmsg case except the allocation/copy must always happen. Then when sending the data we have possibly three memory regions that need to be sent, (0, start - 1), (start, end), and (end + 1, msg_length). This is required to ensure any writes by the BPF program are correctly transmitted. Lastly this operation will invalidate any previous data checks so BPF programs will have to revalidate pointers after making this BPF call. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
In the case where we need a specific number of bytes before a verdict can be assigned, even if the data spans multiple sendmsg or sendfile calls. The BPF program may use msg_cork_bytes(). The extreme case is a user can call sendmsg repeatedly with 1-byte msg segments. Obviously, this is bad for performance but is still valid. If the BPF program needs N bytes to validate a header it can use msg_cork_bytes to specify N bytes and the BPF program will not be called again until N bytes have been accumulated. The infrastructure will attempt to coalesce data if possible so in many cases (most my use cases at least) the data will be in a single scatterlist element with data pointers pointing to start/end of the element. However, this is dependent on available memory so is not guaranteed. So BPF programs must validate data pointer ranges, but this is the case anyways to convince the verifier the accesses are valid. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
A single sendmsg or sendfile system call can contain multiple logical messages that a BPF program may want to read and apply a verdict. But, without an apply_bytes helper any verdict on the data applies to all bytes in the sendmsg/sendfile. Alternatively, a BPF program may only care to read the first N bytes of a msg. If the payload is large say MB or even GB setting up and calling the BPF program repeatedly for all bytes, even though the verdict is already known, creates unnecessary overhead. To allow BPF programs to control how many bytes a given verdict applies to we implement a bpf_msg_apply_bytes() helper. When called from within a BPF program this sets a counter, internal to the BPF infrastructure, that applies the last verdict to the next N bytes. If the N is smaller than the current data being processed from a sendmsg/sendfile call, the first N bytes will be sent and the BPF program will be re-run with start_data pointing to the N+1 byte. If N is larger than the current data being processed the BPF verdict will be applied to multiple sendmsg/sendfile calls until N bytes are consumed. Note1 if a socket closes with apply_bytes counter non-zero this is not a problem because data is not being buffered for N bytes and is sent as its received. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The current implementation of sk_alloc_sg expects scatterlist to always start at entry 0 and complete at entry MAX_SKB_FRAGS. Future patches will want to support starting at arbitrary offset into scatterlist so add an additional sg_start parameters and then default to the current values in TLS code paths. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
When calling do_tcp_sendpages() from in kernel and we know the data has no references from user side we can omit SKBTX_SHARED_FRAG flag. This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used to omit setting SKBTX_SHARED_FRAG. The flag is not exposed to userspace because the sendpage call from the splice logic masks out all bits except MSG_MORE. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The sockmap refcnt up until now has been wrapped in the sk_callback_lock(). So its not actually needed any locking of its own. The counter itself tracks the lifetime of the psock object. Sockets in a sockmap have a lifetime that is independent of the map they are part of. This is possible because a single socket may be in multiple maps. When this happens we can only release the psock data associated with the socket when the refcnt reaches zero. There are three possible delete sock reference decrement paths first through the normal sockmap process, the user deletes the socket from the map. Second the map is removed and all sockets in the map are removed, delete path is similar to case 1. The third case is an asyncronous socket event such as a closing the socket. The last case handles removing sockets that are no longer available. For completeness, although inc does not pose any problems in this patch series, the inc case only happens when a psock is added to a map. Next we plan to add another socket prog type to handle policy and monitoring on the TX path. When we do this however we will need to keep a reference count open across the sendmsg/sendpage call and holding the sk_callback_lock() here (on every send) seems less than ideal, also it may sleep in cases where we hit memory pressure. Instead of dealing with these issues in some clever way simply make the reference counting a refcnt_t type and do proper atomic ops. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
John Fastabend authored
The TLS ULP module builds scatterlists from a sock using page_frag_refill(). This is going to be useful for other ULPs so move it into sock file for more general use. In the process remove useless goto at end of while loop. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller authored
Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2018-03-19 This series contains updates to i40e and i40evf only. Alex fixes a potential deadlock in the configure_clsflower function in i40evf, where we exit with the "IN_CRITICAL_TASK" bit set while notifying the PF of flower filters. Jan fixed an issue where it was possible to set a mode that is not allowed which resulted in link being down, so fixed the parity between i40e_set_link_ksettings() and i40e_get_link_ksettings(). Patryk fixes a bug where a backplane device was allowing the setting of link settings, which is not allowed. Shiraz fixes a crash when entering S3 because the client interface was freeing the MSIx vectors while they are still in use. Jake fixes up a function header comment to document a newly added parameter. Also cleaned up flags that were never used. Doug fixes the incorrect return type for i40e_aq_add_cloud_filters(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paweł Jabłoński authored
This fixes the polling mechanism of GLGEN_RSTAT.DEVSTATE in the PF Reset path when Global Reset is in progress. While the driver is polling for the end of the PF Reset and the Global Reset is triggered, abandon the PF Reset path and prepare for the upcoming Global Reset. Signed-off-by: Paweł Jabłoński <pawel.jablonski@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Jacob Keller authored
These flags were defined, but there is no use within the driver code, so we don't need to keep them. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Patryk Małek authored
Setting link settings on backplane devices shouldn't be allowed. This patch adds one more device id to the list which we check that against. Signed-off-by: Patryk Małek <patryk.malek@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-