- 02 Apr, 2021 3 commits
-
-
Shannon Nelson authored
In preparating for hardware timestamping, we need to support large Tx and Rx completion descriptors. Here we add the new queue feature ids and handling for the completion descriptor sizes. We only are adding support for the Rx 2x sized completion descriptors in the general Rx queues for now as we will be using it for PTP Rx support, and we don't have an immediate use for the large descriptors in the general Tx queues yet; it will be used in a special Tx queues added in one of the next few patches. Signed-off-by: Allen Hubbe <allenbh@pensando.io> Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Shannon Nelson authored
Add queue feature extensions to prepare for features that can be queue specific, in addition to the general queue features already defined. While we're here, change the existing feature ids from #defines to enum. Signed-off-by: Allen Hubbe <allenbh@pensando.io> Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-04-01 The following pull-request contains BPF updates for your *net-next* tree. We've added 68 non-merge commits during the last 7 day(s) which contain a total of 70 files changed, 2944 insertions(+), 1139 deletions(-). The main changes are: 1) UDP support for sockmap, from Cong. 2) Verifier merge conflict resolution fix, from Daniel. 3) xsk selftests enhancements, from Maciej. 4) Unstable helpers aka kernel func calling, from Martin. 5) Batches ops for LPM map, from Pedro. 6) Fix race in bpf_get_local_storage, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 01 Apr, 2021 30 commits
-
-
Phillip Potter authored
Use memset to initialize local array in drivers/net/usb/ax88179_178a.c, and also set a local u16 and u32 variable to 0. Fixes a KMSAN found uninit-value bug reported by syzbot at: https://syzkaller.appspot.com/bug?id=00371c73c72f72487c1d0bfe0cc9d00de339d5aa Reported-by: syzbot+4993e4a0e237f1b53747@syzkaller.appspotmail.com Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
All Gigabit PHYs use the same register layout as far as fetching statistics goes. Fast Ethernet PHYs do not all support statistics, and the BCM54616S would require some switching between the coper and fiber modes to fetch the appropriate statistics which is not supported yet. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Otto Hollmann authored
If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723 int __inet_hash_connect( ... for (i = 0; i < remaining; i += 2, port += 2) { if (unlikely(port >= high)) port -= remaining; if (inet_is_local_reserved_port(net, port)) continue; E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others. If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter. Signed-off-by: Otto Hollmann <otto.hollmann@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yang Yingliang authored
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lu Wei authored
Fix some typos. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wan Jiabing authored
struct smc_clc_msg_local is declared twice. One is declared at 301st line. The blew one is not needed. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wan Jiabing authored
struct ctl_table_header is declared twice. One is declared at 46th line. The blew one is not needed. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wong Vee Khee authored
The commit d2a029bd ("stmmac: pci: add MSI support for Intel Quark X1000") introduced a pci_enable_msi() call in stmmac_pci.c. With the commit 58da0cfa ("net: stmmac: create dwmac-intel.c to contain all Intel platform"), Intel Quark platform related codes have been moved to the newly created driver. Removing this unnecessary pci_enable_msi() call as there are no other devices that uses stmmac-pci and need MSI to be enabled. Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wong Vee Khee authored
Update dwmac-intel to use managed function, i.e. pcim_enable_device(). This will allow devres framework to call resource free function for us. Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Xu Jia authored
The logic in rt6_age_examine_exception is confusing. The commit is to refactor the code. Signed-off-by: Xu Jia <xujia39@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hoang Le authored
When enabling a bearer by name, we don't sanity check its name with higher slot in bearer list. This may have the effect that the name of an already enabled bearer bypasses the check. To fix the above issue, we just perform an extra checking with all existing bearers. Fixes: cb30a633 ("tipc: refactor function tipc_enable_bearer()") Cc: stable@vger.kernel.org Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueDavid S. Miller authored
Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-03-31 This series contains updates to ice driver only. Benita adds support for XPS. Ani moves netdev registration to the end of probe to prevent use before the interface is ready and moves up an error check to possibly avoid an unneeded call. He also consolidates the VSI state and flag fields to a single field. Dan changes the segment where package information is pulled. Paul S ensures correct ITR values are set when increasing ring size. Paul G rewords a link misconfiguration message as this could be expected. Bruce removes setting an unnecessary AQ flag and corrects a memory allocation call. Also fixes checkpatch issues for 'COMPLEX_MACRO'. Qi aligns PTYPE bitmap naming by adding 'ptype' prefix to the bitmaps missing it. Brett removes limiting Rx queue mapping to RSS size as there is not a dependency on this. He also refactors RSS configuration by introducing individual functions for LUT and key configuration and by passing a structure containing pertinent information instead of individual arguments. Tony corrects a comment block to follow netdev style. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
Cong Wang says: ==================== From: Cong Wang <cong.wang@bytedance.com> We have thousands of services connected to a daemon on every host via AF_UNIX dgram sockets, after they are moved into VM, we have to add a proxy to forward these communications from VM to host, because rewriting thousands of them is not practical. This proxy uses an AF_UNIX socket connected to services and a UDP socket to connect to the host. It is inefficient because data is copied between kernel space and user space twice, and we can not use splice() which only supports TCP. Therefore, we want to use sockmap to do the splicing without going to user-space at all (after the initial setup). Currently sockmap only fully supports TCP, UDP is partially supported as it is only allowed to add into sockmap. This patchset, as the second part of the original large patchset, extends sockmap with: 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support. On the high level, ->read_sock() is required for each protocol to support sockmap redirection, and in order to do sock proto update, a new ops ->psock_update_sk_prot() is introduced, which is also required. And the BPF ->recvmsg() is also needed to replace the original ->recvmsg() to retrieve skmsg. To make life easier, we have to get rid of lock_sock() in sk_psock_handle_skb(), otherwise we would have to implement ->sendmsg_locked() on top of ->sendmsg(), which is ugly. Please see each patch for more details. To see the big picture, the original patchset is available here: https://github.com/congwang/linux/tree/sockmap this patchset is also available: https://github.com/congwang/linux/tree/sockmap2 --- v8: get rid of 'offset' in udp_read_sock() add checks for skb_verdict/stream_verdict conflict add two cleanup patches for sock_map_link() add a new test case v7: use work_mutex to protect psock->work return err in udp_read_sock() add patch 6/13 clean up test case v6: get rid of sk_psock_zap_ingress() add rcu work patch v5: use INDIRECT_CALL_2() for function pointers use ingress_lock to fix a race condition found by Jacub rename two helper functions v4: get rid of lock_sock() in sk_psock_handle_skb() get rid of udp_sendmsg_locked() remove an empty line update cover letter v3: export tcp/udp_update_proto() rename sk->sk_prot->psock_update_sk_prot() improve changelogs v2: separate from the original large patchset rebase to the latest bpf-next split UDP test case move inet_csk_has_ulp() check to tcp_bpf.c clean up udp_read_sock() ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Cong Wang authored
This adds a test case to ensure BPF_SK_SKB_VERDICT and BPF_SK_STREAM_VERDICT will never be attached at the same time. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-17-xiyou.wangcong@gmail.com
-
Cong Wang authored
Add a test case to ensure redirection between two UDP sockets work. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-16-xiyou.wangcong@gmail.com
-
Cong Wang authored
Now UDP supports sockmap and redirection, we can safely update the sock type checks for it accordingly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-15-xiyou.wangcong@gmail.com
-
Cong Wang authored
We have to implement udp_bpf_recvmsg() to replace the ->recvmsg() to retrieve skmsg from ingress_msg. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-14-xiyou.wangcong@gmail.com
-
Cong Wang authored
Although these two functions are only used by TCP, they are not specific to TCP at all, both operate on skmsg and ingress_msg, so fit in net/core/skmsg.c very well. And we will need them for non-TCP, so rename and move them to skmsg.c and export them to modules. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
-
Cong Wang authored
This is similar to tcp_read_sock(), except we do not need to worry about connections, we just need to retrieve skb from UDP receive queue. Note, the return value of ->read_sock() is unused in sk_psock_verdict_data_ready(), and UDP still does not support splice() due to lack of ->splice_read(), so users can not reach udp_read_sock() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
-
Cong Wang authored
Currently sockmap calls into each protocol to update the struct proto and replace it. This certainly won't work when the protocol is implemented as a module, for example, AF_UNIX. Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each protocol can implement its own way to replace the struct proto. This also helps get rid of symbol dependencies on CONFIG_INET. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
-
Cong Wang authored
Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is confusing and more importantly we still want to distinguish them from user-space. So we can just reuse the stream verdict code but introduce a new type of eBPF program, skb_verdict. Users are not allowed to attach stream_verdict and skb_verdict programs to the same map. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
-
Cong Wang authored
Now we can fold sock_map_link_no_progs() into sock_map_link() and get rid of sock_map_link_no_progs(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-9-xiyou.wangcong@gmail.com
-
Cong Wang authored
sock_map_link() passes down map progs, but it is confusing to see both map progs and psock progs. Make the map progs more obvious by retrieving it directly with sock_map_progs() inside sock_map_link(). Now it is aligned with sock_map_link_no_progs() too. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-8-xiyou.wangcong@gmail.com
-
Cong Wang authored
This function is only called in process context. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-7-xiyou.wangcong@gmail.com
-
Cong Wang authored
The RCU callback sk_psock_destroy() only queues work psock->gc, so we can just switch to rcu work to simplify the code. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com
-
Cong Wang authored
We do not have to lock the sock to avoid losing sk_socket, instead we can purge all the ingress queues when we close the socket. Sending or receiving packets after orphaning socket makes no sense. We do purge these queues when psock refcnt reaches zero but here we want to purge them explicitly in sock_map_close(). There are also some nasty race conditions on testing bit SK_PSOCK_TX_ENABLED and queuing/canceling the psock work, we can expand psock->ingress_lock a bit to protect them too. As noticed by John, we still have to lock the psock->work, because the same work item could be running concurrently on different CPU's. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com
-
Cong Wang authored
We only have skb_send_sock_locked() which requires callers to use lock_sock(). Introduce a variant skb_send_sock() which locks on its own, callers do not need to lock it any more. This will save us from adding a ->sendmsg_locked for each protocol. To reuse the code, pass function pointers to __skb_send_sock() and build skb_send_sock() and skb_send_sock_locked() on top. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
-
Cong Wang authored
Currently we rely on lock_sock to protect ingress_msg, it is too big for this, we can actually just use a spinlock to protect this list like protecting other skb queues. __tcp_bpf_recvmsg() is still special because of peeking, it still has to use lock_sock. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
-
Cong Wang authored
Currently we purge the ingress_skb queue only when psock refcnt goes down to 0, so locking the queue is not necessary, but in order to be called during ->close, we have to lock it here. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-2-xiyou.wangcong@gmail.com
-
Carlos Llamas authored
SO_TXTIME hardware offload requires testing across devices, either between machines or separate network namespaces. Split up SO_TXTIME test into tx and rx modes, so traffic can be sent from one process to another. Create a veth-pair on different namespaces and bind each process to an end point via [-S]ource and [-D]estination parameters. Optional start [-t]ime parameter can be passed to synchronize the test across the hosts (with synchorinzed clocks). Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 31 Mar, 2021 7 commits
-
-
Frank Wunderlich authored
mt7623 uses offload version 2 too tested on Bananapi-R2 Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Voon Weifeng authored
Turn on the MEEAO field of MTL_ECC_Control_Register by default. As the MTL ECC Error Address Status Over-ride(MEEAO) is set by default, the following error address fields will hold the last valid address where the error is detected. Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Tan Tee Min <tee.min.tan@intel.com> Co-developed-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vladimir Oltean says: ==================== XDP for NXP ENETC This series adds support to the enetc driver for the basic XDP primitives. The ENETC is a network controller found inside the NXP LS1028A SoC, which is a dual-core Cortex A72 device for industrial networking, with the CPUs clocked at up to 1.3 GHz. On this platform, there are 4 ENETC ports and a 6-port embedded DSA switch, in a topology that looks like this: +-------------------------------------------------------------------------+ | +--------+ 1 Gbps (typically disabled) | | ENETC PCI | ENETC |--------------------------+ | | Root Complex | port 3 |-----------------------+ | | | Integrated +--------+ | | | | Endpoint | | | | +--------+ 2.5 Gbps | | | | | ENETC |--------------+ | | | | | port 2 |-----------+ | | | | | +--------+ | | | | | | | | | | | | +------------------------------------------------+ | | | Felix | | Felix | | | | Switch | port 4 | | port 5 | | | | +--------+ +--------+ | | | | | +--------+ +--------+ | +--------+ +--------+ +--------+ +--------+ | | | ENETC | | ENETC | | | Felix | | Felix | | Felix | | Felix | | | | port 0 | | port 1 | | | port 0 | | port 1 | | port 2 | | port 3 | | +-------------------------------------------------------------------------+ | | | | | | v v v v v v Up to Up to Up to 4x 2.5Gbps 2.5Gbps 1Gbps The ENETC ports 2 and 3 can act as DSA masters for the embedded switch. Because 4 out of the 6 externally-facing ports of the SoC are switch ports, the most interesting use case for XDP on this device is in fact XDP_TX on the 2.5Gbps DSA master. Nonetheless, the results presented below are for IPv4 forwarding between ENETC port 0 (eno0) and port 1 (eno1) both configured for 1Gbps. There are two streams of IPv4/UDP datagrams with a frame length of 64 octets delivered at 100% port load to eno0 and to eno1. eno0 has a flow steering rule to process the traffic on RX ring 0 (CPU 0), and eno1 has a flow steering rule towards RX ring 1 (CPU 1). For the IPFWD test, standard IP routing was enabled in the netns. For the XDP_DROP test, the samples/bpf/xdp1 program was attached to both eno0 and to eno1. For the XDP_TX test, the samples/bpf/xdp2 program was attached to both eno0 and to eno1. For the XDP_REDIRECT test, the samples/bpf/xdp_redirect program was attached once to the input of eno0/output of eno1, and twice to the input of eno1/output of eno0. Finally, the preliminary results are as follows: | IPFWD | XDP_TX | XDP_REDIRECT | XDP_DROP --------+-------+--------+------------------------- fps | 761 | 2535 | 1735 | 2783 Gbps | 0.51 | 1.71 | 1.17 | n/a There is a strange phenomenon in my testing sistem where it appears that one CPU is processing more than the other. I have not investigated this too much. Also, the code might not be very well optimized (for example dma_sync_for_device is called with the full ENETC_RXB_DMA_SIZE_XDP). Design wise, the ENETC is a PCI device with BD rings, so it uses the MEM_TYPE_PAGE_SHARED memory model, as can typically be seen in Intel devices. The strategy was to build upon the existing model that the driver uses, and not change it too much. So you will see things like a separate NAPI poll function for XDP. I have only tested with PAGE_SIZE=4096, and since we split pages in half, it means that MTU-sized frames are scatter/gather (the XDP headroom + skb_shared_info only leaves us 1476 bytes of data per buffer). This is sub-optimal, but I would rather keep it this way and help speed up Lorenzo's series for S/G support through testing, rather than change the enetc driver to use some other memory model like page_pool. My code is already structured for S/G, and that works fine for XDP_DROP and XDP_TX, just not for XDP_REDIRECT, even between two enetc ports. So the S/G XDP_REDIRECT is stubbed out (the frames are dropped), but obviously I would like to remove that limitation soon. Please note that I am rather new to this kind of stuff, I am more of a control path person, so I would appreciate feedback. Enough talking, on to the patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
The driver implementation of the XDP_REDIRECT action reuses parts from XDP_TX, most notably the enetc_xdp_tx function which transmits an array of TX software BDs. Only this time, the buffers don't have DMA mappings, we need to create them. When a BPF program reaches the XDP_REDIRECT verdict for a frame, we can employ the same buffer reuse strategy as for the normal processing path and for XDP_PASS: we can flip to the other page half and seed that to the RX ring. Note that scatter/gather support is there, but disabled due to lack of multi-buffer support in XDP (which is added by this series): https://patchwork.kernel.org/project/netdevbpf/cover/cover.1616179034.git.lorenzo@kernel.org/Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
As explained in the XDP_TX patch, when receiving a burst of frames with the XDP_TX verdict, there is a momentary dip in the number of available RX buffers. The system will eventually recover as TX completions will start kicking in and refilling our RX BD ring again. But until that happens, we need to survive with as few out-of-buffer discards as possible. This increases the memory footprint of the driver in order to avoid discards at 2.5Gbps line rate 64B packet sizes, the maximum speed available for testing on 1 port on NXP LS1028A. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
For reflecting packets back into the interface they came from, we create an array of TX software BDs derived from the RX software BDs. Therefore, we need to extend the TX software BD structure to contain most of the stuff that's already present in the RX software BD structure, for reasons that will become evident in a moment. For a frame with the XDP_TX verdict, we don't reuse any buffer right away as we do for XDP_DROP (the same page half) or XDP_PASS (the other page half, same as the skb code path). Because the buffer transfers ownership from the RX ring to the TX ring, reusing any page half right away is very dangerous. So what we can do is we can recycle the same page half as soon as TX is complete. The code path is: enetc_poll -> enetc_clean_rx_ring_xdp -> enetc_xdp_tx -> enetc_refill_rx_ring (time passes, another MSI interrupt is raised) enetc_poll -> enetc_clean_tx_ring -> enetc_recycle_xdp_tx_buff But that creates a problem, because there is a potentially large time window between enetc_xdp_tx and enetc_recycle_xdp_tx_buff, period in which we'll have less and less RX buffers. Basically, when the ship starts sinking, the knee-jerk reaction is to let enetc_refill_rx_ring do what it does for the standard skb code path (refill every 16 consumed buffers), but that turns out to be very inefficient. The problem is that we have no rx_swbd->page at our disposal from the enetc_reuse_page path, so enetc_refill_rx_ring would have to call enetc_new_page for every buffer that we refill (if we choose to refill at this early stage). Very inefficient, it only makes the problem worse, because page allocation is an expensive process, and CPU time is exactly what we're lacking. Additionally, there is an even bigger problem: if we let enetc_refill_rx_ring top up the ring's buffers again from the RX path, remember that the buffers sent to transmission haven't disappeared anywhere. They will be eventually sent, and processed in enetc_clean_tx_ring, and an attempt will be made to recycle them. But surprise, the RX ring is already full of new buffers, because we were premature in deciding that we should refill. So not only we took the expensive decision of allocating new pages, but now we must throw away perfectly good and reusable buffers. So what we do is we implement an elastic refill mechanism, which keeps track of the number of in-flight XDP_TX buffer descriptors. We top up the RX ring only up to the total ring capacity minus the number of BDs that are in flight (because we know that those BDs will return to us eventually). The enetc driver manages 1 RX ring per CPU, and the default TX ring management is the same. So we do XDP_TX towards the TX ring of the same index, because it is affined to the same CPU. This will probably not produce great results when we have a tc-taprio/tc-mqprio qdisc on the interface, because in that case, the number of TX rings might be greater, but I didn't add any checks for that yet (mostly because I didn't know what checks to add). It should also be noted that we need to change the DMA mapping direction for RX buffers, since they may now be reflected into the TX ring of the same device. We choose to use DMA_BIDIRECTIONAL instead of unmapping and remapping as DMA_TO_DEVICE, because performance is better this way. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
For the RX ring, enetc uses an allocation scheme based on pages split into two buffers, which is already very efficient in terms of preventing reallocations / maximizing reuse, so I see no reason why I would change that. +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half B | half B | half B | half B | half B | half B | half B | | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half A | half A | half A | half A | half A | half A | half A | RX ring | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ ^ ^ | | next_to_clean next_to_alloc next_to_use +--------+--------+--------+--------+--------+ | | | | | | | half B | half B | half B | half B | half B | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | | | | | | | half B | half B | half A | half A | half A | half A | half A | RX ring | | | | | | | | +--------+--------+--------+--------+--------+--------+--------+ | | | ^ ^ | half A | half A | | | | | | next_to_clean next_to_use +--------+--------+ ^ | next_to_alloc then when enetc_refill_rx_ring is called, whose purpose is to advance next_to_use, it sees that it can take buffers up to next_to_alloc, and it says "oh, hey, rx_swbd->page isn't NULL, I don't need to allocate one!". The only problem is that for default PAGE_SIZE values of 4096, buffer sizes are 2048 bytes. While this is enough for normal skb allocations at an MTU of 1500 bytes, for XDP it isn't, because the XDP headroom is 256 bytes, and including skb_shared_info and alignment, we end up being able to make use of only 1472 bytes, which is insufficient for the default MTU. To solve that problem, we implement scatter/gather processing in the driver, because we would really like to keep the existing allocation scheme. A packet of 1500 bytes is received in a buffer of 1472 bytes and another one of 28 bytes. Because the headroom required by XDP is different (and much larger) than the one required by the network stack, whenever a BPF program is added or deleted on the port, we drain the existing RX buffers and seed new ones with the required headroom. We also keep the required headroom in rx_ring->buffer_offset. The simplest way to implement XDP_PASS, where an skb must be created, is to create an xdp_buff based on the next_to_clean RX BDs, but not clear those BDs from the RX ring yet, just keep the original index at which the BDs for this frame started. Then, if the verdict is XDP_PASS, instead of converting the xdb_buff to an skb, we replay a call to enetc_build_skb (just as in the normal enetc_clean_rx_ring case), starting from the original BD index. We would also like to be minimally invasive to the regular RX data path, and not check whether there is a BPF program attached to the ring on every packet. So we create a separate RX ring processing function for XDP. Because we only install/remove the BPF program while the interface is down, we forgo the rcu_read_lock() in enetc_clean_rx_ring, since there shouldn't be any circumstance in which we are processing packets and there is a potentially freed BPF program attached to the RX ring. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-