- 29 Jan, 2018 34 commits
-
-
David Ahern authored
Unsolicited IPv6 neighbor advertisements should be sent after DAD completes. Update ndisc_send_unsol_na to skip tentative, non-optimistic addresses and have those sent by addrconf_dad_completed after DAD. Fixes: 4a6e3c5d ("net: ipv6: send unsolicited NA on admin up") Reported-by: Vivek Venkatraman <vivek@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Andy Spencer authored
When the frame check sequence (FCS) is split across the last two frames of a fragmented packet, part of the FCS gets counted twice, once when subtracting the FCS, and again when subtracting the previously received data. For example, if 1602 bytes are received, and the first fragment contains the first 1600 bytes (including the first two bytes of the FCS), and the second fragment contains the last two bytes of the FCS: 'skb->len == 1600' from the first fragment size = lstatus & BD_LENGTH_MASK; # 1602 size -= ETH_FCS_LEN; # 1598 size -= skb->len; # -2 Since the size is unsigned, it wraps around and causes a BUG later in the packet handling, as shown below: kernel BUG at ./include/linux/skbuff.h:2068! Oops: Exception in kernel mode, sig: 5 [#1] ... NIP [c021ec60] skb_pull+0x24/0x44 LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690 Call Trace: [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable) [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c [df7edf40] [c023352c] net_rx_action+0x21c/0x274 [df7edf90] [c0329000] __do_softirq+0xd8/0x240 [df7edff0] [c000c108] call_do_irq+0x24/0x3c [c0597e90] [c00041dc] do_IRQ+0x64/0xc4 [c0597eb0] [c000d920] ret_from_except+0x0/0x18 --- interrupt: 501 at arch_cpu_idle+0x24/0x5c Change the size to a signed integer and then trim off any part of the FCS that was received prior to the last fragment. Fixes: 6c389fc9 ("gianfar: fix size of scatter-gathered frames") Signed-off-by: Andy Spencer <aspencer@spacex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Cong Wang says: ==================== net_sched: reflect tx_queue_len change for pfifo_fast This pathcset restores the pfifo_fast qdisc behavior of dropping packets based on latest dev->tx_queue_len. Patch 1 introduces a helper, patch 2 introduces a new Qdisc ops which is called when we modify tx_queue_len, patch 3 implements this ops for pfifo_fast. Please see each patch for details. --- v3: use skb_array_resize_multiple() v2: handle error case for ->change_tx_queue_len() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len, so we have to resize skb array when we change tx_queue_len. Other qdiscs which read tx_queue_len are fine because they all save it to sch->limit or somewhere else in qdisc during init. They don't have to implement this, it is nicer if they do so that users don't have to re-configure qdisc after changing tx_queue_len. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
Introduce a new qdisc ops ->change_tx_queue_len() so that each qdisc could decide how to implement this if it wants. Previously we simply read dev->tx_queue_len, after pfifo_fast switches to skb array, we need this API to resize the skb array when we change dev->tx_queue_len. To avoid handling race conditions with TX BH, we need to deactivate all TX queues before change the value and bring them back after we are done, this also makes implementation easier. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
This patch promotes the local change_tx_queue_len() to a core helper function, dev_change_tx_queue_len(), so that rtnetlink and net-sysfs could share the code. This also prepares for the following patch. Note, the -EFAULT in the original code doesn't make sense, we should propagate the errno from notifiers. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jason Wang authored
We don't stop device before reset owner, this means we could try to serve any virtqueue kick before reset dev->worker. This will result a warn since the work was pending at llist during owner resetting. Fix this by stopping device during owner reset. Reported-by: syzbot+eb17c6162478cc50632c@syzkaller.appspotmail.com Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Nicolas Dichtel says: ==================== net: Ease to follow an interface that moves to another netns The goal of this series is to ease the user to follow an interface that moves to another netns. After this series, with a patched iproute2: $ ip netns bar foo $ ip monitor link & $ ip link set dummy0 netns foo Deleted 5: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 6e:a7:82:35:96:46 brd ff:ff:ff:ff:ff:ff new-nsid 0 new-ifindex 6 => new nsid: 0, new ifindex: 6 (was 5 in the previous netns) $ ip link set eth1 netns bar Deleted 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default link/ether 52:54:01:12:34:57 brd ff:ff:ff:ff:ff:ff new-nsid 1 new-ifindex 3 => new nsid: 1, new ifindex: 3 (same ifindex) $ ip netns bar (id: 1) foo (id: 0) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
The goal is to let the user follow an interface that moves to another netns. CC: Jiri Benc <jbenc@redhat.com> CC: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
The user should be able to follow any interface that moves to another netns. There is no reason to hide physical interfaces. CC: Jiri Benc <jbenc@redhat.com> CC: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vadim Lomovtsev authored
It was found that ethtool provides unexisting module name while it queries the specified network device for associated driver information. Then user tries to unload that module by provided module name and fails. This happens because ethtool reads value of DRV_NAME macro, while module name is defined at the driver's Makefile. This patch is to correct Cavium CN88xx Thunder NIC driver names (DRV_NAME macro) 'thunder-nicvf' to 'nicvf' and 'thunder-nic' to 'nicpf', sync bgx and xcv driver names accordingly to their module names. Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Michael S. Tsirkin says: ==================== ptr_ring fixes This fixes a bunch of issues around ptr_ring use in net core. One of these: "tap: fix use-after-free" is also needed on net, but can't be backported cleanly. I will post a net patch separately. Lightly tested - Jason, could you pls confirm this addresses the security issue you saw with ptr_ring? Testing reports would be appreciated too. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Jason Wang <jasowang@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
-
Michael S. Tsirkin authored
Offset 128 overlaps the last word of the redzone. Use 132 which is always beyond that. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
This is to make ptr_ring test build again. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
We don't rely on lockless guarantees, but it seems cleaner than inverting __ptr_ring_peek. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
In theory compiler could tear queue loads or stores in two. It does not seem to be happening in practice but it seems easier to convert the cases where this would be a problem to READ/WRITE_ONCE than worry about it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
__skb_array_empty should use __ptr_ring_empty since that's the only legal lockless function. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
This reverts commit bcecb4bb. If we try to allocate an extra entry as the above commit did, and when the requested size is UINT_MAX, addition overflows causing zero size to be passed to kmalloc(). kmalloc then returns ZERO_SIZE_PTR with a subsequent crash. Reported-by: syzbot+87678bcf753b44c39b67@syzkaller.appspotmail.com Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Similar to bcecb4bb ("net: ptr_ring: otherwise safe empty checks can overrun array bounds") a lockless use of __ptr_ring_full might cause an out of bounds access. We can fix this, but it's easier to just disallow lockless __ptr_ring_full for now. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Lockless access to __ptr_ring_full is only legal if ring is never resized, otherwise it might cause use-after free errors. Simply drop the lockless test, we'll drop the packet a bit later when produce fails. Fixes: 362899b8 ("macvtap: switch to use skb array") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
Lockless __ptr_ring_empty requires that consumer head is read and written at once, atomically. Annotate accordingly to make sure compiler does it correctly. Switch locked callers to __ptr_ring_peek which does not support the lockless operation. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
The only function safe to call without locks is __ptr_ring_empty. Move documentation about lockless use there to make sure people do not try to use __ptr_ring_peek outside locks. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael S. Tsirkin authored
The comment near __ptr_ring_peek says: * If ring is never resized, and if the pointer is merely * tested, there's no need to take the lock - see e.g. __ptr_ring_empty. but this was in fact never possible since consumer_head would sometimes point outside the ring. Refactor the code so that it's always pointing within a ring. Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Martin KaFai Lau authored
If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it implicitly implies it is an ipv6only socket. However, in inet6_bind(), this addr_type checking and setting sk->sk_ipv6only to 1 are only done after sk->sk_prot->get_port(sk, snum) has been completed successfully. This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses the 'get_port()'. In particular, when binding SO_REUSEPORT UDP sockets, udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock() checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be 1 while ipv6_only_sock(sk) is still 0 here. The end result is, reuseport_alloc(sk) is called instead of adding sk to the existing sk2->sk_reuseport_cb. It can be reproduced by binding two SO_REUSEPORT UDP sockets on an IPv6 address (!ANY and !MAPPED). Only one of the socket will receive packet. The fix is to set the implicit sk_ipv6only before calling get_port(). The original sk_ipv6only has to be saved such that it can be restored in case get_port() failed. The situation is similar to the inet_reset_saddr(sk) after get_port() has failed. Thanks to Calvin Owens <calvinowens@fb.com> who created an easy reproduction which leads to a fix. Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Christian Brauner says: ==================== rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK Based on the previous discussion this enables passing a IFLA_IF_NETNSID property along with RTM_SETLINK and RTM_DELLINK requests. The patch for RTM_NEWLINK will be sent out in a separate patch since there are more corner-cases to think about. Changelog 2018-01-24: * Preserve old behavior and report -ENODEV when either ifindex or ifname is provided and IFLA_GROUP is set. Spotted by Wolfgang Bumiller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christian Brauner authored
- Backwards Compatibility: If userspace wants to determine whether RTM_DELLINK supports the IFLA_IF_NETNSID property they should first send an RTM_GETLINK request with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply does not include IFLA_IF_NETNSID userspace should assume that IFLA_IF_NETNSID is not supported on this kernel. If the reply does contain an IFLA_IF_NETNSID property userspace can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property with RTM_DELLINK. Userpace should then fallback to other means. - Security: Callers must have CAP_NET_ADMIN in the owning user namespace of the target network namespace. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christian Brauner authored
- Backwards Compatibility: If userspace wants to determine whether RTM_SETLINK supports the IFLA_IF_NETNSID property they should first send an RTM_GETLINK request with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply does not include IFLA_IF_NETNSID userspace should assume that IFLA_IF_NETNSID is not supported on this kernel. If the reply does contain an IFLA_IF_NETNSID property userspace can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property with RTM_SETLINK. Userpace should then fallback to other means. To retain backwards compatibility the kernel will first check whether a IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either one is found it will be used to identify the target network namespace. This implies that users who do not care whether their running kernel supports IFLA_IF_NETNSID with RTM_SETLINK can pass both IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network namespace. - Security: Callers must have CAP_NET_ADMIN in the owning user namespace of the target network namespace. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christian Brauner authored
RTM_{NEW,SET}LINK already allow operations on other network namespaces by identifying the target network namespace through IFLA_NET_NS_{FD,PID} properties. This is done by looking for the corresponding properties in do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID property. This introduces no functional changes since all callers of do_setlink() currently block IFLA_IF_NETNSID by reporting an error before they reach do_setlink(). This introduces the helpers: static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct nlattr *tb[]) static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb, struct net *src_net, struct nlattr *tb[], int cap) to simplify permission checks and target network namespace retrieval for RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look for IFLA_NET_NS_{FD,PID} properties first before checking for IFLA_IF_NETNSID. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller authored
Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Merge tag 'wireless-drivers-next-for-davem-2018-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for 4.16 Major changes: wil6210 * add PCI device id for Talyn * support flashless device ath9k * improve RSSI/signal accuracy on AR9003 series mt76 * validate CCMP PN from received frames to avoid replay attacks qtnfmac * support 64-bit network stats * report more hardware information to kernel log and some via ethtool ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
kbuild test robot authored
efx_default_channel_want_txqs() is only used in efx.c, while efx_ptp_want_txqs() and efx_ptp_channel_type (a struct) are only used in ptp.c. In all cases these symbols should be static. Fixes: 2935e3c3 ("sfc: on 8000 series use TX queues for TX timestamps") Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> [ecree@solarflare.com: rewrote commit message] Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller authored
Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2018-01-26 This series contains updates to i40e and i40evf. Michal updates the driver to pass critical errors from the firmware to the caller. Patryk fixes an issue of creating multiple identical filters with the same location, by simply moving the functions so that we remove the existing filter and then add the new filter. Paweł adds back in the ability to turn off offloads when VLAN is set for the VF driver. Fixed an issue where the number of TC queue pairs was exceeding MSI-X vectors count, causing messages about invalid TC mapping and wrong selected Tx queue. Alex cleans up the i40e/i40evf_set_itr_per_queue() by dropping all the unneeded pointer chases. Puts to use the reg_idx value, which was going unused, so that we can avoid having to compute the vector every time throughout the driver. Upasana enable the driver to display LLDP information on the vSphere Web Client by exposing DCB parameters. Alice converts our flags from 32 to 64 bit size, since we have added more flags. Dave implements a private ethtool flag to disable the processing of LLDP packets by the firmware, so that the firmware will not consume LLDPDU and cause them to be sent up the stack. Alan adds a mechanism for detecting/storing the flag for processing of LLDP packets by the firmware, so that its current state is persistent across reboots/reloads of the driver. Avinash fixes kdump with i40e due to resource constraints. We were enabling VMDq and iWARP when we just have a single CPU, which was starving kdump for the lack of IRQs. Jake adds support to program the fragmented IPv4 input set PCTYPE. Fixed the reported masks to properly report that the entire field is masked, since we had accidentally swapped the mask values for the IPv4 addresses with the L4 port numbers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-01-26 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) A number of extensions to tcp-bpf, from Lawrence. - direct R or R/W access to many tcp_sock fields via bpf_sock_ops - passing up to 3 arguments to bpf_sock_ops functions - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks - optionally calling bpf_sock_ops program when RTO fires - optionally calling bpf_sock_ops program when packet is retransmitted - optionally calling bpf_sock_ops program when TCP state changes - access to tclass and sk_txhash - new selftest 2) div/mod exception handling, from Daniel. One of the ugly leftovers from the early eBPF days is that div/mod operations based on registers have a hard-coded src_reg == 0 test in the interpreter as well as in JIT code generators that would return from the BPF program with exit code 0. This was basically adopted from cBPF interpreter for historical reasons. There are multiple reasons why this is very suboptimal and prone to bugs. To name one: the return code mapping for such abnormal program exit of 0 does not always match with a suitable program type's exit code mapping. For example, '0' in tc means action 'ok' where the packet gets passed further up the stack, which is just undesirable for such cases (e.g. when implementing policy) and also does not match with other program types. After considering _four_ different ways to address the problem, we adapt the same behavior as on some major archs like ARMv8: X div 0 results in 0, and X mod 0 results in X. aarch64 and aarch32 ISA do not generate any traps or otherwise aborts of program execution for unsigned divides. Given the options, it seems the most suitable from all of them, also since major archs have similar schemes in place. Given this is all in the realm of undefined behavior, we still have the option to adapt if deemed necessary. 3) sockmap sample refactoring, from John. 4) lpm map get_next_key fixes, from Yonghong. 5) test cleanups, from Alexei and Prashant. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 28 Jan, 2018 2 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queueDavid S. Miller authored
Jeff Kirsher says: ==================== 10GbE Intel Wired LAN Driver Updates 2018-01-26 This series contains updates to ixgbe and ixgbevf. Emil updates ixgbevf to match ixgbe functionality, starting with the consolidating of functions that represent logical steps in the receive process so we can later update them more easily. Updated ixgbevf to only synchronize the length of the frame, which will typically be the MTU or smaller. Updated the VF driver to use the length of the packet instead of the DD status bit to determine if a new descriptor is ready to be processed, which saves on reads and we can save time on initialization. Added support for DMA_ATTR_SKIP_CPU_SYNC/WEAK_ORDERING to help improve performance on some platforms. Updated the VF driver to do bulk updates of the page reference count instead of just incrementing it by one reference at a time. Updated the VF driver to only go through the region of the receive ring that was designated to be cleaned up, rather than process the entire ring. Colin Ian King adds the use of ARRAY_SIZE() on various arrays. Miroslav Lichvar fixes an issue where ethtool was reporting timestamping filters unsupported for X550, which is incorrect. Paul adds support for reporting 5G link speed for some devices. Dan Carpenter fixes a typo where && was used when it should have been ||. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Leon Romanovsky authored
The "return 0" instruction follows other return instruction and it makes it impossible to execute, hence remove it. Fixes: 00fc0c51 ("rocker: Change world_ops API and implementation to be switchdev independant") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 27 Jan, 2018 4 commits
-
-
Alexei Starovoitov authored
Yonghong Song says: ==================== A kernel page fault which happens in lpm map trie_get_next_key is reported by syzbot and Eric. The issue was introduced by commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map"). Patch #1 fixed the issue in the kernel and patch #2 adds a multithreaded test case in tools/testing/selftests/bpf/test_lpm_map. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
The new test will spawn four threads, doing map update, delete, lookup and get_next_key in parallel. It is able to reproduce the issue in the previous commit found by syzbot and Eric Dumazet. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Yonghong Song authored
Commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") introduces a bug likes below: if (!rcu_dereference(trie->root)) return -ENOENT; if (!key || key->prefixlen > trie->max_prefixlen) { root = &trie->root; goto find_leftmost; } ...... find_leftmost: for (node = rcu_dereference(*root); node;) { In the code after label find_leftmost, it is assumed that *root should not be NULL, but it is not true as it is possbile trie->root is changed to NULL by an asynchronous delete operation. The issue is reported by syzbot and Eric Dumazet with the below error log: ...... kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682 ...... This patch fixed the issue by use local rcu_dereferenced pointer instead of *(&trie->root) later on. Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-
Alexei Starovoitov authored
Daniel Borkmann says: ==================== This set contains a small cleanup in cBPF prologue generation and otherwise fixes an outstanding issue related to BPF to BPF calls and exception handling. For details please see related patches. Last but not least, BPF selftests is extended with several new test cases. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-