- 13 Sep, 2019 3 commits
-
-
Mao Wenan authored
There are more parentheses in if clause when call sctp_get_port_local in sctp_do_bind, and redundant assignment to 'ret'. This patch is to do cleanup. Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mao Wenan authored
Currently sctp_get_port_local() returns a long which is either 0,1 or a pointer casted to long. It's neither of the callers use the return value since commit 62208f12 ("net: sctp: simplify sctp_get_port"). Now two callers are sctp_get_port and sctp_do_bind, they actually assumend a casted to an int was the same as a pointer casted to a long, and they don't save the return value just check whether it is zero or non-zero, so it would better change return type from long to int for sctp_get_port_local. Signed-off-by: Mao Wenan <maowenan@huawei.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jeff Kirsher authored
Port the same fix for ixgbe to ixgbevf. The ixgbevf driver currently does IPsec Tx offloading based on an existing secpath. However, the secpath can also come from the Rx side, in this case it is misinterpreted for Tx offload and the packets are dropped with a "bad sa_idx" error. Fix this by using the xfrm_offload() function to test for Tx offload. CC: Shannon Nelson <snelson@pensando.io> Fixes: 7f68d430 ("ixgbevf: enable VF IPsec offload operations") Reported-by: Jonathan Tooker <jonathan@reliablehosting.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 12 Sep, 2019 6 commits
-
-
Christophe JAILLET authored
The '.exit' functions from 'pernet_operations' structure should be marked as __net_exit, not __net_init. Fixes: 8e2d61e0 ("sctp: fix race on protocol/netns initialization") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Steffen Klassert authored
The ixgbe driver currently does IPsec TX offloading based on an existing secpath. However, the secpath can also come from the RX side, in this case it is misinterpreted for TX offload and the packets are dropped with a "bad sa_idx" error. Fix this by using the xfrm_offload() function to test for TX offload. Fixes: 59259470 ("ixgbe: process the Tx ipsec offload") Reported-by: Michael Marley <michael@michaelmarley.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Navid Emamdoost authored
In qrtr_tun_write_iter the allocated kbuf should be release in case of error or success return. v2 Update: Thanks to David Miller for pointing out the release on success path as well. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Subash Abhinov Kasiviswanathan authored
In event of failure during register_netdevice, free_netdev is invoked immediately. free_netdev assumes that all the netdevice refcounts have been dropped prior to it being called and as a result frees and clears out the refcount pointer. However, this is not necessarily true as some of the operations in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for invocation after a grace period. The IPv4 callback in_dev_rcu_put tries to access the refcount after free_netdev is called which leads to a null de-reference- 44837.761523: <6> Unable to handle kernel paging request at virtual address 0000004a88287000 44837.761651: <2> pc : in_dev_finish_destroy+0x4c/0xc8 44837.761654: <2> lr : in_dev_finish_destroy+0x2c/0xc8 44837.762393: <2> Call trace: 44837.762398: <2> in_dev_finish_destroy+0x4c/0xc8 44837.762404: <2> in_dev_rcu_put+0x24/0x30 44837.762412: <2> rcu_nocb_kthread+0x43c/0x468 44837.762418: <2> kthread+0x118/0x128 44837.762424: <2> ret_from_fork+0x10/0x1c Fix this by waiting for the completion of the call_rcu() in case of register_netdevice errors. Fixes: 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.") Cc: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe JAILLET authored
The '.exit' functions from 'pernet_operations' structure should be marked as __net_exit, not __net_init. Fixes: d862e546 ("net: ipv6: Implement /proc/net/icmp6.") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yang Yingliang authored
I got a UAF repport in tun driver when doing fuzzy test: [ 466.269490] ================================================================== [ 466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0 [ 466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699 [ 466.271810] [ 466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427 [ 466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 466.271838] Call Trace: [ 466.271858] dump_stack+0xca/0x13e [ 466.271871] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271890] print_address_description+0x79/0x440 [ 466.271906] ? vprintk_func+0x5e/0xf0 [ 466.271920] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271935] __kasan_report+0x15c/0x1df [ 466.271958] ? tun_chr_read_iter+0x2ca/0x2d0 [ 466.271976] kasan_report+0xe/0x20 [ 466.271987] tun_chr_read_iter+0x2ca/0x2d0 [ 466.272013] do_iter_readv_writev+0x4b7/0x740 [ 466.272032] ? default_llseek+0x2d0/0x2d0 [ 466.272072] do_iter_read+0x1c5/0x5e0 [ 466.272110] vfs_readv+0x108/0x180 [ 466.299007] ? compat_rw_copy_check_uvector+0x440/0x440 [ 466.299020] ? fsnotify+0x888/0xd50 [ 466.299040] ? __fsnotify_parent+0xd0/0x350 [ 466.299064] ? fsnotify_first_mark+0x1e0/0x1e0 [ 466.304548] ? vfs_write+0x264/0x510 [ 466.304569] ? ksys_write+0x101/0x210 [ 466.304591] ? do_preadv+0x116/0x1a0 [ 466.304609] do_preadv+0x116/0x1a0 [ 466.309829] do_syscall_64+0xc8/0x600 [ 466.309849] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.309861] RIP: 0033:0x4560f9 [ 466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127 [ 466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9 [ 466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003 [ 466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000 [ 466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10 [ 466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000 [ 466.323057] [ 466.323064] Allocated by task 2605: [ 466.335165] save_stack+0x19/0x80 [ 466.336240] __kasan_kmalloc.constprop.8+0xa0/0xd0 [ 466.337755] kmem_cache_alloc+0xe8/0x320 [ 466.339050] getname_flags+0xca/0x560 [ 466.340229] user_path_at_empty+0x2c/0x50 [ 466.341508] vfs_statx+0xe6/0x190 [ 466.342619] __do_sys_newstat+0x81/0x100 [ 466.343908] do_syscall_64+0xc8/0x600 [ 466.345303] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.347034] [ 466.347517] Freed by task 2605: [ 466.348471] save_stack+0x19/0x80 [ 466.349476] __kasan_slab_free+0x12e/0x180 [ 466.350726] kmem_cache_free+0xc8/0x430 [ 466.351874] putname+0xe2/0x120 [ 466.352921] filename_lookup+0x257/0x3e0 [ 466.354319] vfs_statx+0xe6/0x190 [ 466.355498] __do_sys_newstat+0x81/0x100 [ 466.356889] do_syscall_64+0xc8/0x600 [ 466.358037] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 466.359567] [ 466.360050] The buggy address belongs to the object at ffff888372139100 [ 466.360050] which belongs to the cache names_cache of size 4096 [ 466.363735] The buggy address is located 336 bytes inside of [ 466.363735] 4096-byte region [ffff888372139100, ffff88837213a100) [ 466.367179] The buggy address belongs to the page: [ 466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0 [ 466.371582] flags: 0x2fffff80010200(slab|head) [ 466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00 [ 466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 466.377778] page dumped because: kasan: bad access detected [ 466.379730] [ 466.380288] Memory state around the buggy address: [ 466.381844] ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.384009] ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.388257] ^ [ 466.390234] ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.392512] ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 466.394667] ================================================================== tun_chr_read_iter() accessed the memory which freed by free_netdev() called by tun_set_iff(): CPUA CPUB tun_set_iff() alloc_netdev_mqs() tun_attach() tun_chr_read_iter() tun_get() tun_do_read() tun_ring_recv() register_netdevice() <-- inject error goto err_detach tun_detach_all() <-- set RCV_SHUTDOWN free_netdev() <-- called from err_free_dev path netdev_freemem() <-- free the memory without check refcount (In this path, the refcount cannot prevent freeing the memory of dev, and the memory will be used by dev_put() called by tun_chr_read_iter() on CPUB.) (Break from tun_ring_recv(), because RCV_SHUTDOWN is set) tun_put() dev_put() <-- use the memory freed by netdev_freemem() Put the publishing of tfile->tun after register_netdevice(), so tun_get() won't get the tun pointer that freed by err_detach path if register_netdevice() failed. Fixes: eb0fb363 ("tuntap: attach queue 0 before registering netdevice") Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 11 Sep, 2019 13 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queueDavid S. Miller authored
Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2019-09-11 This series contains fixes to ixgbe. Alex fixes up the adaptive ITR scheme for ixgbe which could result in a value that was either 0 or something less than 10 which was causing issues with hardware features, like RSC, that do not function well with ITR values that low. Ilya Maximets fixes the ixgbe driver to limit the number of transmit descriptors to clean by the number of transmit descriptors used in the transmit ring, so that the driver does not try to "double" clean the same descriptors. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neal Cardwell authored
Fix tcp_ecn_withdraw_cwr() to clear the correct bit: TCP_ECN_QUEUE_CWR. Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about the behavior of data receivers, and deciding whether to reflect incoming IP ECN CE marks as outgoing TCP th->ece marks. The TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders, and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function is only called from tcp_undo_cwnd_reduction() by data senders during an undo, so it should zero the sender-side state, TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of incoming CE bits on incoming data packets just because outgoing packets were spuriously retransmitted. The bug has been reproduced with packetdrill to manifest in a scenario with RFC3168 ECN, with an incoming data packet with CE bit set and carrying a TCP timestamp value that causes cwnd undo. Before this fix, the IP CE bit was ignored and not reflected in the TCP ECE header bit, and sender sent a TCP CWR ('W') bit on the next outgoing data packet, even though the cwnd reduction had been undone. After this fix, the sender properly reflects the CE bit and does not set the W bit. Note: the bug actually predates 2005 git history; this Fixes footer is chosen to be the oldest SHA1 I have tested (from Sep 2007) for which the patch applies cleanly (since before this commit the code was in a .h file). Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilya Maximets authored
Tx code doesn't clear the descriptors' status after cleaning. So, if the budget is larger than number of used elems in a ring, some descriptors will be accounted twice and xsk_umem_complete_tx will move prod_tail far beyond the prod_head breaking the completion queue ring. Fix that by limiting the number of descriptors to clean by the number of used descriptors in the Tx ring. 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like 'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use 'next_to_clean' and 'next_to_use' indexes. CC: stable@vger.kernel.org Fixes: 8221c5eb ("ixgbe: add AF_XDP zero-copy Tx support") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Tested-by: William Tu <u9012063@gmail.com> Tested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Alexander Duyck authored
There were a couple cases where the ITR value generated via the adaptive ITR scheme could exceed 126. This resulted in the value becoming either 0 or something less than 10. Switching back and forth between a value less than 10 and a value greater than 10 can cause issues as certain hardware features such as RSC to not function well when the ITR value has dropped that low. CC: stable@vger.kernel.org Fixes: b4ded832 ("ixgbe: Update adaptive ITR algorithm") Reported-by: Gregg Leventhal <gleventhal@janestreet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
-
Colin Ian King authored
There is a spelling mistake in a mlx4_err error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Colin Ian King authored
There is a spelling mistake in a .msg literal string. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Colin Ian King authored
There is a spelling mistake in the lmc_trace message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Colin Ian King authored
There is a spelling mistake in a dev_err message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ka-Cheong Poon authored
In rds_bind(), an rds_sock is added to the RDS bind hash table before rs_transport is set. This means that the socket can be found by the receive code path when rs_transport is NULL. And the receive code path de-references rs_transport for congestion update check. This can cause a panic. An rds_sock should not be added to the bind hash table before all the needed fields are set. Reported-by: syzbot+4b4f8163c2e246df3c4c@syzkaller.appspotmail.com Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jouni Malinen authored
The Layer 2 Update frame is used to update bridges when a station roams to another AP even if that STA does not transmit any frames after the reassociation. This behavior was described in IEEE Std 802.11F-2003 as something that would happen based on MLME-ASSOCIATE.indication, i.e., before completing 4-way handshake. However, this IEEE trial-use recommended practice document was published before RSN (IEEE Std 802.11i-2004) and as such, did not consider RSN use cases. Furthermore, IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been maintained amd should not be used anymore. Sending out the Layer 2 Update frame immediately after association is fine for open networks (and also when using SAE, FT protocol, or FILS authentication when the station is actually authenticated by the time association completes). However, it is not appropriate for cases where RSN is used with PSK or EAP authentication since the station is actually fully authenticated only once the 4-way handshake completes after authentication and attackers might be able to use the unauthenticated triggering of Layer 2 Update frame transmission to disrupt bridge behavior. Fix this by postponing transmission of the Layer 2 Update frame from station entry addition to the point when the station entry is marked authorized. Similarly, send out the VLAN binding update only if the STA entry has already been authorized. Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Randy Dunlap authored
Keep the "Library routines" menu intact by moving OBJAGG into it. Otherwise OBJAGG is displayed/presented as an orphan in the various config menus. Fixes: 0a020d41 ("lib: introduce initial implementation of object aggregation manager") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Ido Schimmel <idosch@mellanox.com> Cc: David S. Miller <davem@davemloft.net> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Mao Wenan authored
sonic_send_packet will be processed in irq or non-irq context, so it would better use dev_kfree_skb_any instead of dev_kfree_skb. Fixes: d9fb9f38 ("*sonic/natsemi/ns83829: Move the National Semi-conductor drivers") Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Navid Emamdoost authored
In i2400m_op_rfkill_sw_toggle cmd buffer should be released along with skb response. Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 10 Sep, 2019 5 commits
-
-
Xin Long authored
This issue causes SCTP_PEER_ADDR_THLDS sockopt not to be able to dump a transport thresholds info. Fix it by adding 'goto' put_user in sctp_getsockopt_paddr_thresholds. Fixes: 8add543e ("sctp: add SCTP_FUTURE_ASSOC for SCTP_PEER_ADDR_THLDS sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
In case of TCA_HHF_NON_HH_WEIGHT or TCA_HHF_QUANTUM is zero, it would make no progress inside the loop in hhf_dequeue() thus kernel would get stuck. Fix this by checking this corner case in hhf_change(). Fixes: 10239edf ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc") Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com Reported-by: syzbot+041483004a7f45f1f20a@syzkaller.appspotmail.com Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Terry Lam <vtlam@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Cong Wang authored
At least sch_red and sch_tbf don't implement ->tcf_block() while still have a non-zero tc "class". Instead of adding nop implementations to each of such qdisc's, we can just relax the check of cops->tcf_block() in tc_bind_tclass(). They don't support TC filter anyway. Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent at the end. In fact, NLMSG_DONE is sent only at the end of a dump. Libraries like libnl will wait forever for NLMSG_DONE. Fixes: 949f1e39 ("bridge: mdb: notify on router port add and del") CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michal Suchanek authored
Commit 1c2977c0 ("net/ibmvnic: free reset work of removed device from queue") adds a } without corresponding { causing build break. Fixes: 1c2977c0 ("net/ibmvnic: free reset work of removed device from queue") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Reviewed-by: Juliet Kim <julietk@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 07 Sep, 2019 8 commits
-
-
Fred Lotter authored
Flower control message replies are handled in different locations. The truly high priority replies are handled in the BH (tasklet) context, while the remaining replies are handled in a predefined Linux work queue. The work queue handler orders replies into high and low priority groups, and always start servicing the high priority replies within the received batch first. Reply Type: Rtnl Lock: Handler: CMSG_TYPE_PORT_MOD no BH tasklet (mtu) CMSG_TYPE_TUN_NEIGH no BH tasklet CMSG_TYPE_FLOW_STATS no BH tasklet CMSG_TYPE_PORT_REIFY no WQ high CMSG_TYPE_PORT_MOD yes WQ high (link/mtu) CMSG_TYPE_MERGE_HINT yes WQ low CMSG_TYPE_NO_NEIGH no WQ low CMSG_TYPE_ACTIVE_TUNS no WQ low CMSG_TYPE_QOS_STATS no WQ low CMSG_TYPE_LAG_CONFIG no WQ low A subset of control messages can block waiting for an rtnl lock (from both work queue priority groups). The rtnl lock is heavily contended for by external processes such as systemd-udevd, systemd-network and libvirtd, especially during netdev creation, such as when flower VFs and representors are instantiated. Kernel netlink instrumentation shows that external processes (such as systemd-udevd) often use successive rtnl_trylock() sequences, which can result in an rtnl_lock() blocked control message to starve for longer periods of time during rtnl lock contention, i.e. netdev creation. In the current design a single blocked control message will block the entire work queue (both priorities), and introduce a latency which is nondeterministic and dependent on system wide rtnl lock usage. In some extreme cases, one blocked control message at exactly the wrong time, just before the maximum number of VFs are instantiated, can block the work queue for long enough to prevent VF representor REIFY replies from getting handled in time for the 40ms timeout. The firmware will deliver the total maximum number of REIFY message replies in around 300us. Only REIFY and MTU update messages require replies within a timeout period (of 40ms). The MTU-only updates are already done directly in the BH (tasklet) handler. Move the REIFY handler down into the BH (tasklet) in order to resolve timeouts caused by a blocked work queue waiting on rtnl locks. Signed-off-by: Fred Lotter <frederik.lotter@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Shmulik Ladkani authored
Historically, support for frag_list packets entering skb_segment() was limited to frag_list members terminating on exact same gso_size boundaries. This is verified with a BUG_ON since commit 89319d38 ("net: Add frag_list support to skb_segment"), quote: As such we require all frag_list members terminate on exact MSS boundaries. This is checked using BUG_ON. As there should only be one producer in the kernel of such packets, namely GRO, this requirement should not be difficult to maintain. However, since commit 6578171a ("bpf: add bpf_skb_change_proto helper"), the "exact MSS boundaries" assumption no longer holds: An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but leaves the frag_list members as originally merged by GRO with the original 'gso_size'. Example of such programs are bpf-based NAT46 or NAT64. This lead to a kernel BUG_ON for flows involving: - GRO generating a frag_list skb - bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room() - skb_segment() of the skb See example BUG_ON reports in [0]. In commit 13acc94e ("net: permit skb_segment on head_frag frag_list skb"), skb_segment() was modified to support the "gso_size mangling" case of a frag_list GRO'ed skb, but *only* for frag_list members having head_frag==true (having a page-fragment head). Alas, GRO packets having frag_list members with a linear kmalloced head (head_frag==false) still hit the BUG_ON. This commit adds support to skb_segment() for a 'head_skb' packet having a frag_list whose members are *non* head_frag, with gso_size mangled, by disabling SG and thus falling-back to copying the data from the given 'head_skb' into the generated segmented skbs - as suggested by Willem de Bruijn [1]. Since this approach involves the penalty of skb_copy_and_csum_bits() when building the segments, care was taken in order to enable this solution only when required: - untrusted gso_size, by testing SKB_GSO_DODGY is set (SKB_GSO_DODGY is set by any gso_size mangling functions in net/core/filter.c) - the frag_list is non empty, its item is a non head_frag, *and* the headlen of the given 'head_skb' does not match the gso_size. [0] https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/ https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/ [1] https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/ Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Maciej Żenczykowski authored
Fixes a stupid bug I recently introduced... ip6_route_info_create() returns an ERR_PTR(err) and not a NULL on error. Fixes: d55a2e37 ("net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)'") Cc: David Ahern <dsahern@gmail.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Biggers authored
syzbot reported: BUG: KMSAN: uninit-value in capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700 CPU: 0 PID: 10025 Comm: syz-executor379 Not tainted 4.20.0-rc7+ #2 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313 capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700 do_loop_readv_writev fs/read_write.c:703 [inline] do_iter_write+0x83e/0xd80 fs/read_write.c:961 vfs_writev fs/read_write.c:1004 [inline] do_writev+0x397/0x840 fs/read_write.c:1039 __do_sys_writev fs/read_write.c:1112 [inline] __se_sys_writev+0x9b/0xb0 fs/read_write.c:1109 __x64_sys_writev+0x4a/0x70 fs/read_write.c:1109 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 [...] The problem is that capi_write() is reading past the end of the message. Fix it by checking the message's length in the needed places. Reported-and-tested-by: syzbot+0849c524d9c634f5ae66@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Juliet Kim authored
Commit 36f1031c ("ibmvnic: Do not process reset during or after device removal") made the change to exit reset if the driver has been removed, but does not free reset work items of the adapter from queue. Ensure all reset work items are freed when breaking out of the loop early. Fixes: 36f1031c ("ibmnvic: Do not process reset during or after device removal”) Signed-off-by: Juliet Kim <julietk@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Stefan Chulski authored
Regarding to IEEE 802.3-2015 standard section 2 28B.3 Priority resolution - Table 28-3 - Pause resolution In case of Local device Pause=1 AsymDir=0, Link partner Pause=1 AsymDir=1, Local device resolution should be enable PAUSE transmit, disable PAUSE receive. And in case of Local device Pause=1 AsymDir=1, Link partner Pause=1 AsymDir=0, Local device resolution should be enable PAUSE receive, disable PAUSE transmit. Fixes: 9525ae83 ("phylink: add phylink infrastructure") Signed-off-by: Stefan Chulski <stefanc@marvell.com> Reported-by: Shaul Ben-Mayor <shaulb@marvell.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Christophe JAILLET authored
We 'allocate' 'count' bytes here. In fact, 'dev_alloc_skb' already add some extra space for padding, so a bit more is allocated. However, we use 1 byte for the KISS command, then copy 'count' bytes, so count+1 bytes. Explicitly allocate and use 1 more byte to be safe. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf 2019-09-06 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) verifier precision tracking fix, from Alexei. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 06 Sep, 2019 5 commits
-
-
David S. Miller authored
Merge tag 'wireless-drivers-for-davem-2019-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 5.3 Fourth set of fixes for 5.3, and hopefully really the last one. Quite a few CVE fixes this time but at least to my knowledge none of them have a known exploit. mt76 * workaround firmware hang by disabling hardware encryption on MT7630E * disable 5GHz band for MT7630E as it's not working properly mwifiex * fix IE parsing to avoid a heap buffer overflow iwlwifi * fix for QuZ device initialisation rt2x00 * another fix for rekeying * revert a commit causing degradation in rx signal levels rsi * fix a double free ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Radhey Shyam Pandey authored
I am maintaining xilinx axiethernet driver in xilinx tree and would like to maintain it in the mainline kernel as well. Hence adding myself as a maintainer. Also Anirudha and John has moved to new roles, so based on request removing them from the maintainer list. Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Acked-by: John Linn <john.linn@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Whenever MQ is not used on a multiqueue device, we experience serious reordering problems. Bisection found the cited commit. The issue can be described this way : - A single qdisc hierarchy is shared by all transmit queues. (eg : tc qdisc replace dev eth0 root fq_codel) - When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting a different transmit queue than the one used to build a packet train, we stop building the current list and save the 'bad' skb (P1) in a special queue. (bad_txq) - When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this skb (P1), it checks if the associated transmit queues is still in frozen state. If the queue is still blocked (by BQL or NIC tx ring full), we leave the skb in bad_txq and return NULL. - dequeue_skb() calls q->dequeue() to get another packet (P2) The other packet can target the problematic queue (that we found in frozen state for the bad_txq packet), but another cpu just ran TX completion and made room in the txq that is now ready to accept new packets. - Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent at next round. In practice P2 is the lead of a big packet train (P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/ To solve this problem, we have to block the dequeue process as long as the first packet in bad_txq can not be sent. Reordering issues disappear and no side effects have been seen. Fixes: a53851e2 ("net: sched: explicit locking in gso_cpu fallback") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsecDavid S. Miller authored
Steffen Klassert says: ==================== pull request (net): ipsec 2019-09-05 1) Several xfrm interface fixes from Nicolas Dichtel: - Avoid an interface ID corruption on changelink. - Fix wrong intterface names in the logs. - Fix a list corruption when changing network namespaces. - Fix unregistation of the underying phydev. 2) Fix a potential warning when merging xfrm_plocy nodes. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Zhu Yanjun authored
When testing with a background iperf pushing 1Gbit/sec traffic and running both ifconfig and netstat to collect statistics, some deadlocks occurred. Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv statistics. In the commit f5d827ae ("forcedeth: implement ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx statistics. The fix is to replace normal tx/rx variables with per cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable will avoid deadlocks and provide fast efficient statistics updates. In nv_probe, the per cpu variable is initialized. In nv_remove, this per cpu variable is freed. In xmit/recv process, this per cpu variable will be updated. In nv_get_stats64, this per cpu variable on each cpu is added up. Then the driver can get xmit/recv packets statistics. A test runs for several days with this commit, the deadlocks disappear and the performance is better. Tested: - iperf SMP x86_64 -> Client connecting to 1.1.1.108, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 1.1.1.105 port 38888 connected with 1.1.1.108 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.10 GBytes 943 Mbits/sec ifconfig results: enp0s9 Link encap:Ethernet HWaddr 00:21:28:6f:de:0f inet addr:1.1.1.105 Bcast:0.0.0.0 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5774764531 errors:0 dropped:0 overruns:0 frame:0 TX packets:633534193 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7646159340904 (7.6 TB) TX bytes:11425340407722 (11.4 TB) netstat results: Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg ... enp0s9 1500 0 5774764531 0 0 0 633534193 0 0 0 BMRU ... Fixes: f5d827ae ("forcedeth: implement ndo_get_stats64() API") CC: Joe Jin <joe.jin@oracle.com> CC: JUNXIAO_BI <junxiao.bi@oracle.com> Reported-and-tested-by: Nan san <nan.1986san@gmail.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-