- 23 Feb, 2021 33 commits
-
-
Jason A. Donenfeld authored
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queueReviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Fixes: e7096c13 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jason A. Donenfeld authored
If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Fixes: e7096c13 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jason A. Donenfeld authored
The is_dead boolean is checked for every single packet, while the internal_id member is used basically only for pr_debug messages. So it makes sense to hoist up is_dead into some space formerly unused by a struct hole, while demoting internal_api to below the lowest struct cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jason A. Donenfeld authored
In order to test ndo_start_xmit being called in parallel, explicitly add separate tests, which should all run on different cores. This should help tease out bugs associated with queueing up packets from different cores in parallel. Currently, it hasn't found those types of bugs, but given future planned work, this is a useful regression to avoid. Fixes: e7096c13 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jann Horn authored
The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove the bogus annotation. This was introduced in https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82 in the historical WireGuard repo because the old code used to zero-initialize multiple members as follows: endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0; Because fl.saddr is fixed-endian and an assignment returns a value with the type of its left operand, this meant that sparse detected an assignment between values of different endianness. Since then, this assignment was already split up into separate statements; just the cast survived. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Antonio Quartulli authored
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Takeshi Misawa authored
If qrtr_endpoint_register() failed, tun is leaked. Fix this, by freeing tun in error path. syzbot report: BUG: memory leak unreferenced object 0xffff88811848d680 (size 64): comm "syz-executor684", pid 10171, jiffies 4294951561 (age 26.070s) hex dump (first 32 bytes): 80 dd 0a 84 ff ff ff ff 00 00 00 00 00 00 00 00 ................ 90 d6 48 18 81 88 ff ff 90 d6 48 18 81 88 ff ff ..H.......H..... backtrace: [<0000000018992a50>] kmalloc include/linux/slab.h:552 [inline] [<0000000018992a50>] kzalloc include/linux/slab.h:682 [inline] [<0000000018992a50>] qrtr_tun_open+0x22/0x90 net/qrtr/tun.c:35 [<0000000003a453ef>] misc_open+0x19c/0x1e0 drivers/char/misc.c:141 [<00000000dec38ac8>] chrdev_open+0x10d/0x340 fs/char_dev.c:414 [<0000000079094996>] do_dentry_open+0x1e6/0x620 fs/open.c:817 [<000000004096d290>] do_open fs/namei.c:3252 [inline] [<000000004096d290>] path_openat+0x74a/0x1b00 fs/namei.c:3369 [<00000000b8e64241>] do_filp_open+0xa0/0x190 fs/namei.c:3396 [<00000000a3299422>] do_sys_openat2+0xed/0x230 fs/open.c:1172 [<000000002c1bdcef>] do_sys_open fs/open.c:1188 [inline] [<000000002c1bdcef>] __do_sys_openat fs/open.c:1204 [inline] [<000000002c1bdcef>] __se_sys_openat fs/open.c:1199 [inline] [<000000002c1bdcef>] __x64_sys_openat+0x7f/0xe0 fs/open.c:1199 [<00000000f3a5728f>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 [<000000004b38b7ec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 28fb4e59 ("net: qrtr: Expose tunneling endpoint to user space") Reported-by: syzbot+5d6e4af21385f5cfc56a@syzkaller.appspotmail.com Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com> Link: https://lore.kernel.org/r/20210221234427.GA2140@DESKTOPSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Taehee Yoo authored
The debug check must be done after unregister_netdevice_many() call -- the hlist_del_rcu() for this is done inside .ndo_stop. This is the same with commit 0fda7600 ("geneve: move debug check after netdev unregister") Test commands: ip netns del A ip netns add A ip netns add B ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \ remote 10.0.0.2 dstport 4789 srcport 4789 4789 ip netns exec B ip link set vxlan0 netns A ip netns exec A ip link set vxlan0 up ip netns del B Splat looks like: [ 73.176249][ T7] ------------[ cut here ]------------ [ 73.178662][ T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan] [ 73.182597][ T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix [ 73.190113][ T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838 [ 73.193037][ T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 73.196986][ T7] Workqueue: netns cleanup_net [ 73.198946][ T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan] [ 73.201509][ T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc [ 73.208813][ T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286 [ 73.211027][ T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0 [ 73.213702][ T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070 [ 73.216169][ T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61 [ 73.218463][ T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000 [ 73.220794][ T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40 [ 73.223337][ T7] FS: 0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000 [ 73.225814][ T7] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 73.227616][ T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0 [ 73.229700][ T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 73.231820][ T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 73.233844][ T7] Call Trace: [ 73.234698][ T7] ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan] [ 73.235962][ T7] ? ops_exit_list.isra.11+0x93/0x140 [ 73.237134][ T7] cleanup_net+0x45e/0x8a0 [ ... ] Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Hayes Wang says: ==================== r8152: minor adjustments These patches are used to adjust the code. ==================== Link: https://lore.kernel.org/r/1394712342-15778-341-Taiwan-albertk@realtek.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hayes Wang authored
Add rtl_eee_plus_en() and rtl_green_en(). Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hayes Wang authored
Some messages are before calling register_netdev(), so replace netif_err() with dev_err(). Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hayes Wang authored
Return error code if autosuspend_en, eee_get, or eee_set don't exist. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Hayes Wang authored
U1/U2 shoued be enabled for USB 3.0 or later. The USB 2.0 doesn't support it. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
wenxu authored
Add invalid and reply flags validate in the fl_validate_ct_state. This makes the checking complete if compared to ovs' validate_ct_state(). Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://lore.kernel.org/r/1614064315-364-1-git-send-email-wenxu@ucloud.cnSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Florian Fainelli says: ==================== net: dsa: Learning fixes for b53/bcm_sf2 This patch series contains a couple of fixes for the b53/bcm_sf2 drivers with respect to configuring learning. The first patch is wiring-up the necessary dsa_switch_ops operations in order to support the offloading of bridge flags. The second patch corrects the switch driver's default learning behavior which was unfortunately wrong from day one. This is submitted against "net" because this is technically a bug fix since ports should not have had learning enabled by default but given this is dependent upon Vladimir's recent br_flags series, there is no Fixes tag provided. I will be providing targeted stable backports that look a bit different. Changes in v2: - added first patch - updated second patch to include BR_LEARNING check in br_flags_pre as a support bridge flag to offload ==================== Link: https://lore.kernel.org/r/20210222223010.2907234-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
Add support for being able to set the learning attribute on port, and make sure that the standalone ports start up with learning disabled. We can remove the code in bcm_sf2 that configured the ports learning attribute because we want the standalone ports to have learning disabled by default and port 7 cannot be bridged, so its learning attribute will not change past its initial configuration. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
Because bcm_sf2 implements its own dsa_switch_ops we need to export the b53_br_flags_pre(), b53_br_flags() and b53_set_mrouter so we can wire-up them up like they used to be with the former b53_br_egress_floods(). Fixes: a8b659e7 ("net: dsa: act as passthrough for bridge port flags") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Krzysztof Halasa authored
sky2.c driver uses netdev_warn() before the net device is initialized. Fix it by using dev_warn() instead. Signed-off-by: Krzysztof Halasa <khalasa@piap.pl> Link: https://lore.kernel.org/r/m3a6s1r1ul.fsf@t19.piap.plSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Sieng Piaw Liew authored
In ndo_stop functions, netdev_completed_queue() is called during forced tx reclaim, after netdev_reset_queue(). This may trigger kernel panic if there is any tx skb left. This patch moves netdev_reset_queue() to after tx reclaim, so BQL can complete successfully then reset. Signed-off-by: Sieng Piaw Liew <liew.s.piaw@gmail.com> Fixes: 4c59b0f5 ("bcm63xx_enet: add BQL support") Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210222013530.1356-1-liew.s.piaw@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jason A. Donenfeld authored
The icmp{,v6}_send functions make all sorts of use of skb->cb, casting it with IPCB or IP6CB, assuming the skb to have come directly from the inet layer. But when the packet comes from the ndo layer, especially when forwarded, there's no telling what might be in skb->cb at that point. As a result, the icmp sending code risks reading bogus memory contents, which can result in nasty stack overflows such as this one reported by a user: panic+0x108/0x2ea __stack_chk_fail+0x14/0x20 __icmp_send+0x5bd/0x5c0 icmp_ndo_send+0x148/0x160 In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read from it. The optlen parameter there is of particular note, as it can induce writes beyond bounds. There are quite a few ways that can happen in __ip_options_echo. For example: // sptr/skb are attacker-controlled skb bytes sptr = skb_network_header(skb); // dptr/dopt points to stack memory allocated by __icmp_send dptr = dopt->__data; // sopt is the corrupt skb->cb in question if (sopt->rr) { optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data // this now writes potentially attacker-controlled data, over // flowing the stack: memcpy(dptr, sptr+sopt->rr, optlen); } In the icmpv6_send case, the story is similar, but not as dire, as only IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is worse than the iif case, but it is passed to ipv6_find_tlv, which does a bit of bounds checking on the value. This is easy to simulate by doing a `memset(skb->cb, 0x41, sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by good fortune and the rarity of icmp sending from that context that we've avoided reports like this until now. For example, in KASAN: BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0 Write of size 38 at addr ffff888006f1f80e by task ping/89 CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5 Call Trace: dump_stack+0x9a/0xcc print_address_description.constprop.0+0x1a/0x160 __kasan_report.cold+0x20/0x38 kasan_report+0x32/0x40 check_memory_region+0x145/0x1a0 memcpy+0x39/0x60 __ip_options_echo+0xa0e/0x12b0 __icmp_send+0x744/0x1700 Actually, out of the 4 drivers that do this, only gtp zeroed the cb for the v4 case, while the rest did not. So this commit actually removes the gtp-specific zeroing, while putting the code where it belongs in the shared infrastructure of icmp{,v6}_ndo_send. This commit fixes the issue by passing an empty IPCB or IP6CB along to the functions that actually do the work. For the icmp_send, this was already trivial, thanks to __icmp_send providing the plumbing function. For icmpv6_send, this required a tiny bit of refactoring to make it behave like the v4 case, after which it was straight forward. Fixes: a2b78e9b ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs") Reported-by: SinYu <liuxyon@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queueJakub Kicinski authored
Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2021-02-19 This series contains updates to i40e driver only. Slawomir resolves an issue with the IPv6 extension headers being processed incorrectly. Keita Suzuki fixes a memory leak on probe failure. Mateusz initializes AQ command structures to zero to comply with spec, fixes FW flow control settings being overwritten and resolves an issue with adding VLAN filters after enabling FW LLDP. He also adds an additional check when adding TC filter as the current check doesn't properly distinguish between IPv4 and IPv6. Sylwester removes setting disabled bit when syncing filters as this prevents VFs from completing setup. Norbert cleans up sparse warnings. v2: - Fix fixes tag on patch 7 * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: i40e: Fix endianness conversions i40e: Fix add TC filter for IPv6 i40e: Fix VFs not created i40e: Fix addition of RX filters after enabling FW LLDP agent i40e: Fix overwriting flow control settings during driver loading i40e: Add zero-initialization of AQ command structures i40e: Fix memory leak in i40e_probe i40e: Fix flow for IPv6 next header (extension header) ==================== Link: https://lore.kernel.org/r/20210219213606.2567536-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Chuhong Yuan authored
mlx4_do_mirror_rule() forgets to call mlx4_free_cmd_mailbox() to free the memory region allocated by mlx4_alloc_cmd_mailbox() before an exit. Add the missed call to fix it. Fixes: 78efed27 ("net/mlx4_core: Support mirroring VF DMFS rules on both ports") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20210221143559.390277-1-hslester96@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Song, Yoong Siang authored
When link speed is not 100 Mbps, port transmit rate and speed divider are set to 8 and 1000000 respectively. These values are incorrect for CBS idleslope and sendslope HW values calculation if the link speed is not 1 Gbps. This patch adds switch statement to set the values of port transmit rate and speed divider for 10 Gbps, 5 Gbps, 2.5 Gbps, 1 Gbps, and 100 Mbps. Note that CBS is not supported at 10 Mbps. Fixes: bc41a668 ("net: stmmac: tc: Remove the speed dependency") Fixes: 1f705bc6 ("net: stmmac: Add support for CBS QDISC") Signed-off-by: Song, Yoong Siang <yoong.siang.song@intel.com> Link: https://lore.kernel.org/r/1613655653-11755-1-git-send-email-yoong.siang.song@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Paolo Abeni says: ==================== mptcp: a bunch of fixes This series bundle a few MPTCP fixes for the current net tree. They have been detected via syzkaller and packetdrill Patch 1 fixes a slow close for orphaned sockets Patch 2 fixes another hangup at close time, when no data was actually transmitted before close Patch 3 fixes a memory leak with unusual sockopts Patch 4 fixes stray wake-ups on listener sockets ==================== Link: https://lore.kernel.org/r/cover.1613755058.git.pabeni@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Paolo Abeni authored
MPJ subflows are not exposed as fds to user spaces. As such, incoming MPJ subflows are removed from the accept queue by tcp_check_req()/tcp_get_cookie_sock(). Later tcp_child_process() invokes subflow_data_ready() on the parent socket regardless of the subflow kind, leading to poll wakeups even if the later accept will block. Address the issue by double-checking the queue state before waking the user-space. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/164Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Fixes: f296234c ("mptcp: Add handling of incoming MP_JOIN requests") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Westphal authored
mptcp re-used inet(6)_release, so the subflow sockets are ignored. Need to invoke ip(v6)_mc_drop_socket function to ensure mcast join resources get free'd. Fixes: 717e79c8 ("mptcp: Add setsockopt()/getsockopt() socket operations") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/110Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Paolo Abeni authored
If the msk is closed before sending or receiving any data, no DATA_FIN is generated, instead an MPC ack packet is crafted out. In the above scenario, the MPTCP protocol creates and sends a pure ack and such packets matches also the criteria for an MPC ack and the protocol tries first to insert MPC options, leading to the described error. This change addresses the issue by avoiding the insertion of an MPC option for DATA_FIN packets or if the sub-flow is not established. To avoid doing multiple times the same test, fetch the data_fin flag in a bool variable and pass it to both the interested helpers. Fixes: 6d0060f6 ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Paolo Abeni authored
Currently we move orphaned msk sockets directly from FIN_WAIT2 state to CLOSE, with the rationale that incoming additional data could be just dropped by the TCP stack/TW sockets. Anyhow we miss sending MPTCP-level ack on incoming DATA_FIN, and that may hang the peers. Fixes: e16163b6 ("mptcp: refactor shutdown and close") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-
Florian Fainelli authored
The core DSA framework uses hsr_is_master() which would not resolve to a valid symbol if HSR is built-into the kernel and DSA is a module. Fixes: 18596f50 ("net: dsa: add support for offloading HSR") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: George McCollister <george.mccollister@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20210220051222.15672-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Dan Carpenter authored
The comments to phy_select_page() say that "phy_restore_page() must always be called after this, irrespective of success or failure of this call." If we don't call phy_restore_page() then we are still holding the phy_lock_mdio_bus() so it eventually leads to a dead lock. Fixes: 32ab60e5 ("net: phy: icplus: add MDI/MDIX support for IP101A/G") Fixes: f9bc51e6 ("net: phy: icplus: fix paged register access") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Michael Walle <michael@walle.cc> Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/YC+OpFGsDPXPnXM5@mwandaSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Stefan Chulski authored
PPv2 loopback port doesn't support RSS, so we should skip RSS configurations for this port. Signed-off-by: Stefan Chulski <stefanc@marvell.com> Reviewed-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/r/1613652123-19021-1-git-send-email-stefanc@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
Camelia Groza authored
The current use of container_of is flawed and unnecessary. Obtain the dpaa_napi_portal reference from the private percpu data instead. Fixes: a1e031ff ("dpaa_eth: add XDP_REDIRECT support") Reported-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/20210218182106.22613-1-camelia.groza@nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
DENG Qingfang authored
2 bytes of the MTU are reserved for Atheros DSA tag, but DSA core has already handled that since commit dc0fe7d4. Remove the unnecessary reservation. Fixes: d51b6ce4 ("net: ethernet: add ag71xx driver") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20210218034514.3421-1-dqfext@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>
-
- 21 Feb, 2021 7 commits
-
-
Dan Carpenter authored
This code does not allocate enough memory for the NUL terminator so it ends up putting it one character beyond the end of the buffer. Fixes: 8756828a ("octeontx2-af: Add NPA aura and pool contexts to debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull performance event updates from Ingo Molnar: - Add CPU-PMU support for Intel Sapphire Rapids CPUs - Extend the perf ABI with PERF_SAMPLE_WEIGHT_STRUCT, to offer two-parameter sampling event feedback. Not used yet, but is intended for Golden Cove CPU-PMU, which can provide both the instruction latency and the cache latency information for memory profiling events. - Remove experimental, default-disabled perfmon-v4 counter_freezing support that could only be enabled via a boot option. The hardware is hopelessly broken, we'd like to make sure nobody starts relying on this, as it would only end in tears. - Fix energy/power events on Intel SPR platforms - Simplify the uprobes resume_execution() logic - Misc smaller fixes. * tag 'perf-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Fix psys-energy event on Intel SPR platform perf/x86/rapl: Only check lower 32bits for RAPL energy counters perf/x86/rapl: Add msr mask support perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters perf/x86/intel: Add perf core PMU support for Sapphire Rapids perf/x86/intel: Filter unsupported Topdown metrics event perf/x86/intel: Factor out intel_update_topdown_event() perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT perf/intel: Remove Perfmon-v4 counter_freezing support x86/perf: Use static_call for x86_pmu.guest_get_msrs perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info perf/x86/intel/uncore: Store the logical die id instead of the physical die id. x86/kprobes: Do not decode opcode in resume_execution()
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull scheduler updates from Ingo Molnar: "Core scheduler updates: - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the preempt=none/voluntary/full boot options (default: full), to allow distros to build a PREEMPT kernel but fall back to close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via a boot time selection. There's also the /debug/sched_debug switch to do this runtime. This feature is implemented via runtime patching (a new variant of static calls). The scope of the runtime patching can be best reviewed by looking at the sched_dynamic_update() function in kernel/sched/core.c. ( Note that the dynamic none/voluntary mode isn't 100% identical, for example preempt-RCU is available in all cases, plus the preempt count is maintained in all models, which has runtime overhead even with the code patching. ) The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority of distributions, are supposed to be unaffected. - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that was found via rcutorture triggering a hang. The bug is that rcu_idle_enter() may wake up a NOCB kthread, but this happens after the last generic need_resched() check. Some cpuidle drivers fix it by chance but many others don't. In true 2020 fashion the original bug fix has grown into a 5-patch scheduler/RCU fix series plus another 16 RCU patches to address the underlying issue of missed preemption events. These are the initial fixes that should fix current incarnations of the bug. - Clean up rbtree usage in the scheduler, by providing & using the following consistent set of rbtree APIs: partial-order; less() based: - rb_add(): add a new entry to the rbtree - rb_add_cached(): like rb_add(), but for a rb_root_cached total-order; cmp() based: - rb_find(): find an entry in an rbtree - rb_find_add(): find an entry, and add if not found - rb_find_first(): find the first (leftmost) matching entry - rb_next_match(): continue from rb_find_first() - rb_for_each(): iterate a sub-tree using the previous two - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass. This is a 4-commit series where each commit improves one aspect of the idle sibling scan logic. - Improve the cpufreq cooling driver by getting the effective CPU utilization metrics from the scheduler - Improve the fair scheduler's active load-balancing logic by reducing the number of active LB attempts & lengthen the load-balancing interval. This improves stress-ng mmapfork performance. - Fix CFS's estimated utilization (util_est) calculation bug that can result in too high utilization values Misc updates & fixes: - Fix the HRTICK reprogramming & optimization feature - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code - Reduce dl_add_task_root_domain() overhead - Fix uprobes refcount bug - Process pending softirqs in flush_smp_call_function_from_idle() - Clean up task priority related defines, remove *USER_*PRIO and USER_PRIO() - Simplify the sched_init_numa() deduplication sort - Documentation updates - Fix EAS bug in update_misfit_status(), which degraded the quality of energy-balancing - Smaller cleanups" * tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) sched,x86: Allow !PREEMPT_DYNAMIC entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point entry: Explicitly flush pending rcuog wakeup before last rescheduling point rcu/nocb: Trigger self-IPI on late deferred wake up before user resume rcu/nocb: Perform deferred wake up before last idle's need_resched() check rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers sched/features: Distinguish between NORMAL and DEADLINE hrtick sched/features: Fix hrtick reprogramming sched/deadline: Reduce rq lock contention in dl_add_task_root_domain() uprobes: (Re)add missing get_uprobe() in __find_uprobe() smp: Process pending softirqs in flush_smp_call_function_from_idle() sched: Harden PREEMPT_DYNAMIC static_call: Allow module use without exposing static_call_key sched: Add /debug/sched_preempt preempt/dynamic: Support dynamic preempt with preempt= boot option preempt/dynamic: Provide irqentry_exit_cond_resched() static call preempt/dynamic: Provide preempt_schedule[_notrace]() static calls preempt/dynamic: Provide cond_resched() and might_resched() static calls preempt: Introduce CONFIG_PREEMPT_DYNAMIC static_call: Provide DEFINE_STATIC_CALL_RET0() ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull tlb gather updates from Ingo Molnar: "Theses fix MM (soft-)dirty bit management in the procfs code & clean up the TLB gather API" * tag 'core-mm-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ldt: Use tlb_gather_mmu_fullmm() when freeing LDT page-tables tlb: arch: Remove empty __tlb_remove_tlb_entry() stubs tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() mm: proc: Invalidate TLB after clearing soft-dirty page state
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull locking updates from Ingo Molnar: "Core locking primitives updates: - Remove mutex_trylock_recursive() from the API - no users left - Simplify + constify the futex code a bit Lockdep updates: - Teach lockdep about local_lock_t - Add CONFIG_DEBUG_IRQFLAGS=y debug config option to check for potentially unsafe IRQ mask restoration patterns. (I.e. calling raw_local_irq_restore() with IRQs enabled.) - Add wait context self-tests - Fix graph lock corner case corrupting internal data structures - Fix noinstr annotations LKMM updates: - Simplify the litmus tests - Documentation fixes KCSAN updates: - Re-enable KCSAN instrumentation in lib/random32.c Misc fixes: - Don't branch-trace static label APIs - DocBook fix - Remove stale leftover empty file" * tag 'locking-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) checkpatch: Don't check for mutex_trylock_recursive() locking/mutex: Kill mutex_trylock_recursive() s390: Use arch_local_irq_{save,restore}() in early boot code lockdep: Noinstr annotate warn_bogus_irq_restore() locking/lockdep: Avoid unmatched unlock locking/rwsem: Remove empty rwsem.h locking/rtmutex: Add missing kernel-doc markup futex: Remove unneeded gotos futex: Change utime parameter to be 'const ... *' lockdep: report broken irq restoration jump_label: Do not profile branch annotations locking: Add Reviewers locking/selftests: Add local_lock inversion tests locking/lockdep: Exclude local_lock_t from IRQ inversions locking/lockdep: Clean up check_redundant() a bit locking/lockdep: Add a skip() function to __bfs() locking/lockdep: Mark local_lock_t locking/selftests: More granular debug_locks_verbose lockdep/selftest: Add wait context selftests tools/memory-model: Fix typo in klitmus7 compatibility table ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull RCU updates from Ingo Molnar: "These are the latest RCU updates for v5.12: - Documentation updates. - Miscellaneous fixes. - kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return addresses to more easily locate bugs. This has a couple of RCU-related commits, but is mostly MM. Was pulled in with akpm's agreement. - Per-callback-batch tracking of numbers of callbacks, which enables better debugging information and smarter reactions to large numbers of callbacks. - The first round of changes to allow CPUs to be runtime switched from and to callback-offloaded state. - CONFIG_PREEMPT_RT-related changes. - RCU CPU stall warning updates. - Addition of polling grace-period APIs for SRCU. - Torture-test and torture-test scripting updates, including a "torture everything" script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale. Plus does an allmodconfig build. - nolibc fixes for the torture tests" * tag 'core-rcu-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits) percpu_ref: Dump mem_dump_obj() info upon reference-count underflow rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback mm: Make mem_obj_dump() vmalloc() dumps include start and length mm: Make mem_dump_obj() handle vmalloc() memory mm: Make mem_dump_obj() handle NULL and zero-sized pointers mm: Add mem_dump_obj() to print source of memory block tools/rcutorture: Fix position of -lgcc in mkinitrd.sh tools/nolibc: Fix position of -lgcc in the documented example tools/nolibc: Emit detailed error for missing alternate syscall number definitions tools/nolibc: Remove incorrect definitions of __ARCH_WANT_* tools/nolibc: Get timeval, timespec and timezone from linux/time.h tools/nolibc: Implement poll() based on ppoll() tools/nolibc: Implement fork() based on clone() tools/nolibc: Make getpgrp() fall back to getpgid(0) tools/nolibc: Make dup2() rely on dup3() when available tools/nolibc: Add the definition for dup() rcutorture: Add rcutree.use_softirq=0 to RUDE01 and TASKS01 torture: Maintain torture-specific set of CPUs-online books torture: Clean up after torture-test CPU hotplugging rcutorture: Make object_debug also double call_rcu() heap object ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds authored
Pull timer updates from Thomas Gleixner: "Time and timer updates: - Instead of new drivers remove tango, sirf, u300 and atlas drivers - Add suspend/resume support for microchip pit64b - The usual fixes, improvements and cleanups here and there" * tag 'timers-core-2021-02-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timens: Delete no-op time_ns_init() alarmtimer: Update kerneldoc clocksource/drivers/timer-microchip-pit64b: Add clocksource suspend/resume clocksource/drivers/prima: Remove sirf prima driver clocksource/drivers/atlas: Remove sirf atlas driver clocksource/drivers/tango: Remove tango driver clocksource/drivers/u300: Remove the u300 driver dt-bindings: timer: nuvoton: Clarify that interrupt of timer 0 should be specified clocksource/drivers/davinci: Move pr_fmt() before the includes clocksource/drivers/efm32: Drop unused timer code
-