- 24 Sep, 2015 1 commit
-
-
Herbert Xu authored
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote: > > store_release and load_acquire are different from the usual memory > barriers and can't be paired this way. You have to pair store_release > and load_acquire. Besides, it isn't a particularly good idea to OK I've decided to drop the acquire/release helpers as they don't help us at all and simply pessimises the code by using full memory barriers (on some architectures) where only a write or read barrier is needed. > depend on memory barriers embedded in other data structures like the > above. Here, especially, rhashtable_insert() would have write barrier > *before* the entry is hashed not necessarily *after*, which means that > in the above case, a socket which appears to have set bound to a > reader might not visible when the reader tries to look up the socket > on the hashtable. But you are right we do need an explicit write barrier here to ensure that the hashing is visible. > There's no reason to be overly smart here. This isn't a crazy hot > path, write barriers tend to be very cheap, store_release more so. > Please just do smp_store_release() and note what it's paired with. It's not about being overly smart. It's about actually understanding what's going on with the code. I've seen too many instances of people simply sprinkling synchronisation primitives around without any knowledge of what is happening underneath, which is just a recipe for creating hard-to-debug races. > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, > > } > > } > > > > - if (!nlk->portid) { > > + if (!nlk->bound) { > > I don't think you can skip load_acquire here just because this is the > second deref of the variable. That doesn't change anything. Race > condition could still happen between the first and second tests and > skipping the second would lead to the same kind of bug. The reason this one is OK is because we do not use nlk->portid or try to get nlk from the hash table before we return to user-space. However, there is a real bug here that none of these acquire/release helpers discovered. The two bound tests here used to be a single one. Now that they are separate it is entirely possible for another thread to come in the middle and bind the socket. So we need to repeat the portid check in order to maintain consistency. > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr, > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND)) > > return -EPERM; > > > > - if (!nlk->portid) > > + if (!nlk->bound) > > Don't we need load_acquire here too? Is this path holding a lock > which makes that unnecessary? Ditto. ---8<--- The commit 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID") created some new races that can occur due to inconcsistencies between the two port IDs. Tejun is right that a barrier is unavoidable. Therefore I am reverting to the original patch that used a boolean to indicate that a user netlink socket has been bound. Barriers have been added where necessary to ensure that a valid portid and the hashed socket is visible. I have also changed netlink_insert to only return EBUSY if the socket is bound to a portid different to the requested one. This combined with only reading nlk->bound once in netlink_bind fixes a race where two threads that bind the socket at the same time with different port IDs may both succeed. Fixes: 1f770c0a ("netlink: Fix autobind race condition that leads to zero port ID") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Nacked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 23 Sep, 2015 23 commits
-
-
John W. Linville authored
This is primarily for consistancy with vxlan and other tunnels which use network byte order for similar parameters. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
We are seeing unexplained TX timeouts under heavy load. Let's try to get a better idea of what's going on. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
The low 16 bits of the 'opts1' field in the TX descriptor are supposed to still contain the buffer length when the descriptor is handed back to us. In practice, at least on my hardware, they don't. So stash the original value of the opts1 field and get the length to unmap from there. There are other ways we could have worked out the length, but I actually want a stash of the opts1 field anyway so that I can dump it alongside the contents of the descriptor ring when we suffer a TX timeout. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
We calculate the value of the opts1 descriptor field in three different places. With two different behaviours when given an invalid packet to be checksummed — none of them correct. Sort that out. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
When sending a TSO frame in multiple buffers, we were neglecting to set the first descriptor up in TSO mode. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
After a certain amount of staring at the debug output of this driver, I realised it was lying to me. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
If an RX interrupt was already received but NAPI has not yet run when the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts already disabled. Blindly re-enabling them will cause an IRQ storm. (This is made particularly horrid by the fact that cp_interrupt() always returns that it's handled the interrupt, even when it hasn't actually done anything. If it didn't do that, the core IRQ code would have detected the storm and handled it, I'd have had a clear smoking gun backtrace instead of just a spontaneously resetting router, and I'd have at *least* two days of my life back. Changing the return value of cp_interrupt() will be argued about under separate cover.) Unconditionally leave RX interrupts disabled after the reset, and schedule NAPI to check the receive ring and re-enable them. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Murali Karicheri says: ==================== net: netcp: a set of bug fixes This patch series fixes a set of issues in netcp driver seen during internal testing of the driver. While at it, do some clean up as well. The fixes are tested on K2HK, K2L and K2E EVMs and the boot up logs can be seen at http://pastebin.ubuntu.com/12533100/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
A deadlock trace is seen in netcp driver with lockup detector enabled. The trace log is provided below for reference. This patch fixes the bug by removing the usage of netcp_modules_lock within ndo_ops functions. ndo_{open/close/ioctl)() is already called with rtnl_lock held. So there is no need to hold another mutex for serialization across processes on multiple cores. So remove use of netcp_modules_lock mutex from these ndo ops functions. ndo_set_rx_mode() shouldn't be using a mutex as it is called from atomic context. In the case of ndo_set_rx_mode(), there can be call to this API without rtnl_lock held from an atomic context. As the underlying modules are expected to add address to a hardware table, it is to be protected across concurrent updates and hence a spin lock is used to synchronize the access. Same with ndo_vlan_rx_add_vid() & ndo_vlan_rx_kill_vid(). Probably the netcp_modules_lock is used to protect the module not being removed as part of rmmod. Currently this is not fully implemented and assumes the interface is brought down before doing rmmod of modules. The support for rmmmod while interface is up is expected in a future patch set when additional modules such as pa, qos are added. For now all of the tests such as if up/down, reboot, iperf works fine with this patch applied. Deadlock trace seen with lockup detector enabled is shown below for reference. [ 16.863014] ====================================================== [ 16.869183] [ INFO: possible circular locking dependency detected ] [ 16.875441] 4.1.6-01265-gfb1e101 #1 Tainted: G W [ 16.881176] ------------------------------------------------------- [ 16.887432] ifconfig/1662 is trying to acquire lock: [ 16.892386] (netcp_modules_lock){+.+.+.}, at: [<c03e8110>] netcp_ndo_open+0x168/0x518 [ 16.900321] [ 16.900321] but task is already holding lock: [ 16.906144] (rtnl_mutex){+.+.+.}, at: [<c053a418>] devinet_ioctl+0xf8/0x7e4 [ 16.913206] [ 16.913206] which lock already depends on the new lock. [ 16.913206] [ 16.921372] [ 16.921372] the existing dependency chain (in reverse order) is: [ 16.928844] -> #1 (rtnl_mutex){+.+.+.}: [ 16.932865] [<c06023f0>] mutex_lock_nested+0x68/0x4a8 [ 16.938521] [<c04c5758>] register_netdev+0xc/0x24 [ 16.943831] [<c03e65c0>] netcp_module_probe+0x214/0x2ec [ 16.949660] [<c03e8a54>] netcp_register_module+0xd4/0x140 [ 16.955663] [<c089654c>] keystone_gbe_init+0x10/0x28 [ 16.961233] [<c000977c>] do_one_initcall+0xb8/0x1f8 [ 16.966714] [<c0867e04>] kernel_init_freeable+0x148/0x1e8 [ 16.972720] [<c05f9994>] kernel_init+0xc/0xe8 [ 16.977682] [<c0010038>] ret_from_fork+0x14/0x3c [ 16.982905] -> #0 (netcp_modules_lock){+.+.+.}: [ 16.987619] [<c006eab0>] lock_acquire+0x118/0x320 [ 16.992928] [<c06023f0>] mutex_lock_nested+0x68/0x4a8 [ 16.998582] [<c03e8110>] netcp_ndo_open+0x168/0x518 [ 17.004064] [<c04c48f0>] __dev_open+0xa8/0x10c [ 17.009112] [<c04c4b74>] __dev_change_flags+0x94/0x144 [ 17.014853] [<c04c4c3c>] dev_change_flags+0x18/0x48 [ 17.020334] [<c053a9fc>] devinet_ioctl+0x6dc/0x7e4 [ 17.025729] [<c04a59ec>] sock_ioctl+0x1d0/0x2a8 [ 17.030865] [<c0142844>] do_vfs_ioctl+0x41c/0x688 [ 17.036173] [<c0142ae4>] SyS_ioctl+0x34/0x5c [ 17.041046] [<c000ff60>] ret_fast_syscall+0x0/0x54 [ 17.046441] [ 17.046441] other info that might help us debug this: [ 17.046441] [ 17.054434] Possible unsafe locking scenario: [ 17.054434] [ 17.060343] CPU0 CPU1 [ 17.064862] ---- ---- [ 17.069381] lock(rtnl_mutex); [ 17.072522] lock(netcp_modules_lock); [ 17.078875] lock(rtnl_mutex); [ 17.084532] lock(netcp_modules_lock); [ 17.088366] [ 17.088366] *** DEADLOCK *** [ 17.088366] [ 17.094279] 1 lock held by ifconfig/1662: [ 17.098278] #0: (rtnl_mutex){+.+.+.}, at: [<c053a418>] devinet_ioctl+0xf8/0x7e4 [ 17.105774] [ 17.105774] stack backtrace: [ 17.110124] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G W 4.1.6-01265-gfb1e101 #1 [ 17.118637] Hardware name: Keystone [ 17.122123] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>] (show_stack+0x10/0x14) [ 17.129862] [<c0013cbc>] (show_stack) from [<c05ff450>] (dump_stack+0x84/0xc4) [ 17.137079] [<c05ff450>] (dump_stack) from [<c0068e34>] (print_circular_bug+0x210/0x330) [ 17.145161] [<c0068e34>] (print_circular_bug) from [<c006ab7c>] (validate_chain.isra.35+0xf98/0x13ac) [ 17.154372] [<c006ab7c>] (validate_chain.isra.35) from [<c006da60>] (__lock_acquire+0x52c/0xcc0) [ 17.163149] [<c006da60>] (__lock_acquire) from [<c006eab0>] (lock_acquire+0x118/0x320) [ 17.171058] [<c006eab0>] (lock_acquire) from [<c06023f0>] (mutex_lock_nested+0x68/0x4a8) [ 17.179140] [<c06023f0>] (mutex_lock_nested) from [<c03e8110>] (netcp_ndo_open+0x168/0x518) [ 17.187484] [<c03e8110>] (netcp_ndo_open) from [<c04c48f0>] (__dev_open+0xa8/0x10c) [ 17.195133] [<c04c48f0>] (__dev_open) from [<c04c4b74>] (__dev_change_flags+0x94/0x144) [ 17.203129] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>] (dev_change_flags+0x18/0x48) [ 17.211560] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>] (devinet_ioctl+0x6dc/0x7e4) [ 17.219729] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>] (sock_ioctl+0x1d0/0x2a8) [ 17.227378] [<c04a59ec>] (sock_ioctl) from [<c0142844>] (do_vfs_ioctl+0x41c/0x688) [ 17.234939] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>] (SyS_ioctl+0x34/0x5c) [ 17.242242] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>] (ret_fast_syscall+0x0/0x54) [ 17.258855] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow control off [ 17.271282] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:616 [ 17.279712] in_atomic(): 1, irqs_disabled(): 0, pid: 1662, name: ifconfig [ 17.286500] INFO: lockdep is turned off. [ 17.290413] Preemption disabled at:[< (null)>] (null) [ 17.295728] [ 17.297214] CPU: 1 PID: 1662 Comm: ifconfig Tainted: G W 4.1.6-01265-gfb1e101 #1 [ 17.305735] Hardware name: Keystone [ 17.309223] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>] (show_stack+0x10/0x14) [ 17.316970] [<c0013cbc>] (show_stack) from [<c05ff450>] (dump_stack+0x84/0xc4) [ 17.324194] [<c05ff450>] (dump_stack) from [<c06023b0>] (mutex_lock_nested+0x28/0x4a8) [ 17.332112] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>] (netcp_set_rx_mode+0x160/0x210) [ 17.340724] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>] (dev_set_rx_mode+0x1c/0x28) [ 17.348982] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>] (__dev_open+0xc4/0x10c) [ 17.356724] [<c04c490c>] (__dev_open) from [<c04c4b74>] (__dev_change_flags+0x94/0x144) [ 17.364729] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>] (dev_change_flags+0x18/0x48) [ 17.373166] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>] (devinet_ioctl+0x6dc/0x7e4) [ 17.381344] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>] (sock_ioctl+0x1d0/0x2a8) [ 17.388994] [<c04a59ec>] (sock_ioctl) from [<c0142844>] (do_vfs_ioctl+0x41c/0x688) [ 17.396563] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>] (SyS_ioctl+0x34/0x5c) [ 17.403873] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>] (ret_fast_syscall+0x0/0x54) [ 17.413772] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready udhcpc (v1.20.2) started Sending discover... [ 18.690666] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow control off Sending discover... [ 22.250972] netcp-1.0 2620110.netcp eth0: Link is Up - 1Gbps/Full - flow control off [ 22.258721] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 22.265458] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:616 [ 22.273896] in_atomic(): 1, irqs_disabled(): 0, pid: 342, name: kworker/1:1 [ 22.280854] INFO: lockdep is turned off. [ 22.284767] Preemption disabled at:[< (null)>] (null) [ 22.290074] [ 22.291568] CPU: 1 PID: 342 Comm: kworker/1:1 Tainted: G W 4.1.6-01265-gfb1e101 #1 [ 22.300255] Hardware name: Keystone [ 22.303750] Workqueue: ipv6_addrconf addrconf_dad_work [ 22.308895] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>] (show_stack+0x10/0x14) [ 22.316643] [<c0013cbc>] (show_stack) from [<c05ff450>] (dump_stack+0x84/0xc4) [ 22.323867] [<c05ff450>] (dump_stack) from [<c06023b0>] (mutex_lock_nested+0x28/0x4a8) [ 22.331786] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>] (netcp_set_rx_mode+0x160/0x210) [ 22.340394] [<c03e9840>] (netcp_set_rx_mode) from [<c04c9d18>] (__dev_mc_add+0x54/0x68) [ 22.348401] [<c04c9d18>] (__dev_mc_add) from [<c05ab358>] (igmp6_group_added+0x168/0x1b4) [ 22.356580] [<c05ab358>] (igmp6_group_added) from [<c05ad2cc>] (ipv6_dev_mc_inc+0x4f0/0x5a8) [ 22.365019] [<c05ad2cc>] (ipv6_dev_mc_inc) from [<c058f0d0>] (addrconf_dad_work+0x21c/0x33c) [ 22.373460] [<c058f0d0>] (addrconf_dad_work) from [<c0042850>] (process_one_work+0x214/0x8d0) [ 22.381986] [<c0042850>] (process_one_work) from [<c0042f54>] (worker_thread+0x48/0x4bc) [ 22.390071] [<c0042f54>] (worker_thread) from [<c004868c>] (kthread+0xf0/0x108) [ 22.397381] [<c004868c>] (kthread) from [<c0010038>] Trace related to incorrect usage of mutex inside ndo_set_rx_mode [ 24.086066] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:616 [ 24.094506] in_atomic(): 1, irqs_disabled(): 0, pid: 1682, name: ifconfig [ 24.101291] INFO: lockdep is turned off. [ 24.105203] Preemption disabled at:[< (null)>] (null) [ 24.110511] [ 24.112005] CPU: 2 PID: 1682 Comm: ifconfig Tainted: G W 4.1.6-01265-gfb1e101 #1 [ 24.120518] Hardware name: Keystone [ 24.124018] [<c00178e4>] (unwind_backtrace) from [<c0013cbc>] (show_stack+0x10/0x14) [ 24.131772] [<c0013cbc>] (show_stack) from [<c05ff450>] (dump_stack+0x84/0xc4) [ 24.138989] [<c05ff450>] (dump_stack) from [<c06023b0>] (mutex_lock_nested+0x28/0x4a8) [ 24.146908] [<c06023b0>] (mutex_lock_nested) from [<c03e9840>] (netcp_set_rx_mode+0x160/0x210) [ 24.155523] [<c03e9840>] (netcp_set_rx_mode) from [<c04c483c>] (dev_set_rx_mode+0x1c/0x28) [ 24.163787] [<c04c483c>] (dev_set_rx_mode) from [<c04c490c>] (__dev_open+0xc4/0x10c) [ 24.171531] [<c04c490c>] (__dev_open) from [<c04c4b74>] (__dev_change_flags+0x94/0x144) [ 24.179528] [<c04c4b74>] (__dev_change_flags) from [<c04c4c3c>] (dev_change_flags+0x18/0x48) [ 24.187966] [<c04c4c3c>] (dev_change_flags) from [<c053a9fc>] (devinet_ioctl+0x6dc/0x7e4) [ 24.196145] [<c053a9fc>] (devinet_ioctl) from [<c04a59ec>] (sock_ioctl+0x1d0/0x2a8) [ 24.203803] [<c04a59ec>] (sock_ioctl) from [<c0142844>] (do_vfs_ioctl+0x41c/0x688) [ 24.211373] [<c0142844>] (do_vfs_ioctl) from [<c0142ae4>] (SyS_ioctl+0x34/0x5c) [ 24.218676] [<c0142ae4>] (SyS_ioctl) from [<c000ff60>] (ret_fast_syscall+0x0/0x54) [ 24.227156] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
Currently netcp_rxpool_refill() that refill descriptors and attached buffers to fdq while interrupt is enabled as part of NAPI poll. Doing it while interrupt is disabled could be beneficial as hardware will not be starved when CPU is busy with processing interrupt. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
Currently netcp_module_probe() doesn't check the return value of of_parse_phandle() that points to the interface data for the module and then pass the node ptr to the module which is incorrect. Check for return value and free the intf_modpriv if there is error. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
Currently, if netcp_allocate_rx_buf() fails due no descriptors in the rx free descriptor queue, inside the netcp_rxpool_refill() function the iterative loop to fill buffers doesn't terminate right away. So modify the netcp_allocate_rx_buf() to return an error code and use it break the loop when there is error. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
The netcp interface is not fully initialized before attach the module to the interface. For example, the tx pipe/rx pipe is initialized in ethss module as part of attach(). So until this is complete, the interface can't be registered. So move registration of interface to net device outside the current loop that attaches the modules to the interface. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Karicheri, Muralidharan authored
netcp_core is the first driver that will get initialized and the modules (ethss, pa etc) will then get initialized. So the code at the end of netcp_probe() that iterate over the modules is a dead code as the module list will be always be empty. So remove this code. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
WingMan Kwok authored
On K2HK, sgmii module registers of slave 0 and 1 are mem mapped to one contiguous block, while those of slave 2 and 3 are mapped to another contiguous block. However, on K2E and K2L, sgmii module registers of all slaves are mem mapped to one contiguous block. SGMII APIs expect slave 0 sgmii base when API is invoked for slave 0 and 1, and slave 2 sgmii base when invoked for other slaves. Before this patch, slave 0 sgmii base is always passed to sgmii API for K2E regardless which slave is the API invoked for. This patch fixes the problem. Signed-off-by: WingMan Kwok <w-kwok2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Woodhouse authored
Commit 7d824109 ("virtio: add explicit big-endian support to memory accessors") accidentally changed the virtio_net header used by AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian. Since virtio_legacy_is_little_endian() is a very long identifier, define a vio_le macro and use that throughout the code instead of the hard-coded 'false' for little-endian. This restores the ABI to match 4.1 and earlier kernels, and makes my test program work again. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neil Horman authored
Drivers might call napi_disable while not holding the napi instance poll_lock. In those instances, its possible for a race condition to exist between poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such the following may happen: CPU0 CPU1 ndo_tx_timeout napi_poll_dev napi_disable poll_one_napi test_and_set_bit (ret 0) test_bit (ret 1) reset adapter napi_poll_routine If the adapter gets a tx timeout without a napi instance scheduled, its possible for the adapter to think it has exclusive access to the hardware (as the napi instance is now scheduled via the napi_disable call), while the netpoll code thinks there is simply work to do. The result is parallel hardware access leading to corrupt data structures in the driver, and a crash. Additionaly, there is another, more critical race between netpoll and napi_disable. The disabled napi state is actually identical to the scheduled state for a given napi instance. The implication being that, if a napi instance is disabled, a netconsole instance would see the napi state of the device as having been scheduled, and poll it, likely while the driver was dong something requiring exclusive access. In the case above, its fairly clear that not having the rings in a state ready to be polled will cause any number of crashes. The fix should be pretty easy. netpoll uses its own bit to indicate that that the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC). We can just gate disabling on that bit as well as the sched bit. That should prevent netpoll from conducting a napi poll if we convert its set bit to a test_and_set_bit operation to provide mutual exclusion Change notes: V2) Remove a trailing whtiespace Resubmit with proper subject prefix V3) Clean up spacing nits Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: jmaxwell@redhat.com Tested-by: jmaxwell@redhat.com Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
RST packets sent on behalf of TCP connections with TS option (RFC 7323 TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr. A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100 ecr 0], length 0 B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss 1460,nop,nop,TS val 7264344 ecr 100], length 0 A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr 7264344], length 0 B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0 ecr 110], length 0 We need to call skb_mstamp_get() to get proper TS val, derived from skb->skb_mstamp Note that RFC 1323 was advocating to not send TS option in RST segment, but RFC 7323 recommends the opposite : Once TSopt has been successfully negotiated, that is both <SYN> and <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST> segment for the duration of the connection, and SHOULD be sent in an <RST> segment (see Section 5.2 for details) Note this RFC recommends to send TS val = 0, but we believe it is premature : We do not know if all TCP stacks are properly handling the receive side : When an <RST> segment is received, it MUST NOT be subjected to the PAWS check by verifying an acceptable value in SEG.TSval, and information from the Timestamps option MUST NOT be used to update connection state information. SEG.TSecr MAY be used to provide stricter <RST> acceptance checks. In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider to decide to send TS val = 0, if it buys something. Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Neil Armstrong authored
The Marvell Egress rx trailer check must be fixed to correctly detect bad bits in the third byte of the Eggress trailer as described in the Table 28 of the 88E6060 datasheet. The current code incorrectly omits to check the third byte and checks the fourth byte twice. Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Dmitriy Vyukov authored
rhashtable_rehash_one() uses complex logic to update entry->next field, after INIT_RHT_NULLS_HEAD and NULLS_MARKER expansion: entry->next = 1 | ((base + off) << 1) This can be compiled along the lines of: entry->next = base + off entry->next <<= 1 entry->next |= 1 Which will break concurrent readers. NULLS value recomputation is not needed here, so just remove the complex logic. The data race was found with KernelThreadSanitizer (KTSAN). Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Tobias Klauser authored
Converts the ch9200 driver to use the module_usb_driver() macro which makes the code smaller and a bit simpler. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jesse Gross authored
When support for megaflows was introduced, OVS needed to start installing flows with a mask applied to them. Since masking is an expensive operation, OVS also had an optimization that would only take the parts of the flow keys that were covered by a non-zero mask. The values stored in the remaining pieces should not matter because they are masked out. While this works fine for the purposes of matching (which must always look at the mask), serialization to netlink can be problematic. Since the flow and the mask are serialized separately, the uninitialized portions of the flow can be encoded with whatever values happen to be present. In terms of functionality, this has little effect since these fields will be masked out by definition. However, it leaks kernel memory to userspace, which is a potential security vulnerability. It is also possible that other code paths could look at the masked key and get uninitialized data, although this does not currently appear to be an issue in practice. This removes the mask optimization for flows that are being installed. This was always intended to be the case as the mask optimizations were really targetting per-packet flow operations. Fixes: 03f0d916 ("openvswitch: Mega flow implementation") Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Russell King authored
Commit 54d792f2 ("net: dsa: Centralise global and port setup code into mv88e6xxx.") merged in the 4.2 merge window broke the link speed forcing for the CPU port of Marvell DSA switches. The original code was: /* MAC Forcing register: don't force link, speed, duplex * or flow control state to any particular values on physical * ports, but force the CPU port and all DSA ports to 1000 Mb/s * full duplex. */ if (dsa_is_cpu_port(ds, p) || ds->dsa_port_mask & (1 << p)) REG_WRITE(addr, 0x01, 0x003e); else REG_WRITE(addr, 0x01, 0x0003); but the new code does a read-modify-write: reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_PCS_CTRL); if (dsa_is_cpu_port(ds, port) || ds->dsa_port_mask & (1 << port)) { reg |= PORT_PCS_CTRL_FORCE_LINK | PORT_PCS_CTRL_LINK_UP | PORT_PCS_CTRL_DUPLEX_FULL | PORT_PCS_CTRL_FORCE_DUPLEX; if (mv88e6xxx_6065_family(ds)) reg |= PORT_PCS_CTRL_100; else reg |= PORT_PCS_CTRL_1000; The link speed in the PCS control register is a two bit field. Forcing the link speed in this way doesn't ensure that the bit field is set to the correct value - on the hardware I have here, the speed bitfield remains set to 0x03, resulting in the speed not being forced to gigabit. We must clear both bits before forcing the link speed. Fixes: 54d792f2 ("net: dsa: Centralise global and port setup code into mv88e6xxx.") Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 22 Sep, 2015 1 commit
-
-
John W. Linville authored
Partially due to a pre-exising "thinko", the new metadata-based tx/rx paths were handling ECN propagation differently than the traditional tx/rx paths. This patch removes the "thinko" (involving multiple ip_hdr assignments) on the rx path and corrects the ECN handling on both the rx and tx paths. Signed-off-by: John W. Linville <linville@tuxdriver.com> Reviewed-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 21 Sep, 2015 15 commits
-
-
Eric Dumazet authored
Before allowing lockless LISTEN processing, we need to make sure to arm the SYN_RECV timer before the req socket is visible in hash tables. Also, req->rsk_hash should be written before we set rsk_refcnt to a non zero value. Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
When creating a timewait socket, we need to arm the timer before allowing other cpus to find it. The signal allowing cpus to find the socket is setting tw_refcnt to non zero value. As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to call inet_twsk_schedule() first. This also means we need to remove tw_refcnt changes from inet_twsk_schedule() and let the caller handle it. Note that because we use mod_timer_pinned(), we have the guarantee the timer wont expire before we set tw_refcnt as we run in BH context. To make things more readable I introduced inet_twsk_reschedule() helper. When rearming the timer, we can use mod_timer_pending() to make sure we do not rearm a canceled timer. Note: This bug can possibly trigger if packets of a flow can hit multiple cpus. This does not normally happen, unless flow steering is broken somehow. This explains this bug was spotted ~5 months after its introduction. A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(), but will be provided in a separate patch for proper tracking. Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sowmini Varadhan authored
`ls /sys/devices/channel-devices/vnet-port-0-0/net' is missing without this change, and applications like NetworkManager are looking in sysfs for the information. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John W. Linville authored
The code handling vlan tag insertion was dropped in commit 371bd106 ("geneve: Consolidate Geneve functionality in single module."). Now we need to drop the related vlan feature bits in the netdev structure. Signed-off-by: John W. Linville <linville@tuxdriver.com> Reviewed-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Matthew Garrett authored
There's a bunch of cheap USB 10/100 devices based on QinHeng chipsets. The vendor driver supports the CH9100 and CH9200 devices, but the majority of the code is of the if (ch9100) {} else {} form, with the most significant difference being that CH9200 provides a real MII interface but CH9100 fakes one with a bunch of global variables and magic commands. I don't have a CH9100, so it's probably better if someone who does provides an independent driver for it. In any case, this is a lightly cleaned up version of the vendor driver with all the CH9100 code dropped. Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Luis de Bethencourt says: ==================== net: phy: Fix module autoload for OF platform drivers These patches add the missing MODULE_DEVICE_TABLE() for OF to export the information so modules have the correct aliases built-in and autoloading works correctly. A longer explanation by Javier Canillas can be found here: https://lkml.org/lkml/2015/7/30/519 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Luis de Bethencourt says: ==================== net: Fix module autoload for OF platform drivers These patches add the missing MODULE_DEVICE_TABLE() for OF to export the information so modules have the correct aliases built-in and autoloading works correctly. A longer explanation by Javier Canillas can be found here: https://lkml.org/lkml/2015/7/30/519 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luis@osg.samsung.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
This platform driver has a OF device ID table but the OF module alias information is not created so module autoloading won't work. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
The commit c0bb07df ("netlink: Reset portid after netlink_insert failure") introduced a race condition where if two threads try to autobind the same socket one of them may end up with a zero port ID. This led to kernel deadlocks that were observed by multiple people. This patch reverts that commit and instead fixes it by introducing a separte rhash_portid variable so that the real portid is only set after the socket has been successfully hashed. Fixes: c0bb07df ("netlink: Reset portid after netlink_insert failure") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-