- 20 Feb, 2014 22 commits
-
-
Asias He authored
commit 2c95a329 upstream. Block layer will allocate a spinlock for the queue if the driver does not provide one in blk_init_queue(). The reason to use the internal spinlock is that blk_cleanup_queue() will switch to use the internal spinlock in the cleanup code path. if (q->queue_lock != &q->__queue_lock) q->queue_lock = &q->__queue_lock; However, processes which are in D state might have taken the driver provided spinlock, when the processes wake up, they would release the block provided spinlock. ===================================== [ BUG: bad unlock balance detected! ] 3.4.0-rc7+ #238 Not tainted ------------------------------------- fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at: [<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380 but there are no more locks to release! other info that might help us debug this: 1 lock held by fio/3587: #0: (&(&vblk->lock)->rlock){......}, at: [<ffffffff8132661a>] get_request_wait+0x19a/0x250 Other drivers use block layer provided spinlock as well, e.g. SCSI. Switching to the block layer provided spinlock saves a bit of memory and does not increase lock contention. Performance test shows no real difference is observed before and after this patch. Changes in v2: Improve commit log as Michael suggested. Cc: virtualization@lists.linux-foundation.org Cc: kvm@vger.kernel.org Signed-off-by: Asias He <asias@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Seth Forshee authored
commit c0394506 upstream. The touchpad on the Acer Aspire One D250 will report out of range values in the extreme lower portion of the touchpad. These appear as abrupt changes in the values reported by the hardware from very low values to very high values, which can cause unexpected vertical jumps in the position of the mouse pointer. What seems to be happening is that the value is wrapping to a two's compliment negative value of higher resolution than the 13-bit value reported by the hardware, with the high-order bits being truncated. This patch adds handling for these values by converting them to the appropriate negative values. The only tricky part about this is deciding when to treat a number as negative. It stands to reason that if out of range values can be reported on the low end then it could also happen on the high end, so not all out of range values should be treated as negative. The approach taken here is to split the difference between the maximum legitimate value for the axis and the maximum possible value that the hardware can report, treating values greater than this number as negative and all other values as positive. This can be tweaked later if hardware is found that operates outside of these parameters. BugLink: http://bugs.launchpad.net/bugs/1001251Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Bojan Smojver authored
commit 5a21d489 upstream. 1. Do not allocate memory for buffers from emergency pools, unless absolutely required. Do not warn about and do not retry non-essential failed allocations. 2. Do not check the amount of free pages left on every single page write, but wait until one map is completely populated and then check. 3. Set maximum number of pages for read buffering consistently, instead of inadvertently depending on the size of the sector type. 4. Fix copyright line, which I missed when I submitted the hibernation threading patch. 5. Dispense with bit shifting arithmetic to improve readability. 6. Really recalculate the number of pages required to be free after all allocations have been done. 7. Fix calculation of pages required for read buffering. Only count in pages that do not belong to high memory. Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Avi Kivity authored
commit f2ebd422 upstream. kvm_set_irq() has an internal buffer of three irq routing entries, allowing connecting a GSI to three IRQ chips or on MSI. However setup_routing_entry() does not properly enforce this, allowing three irqchip routes followed by an MSI route to overflow the buffer. Fix by ensuring that an MSI entry is added to an empty list. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nicholas Bellinger authored
commit b32f4c7e upstream. This patch re-adds the ability to optionally run in buffered FILEIO mode (eg: w/o O_DSYNC) for device backends in order to once again use the Linux buffered cache as a write-back storage mechanism. This logic was originally dropped with mainline v3.5-rc commit: commit a4dff304 Author: Nicholas Bellinger <nab@linux-iscsi.org> Date: Wed May 30 16:25:41 2012 -0700 target/file: Use O_DSYNC by default for FILEIO backends This difference with this patch is that fd_create_virtdevice() now forces the explicit setting of emulate_write_cache=1 when buffered FILEIO operation has been enabled. (v2: Switch to FDBD_HAS_BUFFERED_IO_WCE + add more detailed comment as requested by hch) Reported-by: Ferry <iscsitmp@bananateam.nl> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Nicholas Bellinger authored
commit a4dff304 upstream. Convert to use O_DSYNC for all cases at FILEIO backend creation time to avoid the extra syncing of pure timestamp updates with legacy O_SYNC during default operation as recommended by hch. Continue to do this independently of Write Cache Enable (WCE) bit, as WCE=0 is currently the default for all backend devices and enabled by user on per device basis via attrib/emulate_write_cache. This patch drops the now unnecessary fd_buffered_io= token usage that was originally signalling when to explictly disable O_SYNC at backend creation time for buffered I/O operation. This can end up being dangerous for a number of reasons during physical node failure, so go ahead and drop this option for now when O_DSYNC is used as the default. Also allow explict FUA WRITEs -> vfs_fsync_range() call to function in fd_execute_cmd() independently of WCE bit setting. Reported-by: Christoph Hellwig <hch@lst.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [bwh: Backported to 3.2: - We have fd_do_task() and not fd_execute_cmd() - Various fields are in struct se_task rather than struct se_cmd - fd_create_virtdevice() flags initialisation hasn't been cleaned up] Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Jan Kara authored
commit 603e7729 upstream. qib_user_sdma_queue_pkts() gets called with mmap_sem held for writing. Except for get_user_pages() deep down in qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all. Even more interestingly the function qib_user_sdma_queue_pkts() (and also qib_user_sdma_coalesce() called somewhat later) call copy_from_user() which can hit a page fault and we deadlock on trying to get mmap_sem when handling that fault. So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and leave mmap_sem locking for mm. This deadlock has actually been observed in the wild when the node is under memory pressure. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Roland Dreier <roland@purestorage.com> [Backported to 3.4: (Thank to Ben Hutchings) - Adjust context - Adjust indentation and nr_pages argument in qib_user_sdma_pin_pages()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Zijlstra authored
commit 5aaa0b7a upstream. Follow up on commit 556061b0 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.orgSigned-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Zijlstra authored
commit 556061b0 upstream. While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.orgSigned-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Steven Rostedt authored
commit 23a8e844 upstream. Doing some different tests, I discovered that function graph tracing, when filtered via the set_ftrace_filter and set_ftrace_notrace files, does not always keep with them if another function ftrace_ops is registered to trace functions. The reason is that function graph just happens to trace all functions that the function tracer enables. When there was only one user of function tracing, the function graph tracer did not need to worry about being called by functions that it did not want to trace. But now that there are other users, this becomes a problem. For example, one just needs to do the following: # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function_graph > current_tracer # cat trace [..] 0) | schedule() { ------------------------------------------ 0) <idle>-0 => rcu_pre-7 ------------------------------------------ 0) ! 2980.314 us | } 0) | schedule() { ------------------------------------------ 0) rcu_pre-7 => <idle>-0 ------------------------------------------ 0) + 20.701 us | } # echo 1 > /proc/sys/kernel/stack_tracer_enabled # cat trace [..] 1) + 20.825 us | } 1) + 21.651 us | } 1) + 30.924 us | } /* SyS_ioctl */ 1) | do_page_fault() { 1) | __do_page_fault() { 1) 0.274 us | down_read_trylock(); 1) 0.098 us | find_vma(); 1) | handle_mm_fault() { 1) | _raw_spin_lock() { 1) 0.102 us | preempt_count_add(); 1) 0.097 us | do_raw_spin_lock(); 1) 2.173 us | } 1) | do_wp_page() { 1) 0.079 us | vm_normal_page(); 1) 0.086 us | reuse_swap_page(); 1) 0.076 us | page_move_anon_rmap(); 1) | unlock_page() { 1) 0.082 us | page_waitqueue(); 1) 0.086 us | __wake_up_bit(); 1) 1.801 us | } 1) 0.075 us | ptep_set_access_flags(); 1) | _raw_spin_unlock() { 1) 0.098 us | do_raw_spin_unlock(); 1) 0.105 us | preempt_count_sub(); 1) 1.884 us | } 1) 9.149 us | } 1) + 13.083 us | } 1) 0.146 us | up_read(); When the stack tracer was enabled, it enabled all functions to be traced, which now the function graph tracer also traces. This is a side effect that should not occur. To fix this a test is added when the function tracing is changed, as well as when the graph tracer is enabled, to see if anything other than the ftrace global_ops function tracer is enabled. If so, then the graph tracer calls a test trampoline that will look at the function that is being traced and compare it with the filters defined by the global_ops. As an optimization, if there's no other function tracers registered, or if the only registered function tracers also use the global ops, the function graph infrastructure will call the registered function graph callback directly and not go through the test trampoline. Fixes: d2d45c7a "tracing: Have stack_tracer use a separate list of functions" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Steven Rostedt authored
commit a4c35ed2 upstream. The synchronization needed after ftrace_ops are unregistered must happen after the callback is disabled from becing called by functions. The current location happens after the function is being removed from the internal lists, but not after the function callbacks were disabled, leaving the functions susceptible of being called after their callbacks are freed. This affects perf and any externel users of function tracing (LTTng and SystemTap). Fixes: cdbe61bf "ftrace: Allow dynamically allocated function tracers" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Steven Rostedt authored
commit 405e1d83 upstream. [ Partial commit backported to 3.4. The ftrace_sync() code by this is required for other fixes that 3.4 needs. ] ftrace_trace_function is a variable that holds what function will be called directly by the assembly code (mcount). If just a single function is registered and it handles recursion itself, then the assembly will call that function directly without any helper function. It also passes in the ftrace_op that was registered with the callback. The ftrace_op to send is stored in the function_trace_op variable. The ftrace_trace_function and function_trace_op needs to be coordinated such that the called callback wont be called with the wrong ftrace_op, otherwise bad things can happen if it expected a different op. Luckily, there's no callback that doesn't use the helper functions that requires this. But there soon will be and this needs to be fixed. Use a set_function_trace_op to store the ftrace_op to set the function_trace_op to when it is safe to do so (during the update function within the breakpoint or stop machine calls). Or if dynamic ftrace is not being used (static tracing) then we have to do a bit more synchronization when the ftrace_trace_function is set as that takes affect immediately (as oppose to dynamic ftrace doing it with the modification of the trampoline). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Mikulas Patocka authored
commit 2995fa78 upstream. This reverts commit be35f486 ("dm: wait until embedded kobject is released before destroying a device") and provides an improved fix. The kobject release code that calls the completion must be placed in a non-module file, otherwise there is a module unload race (if the process calling dm_kobject_release is preempted and the DM module unloaded after the completion is triggered, but before dm_kobject_release returns). To fix this race, this patch moves the completion code to dm-builtin.c which is always compiled directly into the kernel if BLK_DEV_DM is selected. The patch introduces a new dm_kobject_holder structure, its purpose is to keep the completion and kobject in one place, so that it can be accessed from non-module code without the need to export the layout of struct mapped_device to that code. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Xishi Qiu authored
commit ca57df79 upstream. On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as Itanium, pageblock_order is a variable with default value of 0. It's set to the right value by set_pageblock_order() in function free_area_init_core(). But pageblock_order may be used by sparse_init() before free_area_init_core() is called along path: sparse_init() ->sparse_early_usemaps_alloc_node() ->usemap_size() ->SECTION_BLOCKFLAGS_BITS ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) The uninitialized pageblock_size will cause memory wasting because usemap_size() returns a much bigger value then it's really needed. For example, on an Itanium platform, sparse_init() pageblock_order=0 usemap_size=24576 free_area_init_core() before pageblock_order=0, usemap_size=24576 free_area_init_core() after pageblock_order=12, usemap_size=8 That means 24K memory has been wasted for each section, so fix it by calling set_pageblock_order() from sparse_init(). Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <liuj97@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [lizf: Backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Andrew Morton authored
commit 955c1cd7 upstream. This has always been broken: one version takes an unsigned int and the other version takes no arguments. This bug was hidden because one version of set_pageblock_order() was a macro which doesn't evaluate its argument. Simplify it all and remove pageblock_default_order() altogether. Reported-by: rajman mekaco <rajman.mekaco@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [lizf: Backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Daniel Vetter authored
commit 9f846a16 upstream. Especially vesafb likes to map everything as uc- (yikes), and if that mapping hangs around still while we try to map the gtt as wc the kernel will downgrade our request to uc-, resulting in abyssal performance. Unfortunately we can't do this as early as readon does (i.e. as the first thing we do when initializing the hw) because our fb/mmio space region moves around on a per-gen basis. So I've had to move it below the gtt initialization, but that seems to work, too. The important thing is that we do this before we set up the gtt wc mapping. Now an altogether different question is why people compile their kernels with vesafb enabled, but I guess making things just work isn't bad per se ... v2: - s/radeondrmfb/inteldrmfb/ - fix up error handling v3: Kill #ifdef X86, this is Intel after all. Noticed by Ben Widawsky. v4: Jani Nikula complained about the pointless bool primary initialization. v5: Don't oops if we can't allocate, noticed by Chris Wilson. v6: Resolve conflicts with agp rework and fixup whitespace. This is commit e188719a in drm-next. Backport to 3.5 -fixes queue requested by Dave Airlie - due to grub using vesa on fedora their initrd seems to load vesafb before loading the real kms driver. So tons more people actually experience a dead-slow gpu. Hence also the Cc: stable. Reported-and-tested-by: "Kilarski, Bernard R" <bernard.r.kilarski@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com> [lizf: Backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Tao Ma authored
commit 6f2e9f0e upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Paul E. McKenney authored
commit 85eae82a upstream. The console_cpu_notify() function runs with interrupts disabled in the CPU_DYING case. It therefore cannot block, for example, as will happen when it calls console_lock(). Therefore, remove the CPU_DYING leg of the switch statement to avoid this problem. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guillaume Morin <guillaume@morinfr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Oberparleiter authored
commit 6583327c upstream. Commit d61931d8, "x86: Add optimized popcnt variants" introduced compile flag -fcall-saved-rdi for lib/hweight.c. When combined with options -fprofile-arcs and -O2, this flag causes gcc to generate broken constructor code. As a result, a 64 bit x86 kernel compiled with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create file" and runs into sproadic BUGs during boot. The gcc people indicate that these kinds of problems are endemic when using ad hoc calling conventions. It is therefore best to treat any file compiled with ad hoc calling conventions as an isolated environment and avoid things like profiling or coverage analysis, since those subsystems assume a "normal" calling conventions. This patch avoids the bug by excluding lib/hweight.o from coverage profiling. Reported-by: Meelis Roos <mroos@linux.ee> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.comSigned-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
KOSAKI Motohiro authored
commit 227d53b3 upstream. To use spin_{un}lock_irq is dangerous if caller disabled interrupt. During aio buffer migration, we have a possibility to see the following call stack. aio_migratepage [disable interrupt] migrate_page_copy clear_page_dirty_for_io set_page_dirty __set_page_dirty_buffers __set_page_dirty spin_lock_irq This mean, current aio migration is a deadlockable. spin_lock_irqsave is a safer alternative and we should use it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Rientjes rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
KOSAKI Motohiro authored
commit a85d9df1 upstream. During aio stress test, we observed the following lockdep warning. This mean AIO+numa_balancing is currently deadlockable. The problem is, aio_migratepage disable interrupt, but __set_page_dirty_nobuffers unintentionally enable it again. Generally, all helper function should use spin_lock_irqsave() instead of spin_lock_irq() because they don't know caller at all. other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** dump_stack+0x19/0x1b print_usage_bug+0x1f7/0x208 mark_lock+0x21d/0x2a0 mark_held_locks+0xb9/0x140 trace_hardirqs_on_caller+0x105/0x1d0 trace_hardirqs_on+0xd/0x10 _raw_spin_unlock_irq+0x2c/0x50 __set_page_dirty_nobuffers+0x8c/0xf0 migrate_page_copy+0x434/0x540 aio_migratepage+0xb1/0x140 move_to_new_page+0x7d/0x230 migrate_pages+0x5e5/0x700 migrate_misplaced_page+0xbc/0xf0 do_numa_page+0x102/0x190 handle_pte_fault+0x241/0x970 handle_mm_fault+0x265/0x370 __do_page_fault+0x172/0x5a0 do_page_fault+0x1a/0x70 page_fault+0x28/0x30 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Stephen Smalley authored
commit 2172fa70 upstream. Setting an empty security context (length=0) on a file will lead to incorrectly dereferencing the type and other fields of the security context structure, yielding a kernel BUG. As a zero-length security context is never valid, just reject all such security contexts whether coming from userspace via setxattr or coming from the filesystem upon a getxattr request by SELinux. Setting a security context value (empty or otherwise) unknown to SELinux in the first place is only possible for a root process (CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only if the corresponding SELinux mac_admin permission is also granted to the domain by policy. In Fedora policies, this is only allowed for specific domains such as livecd for setting down security contexts that are not defined in the build host policy. Reproducer: su setenforce 0 touch foo setfattr -n security.selinux foo Caveat: Relabeling or removing foo after doing the above may not be possible without booting with SELinux disabled. Any subsequent access to foo after doing the above will also trigger the BUG. BUG output from Matthew Thode: [ 473.893141] ------------[ cut here ]------------ [ 473.962110] kernel BUG at security/selinux/ss/services.c:654! [ 473.995314] invalid opcode: 0000 [#6] SMP [ 474.027196] Modules linked in: [ 474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G D I 3.13.0-grsec #1 [ 474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0 07/29/10 [ 474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti: ffff8805f50cd488 [ 474.183707] RIP: 0010:[<ffffffff814681c7>] [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 474.219954] RSP: 0018:ffff8805c0ac3c38 EFLAGS: 00010246 [ 474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX: 0000000000000100 [ 474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI: ffff8805e8aaa000 [ 474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09: 0000000000000006 [ 474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12: 0000000000000006 [ 474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15: 0000000000000000 [ 474.453816] FS: 00007f2e75220800(0000) GS:ffff88061fc00000(0000) knlGS:0000000000000000 [ 474.489254] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4: 00000000000207f0 [ 474.556058] Stack: [ 474.584325] ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98 ffff8805f1190a40 [ 474.618913] ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990 ffff8805e8aac860 [ 474.653955] ffff8805c0ac3cb8 000700068113833a ffff880606c75060 ffff8805c0ac3d94 [ 474.690461] Call Trace: [ 474.723779] [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a [ 474.778049] [<ffffffff81468824>] security_compute_av+0xf4/0x20b [ 474.811398] [<ffffffff8196f419>] avc_compute_av+0x2a/0x179 [ 474.843813] [<ffffffff8145727b>] avc_has_perm+0x45/0xf4 [ 474.875694] [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31 [ 474.907370] [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e [ 474.938726] [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22 [ 474.970036] [<ffffffff811b057d>] vfs_getattr+0x19/0x2d [ 475.000618] [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91 [ 475.030402] [<ffffffff811b063b>] vfs_lstat+0x19/0x1b [ 475.061097] [<ffffffff811b077e>] SyS_newlstat+0x15/0x30 [ 475.094595] [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3 [ 475.148405] [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b [ 475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48 8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7 75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8 [ 475.255884] RIP [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 475.296120] RSP <ffff8805c0ac3c38> [ 475.328734] ---[ end trace f076482e9d754adc ]--- Reported-by: Matthew Thode <mthode@mthode.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 Feb, 2014 18 commits
-
-
Greg Kroah-Hartman authored
-
Colin Cross authored
fixed upstream in v3.6 by ec145bab get_monotonic_boottime adds three nanonsecond values stored in longs, followed by an s64. If the long values are all close to 1e9 the first three additions can overflow and become negative when added to the s64. Cast the first value to s64 so that all additions are 64 bit. Signed-off-by: Colin Cross <ccross@android.com> [jstultz: Fished this out of the AOSP commong.git tree. This was fixed upstream in v3.6 by ec145bab] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
John Stultz authored
commit 6fdda9a9 upstream. As part of normal operaions, the hrtimer subsystem frequently calls into the timekeeping code, creating a locking order of hrtimer locks -> timekeeping locks clock_was_set_delayed() was suppoed to allow us to avoid deadlocks between the timekeeping the hrtimer subsystem, so that we could notify the hrtimer subsytem the time had changed while holding the timekeeping locks. This was done by scheduling delayed work that would run later once we were out of the timekeeing code. But unfortunately the lock chains are complex enoguh that in scheduling delayed work, we end up eventually trying to grab an hrtimer lock. Sasha Levin noticed this in testing when the new seqlock lockdep enablement triggered the following (somewhat abrieviated) message: [ 251.100221] ====================================================== [ 251.100221] [ INFO: possible circular locking dependency detected ] [ 251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted [ 251.101967] ------------------------------------------------------- [ 251.101967] kworker/10:1/4506 is trying to acquire lock: [ 251.101967] (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70 [ 251.101967] [ 251.101967] but task is already holding lock: [ 251.101967] (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] which lock already depends on the new lock. [ 251.101967] [ 251.101967] [ 251.101967] the existing dependency chain (in reverse order) is: [ 251.101967] -> #5 (hrtimer_bases.lock#11){-.-...}: [snipped] -> #4 (&rt_b->rt_runtime_lock){-.-...}: [snipped] -> #3 (&rq->lock){-.-.-.}: [snipped] -> #2 (&p->pi_lock){-.-.-.}: [snipped] -> #1 (&(&pool->lock)->rlock){-.-...}: [ 251.101967] [<ffffffff81194803>] validate_chain+0x6c3/0x7b0 [ 251.101967] [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580 [ 251.101967] [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0 [ 251.101967] [<ffffffff84398500>] _raw_spin_lock+0x40/0x80 [ 251.101967] [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0 [ 251.101967] [<ffffffff81154168>] queue_work_on+0x98/0x120 [ 251.101967] [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30 [ 251.101967] [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160 [ 251.101967] [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70 [ 251.101967] [<ffffffff843a4b49>] ia32_sysret+0x0/0x5 [ 251.101967] -> #0 (timekeeper_seq){----..}: [snipped] [ 251.101967] other info that might help us debug this: [ 251.101967] [ 251.101967] Chain exists of: timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11 [ 251.101967] Possible unsafe locking scenario: [ 251.101967] [ 251.101967] CPU0 CPU1 [ 251.101967] ---- ---- [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(&rt_b->rt_runtime_lock); [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(timekeeper_seq); [ 251.101967] [ 251.101967] *** DEADLOCK *** [ 251.101967] [ 251.101967] 3 locks held by kworker/10:1/4506: [ 251.101967] #0: (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #1: (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #2: (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] stack backtrace: [ 251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 [ 251.101967] Workqueue: events clock_was_set_work So the best solution is to avoid calling clock_was_set_delayed() while holding the timekeeping lock, and instead using a flag variable to decide if we should call clock_was_set() once we've released the locks. This works for the case here, where the do_adjtimex() was the deadlock trigger point. Unfortuantely, in update_wall_time() we still hold the jiffies lock, which would deadlock with the ipi triggered by clock_was_set(), preventing us from calling it even after we drop the timekeeping lock. So instead call clock_was_set_delayed() at that point. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Borislav Petkov authored
commit d5a1c7e3 upstream. 41c7f742 ("rtc: Disable the alarm in the hardware (v2)") added the functionality to disable the RTC wake alarm when shutting down the box. However, there are at least two b0rked BIOSes we know about: https://bugzilla.novell.com/show_bug.cgi?id=812592 https://bugzilla.novell.com/show_bug.cgi?id=805740 where, when wakeup alarm is enabled in the BIOS, the machine reboots automatically right after shutdown, regardless of what wakeup time is programmed. Bisecting the issue lead to this patch so disable its functionality with a DMI quirk only for those boxes. Cc: Brecht Machiels <brecht@mos6581.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Rabin Vincent <rabin.vincent@stericsson.com> Signed-off-by: Borislav Petkov <bp@suse.de> [jstultz: Changed variable name for clarity, added extra dmi entry] Tested-by: Brecht Machiels <brecht@mos6581.org> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Ying Xue authored
commit 57d2aa00 upstream. The issue below was found in 2.6.34-rt rather than mainline rt kernel, but the issue still exists upstream as well. So please let me describe how it was noticed on 2.6.34-rt: On this version, each softirq has its own thread, it means there is at least one RT FIFO task per cpu. The priority of these tasks is set to 49 by default. If user launches an RT FIFO task with priority lower than 49 of softirq RT tasks, it's possible there are two RT FIFO tasks enqueued one cpu runqueue at one moment. By current strategy of balancing RT tasks, when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing, we want RT tasks to be run with the least latency. When the user RT FIFO task which just launched before is running, the sched timer tick of the current cpu happens. In this tick period, the timeout value of the user RT task will be updated once. Subsequently, we try to wake up one softirq RT task on its local cpu. As the priority of current user RT task is lower than the softirq RT task, the current task will be preempted by the higher priority softirq RT task. Before preemption, we check to see if current can readily move to a different cpu. If so, we will reschedule to allow the RT push logic to try to move current somewhere else. Whenever the woken softirq RT task runs, it first tries to migrate the user FIFO RT task over to a cpu that is running a task of lesser priority. If migration is done, it will send a reschedule request to the found cpu by IPI interrupt. Once the target cpu responds the IPI interrupt, it will pick the migrated user RT task to preempt its current task. When the user RT task is running on the new cpu, the sched timer tick of the cpu fires. So it will tick the user RT task again. This also means the RT task timeout value will be updated again. As the migration may be done in one tick period, it means the user RT task timeout value will be updated twice within one tick. If we set a limit on the amount of cpu time for the user RT task by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted upon reaching the soft limit. But exactly when the SIGXCPU signal should be sent depends on the RT task timeout value. In fact the timeout mechanism of sending the SIGXCPU signal assumes the RT task timeout is increased once every tick. However, currently the timeout value may be added twice per tick. So it results in the SIGXCPU signal being sent earlier than expected. To solve this issue, we prevent the timeout value from increasing twice within one tick time by remembering the jiffies value of last updating the timeout. As long as the RT task's jiffies is different with the global jiffies value, we allow its timeout to be updated. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Fan Du <fan.du@windriver.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.comSigned-off-by: Ingo Molnar <mingo@kernel.org> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Peter Boonstoppel authored
commit a4c96ae3 upstream. migrate_tasks() uses _pick_next_task_rt() to get tasks from the real-time runqueues to be migrated. When rt_rq is throttled _pick_next_task_rt() won't return anything, in which case migrate_tasks() can't move all threads over and gets stuck in an infinite loop. Instead unthrottle rt runqueues before migrating tasks. Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair() Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.comSigned-off-by: Ingo Molnar <mingo@kernel.org> [ lizf: backported to 3.4: adjust context ] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Mike Galbraith authored
commit e221d028 upstream. Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Colin Cross authored
commit 454c7999 upstream. task_tick_rt() has an optimization to only reschedule SCHED_RR tasks if they were the only element on their rq. However, with cgroups a SCHED_RR task could be the only element on its per-cgroup rq but still be competing with other SCHED_RR tasks in its parent's cgroup. In this case, the SCHED_RR task in the child cgroup would never yield at the end of its timeslice. If the child cgroup rt_runtime_us was the same as the parent cgroup rt_runtime_us, the task in the parent cgroup would starve completely. Modify task_tick_rt() to check that the task is the only task on its rq, and that the each of the scheduling entities of its ancestors is also the only entity on its rq. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.comSigned-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Al Viro authored
commit 31abdab9 upstream. For one thing, there's an ABBA deadlock on hpfs fs-wide lock and i_mutex in hpfs_dir_lseek() - there's a lot of methods that grab the former with the caller already holding the latter, so it must take i_mutex first. For another, locking the damn thing, carefully validating the offset, then dropping locks and assigning the offset is obviously racy. Moreover, we _must_ do hpfs_add_pos(), or the machinery in dnode.c won't modify the sucker on B-tree surgeries. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Yijing Wang authored
commit b0cc6020 upstream. Currently, we enable ARI in a device's upstream bridge if the bridge and the device support it. But we never disable ARI, even if the device is removed and replaced with a device that doesn't support ARI. This means that if we hot-remove an ARI device and replace it with a non-ARI multi-function device, we find only function 0 of the new device because the upstream bridge still has ARI enabled, and next_ari_fn() only returns function 0 for the new non-ARI device. This patch disables ARI in the upstream bridge if the device doesn't support ARI. See the PCIe spec, r3.0, sec 6.13. [bhelgaas: changelog, function comment] [yijing: replace PCIe Cap accessor with legacy PCI accessor] Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alex Deucher authored
commit 6802d4ba upstream. The BlankCrtc table in some DCE8 boards has some logic shortcuts for the vbios when this bit is set. Clear it for driver use. v2: fix typo Bug: https://bugs.freedesktop.org/show_bug.cgi?id=73420Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alex Deucher authored
commit d45b964a upstream. Needed to properly flush the read caches for fences. Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Marek Olšák authored
commit 56492e0f upstream. This fixes a bug which was causing rejections of valid GPU commands from userspace. Signed-off-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Michel Dänzer authored
commit 37016951 upstream. It's never allocated on systems without an ATOMBIOS or COMBIOS ROM. Should fix an oops I encountered while resetting the GPU after a lockup on my PowerBook with an RV350. Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Alex Deucher authored
commit d1951782 upstream. The hw i2c engines are disabled by default as the current implementation is still experimental. Print a warning when users enable it so that it's obvious when the option is enabled. v2: check for non-0 rather than 1 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Joe Thornber authored
commit 12c91a5c upstream. When extending a low level space map we should update nr_blocks at the start so the new space is used for the index entries. Otherwise extend can fail, e.g.: sm_metadata_extend call sequence that fails: -> sm_ll_extend -> dm_tm_new_block -> dm_sm_new_block -> sm_bootstrap_new_block => returns -ENOSPC because smm->begin == smm->ll.nr_blocks Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Mikulas Patocka authored
commit be35f486 upstream. There may be other parts of the kernel holding a reference on the dm kobject. We must wait until all references are dropped before deallocating the mapped_device structure. The dm_kobject_release method signals that all references are dropped via completion. But dm_kobject_release doesn't free the kobject (which is embedded in the mapped_device structure). This is the sequence of operations: * when destroying a DM device, call kobject_put from dm_sysfs_exit * wait until all users stop using the kobject, when it happens the release method is called * the release method signals the completion and should return without delay * the dm device removal code that waits on the completion continues * the dm device removal code drops the dm_mod reference the device had * the dm device removal code frees the mapped_device structure that contains the kobject Using kobject this way should avoid the module unload race that was mentioned at the beginning of this thread: https://lkml.org/lkml/2014/1/4/83Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
Weston Andros Adamson authored
commit 6ff33b7d upstream. When a task enters call_refreshresult with status 0 from call_refresh and !rpcauth_uptodatecred(task) it enters call_refresh again with no rate-limiting or max number of retries. Instead of trying forever, make use of the retry path that other errors use. This only seems to be possible when the crrefresh callback is gss_refresh_null, which only happens when destroying the context. To reproduce: 1) mount with sec=krb5 (or sec=sys with krb5 negotiated for non FSID specific operations). 2) reboot - the client will be stuck and will need to be hard rebooted BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:2:46] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache ppdev crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd serio_raw i2c_piix4 i2c_core e1000 parport_pc parport shpchp nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc autofs4 mptspi scsi_transport_spi mptscsih mptbase ata_generic floppy irq event stamp: 195724 hardirqs last enabled at (195723): [<ffffffff814a925c>] restore_args+0x0/0x30 hardirqs last disabled at (195724): [<ffffffff814b0a6a>] apic_timer_interrupt+0x6a/0x80 softirqs last enabled at (195722): [<ffffffff8103f583>] __do_softirq+0x1df/0x276 softirqs last disabled at (195717): [<ffffffff8103f852>] irq_exit+0x53/0x9a CPU: 0 PID: 46 Comm: kworker/0:2 Not tainted 3.13.0-rc3-branch-dros_testing+ #4 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 Workqueue: rpciod rpc_async_schedule [sunrpc] task: ffff8800799c4260 ti: ffff880079002000 task.ti: ffff880079002000 RIP: 0010:[<ffffffffa0064fd4>] [<ffffffffa0064fd4>] __rpc_execute+0x8a/0x362 [sunrpc] RSP: 0018:ffff880079003d18 EFLAGS: 00000246 RAX: 0000000000000005 RBX: 0000000000000007 RCX: 0000000000000007 RDX: 0000000000000007 RSI: ffff88007aecbae8 RDI: ffff8800783d8900 RBP: ffff880079003d78 R08: ffff88006e30e9f8 R09: ffffffffa005a3d7 R10: ffff88006e30e7b0 R11: ffff8800783d8900 R12: ffffffffa006675e R13: ffff880079003ce8 R14: ffff88006e30e7b0 R15: ffff8800783d8900 FS: 0000000000000000(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3072333000 CR3: 0000000001a0b000 CR4: 00000000001407f0 Stack: ffff880079003d98 0000000000000246 0000000000000000 ffff88007a9a4830 ffff880000000000 ffffffff81073f47 ffff88007f212b00 ffff8800799c4260 ffff8800783d8988 ffff88007f212b00 ffffe8ffff604800 0000000000000000 Call Trace: [<ffffffff81073f47>] ? trace_hardirqs_on_caller+0x145/0x1a1 [<ffffffffa00652d3>] rpc_async_schedule+0x27/0x32 [sunrpc] [<ffffffff81052974>] process_one_work+0x211/0x3a5 [<ffffffff810528d5>] ? process_one_work+0x172/0x3a5 [<ffffffff81052eeb>] worker_thread+0x134/0x202 [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280 [<ffffffff81052db7>] ? rescuer_thread+0x280/0x280 [<ffffffff810584a0>] kthread+0xc9/0xd1 [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61 [<ffffffff814afd6c>] ret_from_fork+0x7c/0xb0 [<ffffffff810583d7>] ? __kthread_parkme+0x61/0x61 Code: e8 87 63 fd e0 c6 05 10 dd 01 00 01 48 8b 43 70 4c 8d 6b 70 45 31 e4 a8 02 0f 85 d5 02 00 00 4c 8b 7b 48 48 c7 43 48 00 00 00 00 <4c> 8b 4b 50 4d 85 ff 75 0c 4d 85 c9 4d 89 cf 0f 84 32 01 00 00 And the output of "rpcdebug -m rpc -s all": RPC: 61 call_refresh (status 0) RPC: 61 call_refresh (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refreshresult (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refreshresult (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refresh (status 0) RPC: 61 call_refreshresult (status 0) RPC: 61 call_refresh (status 0) RPC: 61 call_refresh (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refreshresult (status 0) RPC: 61 call_refresh (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refresh (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 RPC: 61 call_refreshresult (status 0) RPC: 61 call_refresh (status 0) RPC: 61 call_refresh (status 0) RPC: 61 call_refresh (status 0) RPC: 61 call_refresh (status 0) RPC: 61 call_refreshresult (status 0) RPC: 61 refreshing RPCSEC_GSS cred ffff88007a413cf0 Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-