- 14 Jan, 2011 25 commits
-
-
Mel Gorman authored
Lumpy reclaim is disruptive. It reclaims a large number of pages and ignores the age of the pages it reclaims. This can incur significant stalls and potentially increase the number of major faults. Compaction has reached the point where it is considered reasonably stable (meaning it has passed a lot of testing) and is a potential candidate for displacing lumpy reclaim. This patch introduces an alternative to lumpy reclaim whe compaction is available called reclaim/compaction. The basic operation is very simple - instead of selecting a contiguous range of pages to reclaim, a number of order-0 pages are reclaimed and then compaction is later by either kswapd (compact_zone_order()) or direct compaction (__alloc_pages_direct_compact()). [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: use conventional task_struct naming] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Currently lumpy_mode is an enum and determines if lumpy reclaim is off, syncronous or asyncronous. In preparation for using compaction instead of lumpy reclaim, this patch converts the flags into a bitmap. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
In preparation for a patches promoting the use of memory compaction over lumpy reclaim, this patch adds trace points for memory compaction activity. Using them, we can monitor the scanning activity of the migration and free page scanners as well as the number and success rates of pages passed to page migration. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nikanth Karthikesan authored
Currently there is no way to find whether a process has locked its pages in memory or not. And which of the memory regions are locked in memory. Add a new field "Locked" to export this information via the smaps file. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hai Shan authored
Merge mpage_end_io_read() and mpage_end_io_write() into mpage_end_io() to eliminate code duplication. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hai Shan <shan.hai@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Nick Piggin authored
Testing ->mapping and ->index without a ref is not stable as the page may have been reused at this point. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KOSAKI Motohiro authored
Currently, kswapd() has deep nesting and is slightly hard to read. Clean this up. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Bob Liu authored
__set_page_dirty_no_writeback() should return true if it actually transitioned the page from a clean to dirty state although it seems nobody uses its return value at present. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
Use correct function name, remove incorrect apostrophe Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we easily end up with non-positive nr_to_write after the function returns, if the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment. When nr_to_write is <= 0 wb_writeback() decides we need another round of writeback but this is wrong in some cases! For example when a single large file is continuously dirtied, we would never finish syncing it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp never gets updated (as inode is never completely clean). Thus __writeback_inodes_sb() would write the redirtied inode again and again. Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not need nr_to_write in WB_SYNC_ALL mode anyway since write_cache_pages() does livelock avoidance using page tagging in WB_SYNC_ALL mode. This makes wb_writeback() call __writeback_inodes_sb() only once on WB_SYNC_ALL. The latter function won't livelock because it works on - a finite set of files by doing queue_io() once at the beginning - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer able to stall sync forever. [fengguang.wu@intel.com: fix locking comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Background writeback is easily livelockable in a loop in wb_writeback() by a process continuously re-dirtying pages (or continuously appending to a file). This is in fact intended as the target of background writeback is to write dirty pages it can find as long as we are over dirty_background_threshold. But the above behavior gets inconvenient at times because no other work queued in the flusher thread's queue gets processed. In particular, since e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1) can hang forever waiting for flusher thread to do the work. Generally, when a flusher thread has some work queued, someone submitted the work to achieve a goal more specific than what background writeback does. Moreover by working on the specific work, we also reduce amount of dirty pages which is exactly the target of background writeout. So it makes sense to give specific work a priority over a generic page cleaning. Thus we interrupt background writeback if there is some other work to do. We return to the background writeback after completing all the queued work. This may delay the writeback of expired inodes for a while, however the expired inodes will eventually be flushed to disk as long as the other works won't livelock. [fengguang.wu@intel.com: update comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Wu Fengguang authored
This tracks when balance_dirty_pages() tries to wakeup the flusher thread for background writeback (if it was not started already). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jan Kara authored
Check whether background writeback is needed after finishing each work. When bdi flusher thread finishes doing some work check whether any kind of background writeback needs to be done (either because dirty_background_ratio is exceeded or because we need to start flushing old inodes). If so, just do background write back. This way, bdi_start_background_writeback() just needs to wake up the flusher thread. It will do background writeback as soon as there is no other work. This is a preparatory patch for the next patch which stops background writeback as soon as there is other work to do. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid errors due to counter drift. The functions duplicate some code so this patch replaces them with a single set_pgdat_percpu_threshold() that takes a callback function to calculate the desired threshold as a parameter. [akpm@linux-foundation.org: readability tweak] [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To avoid synchronization overhead, these counters are maintained on a per-cpu basis and drained both periodically and when a threshold is above a threshold. On large CPU systems, the difference between the estimate and real value of NR_FREE_PAGES can be very high. The system can get into a case where pages are allocated far below the min watermark potentially causing livelock issues. The commit solved the problem by taking a better reading of NR_FREE_PAGES when memory was low. Unfortately, as reported by Shaohua Li this accurate reading can consume a large amount of CPU time on systems with many sockets due to cache line bouncing. This patch takes a different approach. For large machines where counter drift might be unsafe and while kswapd is awake, the per-cpu thresholds for the target pgdat are reduced to limit the level of drift to what should be a safe level. This incurs a performance penalty in heavy memory pressure by a factor that depends on the workload and the machine but the machine should function correctly without accidentally exhausting all memory on a node. There is an additional cost when kswapd wakes and sleeps but the event is not expected to be frequent - in Shaohua's test case, there was one recorded sleep and wake event at least. To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is introduced that takes a more accurate reading of NR_FREE_PAGES when called from wakeup_kswapd, when deciding whether it is really safe to go back to sleep in sleeping_prematurely() and when deciding if a zone is really balanced or not in balance_pgdat(). We are still using an expensive function but limiting how often it is called. When the test case is reproduced, the time spent in the watermark functions is reduced. The following report is on the percentage of time spent cumulatively spent in the functions zone_nr_free_pages(), zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(), zone_page_state_snapshot(), zone_page_state(). vanilla 11.6615% disable-threshold 0.2584% David said: : We had to pull aa454840 "mm: page allocator: calculate a better estimate : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36 : internally because tests showed that it would cause the machine to stall : as the result of heavy kswapd activity. I merged it back with this fix as : it is pending in the -mm tree and it solves the issue we were seeing, so I : definitely think this should be pushed to -stable (and I would seriously : consider it for 2.6.37 inclusion even at this late date). Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Tested-by: Nicolas Bareil <nico@chdir.org> Cc: David Rientjes <rientjes@google.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: <stable@kernel.org> [2.6.37.1, 2.6.36.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Dave Jones authored
This warning was added in commit bdff746a ("clone: prepare to recycle CLONE_STOPPED") three years ago. 2.6.26 came and went. As far as I know, no-one is actually using CLONE_STOPPED. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Claudio Scordino authored
When working in RS485 mode, the atmel_serial driver keeps RTS high after the initialization of the serial port. It goes low only after the first character has been sent. [akpm@linux-foundation.org: simplify code] Signed-off-by: Claudio Scordino <claudio@evidence.eu.com> Signed-off-by: Arkadiusz Bubala <arkadiusz.bubala@gmail.com> Tested-by: Arkadiusz Bubala <arkadiusz.bubala@gmail.com> Cc: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Eric Dumazet authored
Use modern per_cpu API to increment {soft|hard}irq counters, and use per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array. This gives better SMP/NUMA locality and saves few instructions per irq. With small nr_cpuids values (8 for example), kstats_irq was a small array (less than L1_CACHE_BYTES), potentially source of false sharing. In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly kstat_irqs_all[NR_IRQS][NR_CPUS] array. Note: we still populate kstats_irq for all possible irqs in early_irq_init(). We probably could use on-demand allocations. (Code included in alloc_descs()). Problem is not all IRQS are used with a prior alloc_descs() call. kstat_irqs_this_cpu() is not used anymore, remove it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Bruce Chang authored
Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't handle the Linux driver for VIA now, I would like to request to update the maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one. Signed-off-by: Bruce Chang <brucechang@via.com.tw> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Joseph Chan <JosephChan@via.com.tw> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Harald Welte <HaraldWelte@viatech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds authored
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits) dm: raid456 basic support dm: per target unplug callback support dm: introduce target callbacks and congestion callback dm mpath: delay activate_path retry on SCSI_DH_RETRY dm: remove superfluous irq disablement in dm_request_fn dm log: use PTR_ERR value instead of ENOMEM dm snapshot: avoid storing private suspended state dm snapshot: persistent make metadata_wq multithreaded dm: use non reentrant workqueues if equivalent dm: convert workqueues to alloc_ordered dm stripe: switch from local workqueue to system_wq dm: dont use flush_scheduled_work dm snapshot: remove unused dm_snapshot queued_bios_work dm ioctl: suppress needless warning messages dm crypt: add loop aes iv generator dm crypt: add multi key capability dm crypt: add post iv call to iv generator dm crypt: use io thread for reads only if mempool exhausted dm crypt: scale to multiple cpus dm crypt: simplify compatible table output ...
-
git://neil.brown.name/mdLinus Torvalds authored
* 'for-linus' of git://neil.brown.name/md: md: Fix removal of extra drives when converting RAID6 to RAID5 md: range check slot number when manually adding a spare. md/raid5: handle manually-added spares in start_reshape. md: fix sync_completed reporting for very large drives (>2TB) md: allow suspend_lo and suspend_hi to decrease as well as increase. md: Don't let implementation detail of curr_resync leak out through sysfs. md: separate meta and data devs md-new-param-to_sync_page_io md-new-param-to-calc_dev_sboffset md: Be more careful about clearing flags bit in ->recovery md: md_stop_writes requires mddev_lock. md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer md: Ensure no IO request to get md device before it is properly initialised. md: Fix single printks with multiple KERN_<level>s md: fix regression resulting in delays in clearing bits in a bitmap md: fix regression with re-adding devices to arrays with no metadata
-
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6Linus Torvalds authored
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] fix build error - arch/ia64/kernel/perfmon.c
-
Linus Torvalds authored
This reverts commit 0fdae42d, which wasn't really supposed to go in, and causes lots of annoying warnings. Quoth Andrew: "Complete brainfart - I meant to drop that patch ages ago." Quoth Greg: "Ick, yeah, that patch isn't ok to go in as-is, all of the callers need to be fixed up first, which is what I thought we had agreed on..." Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Linus Torvalds authored
Stephen Rothwell reports that the vfs merge broke the build of ecryptfs. The breakage comes from commit 66cb7666 ("sanitize ecryptfs ->mount()") which was obviously not even build tested. Tssk, tssk, Al. This is the minimal build fixup for the situation, although I don't have a filesystem to actually test it with. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
- 13 Jan, 2011 15 commits
-
-
Tony Luck authored
arch/ia64/kernel/perfmon.c:621: error: duplicate 'static' Introduced by commit c74a1cbb pass default dentry_operations to mount_pseudo() Signed-off-by: Tony Luck <tony.luck@intel.com>
-
NeilBrown authored
When a RAID6 is converted to a RAID5, the extra drive should be discarded. However it isn't due to a typo in a comparison. This bug was introduced in commit e93f68a1 in 2.6.35-rc4 and is suitable for any -stable since than. As the extra drive is not removed, the 'degraded' counter is wrong and so the RAID5 will not respond correctly to a subsequent failure. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
-
NeilBrown authored
When adding a spare to an active array, we should check the slot number, but allow it to be larger than raid_disks if a reshape is being prepared. Apply the same test when adding a device to an array-under-construction. It already had most of the test in place, but not quite all. Signed-off-by: NeilBrown <neilb@suse.de>
-
NeilBrown authored
It is possible to manually add spares to specific slots before starting a reshape. raid5_start_reshape should recognised this possibility and include it in the accounting. Signed-off-by: NeilBrown <neilb@suse.de>
-
Rémi Rérolle authored
The values exported in the sync_completed file are unsigned long, which overflows with very large drives, resulting in wrong values reported. Since sync_completed uses sectors as unit, we'll start getting wrong values with components larger than 2TB. This patch simply replaces the use of unsigned long by unsigned long long. Signed-off-by: Rémi Rérolle <rrerolle@lacie.com> Signed-off-by: NeilBrown <neilb@suse.de>
-
NeilBrown authored
The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region to which read/writes are suspended so that the under lying data can be manipulated without user-space noticing. Currently the window they describe can only move forwards along the device. However this is an unnecessary restriction which will cause problems with planned developments. So relax this restriction and allow these endpoints to move arbitrarily. Signed-off-by: NeilBrown <neilb@suse.de>
-
NeilBrown authored
mddev->curr_resync has artificial values of '1' and '2' which are used by the code which ensures only one resync is happening at a time on any given device. These values are internal and should never be exposed to user-space (except when translated appropriately as in the 'pending' status in /proc/mdstat). Unfortunately they are as ->curr_resync is assigned to ->curr_resync_completed and that value is directly visible through sysfs. So change the assignments to ->curr_resync_completed to get the same valued from elsewhere in a form that doesn't have the magic '1' or '2' values. Signed-off-by: NeilBrown <neilb@suse.de>
-
Jonathan Brassow authored
Allow the metadata to be on a separate device from the data. This doesn't mean the data and metadata will by on separate physical devices - it simply gives device-mapper and userspace tools more flexibility. Signed-off-by: NeilBrown <neilb@suse.de>
-
Jonathan Brassow authored
Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
-
Jonathan Brassow authored
When we allow for separate devices for data and metadata in a later patch, we will need to be able to calculate the superblock offset based on more than the bdev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
-
NeilBrown authored
Setting ->recovery to 0 is generally not a good idea as it could clear bits that shouldn't be cleared. In particular, MD_RECOVERY_FROZEN should only be cleared on explicit request from user-space. So when we need to clear things, just clear the bits that need clearing. As there are a few different places which reap a resync process - and some do an incomplte job - factor out the code for doing the from md_check_recovery and call that function instead of open coding part of it. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jonathan Brassow <jbrassow@redhat.com>
-
NeilBrown authored
As md_stop_writes manipulates the sync_thread and calls md_update_sb, it need to be called with mddev_lock held. In all internal cases it is, but the symbol is exported for dm-raid to call and in that case the lock won't be help. Do make an exported version which takes the lock, and an internal version which does not. Signed-off-by: NeilBrown <neilb@suse.de>
-
Jonathan Brassow authored
With the module parameter 'start_dirty_degraded' set, raid5_spare_active() previously called sysfs_notify_dirent() with a NULL argument (rdev->sysfs_state) when a rebuild finished. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
-
NeilBrown authored
When an md device is in the process of coming on line it is possible for an IO request (typically a partition table probe) to get through before the array is fully initialised, which can cause unexpected behaviour (e.g. a crash). So explicitly record when the array is ready for IO and don't allow IO through until then. There is no possibility for a similar problem when the array is going off-line as there must only be one 'open' at that time, and it is busy off-lining the array and so cannot send IO requests. So no memory barrier is needed in md_stop() This has been a bug since commit 409c57f3 in 2.6.30 which introduced md_make_request. Before then, each personality would register its own make_request_fn when it was ready. This is suitable for any stable kernel from 2.6.30.y onwards. Cc: <stable@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
-
Joe Perches authored
Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>
-