1. 14 Jan, 2011 15 commits
    • Jan Kara's avatar
      writeback: avoid livelocking WB_SYNC_ALL writeback · b9543dac
      Jan Kara authored
      When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
      usually set to LONG_MAX.  The logic in wb_writeback() then calls
      __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
      easily end up with non-positive nr_to_write after the function returns, if
      the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.
      
      When nr_to_write is <= 0 wb_writeback() decides we need another round of
      writeback but this is wrong in some cases!  For example when a single
      large file is continuously dirtied, we would never finish syncing it
      because each pass would be able to write MAX_WRITEBACK_PAGES and inode
      dirty timestamp never gets updated (as inode is never completely clean).
      Thus __writeback_inodes_sb() would write the redirtied inode again and
      again.
      
      Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
      do not need nr_to_write in WB_SYNC_ALL mode anyway since
      write_cache_pages() does livelock avoidance using page tagging in
      WB_SYNC_ALL mode.
      
      This makes wb_writeback() call __writeback_inodes_sb() only once on
      WB_SYNC_ALL.  The latter function won't livelock because it works on
      
      - a finite set of files by doing queue_io() once at the beginning
      - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging
      
      After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
      longer able to stall sync forever.
      
      [fengguang.wu@intel.com: fix locking comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9543dac
    • Jan Kara's avatar
      writeback: stop background/kupdate works from livelocking other works · aa373cf5
      Jan Kara authored
      Background writeback is easily livelockable in a loop in wb_writeback() by
      a process continuously re-dirtying pages (or continuously appending to a
      file).  This is in fact intended as the target of background writeback is
      to write dirty pages it can find as long as we are over
      dirty_background_threshold.
      
      But the above behavior gets inconvenient at times because no other work
      queued in the flusher thread's queue gets processed.  In particular, since
      e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
      can hang forever waiting for flusher thread to do the work.
      
      Generally, when a flusher thread has some work queued, someone submitted
      the work to achieve a goal more specific than what background writeback
      does.  Moreover by working on the specific work, we also reduce amount of
      dirty pages which is exactly the target of background writeout.  So it
      makes sense to give specific work a priority over a generic page cleaning.
      
      Thus we interrupt background writeback if there is some other work to do.
      We return to the background writeback after completing all the queued
      work.
      
      This may delay the writeback of expired inodes for a while, however the
      expired inodes will eventually be flushed to disk as long as the other
      works won't livelock.
      
      [fengguang.wu@intel.com: update comment]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa373cf5
    • Wu Fengguang's avatar
      writeback: trace wakeup event for background writeback · 71927e84
      Wu Fengguang authored
      This tracks when balance_dirty_pages() tries to wakeup the flusher thread
      for background writeback (if it was not started already).
      Suggested-by: default avatarChristoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71927e84
    • Jan Kara's avatar
      writeback: integrated background writeback work · 6585027a
      Jan Kara authored
      Check whether background writeback is needed after finishing each work.
      
      When bdi flusher thread finishes doing some work check whether any kind of
      background writeback needs to be done (either because
      dirty_background_ratio is exceeded or because we need to start flushing
      old inodes).  If so, just do background write back.
      
      This way, bdi_start_background_writeback() just needs to wake up the
      flusher thread.  It will do background writeback as soon as there is no
      other work.
      
      This is a preparatory patch for the next patch which stops background
      writeback as soon as there is other work to do.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Engelhardt <jengelh@medozas.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6585027a
    • Mel Gorman's avatar
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman authored
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • Mel Gorman's avatar
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman authored
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reported-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
    • Dave Jones's avatar
      sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9
      Dave Jones authored
      This warning was added in commit bdff746a ("clone: prepare to recycle
      CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
      no-one is actually using CLONE_STOPPED.
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43bb40c9
    • Claudio Scordino's avatar
      atmel_serial: fix RTS high after initialization in RS485 mode · 5dfbd1d7
      Claudio Scordino authored
      When working in RS485 mode, the atmel_serial driver keeps RTS high after
      the initialization of the serial port.  It goes low only after the first
      character has been sent.
      
      [akpm@linux-foundation.org: simplify code]
      Signed-off-by: default avatarClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: default avatarArkadiusz Bubala <arkadiusz.bubala@gmail.com>
      Tested-by: default avatarArkadiusz Bubala <arkadiusz.bubala@gmail.com>
      Cc: Nicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dfbd1d7
    • Eric Dumazet's avatar
      irq: use per_cpu kstat_irqs · 6c9ae009
      Eric Dumazet authored
      Use modern per_cpu API to increment {soft|hard}irq counters, and use
      per_cpu allocation for (struct irq_desc)->kstats_irq instead of an array.
      
      This gives better SMP/NUMA locality and saves few instructions per irq.
      
      With small nr_cpuids values (8 for example), kstats_irq was a small array
      (less than L1_CACHE_BYTES), potentially source of false sharing.
      
      In the !CONFIG_SPARSE_IRQ case, remove the huge, NUMA/cache unfriendly
      kstat_irqs_all[NR_IRQS][NR_CPUS] array.
      
      Note: we still populate kstats_irq for all possible irqs in
      early_irq_init().  We probably could use on-demand allocations.  (Code
      included in alloc_descs()).  Problem is not all IRQS are used with a prior
      alloc_descs() call.
      
      kstat_irqs_this_cpu() is not used anymore, remove it.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c9ae009
    • Bruce Chang's avatar
      MAINTAINERS: update entries affecting VIA Technologies · 558bbb2f
      Bruce Chang authored
      Since the original maintainer-Joseph Chan (josephchan@via.com.tw) doesn't
      handle the Linux driver for VIA now, I would like to request to update the
      maintainer for the SD/MMC CARD CONTROLLER DRIVER and VIA
      UNICHROME(PRO)/CHROME9 FRAMEBUFFER DRIVER before we find a better one.
      Signed-off-by: default avatarBruce Chang <brucechang@via.com.tw>
      Signed-off-by: default avatarFlorian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Cc: Joseph Chan <JosephChan@via.com.tw>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Harald Welte <HaraldWelte@viatech.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      558bbb2f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm · f6bcfd94
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits)
        dm: raid456 basic support
        dm: per target unplug callback support
        dm: introduce target callbacks and congestion callback
        dm mpath: delay activate_path retry on SCSI_DH_RETRY
        dm: remove superfluous irq disablement in dm_request_fn
        dm log: use PTR_ERR value instead of ENOMEM
        dm snapshot: avoid storing private suspended state
        dm snapshot: persistent make metadata_wq multithreaded
        dm: use non reentrant workqueues if equivalent
        dm: convert workqueues to alloc_ordered
        dm stripe: switch from local workqueue to system_wq
        dm: dont use flush_scheduled_work
        dm snapshot: remove unused dm_snapshot queued_bios_work
        dm ioctl: suppress needless warning messages
        dm crypt: add loop aes iv generator
        dm crypt: add multi key capability
        dm crypt: add post iv call to iv generator
        dm crypt: use io thread for reads only if mempool exhausted
        dm crypt: scale to multiple cpus
        dm crypt: simplify compatible table output
        ...
      f6bcfd94
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://neil.brown.name/md · 509e4aef
      Linus Torvalds authored
      * 'for-linus' of git://neil.brown.name/md:
        md: Fix removal of extra drives when converting RAID6 to RAID5
        md: range check slot number when manually adding a spare.
        md/raid5: handle manually-added spares in start_reshape.
        md: fix sync_completed reporting for very large drives (>2TB)
        md: allow suspend_lo and suspend_hi to decrease as well as increase.
        md: Don't let implementation detail of curr_resync leak out through sysfs.
        md: separate meta and data devs
        md-new-param-to_sync_page_io
        md-new-param-to-calc_dev_sboffset
        md: Be more careful about clearing flags bit in ->recovery
        md: md_stop_writes requires mddev_lock.
        md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer
        md: Ensure no IO request to get md device before it is properly initialised.
        md: Fix single printks with multiple KERN_<level>s
        md: fix regression resulting in delays in clearing bits in a bitmap
        md: fix regression with re-adding devices to arrays with no metadata
      509e4aef
    • Linus Torvalds's avatar
      375b6f5a
    • Linus Torvalds's avatar
      Revert "gpiolib: annotate gpio-intialization with __must_check" · d8a3515e
      Linus Torvalds authored
      This reverts commit 0fdae42d, which
      wasn't really supposed to go in, and causes lots of annoying warnings.
      
      Quoth Andrew:
        "Complete brainfart - I meant to drop that patch ages ago."
      
      Quoth Greg:
        "Ick, yeah, that patch isn't ok to go in as-is, all of the callers
         need to be fixed up first, which is what I thought we had agreed on..."
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarGreg KH <greg@kroah.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8a3515e
    • Linus Torvalds's avatar
      ecryptfs: fix broken build · 6254b32b
      Linus Torvalds authored
      Stephen Rothwell reports that the vfs merge broke the build of ecryptfs.
      The breakage comes from commit 66cb7666 ("sanitize ecryptfs
      ->mount()") which was obviously not even build tested. Tssk, tssk, Al.
      
      This is the minimal build fixup for the situation, although I don't have
      a filesystem to actually test it with.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6254b32b
  2. 13 Jan, 2011 25 commits
    • Tony Luck's avatar
      [IA64] fix build error - arch/ia64/kernel/perfmon.c · 09579770
      Tony Luck authored
      arch/ia64/kernel/perfmon.c:621: error: duplicate 'static'
      
      Introduced by commit c74a1cbb
      
          pass default dentry_operations to mount_pseudo()
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      09579770
    • NeilBrown's avatar
      md: Fix removal of extra drives when converting RAID6 to RAID5 · bf2cb0da
      NeilBrown authored
      When a RAID6 is converted to a RAID5, the extra drive should
      be discarded.  However it isn't due to a typo in a comparison.
      
      This bug was introduced in commit e93f68a1 in 2.6.35-rc4
      and is suitable for any -stable since than.
      
      As the extra drive is not removed, the 'degraded' counter is wrong and
      so the RAID5 will not respond correctly to a subsequent failure.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bf2cb0da
    • NeilBrown's avatar
      md: range check slot number when manually adding a spare. · ba1b41b6
      NeilBrown authored
      When adding a spare to an active array, we should check the slot
      number, but allow it to be larger than raid_disks if a reshape
      is being prepared.
      
      Apply the same test when adding a device to an
      array-under-construction.  It already had most of the test in place,
      but not quite all.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ba1b41b6
    • NeilBrown's avatar
      md/raid5: handle manually-added spares in start_reshape. · 1a940fce
      NeilBrown authored
      It is possible to manually add spares to specific slots before
      starting a reshape.
      raid5_start_reshape should recognised this possibility and include
      it in the accounting.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      1a940fce
    • Rémi Rérolle's avatar
      md: fix sync_completed reporting for very large drives (>2TB) · 13ae864b
      Rémi Rérolle authored
      The values exported in the sync_completed file are unsigned long, which
      overflows with very large drives, resulting in wrong values reported.
      
      Since sync_completed uses sectors as unit, we'll start getting wrong
      values with components larger than 2TB.
      
      This patch simply replaces the use of unsigned long by unsigned long long.
      Signed-off-by: default avatarRémi Rérolle <rrerolle@lacie.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      13ae864b
    • NeilBrown's avatar
      md: allow suspend_lo and suspend_hi to decrease as well as increase. · 23ddff37
      NeilBrown authored
      The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
      to which read/writes are suspended so that the under lying data can be
      manipulated without user-space noticing.
      Currently the window they describe can only move forwards along the
      device.  However this is an unnecessary restriction which will cause
      problems with planned developments.
      So relax this restriction and allow these endpoints to move
      arbitrarily.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      23ddff37
    • NeilBrown's avatar
      md: Don't let implementation detail of curr_resync leak out through sysfs. · 75d3da43
      NeilBrown authored
      mddev->curr_resync has artificial values of '1' and '2' which are used
      by the code which ensures only one resync is happening at a time on
      any given device.
      
      These values are internal and should never be exposed to user-space
      (except when translated appropriately as in the 'pending' status in
      /proc/mdstat).
      
      Unfortunately they are as ->curr_resync is assigned to
      ->curr_resync_completed and that value is directly visible through
      sysfs.
      
      So change the assignments to ->curr_resync_completed to get the same
      valued from elsewhere in a form that doesn't have the magic '1' or '2'
      values.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      75d3da43
    • Jonathan Brassow's avatar
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow authored
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      a6ff7e08
    • Jonathan Brassow's avatar
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow authored
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • Jonathan Brassow's avatar
      md-new-param-to-calc_dev_sboffset · 57b2caa3
      Jonathan Brassow authored
      When we allow for separate devices for data and metadata
      in a later patch, we will need to be able to calculate
      the superblock offset based on more than the bdev.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      57b2caa3
    • NeilBrown's avatar
      md: Be more careful about clearing flags bit in ->recovery · 7ebc0be7
      NeilBrown authored
      Setting ->recovery to 0 is generally not a good idea as it could clear
      bits that shouldn't be cleared.  In particular, MD_RECOVERY_FROZEN
      should only be cleared on explicit request from user-space.
      
      So when we need to clear things, just clear the bits that need
      clearing.
      
      As there are a few different places which reap a resync process - and
      some do an incomplte job - factor out the code for doing the from
      md_check_recovery and call that function instead of open coding part
      of it.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      7ebc0be7
    • NeilBrown's avatar
      md: md_stop_writes requires mddev_lock. · defad61a
      NeilBrown authored
      As md_stop_writes manipulates the sync_thread and calls md_update_sb,
      it need to be called with mddev_lock held.
      
      In all internal cases it is, but the symbol is exported for dm-raid to
      call and in that case the lock won't be help.
      Do make an exported version which takes the lock, and an internal
      version which does not.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      defad61a
    • Jonathan Brassow's avatar
      md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer · 43c73ca4
      Jonathan Brassow authored
      With the module parameter 'start_dirty_degraded' set,
      raid5_spare_active() previously called sysfs_notify_dirent() with a NULL
      argument (rdev->sysfs_state) when a rebuild finished.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      43c73ca4
    • NeilBrown's avatar
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown authored
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reported-by: default avatar"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886
    • Joe Perches's avatar
      067032bc
    • NeilBrown's avatar
      md: fix regression resulting in delays in clearing bits in a bitmap · 6c987910
      NeilBrown authored
      commit 589a594b (2.6.37-rc4) fixed a problem were md_thread would
      sometimes call the ->run function at a bad time.
      
      If an error is detected during array start up after the md_thread has
      been started, the md_thread is killed.  This resulted in the ->run
      function being called once.  However the array may not be in a state
      that it is safe to call ->run.
      
      However the fix imposed meant that  ->run was not called on a timeout.
      This means that when an array goes idle, bitmap bits do not get
      cleared promptly.  While the array is busy the bits will still be
      cleared when appropriate so this is not very serious.  There is no
      risk to data.
      
      Change the test so that we only avoid calling ->run when the thread
      is being stopped.  This more explicitly addresses the problem situation.
      
      This is suitable for 2.6.37-stable and any -stable kernel to which
      589a594b was applied.
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      6c987910
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6 · 2a86cb7c
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/avr32-2.6:
        avr32: update default configuration files for Atmel boards
        avr32: Convert to clocksource_register_hz
        avr32: make architecture sys_clone prototype match asm-generic prototype
        avr32: use syscall prototypes from asm-generic instead of arch
        avr32: disable kprobes for all default configurations
        avr32: boards: setup: use IS_ERR() instead of NULL check
      2a86cb7c
    • Trond Myklebust's avatar
      NFS: Fix NFSv3 exclusive open semantics · 8a0eebf6
      Trond Myklebust authored
      Commit c0204fd2 (NFS: Clean up
      nfs4_proc_create()) broke NFSv3 exclusive open by removing the code
      that passes the O_EXCL flag down to nfs3_proc_create(). This patch
      reverts that offending hunk from the original commit.
      Reported-by: default avatarNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@kernel.org    [2.6.37]
      Tested-by: default avatarNick Bowler <nbowler@elliptictech.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a0eebf6
    • NeilBrown's avatar
      dm: raid456 basic support · 9d09e663
      NeilBrown authored
      This patch is the skeleton for the DM target that will be
      the bridge from DM to MD (initially RAID456 and later RAID1).  It
      provides a way to use device-mapper interfaces to the MD RAID456
      drivers.
      
      As with all device-mapper targets, the nominal public interfaces are the
      constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO
      and STATUSTYPE_TABLE).  The CTR table looks like the following:
      
      1: <s> <l> raid \
      2:	<raid_type> <#raid_params> <raid_params> \
      3:	<#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN>
      
      Line 1 contains the standard first three arguments to any device-mapper
      target - the start, length, and target type fields.  The target type in
      this case is "raid".
      
      Line 2 contains the arguments that define the particular raid
      type/personality/level, the required arguments for that raid type, and
      any optional arguments.  Possible raid types include: raid4, raid5_la,
      raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc.  (again, raid1 is
      planned for the future.)  The list of required and optional parameters
      is the same for all the current raid types.  The required parameters are
      positional, while the optional parameters are given as key/value pairs.
      The possible parameters are as follows:
       <chunk_size>		Chunk size in sectors.
       [[no]sync]		Force/Prevent RAID initialization
       [rebuild <idx>]	Rebuild the drive indicated by the index
       [daemon_sleep <ms>]	Time between bitmap daemon work to clear bits
       [min_recovery_rate <kB/sec/disk>]	Throttle RAID initialization
       [max_recovery_rate <kB/sec/disk>]	Throttle RAID initialization
       [max_write_behind <value>]		See '-write-behind=' (man mdadm)
       [stripe_cache <sectors>]		Stripe cache size for higher RAIDs
      
      Line 3 contains the list of devices that compose the array in
      metadata/data device pairs.  If the metadata is stored separately, a '-'
      is given for the metadata device position.  If a drive has failed or is
      missing at creation time, a '-' can be given for both the metadata and
      data drives for a given position.
      
      Examples:
      # RAID4 - 4 data drives, 1 parity
      # No metadata devices specified to hold superblock/bitmap info
      # Chunk size of 1MiB
      # (Lines separated for easy reading)
      0 1960893648 raid \
      	raid4 1 2048 \
      	5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
      
      # RAID4 - 4 data drives, 1 parity (no metadata devices)
      # Chunk size of 1MiB, force RAID initialization,
      #	min recovery rate at 20 kiB/sec/disk
      0 1960893648 raid \
              raid4 4 2048 min_recovery_rate 20 sync\
              5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81
      
      Performing a 'dmsetup table' should display the CTR table used to
      construct the mapping (with possible reordering of optional
      parameters).
      
      Performing a 'dmsetup status' will yield information on the state and
      health of the array.  The output is as follows:
      1: <s> <l> raid \
      2:	<raid_type> <#devices> <1 health char for each dev> <resync_ratio>
      
      Line 1 is standard DM output.  Line 2 is best shown by example:
      	0 1960893648 raid raid4 5 AAAAA 2/490221568
      Here we can see the RAID type is raid4, there are 5 devices - all of
      which are 'A'live, and the array is 2/490221568 complete with recovery.
      
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9d09e663
    • NeilBrown's avatar
      dm: per target unplug callback support · 99d03c14
      NeilBrown authored
      Add per-target unplug callback support.
      
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      99d03c14
    • NeilBrown's avatar
      dm: introduce target callbacks and congestion callback · 9d357b07
      NeilBrown authored
      DM currently implements congestion checking by checking on congestion
      in each component device.  For raid456 we need to also check if the
      stripe cache is congested.
      
      Add per-target congestion checker callback support.
      
      Extending the target_callbacks structure with additional callback
      functions allows for establishing multiple callbacks per-target (a
      callback is also needed for unplug).
      
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      9d357b07
    • Chandra Seetharaman's avatar
      dm mpath: delay activate_path retry on SCSI_DH_RETRY · 4e2d19e4
      Chandra Seetharaman authored
      This patch adds a user-configurable 'pg_init_delay_msecs' feature.  Use
      this feature to specify the number of milliseconds to delay before
      retrying scsi_dh_activate, when SCSI_DH_RETRY is returned.
      
      SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry
      activation immediately and SCSI_DH_RETRY in cases where it is better to
      retry after some delay.
      
      Currently we immediately retry scsi_dh_activate irrespective of
      SCSI_DH_IMM_RETRY and SCSI_DH_RETRY.
      
      The 'pg_init_delay_msecs' feature may be provided during table create or
      load, e.g.:
          dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \
      	pg_init_delay_msecs 2500 ..." mpatha
      
      The default for 'pg_init_delay_msecs' is 2000 milliseconds.
      Maximum configurable delay is 60000 milliseconds.  Specifying a
      'pg_init_delay_msecs' of 0 will cause immediate retry.
      Signed-off-by: default avatarNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: default avatarChandra Seetharaman <sekharan@us.ibm.com>
      Acked-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      4e2d19e4
    • Kiyoshi Ueda's avatar
      dm: remove superfluous irq disablement in dm_request_fn · 052189a2
      Kiyoshi Ueda authored
      This patch changes spin_lock_irq() to spin_lock() in dm_request_fn().
      This patch is just a clean-up and no functional change.
      
      The spin_lock_irq() was leftover from the early request-based dm code,
      where map_request() used to enable interrupts.
      Since current map_request() never enables interrupts, we can change it
      to spin_lock() to match the prior spin_unlock().
      
      Auditing through the dm and block-layer code called from
      map_request(), I confirmed all functions save/restore interrupt
      status, so no function returning with interrupts enabled.
      Also I haven't observed any problem on my test environment which
      uses scsi and lpfc driver after heavy I/O testing with occasional
      path down/up.
      
      Added BUG_ON() to detect breakage in future.
      Signed-off-by: default avatarKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      052189a2
    • Dan Carpenter's avatar
      dm log: use PTR_ERR value instead of ENOMEM · dbc883f1
      Dan Carpenter authored
      It's nicer to return the PTR_ERR() value instead of just returning
      -ENOMEM.  In the current code the PTR_ERR() value is always equal to
      -ENOMEM so this doesn't actually affect anything, but still...
      
      In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM
      return.  So this change is safe relative to potential for a non -ENOMEM
      return in the future.
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Acked-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      dbc883f1
    • Mike Snitzer's avatar
      dm snapshot: avoid storing private suspended state · b83b2f29
      Mike Snitzer authored
      Use dm_suspended() rather than having each snapshot target maintain a
      private 'suspended' flag in struct dm_snapshot.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      b83b2f29