1. 28 Aug, 2013 3 commits
    • Shaohua Li's avatar
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li authored
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      851c30c9
    • Shaohua Li's avatar
      raid5: fix stripe release order · d265d9dc
      Shaohua Li authored
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d265d9dc
    • Shaohua Li's avatar
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li authored
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      773ca82f
  2. 27 Aug, 2013 7 commits
    • NeilBrown's avatar
      md: avoid deadlock when dirty buffers during md_stop. · 260fa034
      NeilBrown authored
      When the last process closes /dev/mdX sync_blockdev will be called so
      that all buffers get flushed.
      So if it is then opened for the STOP_ARRAY ioctl to be sent there will
      be nothing to flush.
      
      However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
      moments before some other process which was writing closes their file
      descriptor, then there won't be a 'last close' and the buffers might
      not get flushed.
      
      So do_md_stop() calls sync_blockdev().  However at this point it is
      holding ->reconfig_mutex.  So if the array is currently 'clean' then
      the writes from sync_blockdev() will not complete until the array
      can be marked dirty and that won't happen until some other thread
      can get ->reconfig_mutex.  So we deadlock.
      
      We need to move the sync_blockdev() call to before we take
      ->reconfig_mutex.
      However then some other thread could open /dev/mdX and write to it
      after we call sync_blockdev() and before we actually stop the array.
      This can leave dirty data in the page cache which is awkward.
      
      So introduce new flag MD_STILL_CLOSED.  Set it before calling
      sync_blockdev(), clear it if anyone does open the file, and abort the
      STOP_ARRAY attempt if it gets set before we lock against further
      opens.
      
      It is still possible to get problems if you open /dev/mdX, write to
      it, then issue the STOP_ARRAY ioctl.  Just don't do that.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      260fa034
    • NeilBrown's avatar
      md: Don't test all of mddev->flags at once. · 7a0a5355
      NeilBrown authored
      mddev->flags is mostly used to record if an update of the
      metadata is needed.  Sometimes the whole field is tested
      instead of just the important bits.  This makes it difficult
      to introduce more state bits.
      
      So replace all bare tests of mddev->flags with tests for the bits
      that actually need testing.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7a0a5355
    • Dave Jones's avatar
      md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f
      Dave Jones authored
      Setting a variable to itself probably wasn't the intention here.
      Signed-off-by: default avatarDave Jones <davej@fedoraproject.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c9ad020f
    • Max Filippov's avatar
      raid6/test: replace echo -e with printf · c28399b5
      Max Filippov authored
      -e is a non-standard echo option, echo output is
      implementation-dependent when it is used. Replace echo -e with printf as
      suggested by POSIX echo manual.
      
      Cc: NeilBrown <neilb@suse.de>
      Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c28399b5
    • Ken Steele's avatar
      RAID: add tilegx SIMD implementation of raid6 · ae77cbc1
      Ken Steele authored
      This change adds TILE-Gx SIMD instructions to the software raid
      (md), modeling the Altivec implementation. This is only for Syndrome
      generation; there is more that could be done to improve recovery,
      as in the recent Intel SSE3 recovery implementation.
      
      The code unrolls 8 times; this turns out to be the best on tilegx
      hardware among the set 1, 2, 4, 8 or 16.  The code reads one
      cache-line of data from each disk, stores P and Q then goes to the
      next cache-line.
      
      The test code in sys/linux/lib/raid6/test reports 2008 MB/s data
      read rate for syndrome generation using 18 disks (16 data and 2
      parity). It was 1512 MB/s before this SIMD optimizations. This is
      running on 1 core with all the data in cache.
      
      This is based on the paper The Mathematics of RAID-6.
      (http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf).
      Signed-off-by: default avatarKen Steele <ken@tilera.com>
      Signed-off-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ae77cbc1
    • NeilBrown's avatar
      md: fix safe_mode buglet. · 275c51c4
      NeilBrown authored
      Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
      immediately - otherwise the small value might not be honoured.
      However if the previous timeout was 0 meaning "no timeout", we didn't.
      This would mean that no timeout happens until the next write completes,
      which could be a long time.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      275c51c4
    • NeilBrown's avatar
      md: don't call md_allow_write in get_bitmap_file. · 60559da4
      NeilBrown authored
      There is no really need as GFP_NOIO is very likely sufficient,
      and failure is not catastrophic.
      
      Calling md_allow_write here will convert a read-auto array to
      read/write which could be confusing when you are just performing
      a read operation.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      60559da4
  3. 26 Aug, 2013 1 commit
  4. 25 Aug, 2013 4 commits
  5. 24 Aug, 2013 8 commits
  6. 23 Aug, 2013 17 commits