1. 01 Aug, 2012 1 commit
  2. 31 Jul, 2012 13 commits
    • Alexander Lyakas's avatar
      md/RAID1: Add missing case for attempting to repair known bad blocks. · d57368af
      Alexander Lyakas authored
      When doing resync or repair, attempt to correct bad blocks, according
      to WriteErrorSeen policy
      Signed-off-by: default avatarAlex Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d57368af
    • majianpeng's avatar
      md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. · 895e3c5c
      majianpeng authored
      'sync' writes set both REQ_SYNC and REQ_NOIDLE.
      O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.
      
      We currently assume that a REQ_SYNC request will not be followed by
      more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
      request.
      This is appropriate for sync requests, but not for O_DIRECT requests.
      
      So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
      rather than REQ_SYNC.  This is consistent with the documented meaning
      of REQ_NOIDLE:
      
              __REQ_NOIDLE,           /* don't anticipate more IO after this one */
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      895e3c5c
    • NeilBrown's avatar
      md/raid1: don't abort a resync on the first badblock. · b7219ccb
      NeilBrown authored
      If a resync of a RAID1 array with 2 devices finds a known bad block
      one device it will neither read from, or write to, that device for
      this block offset.
      So there will be one read_target (The other device) and zero write
      targets.
      This condition causes md/raid1 to abort the resync assuming that it
      has finished - without known bad blocks this would be true.
      
      When there are no write targets because of the presence of bad blocks
      we should only skip over the area covered by the bad block.
      RAID10 already gets this right, raid1 doesn't.  Or didn't.
      
      As this can cause a 'sync' to abort early and appear to have succeeded
      it could lead to some data corruption, so it suitable for -stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b7219ccb
    • NeilBrown's avatar
      md: remove duplicated test on ->openers when calling do_md_stop() · 90cf195d
      NeilBrown authored
      do_md_stop tests mddev->openers while holding ->open_mutex,
      and fails if this count is too high.
      So callers do not need to check mddev->openers and doing so isn't
      very meaningful as they don't hold ->open_mutex so the number could
      change.
      
      So remove the unnecessary tests on mddev->openers.
      These are not called often enough for there to be any gain in
      an early test on ->open_mutex to avoid the need for a slightly more
      costly mutex_lock call.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      90cf195d
    • majianpeng's avatar
      raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer · 3f9e7c14
      majianpeng authored
      Because bios will merge at block-layer,so bios-error may caused by other
      bio which be merged into to the same request.
      Using this flag,it will find exactly error-sector and not do redundant
      operation like re-write and re-read.
      
      V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
      layer.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3f9e7c14
    • Shaohua Li's avatar
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li authored
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      12cee5a8
    • Shaohua Li's avatar
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li authored
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      9dedf603
    • Shaohua Li's avatar
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li authored
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      be4d3280
    • Jonathan Brassow's avatar
      MD RAID10: Export md_raid10_congested · cc4d1efd
      Jonathan Brassow authored
      md/raid10: Export is_congested test.
      
      In similar fashion to commits
      	11d8a6e3
      	1ed7242e
      we export the RAID10 congestion checking function so that dm-raid.c can
      make use of it and make use of the personality.  The 'queue' and 'gendisk'
      structures will not be available to the MD code when device-mapper sets
      up the device, so we conditionalize access to these fields also.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cc4d1efd
    • Jonathan Brassow's avatar
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow authored
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      473e87ce
    • Jonathan Brassow's avatar
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow authored
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      0eaf822c
    • Jonathan Brassow's avatar
      MD RAID10: rename mirror_info structure · dc280d98
      Jonathan Brassow authored
      MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
      
      The same structure name ('mirror_info') is used by raid1.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dc280d98
    • Jonathan Brassow's avatar
      MD RAID10: Fix compiler warning. · 3bbae04b
      Jonathan Brassow authored
      MD RAID10:  Fix compiler warning.
      
      Initialize variable to prevent compiler warning.
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3bbae04b
  3. 27 Jul, 2012 22 commits
  4. 25 Jul, 2012 4 commits
    • Linus Torvalds's avatar
      Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · bdc0077a
      Linus Torvalds authored
      Pull first round of SCSI updates from James Bottomley:
       "The most important feature of this patch set is the new async
        infrastructure that makes sure async_synchronize_full() synchronizes
        all domains and allows us to remove all the hacks (like having
        scsi_complete_async_scans() in the device base code) and means that
        the async infrastructure will "just work" in future.
      
        The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
        megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure
        work in sas and FC.
      
        Signed-off-by: James Bottomley <JBottomley@Parallels.com>"
      
      * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
        [SCSI] Revert "[SCSI] fix async probe regression"
        [SCSI] cleanup usages of scsi_complete_async_scans
        [SCSI] queue async scan work to an async_schedule domain
        [SCSI] async: make async_synchronize_full() flush all work regardless of domain
        [SCSI] async: introduce 'async_domain' type
        [SCSI] bfa: Fix to set correct return error codes and misc cleanup.
        [SCSI] aacraid: Series 7 Async. (performance) mode support
        [SCSI] aha152x: Allow use on 64bit systems
        [SCSI] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
        [SCSI] bfa: squelch lockdep complaint with a spin_lock_init
        [SCSI] qla2xxx: remove unnecessary reads of PCI_CAP_ID_EXP
        [SCSI] qla4xxx: remove unnecessary read of PCI_CAP_ID_EXP
        [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
        [SCSI] ufs: reverse the ufshcd_is_device_present logic
        [SCSI] ufs: use module_pci_driver
        [SCSI] usb-storage: update usb devices for write cache quirk in quirk list.
        [SCSI] usb-storage: add support for write cache quirk
        [SCSI] set to WCE if usb cache quirk is present.
        [SCSI] virtio-scsi: hotplug support for virtio-scsi
        [SCSI] virtio-scsi: split scatterlist per target
        ...
      bdc0077a
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw · 801b0365
      Linus Torvalds authored
      Pull GFS2 updates from Steven Whitehouse.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
        GFS2: Eliminate 64-bit divides
        GFS2: Reduce file fragmentation
        GFS2: kernel panic with small gfs2 filesystems - 1 RG
        GFS2: Fixing double brelse'ing bh allocated in gfs2_meta_read when EIO occurs
        GFS2: Combine functions get_local_rgrp and gfs2_inplace_reserve
        GFS2: Add kobject release method
        GFS2: Size seq_file buffer more carefully
        GFS2: Use seq_vprintf for glocks debugfs file
        seq_file: Add seq_vprintf function and export it
        GFS2: Use lvbs for storing rgrp information with mount option
        GFS2: Cache last hash bucket for glock seq_files
        GFS2: Increase buffer size for glocks and glstats debugfs files
        GFS2: Fix error handling when reading an invalid block from the journal
        GFS2: Add "top dir" flag support
        GFS2: Fold quota data into the reservations struct
        GFS2: Extend the life of the reservations
      801b0365
    • Linus Torvalds's avatar
      Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 614a6d43
      Linus Torvalds authored
      Pull cgroup changes from Tejun Heo:
       "Nothing too interesting.  A minor bug fix and some cleanups."
      
      * 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup: Update remount documentation
        cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
        cgroup: Remove populate() documentation
        cgroup: remove hierarchy_mutex
      614a6d43
    • Linus Torvalds's avatar
      Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · a08489c5
      Linus Torvalds authored
      Pull workqueue changes from Tejun Heo:
       "There are three major changes.
      
         - WQ_HIGHPRI has been reimplemented so that high priority work items
           are served by worker threads with -20 nice value from dedicated
           highpri worker pools.
      
         - CPU hotplug support has been reimplemented such that idle workers
           are kept across CPU hotplug events.  This makes CPU hotplug cheaper
           (for PM) and makes the code simpler.
      
         - flush_kthread_work() has been reimplemented so that a work item can
           be freed while executing.  This removes an annoying behavior
           difference between kthread_worker and workqueue."
      
      * 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: fix spurious CPU locality WARN from process_one_work()
        kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
        kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
        workqueue: simplify CPU hotplug code
        workqueue: remove CPU offline trustee
        workqueue: don't butcher idle workers on an offline CPU
        workqueue: reimplement CPU online rebinding to handle idle workers
        workqueue: drop @bind from create_worker()
        workqueue: use mutex for global_cwq manager exclusion
        workqueue: ROGUE workers are UNBOUND workers
        workqueue: drop CPU_DYING notifier operation
        workqueue: perform cpu down operations from low priority cpu_notifier()
        workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
        workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
        workqueue: separate out worker_pool flags
        workqueue: use @pool instead of @gcwq or @cpu where applicable
        workqueue: factor out worker_pool from global_cwq
        workqueue: don't use WQ_HIGHPRI for unbound workqueues
      a08489c5