1. 23 Aug, 2017 22 commits
    • Christoph Hellwig's avatar
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig authored
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      74d46992
    • Christoph Hellwig's avatar
      c2ee070f
    • Christoph Hellwig's avatar
      block: add a __disk_get_part helper · 807d4af2
      Christoph Hellwig authored
      This helper allows looking up a partion under RCU protection without
      grabbing a reference to it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      807d4af2
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      raid5: remove a call to get_start_sect · 10433d04
      Christoph Hellwig authored
      The block layer always remaps partitions before calling into the
      ->make_request methods of drivers.  Thus the call to get_start_sect in
      in_chunk_boundary will always return 0 and can be removed.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      10433d04
    • Christoph Hellwig's avatar
      btrfs: index check-integrity state hash by a dev_t · f8f84b2d
      Christoph Hellwig authored
      We won't have the struct block_device available in the bio soon, so switch
      to the numerical dev_t instead of the block_device pointer for looking up
      the check-integrity state.
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f8f84b2d
    • Bart Van Assche's avatar
      skd: Change default interrupt mode to MSI-X · 744353b6
      Bart Van Assche authored
      Since MSI support on some motherboards is unreliable, change the
      default interrupt mode from MSI to MSI-X. This patch avoids that
      the following message appears sporadially in the kernel logs of
      my test setup:
      
      do_IRQ: 3.193 No irq handler for vector
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      744353b6
    • Bart Van Assche's avatar
      skd: Avoid double completions in case of a timeout · f2fe4459
      Bart Van Assche authored
      Avoid that normal request completion and the timeout handler can
      run concurrently by calling blk_mq_complete_request() instead of
      blk_mq_end_request() from skd_end_request(). Avoid that the block
      layer can reuse a request while the firmware is still processing
      it. Convert skd_softirq_done() to blk-mq. Pass the pointer to
      skd_softirq_done() to the block layer core through
      blk_mq_ops.complete instead of by calling blk_queue_softirq_done().
      Pass the pointer to skd_timed_out() to the block layer core
      through blk_mq_ops.timeout instead of by calling
      blk_queue_timed_out(). The timeout handler has been tested as
      follows:
      
          echo 1 > /sys/block/skd0/io-timeout-fail &&
          (cd /sys/kernel/debug/fail_io_timeout &&
            echo 100 > probability &&
            echo N > task-filter &&
            echo 1 > times)
      
      Fixes: commit a74d5b76 ("skd: Switch to block layer timeout mechanism")
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f2fe4459
    • Bart Van Assche's avatar
      skd: Inline skd_process_request() · c39c6c77
      Bart Van Assche authored
      This patch does not change any functionality but makes the skd
      driver code more similar to that of other blk-mq kernel drivers.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c39c6c77
    • Bart Van Assche's avatar
      skd: Report completion mismatches once · 49f16e2f
      Bart Van Assche authored
      This patch removes one debug statement but otherwise does not change
      any functionality.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      49f16e2f
    • Bart Van Assche's avatar
      block: Warn if blk_queue_rq_timed_out() is called for a blk-mq queue · 130d733a
      Bart Van Assche authored
      The timeout handler set by blk_queue_rq_timed_out() is only used
      in single queue mode. Calling this function for blk-mq drivers is
      wrong. Hence issue a warning if this function is called by a blk-mq
      driver.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      130d733a
    • Shaohua Li's avatar
      nullb: badbblocks support · 2f54a613
      Shaohua Li authored
      Sometime disk could have tracks broken and data there is inaccessable,
      but data in other parts can be accessed in normal way. MD RAID supports
      such disks. But we don't have a good way to test it, because we can't
      control which part of a physical disk is bad. For a virtual disk, this
      can be easily controlled.
      
      This patch adds a new 'badblock' attribute. Configure it in this way:
      echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
      blocks.
      echo "-20-30" > xxx/badblock, this will make sector [20-30] good
      
      If badblocks are accessed, the nullb disk will return IO error. Other
      parts of the disk can accessed in normal way.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f54a613
    • Shaohua Li's avatar
      nullb: emulate cache · deb78b41
      Shaohua Li authored
      Software must flush disk cache to guarantee data safety. To check if
      software correctly does disk cache flush, we must know the behavior of
      disk. But physical disk behavior is uncontrollable. Even software
      doesn't do the flush, the disk probably does the flush. This patch tries
      to emulate a cache in the test disk.
      
      All write will go to a cache first, when the cache is full, we then
      flush some data to disk storage. A flush request will flush all data of
      the cache to disk storage. A FUA write will write to memory store
      directly and revalidate data in cache. If there is a power failure (by
      writing to power attribute, 'echo 0 > disk_name/power'), we discard all
      data in the cache, but preserve the data in disk storage. Later we can
      power on the disk again as usual (write 1 to 'power' attribute), then we
      can check data integrity and very if software does everything correctly.
      
      A new attribute 'cache_size' (in MB) is added to configure cache size.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: default avatarKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      deb78b41
    • Shaohua Li's avatar
      nullb: bandwidth control · eff2c4f1
      Shaohua Li authored
      In test, we usually expect controllable disk speed. For example, in a
      raid array, we'd like some disks are fast and some are slow. MD RAID
      actually has a feature for this. To test the feature, we'd like to make
      the disk run in specific speed.
      
      block throttling probably can be used for this purpose, but it requires
      cgroup setup. Here we just implement a simple throttling mechanism in
      the driver. There is slight fluctuation in the mechanism, but it's good
      enough for test.
      
      To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
      MB/s.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: default avatarKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eff2c4f1
    • Shaohua Li's avatar
      nullb: support discard · 306eb6b4
      Shaohua Li authored
      discard makes sense for memory backed disk. And also it's useful to test
      if upper layer supports dicard correctly.
      
      User configures 'discard' attribute to enable/disable dicard support.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: default avatarKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      306eb6b4
    • Shaohua Li's avatar
      nullb: support memory backed store · 5bcd0e0c
      Shaohua Li authored
      This adds memory backed store in nullb.
      
      User configure 'memory_backed' attribute for this. By default, nullb
      disk doesn't use memory backed store.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: default avatarKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5bcd0e0c
    • Shaohua Li's avatar
      nullb: use ida to manage index · 94bc02e3
      Shaohua Li authored
      We now dynamically create disks. Managing the disk index with ida to
      avoid bump up the index too much.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94bc02e3
    • Shaohua Li's avatar
      nullb: add interface to power on disk · cedcafad
      Shaohua Li authored
      The device created in nullb configfs interface isn't power on by
      default. After user configures the device, user can do 'echo 1 >
      xxx/nullb/device_name/power' to power on the device, which will create a
      disk. the xxx/nullb/device_name/index is the disk index, so if the index
      is 2, the new created disk should be named as /dev/nullb2. Note, the
      'index' is only valid after disk is power on.
      
      'echo 0 > xxx/nullb/device_name/power' will remove the disk. Note, this
      doesn't remove the device. To remove the device, user should do 'rmdir
      xxx/nullb/device_name'. Removing the device will remove the disk too.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cedcafad
    • Shaohua Li's avatar
      nullb: add configfs interface · 3bf2bd20
      Shaohua Li authored
      Add configfs interface for nullb. configfs interface is more flexible
      and easy to configure in a per-disk basis.
      
      Configuration is something like this:
      mount -t configfs none /mnt
      
      Checking which features the driver supports:
      cat /mnt/nullb/features
      
      The 'features' attribute is for future extension. We probably will add
      new features into the driver, userspace can check this attribute to find
      the supported features.
      
      Create/remove a device:
      mkdir/rmdir /mnt/nullb/a
      
      Then configure the device by setting attributes under /mnt/nullb/a, most
      of nullb supported module parameters are converted to attributes:
      size; /* device size in MB */
      completion_nsec; /* time in ns to complete a request */
      submit_queues; /* number of submission queues */
      home_node; /* home node for the device */
      queue_mode; /* block interface */
      blocksize; /* block size */
      irqmode; /* IRQ completion handler */
      hw_queue_depth; /* queue depth */
      use_lightnvm; /* register as a LightNVM device */
      blocking; /* blocking blk-mq device */
      use_per_node_hctx; /* use per-node allocation for hardware context */
      
      Note, creating a device doesn't create a disk immediately. Creating a
      disk is done in two phases: create a device and then power on the
      device. Next patch will introduce device power on.
      
      Based on original patch from Kyungchan Koh
      Signed-off-by: default avatarKyungchan Koh <kkc6196@fb.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3bf2bd20
    • Shaohua Li's avatar
      nullb: factor disk parameters · 2984c868
      Shaohua Li authored
      When we switch to configfs interface, each disk could have different
      configuration. To prepare for the change, we move most disk setting to a
      separate data structure. The existing module parameter interface is
      kept. The 'nr_devices' and 'shared_tags' don't make sense for per-disk
      setting, so they are remained as global settings.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2984c868
    • Dan Carpenter's avatar
      skd: error pointer dereference in skd_cons_disk() · 92d499d4
      Dan Carpenter authored
      My initial impulse was to check for IS_ERR_OR_NULL() but when I looked
      at this code a bit more closely, we should only need to check for
      IS_ERR().
      
      The blk_mq_alloc_tag_set() returns negative error codes and zero on
      success so we can just do an "if (rc) goto err_out;".  It's better to
      preserve the error code anyhow.  The blk_mq_init_queue() returns error
      pointers on failure, it never returns NULL.  We can also remove the
      "q = NULL;" at the start because that's no longer needed.
      
      Fixes: ca33dd92 ("skd: Convert to blk-mq")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      92d499d4
    • Dan Carpenter's avatar
      skd: Uninitialized variable in skd_isr_completion_posted() · c0b3dda7
      Dan Carpenter authored
      Someone got too agressive about removing initializations and
      accidentally removed the "rc = 0;" which is required.
      
      Fixes: c830da8c ("skd: Remove superfluous initializations from skd_isr_completion_posted()")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c0b3dda7
  2. 18 Aug, 2017 18 commits