1. 15 Jan, 2018 5 commits
    • Mike Snitzer's avatar
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer authored
      DM is no longer prone to having its request_queue be improperly
      initialized.
      
      Summary of changes:
      
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      https://patchwork.kernel.org/patch/10067961/
      
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c100ec49
    • Douglas Gilbert's avatar
      blk_rq_map_user_iov: fix error override · 69e0927b
      Douglas Gilbert authored
      During stress tests by syzkaller on the sg driver the block layer
      infrequently returns EINVAL. Closer inspection shows the block
      layer was trying to return ENOMEM (which is much more
      understandable) but for some reason overroad that useful error.
      
      Patch below does not show this (unchanged) line:
         ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
      That 'ret' was being overridden when that function failed.
      Signed-off-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      69e0927b
    • Mike Snitzer's avatar
      block: allow gendisk's request_queue registration to be deferred · fa70d2e2
      Mike Snitzer authored
      Since I can remember DM has forced the block layer to allow the
      allocation and initialization of the request_queue to be distinct
      operations.  Reason for this is block/genhd.c:add_disk() has requires
      that the request_queue (and associated bdi) be tied to the gendisk
      before add_disk() is called -- because add_disk() also deals with
      exposing the request_queue via blk_register_queue().
      
      DM's dynamic creation of arbitrary device types (and associated
      request_queue types) requires the DM device's gendisk be available so
      that DM table loads can establish a master/slave relationship with
      subordinate devices that are referenced by loaded DM tables -- using
      bd_link_disk_holder().  But until these DM tables, and their associated
      subordinate devices, are known DM cannot know what type of request_queue
      it needs -- nor what its queue_limits should be.
      
      This chicken and egg scenario has created all manner of problems for DM
      and, at times, the block layer.
      
      Summary of changes:
      
      - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
        that drivers may use to add a disk without also calling
        blk_register_queue().  Driver must call blk_register_queue() once its
        request_queue is fully initialized.
      
      - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
        is not set.  It won't be set if driver used add_disk_no_queue_reg()
        but driver encounters an error and must del_gendisk() before calling
        blk_register_queue().
      
      - Export blk_register_queue().
      
      These changes allow DM to use add_disk_no_queue_reg() to anchor its
      gendisk as the "master" for master/slave relationships DM must establish
      with subordinate devices referenced in DM tables that get loaded.  Once
      all "slave" devices for a DM device are known its request_queue can be
      properly initialized and then advertised via sysfs -- important
      improvement being that no request_queue resource initialization
      performed by blk_register_queue() is missed for DM devices anymore.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa70d2e2
    • Mike Snitzer's avatar
      block: properly protect the 'queue' kobj in blk_unregister_queue · 667257e8
      Mike Snitzer authored
      The original commit e9a823fb (block: fix warning when I/O elevator
      is changed as request_queue is being removed) is pretty conflated.
      "conflated" because the resource being protected by q->sysfs_lock isn't
      the queue_flags (it is the 'queue' kobj).
      
      q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
      from racing with blk_unregister_queue():
      1) By holding q->sysfs_lock first, __elevator_change() can complete
      before a racing blk_unregister_queue().
      2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
      in case elv_iosched_store() loses the race with blk_unregister_queue(),
      it needs a way to know the 'queue' kobj isn't there.
      
      Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
      held until after the 'queue' kobj is removed.
      
      To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
      rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().
      
      Also, blk_unregister_queue() should use q->queue_lock to protect against
      any concurrent writes to q->queue_flags -- even though chances are the
      queue is being cleaned up so no concurrent writes are likely.
      
      Fixes: e9a823fb ("block: fix warning when I/O elevator is changed as request_queue is being removed")
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      667257e8
    • Mike Snitzer's avatar
      block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN · bc8d062c
      Mike Snitzer authored
      device_add_disk() will only call bdi_register_owner() if
      !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
      bdi_unregister() if !GENHD_FL_HIDDEN.
      
      Found with code inspection.  bdi_unregister() won't do any harm if
      bdi_register_owner() wasn't used but best to avoid the unnecessary
      call to bdi_unregister().
      
      Fixes: 8ddcd653 ("block: introduce GENHD_FL_HIDDEN")
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bc8d062c
  2. 14 Jan, 2018 1 commit
  3. 12 Jan, 2018 3 commits
  4. 11 Jan, 2018 2 commits
  5. 10 Jan, 2018 17 commits
  6. 09 Jan, 2018 12 commits
    • Jens Axboe's avatar
      null_blk: wire up timeouts · 5448aca4
      Jens Axboe authored
      This is needed to ensure that we actually handle timeouts.
      Without it, the queue_mode=1 path will never call blk_add_timer(),
      and the queue_mode=2 path will continually just return
      EH_RESET_TIMER and we never actually complete the offending request.
      
      This was used to test the new timeout code, and the changes around
      killing off REQ_ATOM_COMPLETE.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5448aca4
    • Jens Axboe's avatar
      bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED · 8abef10b
      Jens Axboe authored
      It's not available if we don't have group io scheduling set, and
      there's no need to call it.
      
      Fixes: 0d52af59 ("block, bfq: release oom-queue ref to root group on exit")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8abef10b
    • Michael Lyle's avatar
      bcache: closures: move control bits one bit right · 3609c471
      Michael Lyle authored
      Otherwise, architectures that do negated adds of atomics (e.g. s390)
      to do atomic_sub fail in closure_set_stopped.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3609c471
    • Bart Van Assche's avatar
      block: Fix kernel-doc warnings reported when building with W=1 · aa98192d
      Bart Van Assche authored
      Commit 3a025e1d ("Add optional check for bad kernel-doc comments")
      causes W=1 the kernel-doc script to be run and thereby causes several
      new warnings to appear when building the kernel with W=1. Fix the
      block layer kernel-doc headers such that the block layer again builds
      cleanly with W=1.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa98192d
    • Bart Van Assche's avatar
      blk-mq: Fix spelling in a source code comment · ee3e4de5
      Bart Van Assche authored
      Change "nedeing" into "needing" and "caes" into "cases".
      
      Fixes: commit f906a6a0 ("blk-mq: improve tag waiting setup for non-shared tags")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ee3e4de5
    • Jens Axboe's avatar
      blk-mq: silence false positive warnings in hctx_unlock() · 08b5a6e2
      Jens Axboe authored
      In some stupider versions of gcc, it complains:
      
      block/blk-mq.c: In function ‘blk_mq_complete_request’:
      ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        __srcu_read_unlock(sp, idx);
        ^
      block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here
        int srcu_idx;
            ^
      
      which is completely bogus, since we only use srcu_idx when
      hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where
      hctx_lock() has initialized it.
      
      Just set it to '0' in the normal path in hctx_lock() to silence
      this annoying warning.
      
      Fixes: 04ced159 ("blk-mq: move hctx lock/unlock into a helper")
      Fixes: 5197c05e ("blk-mq: protect completion path with RCU")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08b5a6e2
    • Tejun Heo's avatar
      blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64
      Tejun Heo authored
      The RCU protection has been expanded to cover both queueing and
      completion paths making ->queue_rq_srcu a misnomer.  Rename it to
      ->srcu as suggested by Bart.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05707b64
    • Tejun Heo's avatar
      blk-mq: remove REQ_ATOM_STARTED · 5a61c363
      Tejun Heo authored
      After the recent updates to use generation number and state based
      synchronization, we can easily replace REQ_ATOM_STARTED usages by
      adding an extra state to distinguish completed but not yet freed
      state.
      
      Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
      blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
      left and is removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a61c363
    • Tejun Heo's avatar
      blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46
      Tejun Heo authored
      After the recent updates to use generation number and state based
      synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
      to avoid firing the same timeout multiple times.
      
      Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
      RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
      times.  This removes atomic bitops from hot paths too.
      
      v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
      
      v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      634f9e46
    • Tejun Heo's avatar
      blk-mq: make blk_abort_request() trigger timeout path · 358f70da
      Tejun Heo authored
      With issue/complete and timeout paths now using the generation number
      and state based synchronization, blk_abort_request() is the only one
      which depends on REQ_ATOM_COMPLETE for arbitrating completion.
      
      There's no reason for blk_abort_request() to be a completely separate
      path.  This patch makes blk_abort_request() piggyback on the timeout
      path instead of trying to terminate the request directly.
      
      This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.
      
      Note that this makes blk_abort_request() asynchronous - it initiates
      abortion but the actual termination will happen after a short while,
      even when the caller owns the request.  AFAICS, SCSI and ATA should be
      fine with that and I think mtip32xx and dasd should be safe but not
      completely sure.  It'd be great if people who know the drivers take a
      look.
      
      v2: - Add comment explaining the lack of synchronization around
            ->deadline update as requested by Bart.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Asai Thambi SP <asamymuthupa@micron.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      358f70da
    • Tejun Heo's avatar
      blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE · 67818d25
      Tejun Heo authored
      blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
      REQ_ATOM_COMPLETE to determine the request state.  Both uses are
      speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
      equivalent results.  Replace the tests.  This will allow removing
      REQ_ATOM_COMPLETE usages from blk-mq.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      67818d25
    • Tejun Heo's avatar
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo authored
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d9bd516