1. 04 Oct, 2017 1 commit
    • Benjamin Block's avatar
      bsg-lib: fix use-after-free under memory-pressure · eab40cf3
      Benjamin Block authored
      When under memory-pressure it is possible that the mempool which backs
      the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
      emergency buffers - in case it can't get a regular allocation. These
      buffers are preallocated and once they are also used, they are
      re-supplied with old finished requests from the same request_queue (see
      mempool_free()).
      
      The bug is, when re-supplying the emergency pool, the old requests are
      not again ran through the callback mempool_t->alloc(), and thus also not
      through the callback bsg_init_rq(). Thus we skip initialization, and
      while the sense-buffer still should be good, scsi_request->cmd might
      have become to be an invalid pointer in the meantime. When the request
      is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
      bsg will replace it with a custom allocated buffer, which is freed when
      the user's command is finished, thus it dangles afterwards. When next a
      command is sent by the user that has a smaller/similar CDB as
      BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
      scsi_request->__cmd, will not make a custom allocation, and write into
      undefined memory.
      
      Fix this by splitting bsg_init_rq() into two functions:
       - bsg_init_rq() is changed to only do the allocation of the
         sense-buffer, which is used to back the bsg job's reply buffer. This
         pointer should never change during the lifetime of a scsi_request, so
         it doesn't need re-initialization.
       - bsg_initialize_rq() is a new function that makes use of
         'struct request_queue's initialize_rq_fn callback (which was
         introduced in v4.12). This is always called before the request is
         given out via blk_get_request(). This function does the remaining
         initialization that was previously done in bsg_init_rq(), and will
         also do it when the request is taken from the emergency-pool of the
         backing mempool.
      
      Fixes: 50b4d485 ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
      Cc: <stable@vger.kernel.org> # 4.11+
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarBenjamin Block <bblock@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eab40cf3
  2. 03 Oct, 2017 4 commits
    • Omar Sandoval's avatar
      blk-mq-debugfs: fix device sched directory for default scheduler · 70e62f4b
      Omar Sandoval authored
      In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
      directories if a default scheduler was already configured by
      blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
      the same for the device-wide sched directory. Fix it.
      
      Fixes: d332ce09 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      70e62f4b
    • Jens Axboe's avatar
      null_blk: change configfs dependency to select · 6cd1a6fe
      Jens Axboe authored
      A recent commit made null_blk depend on configfs, which is kind of
      annoying since you now have to find this dependency and enable that
      as well. Discovered this since I no longer had null_blk available
      on a box I needed to debug, since it got killed when the config
      updated after the configfs change was merged.
      
      Fixes: 3bf2bd20 ("nullb: add configfs interface")
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6cd1a6fe
    • Joseph Qi's avatar
      blk-throttle: fix possible io stall when upgrade to max · 4f02fb76
      Joseph Qi authored
      There is a case which will lead to io stall. The case is described as
      follows.
      /test1
        |-subtest1
      /test2
        |-subtest2
      And subtest1 and subtest2 each has 32 queued bios already.
      
      Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
      bios as follows:
      1) tg=subtest1, do nothing;
      2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
      left, no need to schedule next dispatch;
      3) tg=subtest2, do nothing;
      4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
      left, no need to schedule next dispatch;
      5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
      test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
      to /; note that test1 and test2 each still has 16 queued bios left;
      6) tg=/, try to schedule next dispatch, but since disptime is now
      (update in tg_update_disptime, wait=0), pending timer is not scheduled
      in fact;
      7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
      32 left. test1 and test2 each has 16 queued bios;
      8) throtl_pending_timer_fn sees the left over bios, but could do
      nothing, because throtl_select_dispatch returns 0, and test1/test2 has
      no pending tg.
      
      The blktrace shows the following:
      8,32   0        0     2.539007641     0  m   N throtl upgrade to max
      8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
      8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16
      
      So force schedule dispatch if there are pending children.
      Reviewed-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJoseph Qi <qijiang.qj@alibaba-inc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f02fb76
    • Wouter Verhelst's avatar
      MAINTAINERS: update list for NBD · 38b249bc
      Wouter Verhelst authored
      nbd-general@sourceforge.net becomes nbd@other.debian.org, because
      sourceforge is just a spamtrap these days.
      Signed-off-by: default avatarWouter Verhelst <w@uter.be>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      38b249bc
  3. 02 Oct, 2017 1 commit
    • Josef Bacik's avatar
      nbd: fix -ERESTARTSYS handling · 6e60a3bb
      Josef Bacik authored
      Christoph made it so that if we return'ed BLK_STS_RESOURCE whenever we
      got ERESTARTSYS from sending our packets we'd return BLK_STS_OK, which
      means we'd never requeue and just hang.  We really need to return the
      right value from the upper layer.
      
      Fixes: fc17b653 ("blk-mq: switch ->queue_rq return value to blk_status_t")
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e60a3bb
  4. 27 Sep, 2017 1 commit
    • Coly Li's avatar
      bcache: use llist_for_each_entry_safe() in __closure_wake_up() · a5f3d8a5
      Coly Li authored
      Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
      API") replaces the following while loop by llist_for_each_entry(),
      
      -
      -	while (reverse) {
      -		cl = container_of(reverse, struct closure, list);
      -		reverse = llist_next(reverse);
      -
      +	llist_for_each_entry(cl, reverse, list) {
       		closure_set_waiting(cl, 0);
       		closure_sub(cl, CLOSURE_WAITING + 1);
       	}
      
      This modification introduces a potential race by iterating a corrupted
      list. Here is how it happens.
      
      In the above modification, closure_sub() may wake up a process which is
      waiting on reverse list. If this process decides to wait again by calling
      closure_wait(), its cl->list will be added to another wait list. Then
      when llist_for_each_entry() continues to iterate next node, it will travel
      on another new wait list which is added in closure_wait(), not the
      original reverse list in __closure_wake_up(). It is more probably to
      happen on UP machine because the waked up process may preempt the process
      which wakes up it.
      
      Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
      next node before waking up a process. Then the copy of next node will make
      sure list iteration stays on original reverse list.
      
      Fixes: 09b3efec ("bcache: Don't reinvent the wheel but use existing llist API")
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarByungchul Park <byungchul.park@lge.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a5f3d8a5
  5. 26 Sep, 2017 4 commits
  6. 25 Sep, 2017 29 commits