1. 15 Jan, 2018 2 commits
    • Sagi Grimberg's avatar
      nvme-pci: allocate device queues storage space at probe · 147b27e4
      Sagi Grimberg authored
      It may cause race by setting 'nvmeq' in nvme_init_request()
      because .init_request is called inside switching io scheduler, which
      may happen when the NVMe device is being resetted and its nvme queues
      are being freed and created. We don't have any sync between the two
      pathes.
      
      This patch changes the nvmeq allocation to occur at probe time so
      there is no way we can dereference it at init_request.
      
      [   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
      [   93.274146] invalid opcode: 0000 [#1] SMP
      [   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
      nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
      intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
      kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
      intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
      ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
      shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
      mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
      fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
      megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
      [   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
      [   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
      [   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
      [   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
      [   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
      [   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
      [   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
      [   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
      [   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
      [   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
      [   93.446368] Call Trace:
      [   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
      [   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
      [   93.459214]  blk_mq_init_sched+0x7e/0x140
      [   93.463687]  elevator_switch+0x5a/0x1f0
      [   93.467966]  ? elevator_get.isra.17+0x52/0xc0
      [   93.472826]  elv_iosched_store+0xde/0x150
      [   93.477299]  queue_attr_store+0x4e/0x90
      [   93.481580]  kernfs_fop_write+0xfa/0x180
      [   93.485958]  __vfs_write+0x33/0x170
      [   93.489851]  ? __inode_security_revalidate+0x4c/0x60
      [   93.495390]  ? selinux_file_permission+0xda/0x130
      [   93.500641]  ? _cond_resched+0x15/0x30
      [   93.504815]  vfs_write+0xad/0x1a0
      [   93.508512]  SyS_write+0x52/0xc0
      [   93.512113]  do_syscall_64+0x61/0x1a0
      [   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
      [   93.521351] RIP: 0033:0x7f33ce96aab0
      [   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
      [   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
      [   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
      [   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
      [   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
      [   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
      74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
      0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
      [   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
      [   93.602273] ---[ end trace 810dde3993e5f14e ]---
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      147b27e4
    • Sagi Grimberg's avatar
      nvme-pci: serialize pci resets · 79c48ccf
      Sagi Grimberg authored
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      79c48ccf
  2. 14 Jan, 2018 1 commit
  3. 12 Jan, 2018 3 commits
  4. 11 Jan, 2018 2 commits
  5. 10 Jan, 2018 17 commits
  6. 09 Jan, 2018 15 commits
    • Jens Axboe's avatar
      null_blk: wire up timeouts · 5448aca4
      Jens Axboe authored
      This is needed to ensure that we actually handle timeouts.
      Without it, the queue_mode=1 path will never call blk_add_timer(),
      and the queue_mode=2 path will continually just return
      EH_RESET_TIMER and we never actually complete the offending request.
      
      This was used to test the new timeout code, and the changes around
      killing off REQ_ATOM_COMPLETE.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5448aca4
    • Jens Axboe's avatar
      bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED · 8abef10b
      Jens Axboe authored
      It's not available if we don't have group io scheduling set, and
      there's no need to call it.
      
      Fixes: 0d52af59 ("block, bfq: release oom-queue ref to root group on exit")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8abef10b
    • Michael Lyle's avatar
      bcache: closures: move control bits one bit right · 3609c471
      Michael Lyle authored
      Otherwise, architectures that do negated adds of atomics (e.g. s390)
      to do atomic_sub fail in closure_set_stopped.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3609c471
    • Bart Van Assche's avatar
      block: Fix kernel-doc warnings reported when building with W=1 · aa98192d
      Bart Van Assche authored
      Commit 3a025e1d ("Add optional check for bad kernel-doc comments")
      causes W=1 the kernel-doc script to be run and thereby causes several
      new warnings to appear when building the kernel with W=1. Fix the
      block layer kernel-doc headers such that the block layer again builds
      cleanly with W=1.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa98192d
    • Bart Van Assche's avatar
      blk-mq: Fix spelling in a source code comment · ee3e4de5
      Bart Van Assche authored
      Change "nedeing" into "needing" and "caes" into "cases".
      
      Fixes: commit f906a6a0 ("blk-mq: improve tag waiting setup for non-shared tags")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ee3e4de5
    • Jens Axboe's avatar
      blk-mq: silence false positive warnings in hctx_unlock() · 08b5a6e2
      Jens Axboe authored
      In some stupider versions of gcc, it complains:
      
      block/blk-mq.c: In function ‘blk_mq_complete_request’:
      ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        __srcu_read_unlock(sp, idx);
        ^
      block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here
        int srcu_idx;
            ^
      
      which is completely bogus, since we only use srcu_idx when
      hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where
      hctx_lock() has initialized it.
      
      Just set it to '0' in the normal path in hctx_lock() to silence
      this annoying warning.
      
      Fixes: 04ced159 ("blk-mq: move hctx lock/unlock into a helper")
      Fixes: 5197c05e ("blk-mq: protect completion path with RCU")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      08b5a6e2
    • Tejun Heo's avatar
      blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64
      Tejun Heo authored
      The RCU protection has been expanded to cover both queueing and
      completion paths making ->queue_rq_srcu a misnomer.  Rename it to
      ->srcu as suggested by Bart.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      05707b64
    • Tejun Heo's avatar
      blk-mq: remove REQ_ATOM_STARTED · 5a61c363
      Tejun Heo authored
      After the recent updates to use generation number and state based
      synchronization, we can easily replace REQ_ATOM_STARTED usages by
      adding an extra state to distinguish completed but not yet freed
      state.
      
      Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
      blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
      left and is removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5a61c363
    • Tejun Heo's avatar
      blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46
      Tejun Heo authored
      After the recent updates to use generation number and state based
      synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
      to avoid firing the same timeout multiple times.
      
      Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
      RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
      times.  This removes atomic bitops from hot paths too.
      
      v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
      
      v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      634f9e46
    • Tejun Heo's avatar
      blk-mq: make blk_abort_request() trigger timeout path · 358f70da
      Tejun Heo authored
      With issue/complete and timeout paths now using the generation number
      and state based synchronization, blk_abort_request() is the only one
      which depends on REQ_ATOM_COMPLETE for arbitrating completion.
      
      There's no reason for blk_abort_request() to be a completely separate
      path.  This patch makes blk_abort_request() piggyback on the timeout
      path instead of trying to terminate the request directly.
      
      This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.
      
      Note that this makes blk_abort_request() asynchronous - it initiates
      abortion but the actual termination will happen after a short while,
      even when the caller owns the request.  AFAICS, SCSI and ATA should be
      fine with that and I think mtip32xx and dasd should be safe but not
      completely sure.  It'd be great if people who know the drivers take a
      look.
      
      v2: - Add comment explaining the lack of synchronization around
            ->deadline update as requested by Bart.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Asai Thambi SP <asamymuthupa@micron.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      358f70da
    • Tejun Heo's avatar
      blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE · 67818d25
      Tejun Heo authored
      blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
      REQ_ATOM_COMPLETE to determine the request state.  Both uses are
      speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
      equivalent results.  Replace the tests.  This will allow removing
      REQ_ATOM_COMPLETE usages from blk-mq.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      67818d25
    • Tejun Heo's avatar
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo authored
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d9bd516
    • Tejun Heo's avatar
      blk-mq: protect completion path with RCU · 5197c05e
      Tejun Heo authored
      Currently, blk-mq protects only the issue path with RCU.  This patch
      puts the completion path under the same RCU protection.  This will be
      used to synchronize issue/completion against timeout by later patches,
      which will also add the comments.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5197c05e
    • Jens Axboe's avatar
      blk-mq: move hctx lock/unlock into a helper · 04ced159
      Jens Axboe authored
      Move the RCU vs SRCU logic into lock/unlock helpers, which makes
      the actual functional bits within the locked region much easier
      to read.
      
      tj: Reordered in front of timeout revamp patches and added the missing
          blk_mq_run_hw_queue() conversion.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04ced159
    • Paolo Valente's avatar
      block, bfq: release oom-queue ref to root group on exit · 0d52af59
      Paolo Valente authored
      On scheduler init, a reference to the root group, and a reference to
      its corresponding blkg are taken for the oom queue. Yet these
      references are not released on scheduler exit, which prevents these
      objects from be freed. This commit adds the missing reference
      releases.
      Reported-by: default avatarDavide Ferrari <davideferrari8@gmail.com>
      Tested-by: default avatarHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0d52af59