1. 08 Sep, 2020 2 commits
    • James Smart's avatar
      nvme: Revert: Fix controller creation races with teardown flow · b63de840
      James Smart authored
      The indicated patch introduced a barrier in the sysfs_delete attribute
      for the controller that rejects the request if the controller isn't
      created. "Created" is defined as at least 1 call to nvme_start_ctrl().
      
      This is problematic in error-injection testing.  If an error occurs on
      the initial attempt to create an association and the controller enters
      reconnect(s) attempts, the admin cannot delete the controller until
      either there is a successful association created or ctrl_loss_tmo
      times out.
      
      Where this issue is particularly hurtful is when the "admin" is the
      nvme-cli, it is performing a connection to a discovery controller, and
      it is initiated via auto-connect scripts.  With the FC transport, if the
      first connection attempt fails, the controller enters a normal reconnect
      state but returns control to the cli thread that created the controller.
      In this scenario, the cli attempts to read the discovery log via ioctl,
      which fails, causing the cli to see it as an empty log and then proceeds
      to delete the discovery controller. The delete is rejected and the
      controller is left live. If the discovery controller reconnect then
      succeeds, there is no action to delete it, and it sits live doing nothing.
      
      Cc: <stable@vger.kernel.org> # v5.7+
      Fixes: ce151813 ("nvme: Fix controller creation races with teardown flow")
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      CC: Israel Rukshin <israelr@mellanox.com>
      CC: Max Gurtovoy <maxg@mellanox.com>
      CC: Christoph Hellwig <hch@lst.de>
      CC: Keith Busch <kbusch@kernel.org>
      CC: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      b63de840
    • Christoph Hellwig's avatar
      block: restore a specific error code in bdev_del_partition · 88ce2a53
      Christoph Hellwig authored
      mdadm relies on the fact that deleting an invalid partition returns
      -ENXIO or -ENOTTY to detect if a block device is a partition or a
      whole device.
      
      Fixes: 08fc1ab6 ("block: fix locking in bdev_del_partition")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      88ce2a53
  2. 03 Sep, 2020 1 commit
  3. 01 Sep, 2020 5 commits
  4. 29 Aug, 2020 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-5.9-rc' of git://git.infradead.org/nvme into block-5.9 · 5d220bcd
      Jens Axboe authored
      Pull NVMe fixes from Sagi:
      
      "- instance leak and io boundary fixes from Keith
       - fc locking fix from Christophe
       - various tcp/rdma reset during traffic fixes from Me
       - pci use-after-free fix from Tong
       - tcp target null deref fix from Ziye"
      
      * 'nvme-5.9-rc' of git://git.infradead.org/nvme:
        nvme-pci: cancel nvme device request before disabling
        nvme: only use power of two io boundaries
        nvme: fix controller instance leak
        nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
        nvme: Fix NULL dereference for pci nvme controllers
        nvme-rdma: fix reset hang if controller died in the middle of a reset
        nvme-rdma: fix timeout handler
        nvme-rdma: serialize controller teardown sequences
        nvme-tcp: fix reset hang if controller died in the middle of a reset
        nvme-tcp: fix timeout handler
        nvme-tcp: serialize controller teardown sequences
        nvme: have nvme_wait_freeze_timeout return if it timed out
        nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
        nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
      5d220bcd
  5. 28 Aug, 2020 16 commits
    • Tong Zhang's avatar
      nvme-pci: cancel nvme device request before disabling · 7ad92f65
      Tong Zhang authored
      This patch addresses an irq free warning and null pointer dereference
      error problem when nvme devices got timeout error during initialization.
      This problem happens when nvme_timeout() function is called while
      nvme_reset_work() is still in execution. This patch fixed the problem by
      setting flag of the problematic request to NVME_REQ_CANCELLED before
      calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns
      an error code and let nvme_submit_sync_cmd() fail gracefully.
      The following is console output.
      
      [   62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller
      [   62.488796] nvme nvme0: could not set timestamp (881)
      [   62.494888] ------------[ cut here ]------------
      [   62.495142] Trying to free already-free IRQ 11
      [   62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370
      [   62.495742] Modules linked in:
      [   62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8
      [   62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
      [   62.496772] Workqueue: nvme-reset-wq nvme_reset_work
      [   62.497019] RIP: 0010:free_irq+0x1f7/0x370
      [   62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70
      [   62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086
      [   62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000
      [   62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c
      [   62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163
      [   62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00
      [   62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000
      [   62.500140] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
      [   62.500534] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
      [   62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   62.501864] Call Trace:
      [   62.501993]  pci_free_irq+0x13/0x20
      [   62.502167]  nvme_reset_work+0x5d0/0x12a0
      [   62.502369]  ? update_load_avg+0x59/0x580
      [   62.502569]  ? ttwu_queue_wakelist+0xa8/0xc0
      [   62.502780]  ? try_to_wake_up+0x1a2/0x450
      [   62.502979]  process_one_work+0x1d2/0x390
      [   62.503179]  worker_thread+0x45/0x3b0
      [   62.503361]  ? process_one_work+0x390/0x390
      [   62.503568]  kthread+0xf9/0x130
      [   62.503726]  ? kthread_park+0x80/0x80
      [   62.503911]  ret_from_fork+0x22/0x30
      [   62.504090] ---[ end trace de9ed4a70f8d71e2 ]---
      [  123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller
      [  123.914670] nvme nvme0: 1/0/0 default/read/poll queues
      [  123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  123.917469] #PF: supervisor write access in kernel mode
      [  123.917725] #PF: error_code(0x0002) - not-present page
      [  123.917976] PGD 0 P4D 0
      [  123.918109] Oops: 0002 [#1] SMP PTI
      [  123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G        W         5.8.0+ #8
      [  123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4
      [  123.919219] Workqueue: nvme-reset-wq nvme_reset_work
      [  123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
      [  123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
      [  123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
      [  123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
      [  123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
      [  123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
      [  123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
      [  123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
      [  123.922641] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
      [  123.923032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
      [  123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  123.924353] Call Trace:
      [  123.924479]  blk_mq_alloc_tag_set+0x137/0x2a0
      [  123.924694]  nvme_reset_work+0xed6/0x12a0
      [  123.924898]  process_one_work+0x1d2/0x390
      [  123.925099]  worker_thread+0x45/0x3b0
      [  123.925280]  ? process_one_work+0x390/0x390
      [  123.925486]  kthread+0xf9/0x130
      [  123.925642]  ? kthread_park+0x80/0x80
      [  123.925825]  ret_from_fork+0x22/0x30
      [  123.926004] Modules linked in:
      [  123.926158] CR2: 0000000000000000
      [  123.926322] ---[ end trace de9ed4a70f8d71e3 ]---
      [  123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80
      [  123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4
      [  123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286
      [  123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000
      [  123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000
      [  123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000
      [  123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000
      [  123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000
      [  123.929715] FS:  0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000
      [  123.930106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0
      [  123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Co-developed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarTong Zhang <ztong0001@gmail.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      7ad92f65
    • Keith Busch's avatar
      nvme: only use power of two io boundaries · e83d776f
      Keith Busch authored
      The kernel requires a power of two for boundaries because that's the
      only way it can efficiently split commands that cross them. A
      controller, however, may report a non-power of two boundary.
      
      The driver had been rounding the controller's value to one the kernel
      can use, but splitting on the wrong boundary provides no benefit on the
      device side, and incurs additional submission overhead from non-optimal
      splits.
      
      Don't provide any boundary hint if the controller's value can't be used
      and log a warning when first scanning a disk's unreported IO boundary.
      Since the chunk sector logic has grown, move it to a separate function.
      
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      e83d776f
    • Keith Busch's avatar
      nvme: fix controller instance leak · 192f6c29
      Keith Busch authored
      If the driver has to unbind from the controller for an early failure
      before the subsystem has been set up, there won't be a subsystem holding
      the controller's instance, so the controller needs to free its own
      instance in this case.
      
      Fixes: 733e4b69 ("nvme: Assign subsys instance from first ctrl")
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      192f6c29
    • Christophe JAILLET's avatar
      nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' · 70e37988
      Christophe JAILLET authored
      The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
      in this function.
      
      Use 'spin_lock_irqsave()' also here, as there is no guarantee that
      interruptions are disabled at that point, according to surrounding code.
      
      Fixes: a97ec51b ("nvmet_fc: Rework target side abort handling")
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      70e37988
    • Sagi Grimberg's avatar
      nvme: Fix NULL dereference for pci nvme controllers · 7cd49f75
      Sagi Grimberg authored
      PCIe controllers do not have fabric opts, verify they exist before
      showing ctrl_loss_tmo or reconnect_delay attributes.
      
      Fixes: 764075fd ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
      Reported-by: default avatarTobias Markus <tobias@markus-regensburg.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      7cd49f75
    • Sagi Grimberg's avatar
      nvme-rdma: fix reset hang if controller died in the middle of a reset · 2362acb6
      Sagi Grimberg authored
      If the controller becomes unresponsive in the middle of a reset, we
      will hang because we are waiting for the freeze to complete, but that
      cannot happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to
      proceed (either schedule a reconnect of remove the controller).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      2362acb6
    • Sagi Grimberg's avatar
      nvme-rdma: fix timeout handler · 0475a8dc
      Sagi Grimberg authored
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      0475a8dc
    • Sagi Grimberg's avatar
      nvme-rdma: serialize controller teardown sequences · 5110f402
      Sagi Grimberg authored
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      5110f402
    • Sagi Grimberg's avatar
      nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f
      Sagi Grimberg authored
      If the controller becomes unresponsive in the middle of a reset, we will
      hang because we are waiting for the freeze to complete, but that cannot
      happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to proceed
      (either schedule a reconnect of remove the controller).
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      e5c01f4f
    • Sagi Grimberg's avatar
      nvme-tcp: fix timeout handler · 236187c4
      Sagi Grimberg authored
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      236187c4
    • Sagi Grimberg's avatar
      nvme-tcp: serialize controller teardown sequences · d4d61470
      Sagi Grimberg authored
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      d4d61470
    • Sagi Grimberg's avatar
      nvme: have nvme_wait_freeze_timeout return if it timed out · 7cf0d7c0
      Sagi Grimberg authored
      Users can detect if the wait has completed or not and take appropriate
      actions based on this information (e.g. weather to continue
      initialization or rather fail and schedule another initialization
      attempt).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      7cf0d7c0
    • Sagi Grimberg's avatar
      nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance · d7144f5c
      Sagi Grimberg authored
      NVME_CTRL_NEW should never see any I/O, because in order to start
      initialization it has to transition to NVME_CTRL_CONNECTING and from
      there it will never return to this state.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      d7144f5c
    • Ziye Yang's avatar
      nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu · a6ce7d7b
      Ziye Yang authored
      When handling commands without in-capsule data, we assign the ttag
      assuming we already have the queue commands array allocated (based
      on the queue size information in the connect data payload). However
      if the connect itself did not send the connect data in-capsule we
      have yet to allocate the queue commands,and we will assign a bogus
      ttag and suffer a NULL dereference when we receive the corresponding
      h2cdata pdu.
      
      Fix this by checking if we already allocated commands before
      dereferencing it when handling h2cdata, if we didn't, its for sure a
      connect and we should use the preallocated connect command.
      Signed-off-by: default avatarZiye Yang <ziye.yang@intel.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      a6ce7d7b
    • Jens Axboe's avatar
      Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.9 · a433d721
      Jens Axboe authored
      Pull MD fix from Song.
      
      * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid5: make sure stripe_size as power of two
      a433d721
    • Yufen Yu's avatar
      md/raid5: make sure stripe_size as power of two · 6af10a33
      Yufen Yu authored
      Commit 3b5408b9 ("md/raid5: support config stripe_size by sysfs
      entry") make stripe_size as a configurable value. It just requires
      stripe_size as multiple of 4KB.
      
      In fact, we should make sure stripe_size as power of two. Otherwise,
      stripe_shift which is the result of ilog2 can not represent the real
      stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
      get unexpected value.
      
      Fixes: 3b5408b9 ("md/raid5: support config stripe_size by sysfs entry")
      Signed-off-by: default avatarYufen Yu <yuyufen@huawei.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      6af10a33
  6. 26 Aug, 2020 2 commits
  7. 21 Aug, 2020 13 commits
    • Hou Pu's avatar
      null_blk: fix passing of REQ_FUA flag in null_handle_rq · 2d62e6b0
      Hou Pu authored
      REQ_FUA should be checked using rq->cmd_flags instead of req_op().
      
      Fixes: deb78b41 ("nullb: emulate cache")
      Signed-off-by: default avatarHou Pu <houpu@bytedance.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d62e6b0
    • Amit Engel's avatar
      nvmet: Disable keep-alive timer when kato is cleared to 0h · 0d3b6a8d
      Amit Engel authored
      Based on nvme spec, when keep alive timeout is set to zero
      the keep-alive timer should be disabled.
      Signed-off-by: default avatarAmit Engel <amit.engel@dell.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0d3b6a8d
    • Chao Leng's avatar
      nvme: redirect commands on dying queue · 5eac5f33
      Chao Leng authored
      If a command send through nvme-multipath failed on a dying queue, resend it
      on another path.
      Signed-off-by: default avatarChao Leng <lengchao@huawei.com>
      [hch: rebased on top of the completion refactoring]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5eac5f33
    • Christoph Hellwig's avatar
      nvme: just check the status code type in nvme_is_path_error · 1e41f3bd
      Christoph Hellwig authored
      Check the SCT sub-field for a path related status instead of enumerating
      invididual status code.  As of NVMe 1.4 this adds "Internal Path Error"
      and "Controller Pathing Error" to the list, but it also future proofs for
      additional status codes added to the category.
      Suggested-by: default avatarChao Leng <lengchao@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1e41f3bd
    • Christoph Hellwig's avatar
      nvme: refactor command completion · 5ddaabe8
      Christoph Hellwig authored
      Lift all the code to decide the dispostition of a completed command
      from nvme_complete_rq and nvme_failover_req into a new helper, which
      returns an emum of the potential actions.  nvme_complete_rq then
      just switches on those and calls the proper helper for the action.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5ddaabe8
    • Christoph Hellwig's avatar
      nvme: rename and document nvme_end_request · 2eb81a33
      Christoph Hellwig authored
      nvme_end_request is a bit misnamed, as it wraps around the
      blk_mq_complete_* API.  It's semantics also are non-trivial, so give it
      a more descriptive name and add a comment explaining the semantics.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2eb81a33
    • Keith Busch's avatar
      nvme: skip noiob for zoned devices · c41ad98b
      Keith Busch authored
      Zoned block devices reuse the chunk_sectors queue limit to define zone
      boundaries. If a such a device happens to also report an optimal
      boundary, do not use that to define the chunk_sectors as that may
      intermittently interfere with io splitting and zone size queries.
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c41ad98b
    • Christoph Hellwig's avatar
      nvme-pci: fix PRP pool size · c61b82c7
      Christoph Hellwig authored
      All operations are based on the controller, not the host page size.
      Switch the dma pool to use the controller page size as well to avoid
      massive overallocations on large page size systems.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c61b82c7
    • John Garry's avatar
      nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth · 7442ddce
      John Garry authored
      Recently nvme_dev.q_depth was changed from an int to u16 type.
      
      This falls over for the queue depth calculation in nvme_pci_enable(),
      where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
      NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
      result:
      
      root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
      dereference at virtual address 0000000000000010
      Mem abort info:
      ESR = 0x96000004
      EC = 0x25: DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      Data abort info:
      ISV = 0, ISS = 0x00000004
      CM = 0, WnR = 0
      user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
      [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      Modules linked in: nvme nvme_core
      CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
      5.8.0-next-20200812 #27
      Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
      V1.16.01 03/15/2019
      Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
      pc : __sg_alloc_table_from_pages+0xec/0x238
      lr : __sg_alloc_table_from_pages+0xc8/0x238
      sp : ffff800013ccbad0
      x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
      x27: 0000000000000000 x26: 0000000000002dc2
      x25: 0000000000000dc0 x24: 0000000000000000
      x23: 0000000000000000 x22: ffff800013ccbbe8
      x21: 0000000000000010 x20: 0000000000000000
      x19: 00000000fffff000 x18: ffffffffffffffff
      x17: 00000000000000c0 x16: fffffe289eaf6380
      x15: ffff800011b59948 x14: ffff002bc8fe98f8
      x13: ff00000000000000 x12: ffff8000114ca000
      x11: 0000000000000000 x10: ffffffffffffffff
      x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
      x7 : 0000000000000000 x6 : 0000000000000001
      x5 : ffff0a27b5f9b680 x4 : 0000000000000000
      x3 : ffff0a27b5f9b680 x2 : 0000000000000000
       x1 : 0000000000000001 x0 : 0000000000000000
       Call trace:
      __sg_alloc_table_from_pages+0xec/0x238
      sg_alloc_table_from_pages+0x18/0x28
      iommu_dma_alloc+0x474/0x678
      dma_alloc_attrs+0xd8/0xf0
      nvme_alloc_queue+0x114/0x160 [nvme]
      nvme_reset_work+0xb34/0x14b4 [nvme]
      process_one_work+0x1e8/0x360
      worker_thread+0x44/0x478
      kthread+0x150/0x158
      ret_from_fork+0x10/0x34
       Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
      ---[ end trace 89bb2b72d59bf925 ]---
      
      Fix by making onto a u32.
      
      Also use u32 for nvme_dev.q_depth, as we assign this value from
      nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
      avoids the same crash as above.
      
      Fixes: 61f3b896 ("nvme-pci: use unsigned for io queue depth")
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7442ddce
    • Logan Gunthorpe's avatar
      nvme: Use spin_lock_irq() when taking the ctrl->lock · ecbcdf0c
      Logan Gunthorpe authored
      When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
      dead lock. The new spin_lock() calls recently added produce the
      following lockdep warning when running the blktest nvme/003:
      
          ================================
          WARNING: inconsistent lock state
          --------------------------------
          inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
          ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
          ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
          {SOFTIRQ-ON-W} state was registered at:
            lock_acquire+0x164/0x500
            _raw_spin_lock+0x28/0x40
            nvme_get_effects_log+0x37/0x1c0
            nvme_init_identify+0x9e4/0x14f0
            nvme_reset_work+0xadd/0x2360
            process_one_work+0x66b/0xb70
            worker_thread+0x6e/0x6c0
            kthread+0x1e7/0x210
            ret_from_fork+0x22/0x30
          irq event stamp: 1449221
          hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
          hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
          softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
          softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50
      
          other info that might help us debug this:
           Possible unsafe locking scenario:
      
                 CPU0
                 ----
            lock(&ctrl->lock);
            <Interrupt>
              lock(&ctrl->lock);
      
           *** DEADLOCK ***
      
          no locks held by ksoftirqd/2/22.
      
          stack backtrace:
          CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c #1450
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
          Call Trace:
           dump_stack+0xc8/0x11a
           print_usage_bug.cold.63+0x235/0x23e
           mark_lock+0xa9c/0xcf0
           __lock_acquire+0xd9a/0x2b50
           lock_acquire+0x164/0x500
           _raw_spin_lock_irqsave+0x40/0x60
           nvme_keep_alive_end_io+0x50/0xc0
           blk_mq_end_request+0x158/0x210
           nvme_complete_rq+0x146/0x500
           nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
           blk_done_softirq+0x187/0x1e0
           __do_softirq+0x118/0x595
           run_ksoftirqd+0x35/0x50
           smpboot_thread_fn+0x1d3/0x310
           kthread+0x1e7/0x210
           ret_from_fork+0x22/0x30
      
      Fixes: be93e87e ("nvme: support for multiple Command Sets Supported and Effects log pages")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Tested-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ecbcdf0c
    • Chaitanya Kulkarni's avatar
      nvmet: call blk_mq_free_request() directly · 7ee51cf6
      Chaitanya Kulkarni authored
      Instead of calling blk_put_request() which calls blk_mq_free_request(),
      call blk_mq_free_request() directly for NVMeOF passthru. This is to
      mainly avoid an extra function call in the completion path
      nvmet_passthru_req_done().
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ee51cf6
    • Chaitanya Kulkarni's avatar
      nvmet: fix oops in pt cmd execution · a2138fd4
      Chaitanya Kulkarni authored
      In the existing NVMeOF Passthru core command handling on failure of
      nvme_alloc_request() it errors out with rq value set to NULL. In the
      error handling path it calls blk_put_request() without checking if
      rq is set to NULL or not which produces following Oops:-
      
      [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [ 1457.347838] #PF: supervisor read access in kernel mode
      [ 1457.348464] #PF: error_code(0x0000) - not-present page
      [ 1457.349085] PGD 0 P4D 0
      [ 1457.349402] Oops: 0000 [#1] SMP NOPTI
      [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G           OE     5.8.0-rc4nvme-5.9+ #35
      [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
      [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
      [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
      [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
      [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
      [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
      [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
      [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
      [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
      [ 1457.361322] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
      [ 1457.362337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
      [ 1457.363973] Call Trace:
      [ 1457.364296]  nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
      [ 1457.364990]  process_one_work+0x24e/0x5a0
      [ 1457.365493]  ? __schedule+0x353/0x840
      [ 1457.365957]  worker_thread+0x3c/0x380
      [ 1457.366426]  ? process_one_work+0x5a0/0x5a0
      [ 1457.366948]  kthread+0x135/0x150
      [ 1457.367362]  ? kthread_create_on_node+0x60/0x60
      [ 1457.367934]  ret_from_fork+0x22/0x30
      [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
      [ 1457.368414]  ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
      [ 1457.380849] CR2: 0000000000000000
      [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
      [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
      [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
      [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
      [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
      [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
      [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
      [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
      [ 1457.389798] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
      [ 1457.390796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
      [ 1457.392525] Kernel panic - not syncing: Fatal exception
      [ 1457.394138] Kernel Offset: disabled
      [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      We fix this Oops by adding a new goto label out_put_req and reordering
      the blk_put_request call to avoid calling blk_put_request() with rq
      value is set to NULL. Here we also update the rest of the code
      accordingly.
      
      Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2138fd4
    • Chaitanya Kulkarni's avatar
      nvmet: add ns tear down label for pt-cmd handling · 4db69a3d
      Chaitanya Kulkarni authored
      In the current implementation before submitting the passthru cmd we
      may come across error e.g. getting ns from passthru controller,
      allocating a request from passthru controller, etc. For all the failure
      cases it only uses single goto label fail_out.
      
      In the target code, we follow the pattern to have a separate label for
      each error out the case when setting up multiple things before the actual
      action.
      
      This patch follows the same pattern and renames generic fail_out label
      to out_put_ns and updates the error out cases in the
      nvmet_passthru_execute_cmd() where it is needed.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4db69a3d