Commits · f4524cc45626e16264aabb930d0635eff19c7f73 · nexedi / linux

13 May, 2019 2 commits

nvme-pci: add known admin effects to augument admin effects log page · f4524cc4

Maxim Levitsky authored May 02, 2019

Add known admin effects even if hardware has known admin effects page,
since hardware can't be ever trusted to report sane values.
(on my Intel DC P3700, it reports no side effects for namespace format)
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

f4524cc4

nvme-pci: init shadow doorbell after each reset · e8fd41bb

Maxim Levitsky authored May 02, 2019

The spec states:

  "The settings are not retained across a Controller Level Reset"

Therefore the driver must enable the shadow doorbell, after each reset.

This was caught while testing the nvme driver over upcoming nvme-mdev
device.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e8fd41bb

09 May, 2019 3 commits

brd: add cond_resched to brd_free_pages · 936b33f7

Mikulas Patocka authored May 09, 2019

The loop that frees all the pages can take unbounded amount of time, so
add cond_resched() to it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

936b33f7

sata_rcar: Remove ata_host_alloc() error printing · cf12c672

Geert Uytterhoeven authored Apr 29, 2019

ata_host_alloc() can only fail due to memory allocation failures.
Hence there is no need to print a message, as the memory allocation core
code already takes care of that.
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cf12c672

s390/dasd: fix build warning in dasd_eckd_build_cp_raw · e78c21d1

Ming Lei authored May 09, 2019

Commit 72deb455 ("block: remove CONFIG_LBDAF") changes
sector_t to u64 unconditionaly, so apply '%llu' for print
sector_t variable.

Fixes: 72deb455 ("block: remove CONFIG_LBDAF")
Cc: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e78c21d1

06 May, 2019 26 commits

lightnvm: pblk: use nvm_rq_to_ppa_list() · 45c5fcbb

Igor Konopko authored May 04, 2019

This patch replaces few remaining usages of rqd->ppa_list[] with
existing nvm_rq_to_ppa_list() helpers. This is needed for theoretical
devices with ws_min/ws_opt equal to 1.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45c5fcbb

lightnvm: pblk: simplify partial read path · a96de64a

Igor Konopko authored May 04, 2019

This patch changes the approach to handling partial read path.

In old approach merging of data from round buffer and drive was fully
made by drive. This had some disadvantages - code was complex and
relies on bio internals, so it was hard to maintain and was strongly
dependent on bio changes.

In new approach most of the handling is done mostly by block layer
functions such as bio_split(), bio_chain() and generic_make request()
and generally is less complex and easier to maintain. Below some more
details of the new approach.

When read bio arrives, it is cloned for pblk internal purposes. All
the L2P mapping, which includes copying data from round buffer to bio
and thus bio_advance() calls is done on the cloned bio, so the original
bio is untouched. If we found that we have partial read case, we
still have original bio untouched, so we can split it and continue to
process only first part of it in current context, when the rest will be
called as separate bio request which is passed to generic_make_request()
for further processing.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Heiner Litz <hlitz@ucsc.edu>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a96de64a

lightnvm: do not remove instance under global lock · 843f2edb

Igor Konopko authored May 04, 2019

Currently all the target instances are removed under global nvm_lock.
This was needed to ensure that nvm_dev struct will not be freed by
hot unplug event during target removal. However, current implementation
has some drawbacks, since the same lock is used when new nvme subsystem
is registered, so we can have a situation, that due to long process of
target removal on drive A, registration (and listing in OS) of the
drive B will take a lot of time, since it will wait for that lock.

Now when we have kref which ensures that nvm_dev will not be freed in
the meantime, we can easily get rid of this lock for a time when we are
removing nvm targets.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

843f2edb

lightnvm: track inflight target creations · e69397ea

Igor Konopko authored May 04, 2019

When creation process is still in progress, target is not yet on
targets list. This causes a chance for removing whole lightnvm
subsystem by calling nvm_unregister() in the meantime and finally by
causing kernel panic inside target init function.

This patch changes the behaviour by adding kref variable which tracks
all the users of nvm_dev structure. When nvm_dev is allocated, kref
value is set to 1. Then before every target creation the value is
increased and decreased after target removal. The extra reference
is decreased when nvm subsystem is unregistered.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e69397ea

lightnvm: pblk: recover only written metadata · a24eab59

Igor Konopko authored May 04, 2019

This patch ensures that smeta was fully written before even
trying to read it based on chunk table state and write pointer.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a24eab59

lightnvm: pblk: IO path reorganization · 3e03f632

Igor Konopko authored May 04, 2019

This patch is made in order to prepare read path for new approach to
partial read handling, which is simpler in compare with previous one.

The most important change is to move the handling of completed and
failed bio from the pblk_make_rq() to particular read and write
functions. This is needed, since after partial read path changes,
sometimes completed/failed bio will be different from original one, so
we cannot do this any longer in pblk_make_rq().

Other changes are small read path refactor in order to reduce the size
of the following patch with partial read changes.

Generally the goal of this patch is not to change the functionality,
but just to prepare the code for the following changes.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3e03f632

lightnvm: pblk: GC error handling · f2e02457

Igor Konopko authored May 04, 2019

Currently when there is an IO error (or similar) on GC read path, pblk
still move the line, which was currently under GC process to free state.
Such a behaviour can lead to silent data mismatch issue.

With this patch, the line which was under GC process on which some IO
errors occurred, will be putted back to closed state (instead of free
state as it was without this patch) and the L2P mapping for such a
failed sectors will not be updated.

Then in case of any user IOs to such a failed sectors, pblk would be
able to return at least real IO error instead of stale data as it is
right now.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f2e02457

lightnvm: pblk: remove internal IO timeout · 32ac0fa3

Igor Konopko authored May 04, 2019

Currently during pblk padding, there is internal IO timeout introduced,
which is smaller than default NVMe timeout. This can lead to various
use-after-free issues. Since in case of any IO timeouts NVMe and block
layer will handle timeout by themselves and report it back to use,
there is no need to keep this internal timeout in pblk.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

32ac0fa3

lightnvm: pblk: wait for inflight IOs in recovery · 1fc3b305

Igor Konopko authored May 04, 2019

This patch changes the behaviour of recovery padding in order to
support a case, when some IOs were already submitted to the drive and
some next one are not submitted due to error returned.

Currently in case of errors we simply exit the pad function without
waiting for inflight IOs, which leads to panic on inflight IOs
completion.

After the changes we always wait for all the inflight IOs before
exiting the function.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1fc3b305

lightnvm: pblk: propagate errors when reading meta · d165a7a6

Igor Konopko authored May 04, 2019

Read errors are not correctly propagated. Errors are cleared before
returning control to the io submitter. Change the behaviour such that
all read errors exept high ecc read warning status is returned
appropriately.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d165a7a6

lightnvm: pblk: fix update line wp in OOB recovery · 2b0ae81e

Igor Konopko authored May 04, 2019

In case of OOB recovery, we can hit the scenario when all the data in
line were written and some part of emeta was written too. In such
a case pblk_update_line_wp() function will call pblk_alloc_page()
function which will case to set left_msecs to value below zero
(since this field does not track emeta region) and thus will lead to
multiple kernel warnings. This patch fixes that issue.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2b0ae81e

lightnvm: pblk: kick writer on write recovery path · 74a37fbb

Igor Konopko authored May 04, 2019

In case of write recovery path, there is a chance that writer thread
is not active, kick immediately instead of waiting for timer.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

74a37fbb

lightnvm: pblk: fix lock order in pblk_rb_tear_down_check · 486b5aac

Igor Konopko authored May 04, 2019

In pblk_rb_tear_down_check() the spinlock functions are not
called in proper order.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

486b5aac

lightnvm: prevent race condition on pblk remove · f41d427c

Marcin Dziegielewski authored May 04, 2019

When we trigger nvm target remove during device hot unplug, there is
a probability to hit a general protection fault. This is caused by use
of nvm_dev thay may be freed from another (hot unplug) thread
(in the nvm_unregister function).

Introduce lock in nvme_ioctl_dev_remove function to prevent this
situation.
Signed-off-by: Marcin Dziegielewski <marcin.dziegielewski@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f41d427c

lightnvm: pblk: set propper line as data_line after gc · 4bbae699

Marcin Dziegielewski authored May 04, 2019

In current implementation of l2p recovery, when we are after gc and we
have open line, we are not setting current data line properly (we set
last line from the device instead of last line ordered by seq_nr) and
in consequence, kernel panic and data corruption.
Signed-off-by: Marcin Dziegielewski <marcin.dziegielewski@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4bbae699

lightnvm: pblk: fix bio leak when bio is split · 05038712

Chansol Kim authored May 04, 2019

For large size io where blk_queue_split needs to be called inside
pblk_rw_io, results in bio leak as bio_endio is not called on the
newly allocated. One way to observe this is to mounting ext4
filesystem on the target and issuing 1MB io with dd, e.g., dd bs=1MB
if=/dev/null of=/mount/myvolume. kmemleak reports:

unreferenced object 0xffff88803d7d0100 (size 256):
  comm "kworker/u16:1", pid 68, jiffies 4294899333 (age 284.120s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 60 e8 31 81 88 ff ff  .........`.1....
    01 40 00 00 06 06 00 00 00 00 00 00 05 00 00 00  .@..............
  backtrace:
    [<000000001f5aa04f>] kmem_cache_alloc+0x204/0x3c0
    [<0000000040945aab>] mempool_alloc_slab+0x1d/0x30
    [<00000000b4959ab4>] mempool_alloc+0x83/0x220
    [<00000000646bad9b>] bio_alloc_bioset+0x229/0x320
    [<000000009264b251>] bio_clone_fast+0x26/0xc0
    [<0000000008250252>] bio_split+0x41/0x110
    [<00000000e365cad0>] blk_queue_split+0x349/0x930
    [<00000000eb5426bc>] pblk_make_rq+0x1b5/0x1f0
    [<00000000eea09cec>] generic_make_request+0x2f9/0x690
    [<00000000ae6acede>] submit_bio+0x12e/0x1f0
    [<00000000f9b8b82a>] ext4_io_submit+0x64/0x80
    [<000000009e4f817d>] ext4_bio_write_page+0x32e/0x890
    [<00000000cbd0d106>] mpage_submit_page+0x65/0xc0
    [<000000000eec7359>] mpage_map_and_submit_buffers+0x171/0x330
    [<000000009a7afcb6>] ext4_writepages+0xd5e/0x1650
    [<000000004476b096>] do_writepages+0x39/0xc0

In case there is a need for a split, blk_queue_split returns the newly
allocated bio to the caller by changing the value of pointer passed as
a reference, while the original is passed to generic_make_requests.

Although pblk_rw_io's local variable bio* has changed and passed to
pblk_submit_read and pblk_write_to_cache, work is done on this new
bio*, and pblk_rw_io returns NVM_IO_DONE, pblk_make_rq calls bio_endio
on the old bio* because it passed bio pointer by value to pblk_rw_io.

pblk_rw_io is unfolded into pblk_make_rq so that there is no copying
of bio* and bio_endio is called on the correct bio*.
Signed-off-by: Chansol Kim <chansol.kim@samsung.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05038712

lightnvm: Inherit mdts from the parent nvme device · a14669eb

Igor Konopko authored May 04, 2019

Current lightnvm and pblk implementation does not care about NVMe max
data transfer size, which can be smaller than 64*K=256K. There are
existing NVMe controllers which NVMe max data transfer size is lower
that 256K (for example 128K, which happens for existing NVMe
controllers which are NVMe spec compliant). Such a controllers are not
able to handle command which contains 64 PPAs, since the the size of
DMAed buffer will be above the capabilities of such a controller.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a14669eb

lightnvm: pblk: set proper read status in bio · d38954ed

Igor Konopko authored May 04, 2019

Currently in case of read errors, bi_status is not set properly which
leads to returning inproper data to layers above. This patch fix that
by setting proper status in case of read errors.

Also remove unnecessary warn_once(), which does not make sense
in that place, since user bio is not used for interation with drive
and thus bi_status will not be set here.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d38954ed

lightnvm: pblk: cleanly fail when there is not enough memory · 6e46b8b2

Igor Konopko authored May 04, 2019

L2P table can be huge in many cases, since it typically requires 1GB
of DRAM for 1TB of drive. When there is not enough memory available,
OOM killer turns on and kills random processes, which can be very
annoying for users.

This patch changes the flag for L2P table allocation on order to handle
this situation in more user friendly way.

GFP_KERNEL and __GPF_HIGHMEM are default flags used in parameterless
vmalloc() calls, so they are also keeped in that patch. Additionally
__GFP_NOWARN flag is added in order to hide very long dmesg warn in
case of the allocation failures. The most important flag introduced
in that patch is __GFP_RETRY_MAYFAIL, which would cause allocator
to try use free memory and if not available to drop caches, but not
to run OOM killer.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6e46b8b2

lightnvm: pblk: ensure that erase is chunk aligned · 75c89bef

Igor Konopko authored May 04, 2019

The sector bits in the erase command may be uninitialized are
uninitialized, causing the erase LBA to be unaligned to the chunk size.

This is unexpected situation, since erase shall always be chunk
aligned based on OCSSD the 2.0 specification.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75c89bef

lightnvm: pblk: fix race during put line · 4ca88524

Igor Konopko authored May 04, 2019

In the pblk_put_line_back function, a race condition with
__pblk_map_invalidate can make a line not part of any lists.

Fix gc_list by resetting it to null fixes the above issue.

Fixes: a4bd217b ("lightnvm: physical block device (pblk) target")
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4ca88524

lightnvm: pblk: gracefully handle GC vmalloc fail · d378561b

Igor Konopko authored May 04, 2019

Currently when we fail on rq data allocation in gc, it skips moving
active data and moves line straigt to its free state. Losing user
data in the process.

Move the data allocation to an earlier phase of GC, where we can still
fail gracefully by moving line back to the closed state.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d378561b

lightnvm: pblk: remove unused smeta_ssec field · 605bcef7

Igor Konopko authored May 04, 2019

smeta_ssec field in pblk_line is never used after it was replaced by
the function pblk_line_smeta_start().
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

605bcef7

lightnvm: pblk: reduce L2P memory footprint · 847a3a27

Igor Konopko authored May 04, 2019

Currently L2P map size is calculated based on the total number of
available sectors, which is redundant, since it contains mapping for
overprovisioning as well (11% by default).

Change this size to the real capacity and thus reduce the memory
footprint significantly - with default op value it is approx.
110MB of DRAM less for every 1TB of media.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

847a3a27

lightnvm: pblk: rollback on error during gc read · 8935ebfc

Igor Konopko authored May 04, 2019

A line is left unsigned to the blocks lists in case pblk_gc_line
returns an error.

This moves the line back to be appropriate list, which can then be
picked up by the garbage collector.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8935ebfc

lightnvm: pblk: line reference fix in GC · 7e5434ee

Igor Konopko authored May 04, 2019

Fixes the GC error case when moving a line back to closed state
while releasing additional references.
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7e5434ee

04 May, 2019 7 commits

block: don't drain in-progress dispatch in blk_cleanup_queue() · 66215664

Ming Lei authored Apr 30, 2019

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66215664

blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release · 1b97871b

Ming Lei authored Apr 30, 2019

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1b97871b

blk-mq: always free hctx after request queue is freed · 2f8f1336

Ming Lei authored Apr 30, 2019

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2f8f1336

blk-mq: split blk_mq_alloc_and_init_hctx into two parts · 7c6c5b7c

Ming Lei authored Apr 30, 2019

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7c6c5b7c

blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b

Ming Lei authored Apr 30, 2019

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d9,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: James Smart <james.smart@broadcom.com>
Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7e2d94b

blk-mq: move cancel of requeue_work into blk_mq_release · fbc2a15e

Ming Lei authored Apr 30, 2019

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fbc2a15e

blk-mq: grab .q_usage_counter when queuing request from plug code path · e87eb301

Ming Lei authored Apr 30, 2019

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e87eb301

02 May, 2019 2 commits

Merge branch 'nvme-5.2' of git://git.infradead.org/nvme into for-5.2/block · 6143393c

Jens Axboe authored May 02, 2019

Pull NVMe updates from Christoph.

* 'nvme-5.2' of git://git.infradead.org/nvme:
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  nvme-tcp: fix possible null deref on a timed out io queue connect

6143393c

block: fix function name in comment · 273938bf

Raul E Rangel authored May 02, 2019

The comment was out of date.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Raul E Rangel <rrangel@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

273938bf