Commits · 287922eb0b186e2a5bf54fdd04b734c25c90035c · nexedi / linux

22 Dec, 2015 1 commit

block: defer timeouts to a workqueue · 287922eb

Christoph Hellwig authored Oct 30, 2015

Timer context is not very useful for drivers to perform any meaningful abort
action from.  So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals.  But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
 start a freeze on its own queue. The timer enters the queue, so timer
 context can only start a freeze, but not wait for frozen."
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

287922eb

09 Dec, 2015 2 commits

nvme: precedence bug in nvme_pr_clear() · 8c0b3915

Dan Carpenter authored Dec 09, 2015

The "|" operator has higher precedence than "?:" so this didn't work as
intended. I had previously fixed this bug, but it we copied the older
unfixed version when we moved the function between files.

Fixes: 1673f1f0 ('nvme: move block_device_operations and ns/ctrl freeing to common code')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

8c0b3915

blk-integrity: checking for NULL instead of IS_ERR · 7b6c0f80

Dan Carpenter authored Dec 09, 2015

We recently changed bio_integrity_alloc() to return ERR_PTRs instead of
NULL but these calls were missed.

Fixes: 06c1e390 ('blk-integrity: empty implementation when disabled')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

7b6c0f80

08 Dec, 2015 1 commit

nvme: fix another 32-bit build warning · d1ea7be5

Arnd Bergmann authored Dec 08, 2015

The nvme_user_cmd function was recently moved around from one file
to another, which made a warning reappear that I had fixed before
at some point:

drivers/nvme/host/core.c: In function 'nvme_user_cmd':
drivers/nvme/host/core.c:424:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

This applies the same workaround that we have elsewhere in the
driver with an extra type cast to uintptr_t.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 1673f1f0 ("nvme: move block_device_operations and ns/ctrl freeing to common code")
Link: https://lkml.org/lkml/2015/10/9/611Signed-off-by: Jens Axboe <axboe@fb.com>

d1ea7be5

03 Dec, 2015 2 commits

NVMe: fix build with CONFIG_NVM enabled · ac02ddde

Christoph Hellwig authored Dec 03, 2015

Looks like I didn't test with CONFIG_NVM enabled, and neither did
the build bot.

Most of this is really weird crazy shit in the lighnvm support, though.

Struct nvme_ns is a structure for the NVM I/O command set, and it has
no business poking into it.  Second this commit:

commit 47b3115a
Author: Wenwei Tao <ww.tao0320@gmail.com>
Date:   Fri Nov 20 13:47:55 2015 +0100

    nvme: lightnvm: use admin queues for admin cmds

Does even more crazy stuff.  If a function gets a request_queue parameter
passed it'd better use that and not look for another one.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

ac02ddde

blk-integrity: empty implementation when disabled · 06c1e390

Keith Busch authored Dec 03, 2015

This patch moves the blk_integrity_payload definition outside the
CONFIG_BLK_DEV_INTERITY dependency and provides empty function
implementations when the kernel configuration disables integrity
extensions. This simplifies drivers that make use of these to map user
data so they don't need to repeat the same configuration checks.
Signed-off-by: Keith Busch <keith.busch@intel.com>

Updated by Jens to pass an error pointer return from
bio_integrity_alloc(), otherwise if CONFIG_BLK_DEV_INTEGRITY isn't
set, we return a weird ENOMEM from __nvme_submit_user_cmd()
if a meta buffer is set.
Signed-off-by: Jens Axboe <axboe@fb.com>

06c1e390

01 Dec, 2015 23 commits

nvme: refactor set_queue_count · 9a0be7ab

Christoph Hellwig authored Nov 26, 2015

Split out a helper that just issues the Set Features and interprets the
result which can go to common code, and document why we are ignoring
non-timeout error returns in the PCIe driver.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

9a0be7ab

nvme: move chardev and sysfs interface to common code · f3ca80fc

Christoph Hellwig authored Nov 28, 2015

For this we need to add a proper controller init routine and a list of
all controllers that is in addition to the list of PCIe controllers,
which stays in pci.c.  Note that we remove the sysfs device when the
last reference to a controller is dropped now - the old code would have
kept it around longer, which doesn't make much sense.

This requires a new ->reset_ctrl operation to implement controleller
resets, and a new ->write_reg32 operation that is required to implement
subsystem resets.  We also now store caches copied of the NVMe compliance
version and the flag if a controller is attached to a subsystem or not in
the generic controller structure now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixes for pr merge]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

f3ca80fc

nvme: move namespace scanning to common code · 5bae7f73

Christoph Hellwig authored Nov 28, 2015

The namespace scanning code has been mostly generic already, we just
need to store a pointer to the tagset in the nvme_ctrl structure, and
add a method to check if a controller is I/O incapable.  The latter
will hopefully be replaced by a proper controller state machine soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixed pr conflicts]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

5bae7f73

nvme: move the call to nvme_init_identify earlier · ce4541f4

Christoph Hellwig authored Oct 16, 2015

We want to record the identify and CAP values even if no I/O queue
is available.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

ce4541f4

nvme: add a common helper to read Identify Controller data · 7fd8930f

Christoph Hellwig authored Nov 28, 2015

And add the 64-bit register read operation for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

7fd8930f

nvme: move nvme_{enable,disable,shutdown}_ctrl to common code · 5fd4ce1b
Christoph Hellwig authored Nov 28, 2015
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
```
5fd4ce1b

nvme: move remaining CC setup into nvme_enable_ctrl · 1b2eb374

Christoph Hellwig authored Nov 28, 2015

Remove the calculation of all the bits written into the CC register into
nvme_enable_ctrl, so that they can be moved into the core NVMe driver in
the future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

1b2eb374

nvme: add explicit quirk handling · 106198ed

Christoph Hellwig authored Nov 26, 2015

Add an enum for all workarounds not in the spec and identify the affected
controllers at probe time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

106198ed

nvme: move block_device_operations and ns/ctrl freeing to common code · 1673f1f0

Christoph Hellwig authored Nov 26, 2015

This moves the block_device_operations over to common code mostly
as-is.  The only change is that the ns and ctrl refcounting got some
small refcounting to have wrappers around the kref_put operations.

A new free_ctrl operation is added to allow the PCI driver to free
it's ressources on the final drop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Moved the integrity and pr changes due to merge conflict]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

1673f1f0

nvme: use the block layer for userspace passthrough metadata · 0b7f1f26

Keith Busch authored Oct 23, 2015

Use the integrity API to pass through metadata from userspace.  For PI
enabled devices this means that we now validate the reftag, which seems
like an unintentional ommission in the old code.

Thanks to Keith Busch for testing and fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Skip metadata setup on admin commands]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

0b7f1f26

nvme: split __nvme_submit_sync_cmd · 4160982e

Christoph Hellwig authored Nov 20, 2015

Add a separate nvme_submit_user_cmd for commands that directly DMA
to or from userspace.  We'll add metadata support to that soon and
the common version would become too messy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

4160982e

nvme: move nvme_setup_flush and nvme_setup_rw to common code · 22944e99

Christoph Hellwig authored Oct 16, 2015

And mark them inline so that we don't slow down the I/O submission path by
having to turn it into a forced out of line call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

22944e99

nvme: move nvme_error_status to common code · 15a190f7

Christoph Hellwig authored Oct 16, 2015

And mark it inline so that we don't slow down the completion path by
having to turn it into a forced out of line call.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

15a190f7

nvme: factor out a nvme_unmap_data helper · d4f6c3ab

Christoph Hellwig authored Nov 26, 2015

This is the counter part to nvme_map_data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

d4f6c3ab

nvme: refactor nvme_queue_rq · ba1ca37e

Christoph Hellwig authored Oct 16, 2015

This "backports" the structure I've used for the fabrics driver.  It
mostly started out as a cleanup so that I could actually understand
the code, but I think it also qualifies as a micro-optimization due
to the reduced time we hold q_lock and disable interrupts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

ba1ca37e

nvme: simplify nvme_setup_prps calling convention · 69d2b571

Christoph Hellwig authored Oct 16, 2015

Pass back a true/false value instead of the length which needs a compare
with the bytes in the request and drop the pointless gfp_t argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

69d2b571

nvme: split a new struct nvme_ctrl out of struct nvme_dev · 1c63dc66

Christoph Hellwig authored Nov 26, 2015

The new struct nvme_ctrl will be used by the common NVMe code that sits
on top of struct request_queue and the new nvme_ctrl_ops abstraction.
It only contains the bare minimum required, which consists of values
sampled during controller probe, the admin queue pointer and a second
struct device pointer at the moment, but more will follow later.  Only
values that are not used in the I/O fast path should be moved to
struct nvme_ctrl so that drivers can optimize their cache line usage
easily.  That's also the reason why we have two device pointers as
the struct device is used for DMA mapping purposes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

1c63dc66

nvme: use vendor it from identify · 01fec28a

Christoph Hellwig authored Nov 26, 2015

Use the vendor ID from the identify data instead of the PCI device to
make the SCSI translation layer independent from the PCI driver.  The NVMe
spec defines them as having the same value for current PCIe devices.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

01fec28a

nvme: split nvme_trans_device_id_page · bf7d3ebb
Christoph Hellwig authored Nov 26, 2015
```
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
```
bf7d3ebb

nvme: use offset instead of a struct for registers · 7a67cbea

Christoph Hellwig authored Nov 20, 2015

This makes life easier for future non-PCI drivers where access to the
registers might be more complicated.  Note that Linux drivers are
pretty evenly split between the two versions, and in fact the NVMe
driver already uses offsets for the doorbells.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
[Fixed CMBSZ offset]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

7a67cbea

nvme: split command submission helpers out of pci.c · 21d34711

Christoph Hellwig authored Nov 26, 2015

Create a new core.c and start by adding the command submission helpers
to it, which are already abstracted away from the actual hardware queues
by the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

21d34711

nvme: move struct nvme_iod to pci.c · 71bd150c

Christoph Hellwig authored Oct 16, 2015

This structure is specific to the PCIe driver internals and should be moved
to pci.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

71bd150c

blk-mq: add a flags parameter to blk_mq_alloc_request · 6f3b0e8b

Christoph Hellwig authored Nov 26, 2015

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t.  Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

6f3b0e8b

25 Nov, 2015 1 commit

Revert "blk-flush: Queue through IO scheduler when flush not required" · d7cf931d

Jens Axboe authored Nov 25, 2015

This reverts commit 1b2ff19e.

Jan writes:

--

Thanks for report! After some investigation I found out we allocate
elevator specific data in __get_request() only for non-flush requests. And
this is actually required since the flush machinery uses the space in
struct request for something else. Doh. So my patch is just wrong and not
easy to fix since at the time __get_request() is called we are not sure
whether the flush machinery will be used in the end. Jens, please revert
1b2ff19e. Thanks!

I'm somewhat surprised that you can reliably hit the race where flushing
gets disabled for the device just while the request is in flight. But I
guess during boot it makes some sense.

--

So let's just revert it, we can fix the queue run manually after the
fact. This race is rare enough that it didn't trigger in testing, it
requires the specific disable-while-in-flight scenario to trigger.

d7cf931d

24 Nov, 2015 10 commits

block: clarify blk_add_timer() use case for blk-mq · 3b627a3f

Jens Axboe authored Nov 24, 2015

Just a comment update on not needing queue_lock, and that we aren't
really adding the request to a timeout list for !mq.
Signed-off-by: Jens Axboe <axboe@fb.com>

3b627a3f

bio: use offset_in_page macro · bd5cecea

Geliang Tang authored Nov 21, 2015

Use offset_in_page macro instead of (addr & ~PAGE_MASK).
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

bd5cecea

block: do not initialise statics to 0 or NULL · 1fe8f348

Wei Tang authored Nov 24, 2015

This patch fixes the checkpatch.pl error to genhd.c:

ERROR: do not initialise statics to 0 or NULL
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

1fe8f348

block: do not initialise globals to 0 or NULL · d674d414

Wei Tang authored Nov 24, 2015

This patch fixes the checkpatch.pl error to blk-exec.c:

ERROR: do not initialise globals to 0 or NULL
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

d674d414

block: rename request_queue slab cache · c2789bd4

Ilya Dryomov authored Nov 20, 2015

Name the cache after the actual name of the struct.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

c2789bd4

block: fix blk_abort_request for blk-mq drivers · 55ce0da1

Christoph Hellwig authored Oct 30, 2015

We only added the request to the request list for the !blk-mq case,
so we should only delete it in that case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

55ce0da1

nvme: add missing unmaps in nvme_queue_rq · bf508e91

Christoph Hellwig authored Oct 16, 2015

When we fail various metadata related operations in nvme_queue_rq we
need to unmap the data SGL.

Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

bf508e91

NVMe: default to 4k device page size · c5c9f25b

Nishanth Aravamudan authored Nov 24, 2015

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

c5c9f25b

Merge tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm · 6ffeba96

Linus Torvalds authored Nov 24, 2015

Pull device mapper fixes from Mike Snitzer:
 "Two fixes for 4.4-rc1's DM ioctl changes that introduced the potential
  for infinite recursion on ioctl (with DM multipath).

  And four stable fixes:

   - A DM thin-provisioning fix to restore 'error_if_no_space' setting
     when a thin-pool is made writable again (after having been out of
     space).

   - A DM thin-provisioning fix to properly advertise discard support
     for thin volumes that are stacked on a thin-pool whose underlying
     data device doesn't support discards.

   - A DM ioctl fix to allow ctrl-c to break out of an ioctl retry loop
     when DM multipath is configured to 'queue_if_no_path'.

   - A DM crypt fix for a possible hang on dm-crypt device removal"

* tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm thin: fix regression in advertised discard limits
  dm crypt: fix a possible hang due to race condition on exit
  dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
  dm: do not reuse dm_blk_ioctl block_device input as local variable
  dm: fix ioctl retry termination with signal
  dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition

6ffeba96

pidns: fix NULL dereference in __task_pid_nr_ns() · 81b1a832

Eric Dumazet authored Nov 24, 2015

I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

    if (pid && ns->level <= pid->level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

81b1a832