- 13 Mar, 2019 16 commits
-
-
Christoph Hellwig authored
Add a gendisk argument to nvme_config_write_zeroes so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Add a gendisk argument to nvme_config_discard so that the call to nvme_update_disk_info for the multipath device node updates the proper request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Just opencode the two function calls in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Qemu started out with a broken implementation of Write Zeroes written by yours truly. Disable Write Zeroes on qemu for now, eventually we need to go back and make all the qemu quirks version specific, but that is left for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
The FC-NVME spec, when finally approved, modified the disconnect LS such that the only scope available is the association. Rework the Disconnect LS processing to be in accordance with the change. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
There are two changes: 1) The logic in the __nvmet_fc_free_assoc() routine is bad. It uses "safe" routines assuming pointers will come back valid. However, the intervening next structure being linked can be removed from the list and the resulting safe pointers are bad, resulting in NULL ptrs being hit. Correct by scheduling a work element to perform the association delete, which can be done while under lock. 2) Prior patch that added the work element scheduling left a possible reference on the object if the work element couldn't be scheduled. Correct by doing the put on a failing schedule_work() call. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
If: - A successful connect has occurred with an io queue count greater than zero and namespaces detected and running. - An error or something occurs which causes a termination of the prior association and then starts a reconnect, - The reconnect then creates a new controller, but for whatever reason, nvme_set_queue_count() results in io queue count set to zero. This will skip io queue and tag set changes. - But... the controller will transition to live, calling nvme_start_ctrl, which calls nvme_start_queues(), which then releases I/Os into the transport which then sends them to the driver. As there are no queues, things eventually hit the driver looking for a handle, which was cleared when the original controller was reset, and it can't proceed. Worst case, things progress, but everything fails. In the failing scenario, the nvme_set_features(NVME_FEAT_NUM_QUEUES) command actually failed with a NVME_SC_INTERNAL error. For some reason, although nvme_set_queue_count() saw the error and set io queue count to zero, it doesn't return a failure status to the transport, which allows the transport to continue using the controller. Fix the problem by simply rejecting the new association if at least 1 I/O queue can't be created. The association reject will fail the reconnect attempt and fall into the reconnect retry policy. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
A recent change added a numa_node field to the nvme controller and has the transport assign the node using dev_to_node(). However, fcloop registers with a NULL device struct, so the dev_to_node() call oops. Revise the assignment to assign no node when device struct is null. Fixes: 103e515e ("nvme: add a numa_node field to struct nvme_ctrl") Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> [hch: small coding style fixup] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
James Smart authored
For some nvme command, when issued by the nvme core layer, there is an internal buffer which can cause blk_rq_payload_bytes() to return a non-zero value yet there is no actual/real command payload and sg list. An example is the WRITE ZEROES command. To address this, when making choices on whether to dma map an sgl, use blk_rq_nr_phys_segments() instead of blk_rq_payload_bytes(). When there is a sgl, blk_rq_payload_bytes() will return the amount of data to be transferred by the sgl. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Yufen Yu authored
After commit 4d43d395 (workqueue: Try to catch flush_work() without INIT_WORK()), it can cause warning when delete nvme-loop device, trace like: [ 76.601272] Call Trace: [ 76.601646] ? del_timer+0x72/0xa0 [ 76.602156] __cancel_work_timer+0x1ae/0x270 [ 76.602791] cancel_work_sync+0x14/0x20 [ 76.603407] nvmet_ctrl_free+0x1b7/0x2f0 [nvmet] [ 76.604091] ? free_percpu+0x168/0x300 [ 76.604652] nvmet_sq_destroy+0x106/0x240 [nvmet] [ 76.605346] nvme_loop_destroy_admin_queue+0x30/0x60 [nvme_loop] [ 76.606220] nvme_loop_shutdown_ctrl+0xc3/0xf0 [nvme_loop] [ 76.607026] nvme_loop_delete_ctrl_host+0x19/0x30 [nvme_loop] [ 76.607871] nvme_do_delete_ctrl+0x75/0xb0 [ 76.608477] nvme_sysfs_delete+0x7d/0xc0 [ 76.609057] dev_attr_store+0x24/0x40 [ 76.609603] sysfs_kf_write+0x4c/0x60 [ 76.610144] kernfs_fop_write+0x19a/0x260 [ 76.610742] __vfs_write+0x1c/0x60 [ 76.611246] vfs_write+0xfa/0x280 [ 76.611739] ksys_write+0x6e/0x120 [ 76.612238] __x64_sys_write+0x1e/0x30 [ 76.612787] do_syscall_64+0xbf/0x3a0 [ 76.613329] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We fix it by moving fatal_err_work init to nvmet_alloc_ctrl(), which may more reasonable. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Yufen Yu authored
After commit a686ed75 ("nvme: introduce a helper function for controller deletion), nvme_delete_ctrl_sync no longer use flush_work. Update comment, accordingly. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Sagi Grimberg authored
In case nvme_alloc_ns fails after we initialize ns_head but before we add the ns to the controller namespaces list we need to explicitly put the ns_head reference because when we teardown the controller we won't find it, causing us to leak a dangling subsystem eventually. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
The field is defined to be a 24 byte array, we don't need to multiply the sizeof() that field by the number of dwords it covers. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
A write or flush IO passthrough command is expected to change the logical block content, so don't warn on these as no additional handling is necessary. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Max Gurtovoy authored
This will print get-feature cmd in more informative way. For example, run "nvme get-feature /dev/nvme0 -n 1 -f 0x9 -c 10" will trace: nvme-3907 [008] .... 1763.635054: nvme_setup_cmd: nvme0: qid=0, cmdid=6, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_admin_get_features fid=0x9 sel=0x0 cdw11=0xa) <idle>-0 [001] d.h. 1763.635112: nvme_sq: nvme0: qid=0, head=27, tail=27 <idle>-0 [008] ..s. 1763.635121: nvme_complete_rq: nvme0: qid=0, cmdid=6, res=10, retries=0, flags=0x2, status=0 Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
https://github.com/liu-song-6/linuxJens Axboe authored
Pull MD fixes from Song. * 'for-5.1/md-post' of https://github.com/liu-song-6/linux: md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice raid5: set write hint for PPL
-
- 12 Mar, 2019 3 commits
-
-
Aditya Pakki authored
mddev->sync_thread can be set to NULL on kzalloc failure downstream. The patch checks for such a scenario and frees allocated resources. Committer node: Added similar fix to raid5.c, as suggested by Guoqing. Cc: stable@vger.kernel.org # v3.16+ Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Xiao Ni authored
In reshape_request it already adds len to sector_nr already. It's wrong to add len to sector_nr again after adding pages to bio. If there is bad block it can't copy one chunk at a time, it needs to goto read_more. Now the sector_nr is wrong. It can cause data corruption. Cc: stable@vger.kernel.org # v3.16+ Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
Mariusz Dabrowski authored
When the Partial Parity Log is enabled, circular buffer is used to store PPL data. Each write to RAID device causes overwrite of data in this buffer so some write_hint can be set to those request to help drives handle garbage collection. This patch adds new sysfs attribute which can be used to specify which write_hint should be assigned to PPL. Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
-
- 07 Mar, 2019 1 commit
-
-
Javier González authored
When calculating the maximun I/O size allowed into the buffer, consider the write size (ws_opt) used by the write thread in order to cover the case in which, due to flushes, the mem and subm pointers are disaligned by (ws_opt - 1). This case currently translates into a stall when an I/O of the largest possible size is submitted. Fixes: f9f9d1ae2c66 ("lightnvm: pblk: prevent stall due to wb threshold") Signed-off-by: Javier González <javier@javigon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 06 Mar, 2019 2 commits
-
-
Ming Lei authored
blk_recount_segments() can be called in bio_add_pc_page() for calculating how many segments this bio will has after one page is added to this bio. If the resulted segment number is beyond the queue limit, the added page will be removed. The try-and-fix policy requires blk_recount_segments(__blk_recalc_rq_segments) to not consider the segment number limit. Unfortunately bvec_split_segs() does check this limit, and causes small segment number returned to bio_add_pc_page(), then page still may be added to the bio even though segment number limit becomes broken. Fixes this issue by not considering segment number limit when calcualting bio's segment number. Fixes: dcebd755 ("block: use bio_for_each_bvec() to compute multi-page bvec count") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Merge branch 'stable/for-jens-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-5.1/block-post Pull two xen blkback fixes from Konrad. * 'stable/for-jens-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront xen/blkback: add stack variable 'blkif' in connect_ring()
-
- 02 Mar, 2019 1 commit
-
-
Ming Lei authored
When the current bvec can be merged to the 1st segment, the bio's front segment size has to be updated. However, dcebd755 doesn't consider that case, then bio's front segment size may not be correct. This patch fixes this issue. Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Fixes: dcebd755 ("block: use bio_for_each_bvec() to compute multi-page bvec count") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 28 Feb, 2019 9 commits
-
-
Keyur Patel authored
Replace hard coded function name register_blkdev with __func__, to improve robustness and to conform to the Linux kernel coding style. Issue found using checkpatch. Signed-off-by: Keyur Patel <iamkeyur96@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Li RongQing authored
genlmsg_reply can fail, so propagate its return code Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
YueHaibing authored
Fixes gcc '-Wunused-but-set-variable' warning: drivers/block/floppy.c: In function 'request_done': drivers/block/floppy.c:2233:24: warning: variable 'q' set but not used [-Wunused-but-set-variable] It's never used and can be removed. Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Heinz Mauelshagen authored
null_handle_bio() erroneously uses the bio_op macro which masks respective request flag bits including REQ_FUA out thus failing the check. Fix by checking bio->bi_opf directly. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
zhengbin authored
If __device_add_disk-->bdi_register_owner-->bdi_register--> bdi_register_va-->device_create_vargs fails, bdi->dev is still NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj. This patch fixes that. Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Carlos Maiolino authored
guard_bio_eod() can truncate a segment in bio to allow it to do IO on odd last sectors of a device. It already checks if the IO starts past EOD, but it does not consider the possibility of an IO request starting within device boundaries can contain more than one segment past EOD. In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will underflow bvec->bv_len. Fix this by checking if truncated_bytes is lower than PAGE_SIZE. This situation has been found on filesystems such as isofs and vfat, which doesn't check the device size before mount, if the device is smaller than the filesystem itself, a readahead on such filesystem, which spans EOD, can trigger this situation, leading a call to zero_user() with a wrong size possibly corrupting memory. I didn't see any crash, or didn't let the system run long enough to check if memory corruption will be hit somewhere, but adding instrumentation to guard_bio_end() to check truncated_bytes size, was enough to see the error. The following script can trigger the error. MNT=/mnt IMG=./DISK.img DEV=/dev/loop0 mkfs.vfat $IMG mount $IMG $MNT cp -R /etc $MNT &> /dev/null umount $MNT losetup -D losetup --find --show --sizelimit 16247280 $IMG mount $DEV $MNT find $MNT -type f -exec cat {} + >/dev/null Kudos to Eric Sandeen for coming up with the reproducer above Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dongli Zhang authored
Replace set->map[0] with set->map[HCTX_TYPE_DEFAULT] to avoid hardcoding. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
There is no need to only iterate in chunks of PAGE_SIZE or less in bvec_iter_advance, given that the callers pass in the chunk length that they are operating on - either that already is less than PAGE_SIZE because they do classic page-based iteration, or it is larger because the caller operates on multi-page bvecs. This should help shaving off a few cycles of the I/O hot path. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
mp_bvec_for_each_segment() is a bit big for the iteration, so introduce a light-weight helper for iterating over pages, then 32bytes stack space can be saved. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 27 Feb, 2019 3 commits
-
-
Ming Lei authored
Introduce a fast path for single-page bvec IO, then we can avoid to call bvec_split_segs() unnecessarily. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Introduce a fast path for single-page bvec IO, then blk_bvec_map_sg() can be avoided. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Ming Lei authored
Single-page bvec can often be seen in small BS workloads, so introduce bvec_nth_page() for avoiding to call nth_page() unnecessarily, which looks not cheap. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 24 Feb, 2019 5 commits
-
-
Christoph Hellwig authored
Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified to use bio_set_polled(). Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dongli Zhang authored
xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront The xenstore 'ring-page-order' is used globally for each blkback queue and therefore should be read from xenstore only once. However, it is obtained in read_per_ring_refs() which might be called multiple times during the initialization of each blkback queue. If the blkfront is malicious and the 'ring-page-order' is set in different value by blkfront every time before blkback reads it, this may end up at the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in xen_blkif_disconnect() when frontend is destroyed. This patch reworks connect_ring() to read xenstore 'ring-page-order' only once. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-