- 30 Nov, 2018 3 commits
-
-
Jens Axboe authored
Even if we have no waiters on any of the sbitmap_queue wait states, we still have to loop every entry to check. We do this for every IO, so the cost adds up. Shift a bit of the cost to the slow path, when we actually have waiters. Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain an internal count of how many are currently active. Then we can simply check this count in sbq_wake_ptr() and not have to loop if we don't have any sleepers. Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
sbitmap maintains a set of words that we use to set and clear bits, with each bit representing a tag for blk-mq. Even though we spread the bits out and maintain a hint cache, one particular bit allocated will end up being cleared in the exact same spot. This introduces batched clearing of bits. Instead of clearing a given bit, the same bit is set in a cleared/free mask instead. If we fail allocating a bit from a given word, then we check the free mask, and batch move those cleared bits at that time. This trades 64 atomic bitops for 2 cmpxchg(). In a threaded poll test case, half the overhead of getting and clearing tags is removed with this change. On another poll test case with a single thread, performance is unchanged. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
The bio referencing has a trick that doesn't do any actual atomic inc/dec on the reference count until we have to elevator to > 1. For the async IO O_DIRECT case, we can't use the simple DIO variants, so we use __blkdev_direct_IO(). It always grabs an extra reference to the bio after allocation, which means we then enter the slower path of actually having to do atomic_inc/dec on the count. We don't need to do that for the async case, unless we end up going multi-bio, in which case we're already doing huge amounts of IO. For the smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref. Based on an earlier patch (and commit log) from Jens Axboe. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 29 Nov, 2018 10 commits
-
-
Jens Axboe authored
If we aren't forced to do round robin tag allocation, just use the allocation hint to find the index for the tag word, don't use it for the offset inside the word. This avoids a potential extra round trip in the bit looping, and since we're fetching this cacheline, we may as well check the whole word from the start. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we have that hook, we know the driver handles bd->last == true in a smart fashion. If it does, even for multiple hardware queues, it's a good idea to flush batches of requests to the device, if we have batches of requests from the submitter. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we are issuing a list of requests, we know if we're at the last one. If we fail issuing, ensure that we call ->commits_rqs() to flush any potential previous requests. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We need this for blk-mq to kick things into gear, if we told it that we had more IO coming, but then failed to deliver on that promise. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We need this for blk-mq to kick things into gear, if we told it that we had more IO coming, but then failed to deliver on that promise. Reviewed-by: Omar Sandoval <osandov@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Split the command submission and the SQ doorbell ring, and add the doorbell ring as our ->commit_rqs() hook. This allows a list of requests to be issued, with nvme only writing the SQ update when it's necessary. This is more efficient if we have lists of requests to issue, particularly on virtualized hardware, where writing the SQ doorbell is more expensive than on real hardware. For those cases, performance increases of 2-3x have been observed. The use case for this is plugged IO, where blk-mq flushes a batch of requests at the time. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
blk-mq passes information to the hardware about any given request being the last that we will issue in this sequence. The point is that hardware can defer costly doorbell type writes to the last request. But if we run into errors issuing a sequence of requests, we may never send the request with bd->last == true set. For that case, we need a hook that tells the hardware that nothing else is coming right now. For failures returned by the drivers ->queue_rq() hook, the driver is responsible for flushing pending requests, if it uses bd->last to optimize that part. This works like before, no changes there. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Only do it if we have requests for multiple queues in the same plug. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dan Carpenter authored
I recently found some code which called blk_mq_free_map_and_requests() with a NULL set->tags pointer. I fixed the caller, but it seems like a good idea to add a NULL check here as well. Now we can call: blk_mq_free_tag_set(set); blk_mq_free_tag_set(set); twice in a row and it's harmless. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Dan Carpenter authored
Smatch complains that there is an off by one if the allocation fails in: DMABuffer = atari_stram_alloc(BUFFER_SIZE+512, "ataflop"); In that situation, "i" would be point to one element beyond the end of the unit[] array. There is a second bug because the error handling calls blk_mq_free_tag_set(&unit[i].tag_set); regardless of whether "disk->queue" is NULL or non-NULL. So if blk_mq_init_sq_queue() fails, then that means unit[i].tag_set->tags is NULL and it leads to an Oops. It's easiest to call put_disk() before the goto to clean up the partial iteration. Then the earlier unit[] elements are fully allocated so we can remove the checks whether "disk->queue" is NULL and the code is simpler. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 28 Nov, 2018 4 commits
-
-
Weiping Zhang authored
Give a interface to adjust io timeout(ms) by device. Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Yufen Yu authored
We recently got a stack by syzkaller like this: BUG: sleeping function called from invalid context at mm/slab.h:361 in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid INFO: lockdep is turned off. CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00 0000000000000000 0000000000000001 0000000000000004 0000000000000000 Call Trace: <IRQ> [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline] <IRQ> [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline] [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline] [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline] [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline] [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline] [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline] [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline] [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline] [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline] [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline] [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline] [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline] [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline] [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746 <EOI> [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline] [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a In softirq context, we call rcu callback function delete_partition_rcu_cb(), which may allocate memory by kzalloc with GFP_KERNEL flag. If the allocation cannot be satisfied, it may sleep. However, That is not allowed in softirq contex. Although we found this problem on linux 4.4, the latest kernel version seems to have this problem as well. And it is very similar to the previous one: https://lkml.org/lkml/2018/7/9/391 Fix it by using RCU workqueue, which allows sleep. Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we yank a 'same_queue_rq' request off the plug list, we should also decrement the cached request count. Fixes: 5f0ed774 ("block: sum requests in the plug structure") Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Young Xiao authored
__vdc_tx_trigger should only loop on EAGAIN a finite number of times. See commit adddc32d ("sunvnet: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN") for detail. Signed-off-by: Young Xiao <YangX92@hotmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 26 Nov, 2018 10 commits
-
-
Jens Axboe authored
This isn't exactly the same as the previous count, as it includes requests for all devices. But that really doesn't matter, if we have more than the threshold (16) queued up, flush it. It's not worth it to have an expensive list loop for this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
There are no more users relying on blk-mq request states to prevent double completions, so replace the relatively expensive cmpxchg operation with WRITE_ONCE. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
The scsi timeout error handling had been directly updating the block layer's request state to prevent a error handling and a natural completion from completing the same request twice. Fix this layering violation by having scsi control the fate of its commands with scsi owned flags rather than use blk-mq's. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Keith Busch authored
A driver may have internal state to cleanup if we're pretending a request didn't complete. Return 'false' if the command wasn't actually completed due to the timeout error injection, and true otherwise. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
It's pointless to do so, we are by definition on the CPU we want/need to be, as that's the one waiting for a completion event. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Right now we immediately bail if need_resched() is true, but we need to do at least one loop in case we have entries waiting. So just invert the need_resched() check, putting it at the bottom of the loop. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
blk_poll() has always kept spinning until it found an IO. This is fine for SYNC polling, since we need to find one request we have pending, but in preparation for ASYNC polling it can be beneficial to just check if we have any entries available or not. Existing callers are converted to pass in 'spin == true', to retain the old behavior. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
It doesn't set HIPRI on the bio, so polling for it is pretty silly. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We always pass in -1 now and none of the callers use the tag value, remove the parameter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If we want to support async IO polling, then we have to allow finding completions that aren't just for the one we are looking for. Always pass in -1 to the mq_ops->poll() helper, and have that return how many events were found in this poll loop. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 21 Nov, 2018 2 commits
-
-
Ming Lei authored
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs. This patch fixes the issue by the following approach: 1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing all ctxs 2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release handler of .mq_kobj 3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that .mq_kobj is always released after all ctxs are freed. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the first request allocated and issued by a process is a passhthrough request, we don't set up an IO context for it. Ensure that blk_mq_sched_assign_ioc() ignores a NULL io_context. Fixes: e2b3fa5a ("block: Remove bio->bi_ioc") Reported-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 20 Nov, 2018 6 commits
-
-
Damien Le Moal authored
For the synchronous I/O path case (read(), write() etc system calls), a BIO I/O priority is not initialized until the execution of blk_init_request_from_bio() when the BIO is submitted and a request initialized for the BIO execution. This is due to the ki_ioprio field of the struct kiocb defined on stack being always initialized to IOPRIO_CLASS_NONE, regardless of the calling process I/O context ioprio value set with ioprio_set(). This late initialization can result in the BIO being merged to pending requests even when the I/O priorities differ. Fix this by initializing the ki_iopriority field of on stack struct kiocb using the get_current_ioprio() helper, ensuring that all BIOs allocated and submitted for the system call execution see the correct intended I/O priority early. With this, since a BIO I/O priority is always set to the intended effective value for both the sync and async path, blk_init_request_from_bio() can be simplified. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Growing in size a high priority request by merging it with a lower priority BIO or request will increase the request execution time. This is the opposite result of the desired effect of high I/O priorities, namely getting low I/O latencies. Prevent merging of requests and BIOs that have different I/O priorities to fix this. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
For cases when the application does not specify aio_reqprio for an aio, fallback to use get_current_ioprio() to obtain the task I/O priority last set using ioprio_set() rather than the hardcoded IOPRIO_CLASS_NONE value. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Define get_current_ioprio() as an inline helper to obtain the caller I/O priority from its task I/O context. Use this helper in blk_init_request_from_bio() to set a request ioprio. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
bio->bi_ioc is never set so always NULL. Remove references to it in bio_disassociate_task() and in rq_ioc() and delete this field from struct bio. With this change, rq_ioc() always returns current->io_context without the need for a bio argument. Further simplify the code and make it more readable by also removing this helper, which also allows to simplify blk_mq_sched_assign_ioc() by removing its bio argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Adam Manzanares <adam.manzanares@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Damien Le Moal authored
Comment the use of the IOCB_FLAG_IOPRIO aio flag similarly to the IOCB_FLAG_RESFD flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 19 Nov, 2018 4 commits
-
-
Jens Axboe authored
It's specifically looking for a given request, which we will not be supporting going forward. Also kill the qla2xxx poll implementation as that's the only user of the nvme-fc poll, and the now unused ->poll_queue() hook. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We currently only really support sync poll, ie poll with 1 IO in flight. This prepares us for supporting async poll. Note that the returned value isn't necessarily 100% accurate. If poll races with IRQ completion, we assume that the fact that the task is now runnable means we found at least one entry. In reality it could be more than 1, or not even 1. This is fine, the caller will just need to take this into account. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
For the core poll helper, the task state setting don't need to imply any atomics, as it's the current task itself that is being modified and we're not going to sleep. For IRQ driven, the wakeup path have the necessary barriers to not need us using the heavy handed version of the task state setting. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We need a better way of configuring this, and given that polling is (still) a bit niche, let's default to using 0 poll queues. That way we'll have the same read/write/poll behavior as 4.20, and users that want to test/use polling are required to do manual configuration of the number of poll queues. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
- 18 Nov, 2018 1 commit
-
-
Jens Axboe authored
Merge in -rc3 to resolve a few conflicts, but also to get a few important fixes that have gone into mainline since the block 4.21 branch was forked off (most notably the SCSI queue issue, which is both a conflict AND needed fix). Signed-off-by: Jens Axboe <axboe@kernel.dk>
-