Commits · d4b7a5ef2b9c06def90d12db9b99bd12d75758fb · Kirill Smelkov / linux

19 Oct, 2021 25 commits

io_uring: inline completion batching helpers · d4b7a5ef

Pavel Begunkov authored Sep 24, 2021

We now have a single function for batched put of requests, just inline
struct req_batch and all related helpers into it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/595a2917f80dd94288cd7203052c7934f5446580.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d4b7a5ef

io_uring: optimise batch completion · f5ed3bcd

Pavel Begunkov authored Sep 24, 2021

First, convert rest of iopoll bits to single linked lists, and also
replace per-request list_add_tail() with splicing a part of slist.

With that, use io_free_batch_list() to put/free requests. The main
advantage of it is that it's now the only user of struct req_batch and
friends, and so they can be inlined. The main overhead there was
per-request call to not-inlined io_req_free_batch(), which is expensive
enough.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f5ed3bcd

io_uring: convert iopoll_completed to store_release · b3fa03fd

Pavel Begunkov authored Sep 24, 2021

Convert explicit barrier around iopoll_completed to smp_load_acquire()
and smp_store_release(). Similar on the callback side, but replaces a
single smp_rmb() with per-request smp_load_acquire(), neither imply any
extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't
matter.

Use it to move filling CQEs by iopoll earlier, that will be necessary
to avoid traversing the list one extra time in the future.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b3fa03fd

io_uring: add a helper for batch free · 3aa83bfb

Pavel Begunkov authored Sep 24, 2021

Add a helper io_free_batch_list(), which takes a single linked list and
puts/frees all requests from it in an efficient manner. Will be reused
later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4fc8306b542c6b1dd1d08e8021ef3bdb0ad15010.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3aa83bfb

io_uring: use single linked list for iopoll · 5eef4e87

Pavel Begunkov authored Sep 24, 2021

Use single linked lists for keeping iopoll requests, takes less space,
may be faster, but mostly will be of benefit for further patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/314033676b100cd485518c3bc55e1b95a0dcd71f.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5eef4e87

io_uring: split iopoll loop · e3f721e6

Pavel Begunkov authored Sep 24, 2021

The main loop of io_do_iopoll() iterates and does ->iopoll() until it
meets a first completed request, then it continues from that position
and splices requests to pass them through io_iopoll_complete().

Split the loop in two for clearness, iopolling and reaping completed
requests from the list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e3f721e6

io_uring: replace list with stack for req caches · c2b6c6bc

Pavel Begunkov authored Sep 24, 2021

Replace struct list_head free_list serving for caching requests with
singly linked stack, which is faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bc942b82422fb2624b8353bd93aca183a022846.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c2b6c6bc

io-wq: add io_wq_work_node based stack · 0d9521b9

Pavel Begunkov authored Sep 24, 2021

Apart from just using lists (i.e. io_wq_work_list), we also want to have
stacks, which are a bit faster, and have some interoperability between
them. Add a stack implementation based on io_wq_work_node and some
helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5d3a412a5ac0d47e0f0499d70d2207d70a68925e.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0d9521b9

io_uring: remove allocation cache array · 3ab665b7

Pavel Begunkov authored Sep 24, 2021

We have several of request allocation layers, remove the last one, which
is the submit->reqs array, and always use submit->free_reqs instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8547095c35f7a87bab14f6447ecd30a273ed7500.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3ab665b7

io_uring: use slist for completion batching · 6f33b0bc

Pavel Begunkov authored Sep 24, 2021

Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6f33b0bc

io_uring: make io_do_iopoll return number of reqs · 5ba3c874

Pavel Begunkov authored Sep 24, 2021

Don't pass nr_events pointer around but return directly, it's less
expensive than pointer increments.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f771a8153a86f16f12ff4272524e9e549c5de40b.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5ba3c874

io_uring: force_nonspin · 87a115fb

Pavel Begunkov authored Sep 24, 2021

We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.

Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

87a115fb

io_uring: mark having different creds unlikely · 6878b40e

Pavel Begunkov authored Sep 24, 2021

Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6878b40e

io_uring: return boolean value for io_alloc_async_data · 8d4af685

Hao Xu authored Sep 22, 2021

boolean value is good enough for io_alloc_async_data.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101522.9179-1-haoxu@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8d4af685

io_uring: optimise io_req_init() sqe flags checks · 68fe256a

Pavel Begunkov authored Sep 15, 2021

IOSQE_IO_DRAIN is quite marginal and we don't care too much about
IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under
SQE_VALID_FLAGS check. Now we first check whether it uses a "safe"
subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not
true we test the rest of the flags.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

68fe256a

io_uring: remove ctx referencing from complete_post · a3f34907

Pavel Begunkov authored Sep 15, 2021

Now completions are done from task context, that means that it's either
the task itself, task_work or io-wq worker. In all those cases the ctx
will be staying alive by mutexing, explicit referencing or req references
by iowq. Remove extra ctx pinning from io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a3f34907

io_uring: add more uring info to fdinfo for debug · 83f84356

Hao Xu authored Sep 13, 2021

Developers may need some uring info to help themselves debug and address
issues in production. This includes sqring/cqring head/tail and the
detailed sqe/cqe info, which is very useful when an application is hung
on a ring.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

83f84356

io_uring: kill extra wake_up_process in tw add · d97ec623

Pavel Begunkov authored Sep 08, 2021

TWA_SIGNAL already wakes the thread, no need in wake_up_process() after
it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e90cf643f633e857443e0c9e72471b221735c50.1631115443.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d97ec623

io_uring: dedup CQE flushing non-empty checks · c450178d

Pavel Begunkov authored Sep 08, 2021

We don't do io_submit_flush_completions() when there is no requests
enqueued, and every single caller checks for it. Hide that check into
the function not forgetting about inlining. That will make it much
easier for changing the empty check condition in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c450178d

io_uring: inline linked part of io_req_find_next · d81499bf

Pavel Begunkov authored Sep 08, 2021

Inline part of __io_req_find_next() that returns a request but doesn't
need io_disarm_next(). It's just two places, but makes links a bit
faster.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4126d13f23d0e91b39b3558e16bd86cafa7fcef2.1631115443.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d81499bf

io_uring: inline io_dismantle_req · 6b639522

Pavel Begunkov authored Sep 08, 2021

io_dismantle_req() is hot, and not _too_ huge. Inline it, there are 3
call sites, which hopefully will turn into 2 in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bdd2dc30716cac270c2403e99bccd6286e4ae201.1631115443.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6b639522

io_uring: kill off ios_left · 4b628aeb

Pavel Begunkov authored Sep 08, 2021

->ios_left is only used to decide whether to plug or not, kill it to
avoid this extra accounting, just use the initial submission number.
There is no much difference in regards of enabling plugging, where this
one does it in a few more cases, but all major ones should be covered
well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4b628aeb

io-wq: Remove duplicate code in io_workqueue_create() · 71e1cef2

Bixuan Cui authored Sep 11, 2021

While task_work_add() in io_workqueue_create() is true,
then duplicate code is executed:

  -> clear_bit_unlock(0, &worker->create_state);
  -> io_worker_release(worker);
  -> atomic_dec(&acct->nr_running);
  -> io_worker_ref_put(wq);
  -> return false;

  -> clear_bit_unlock(0, &worker->create_state); // back to io_workqueue_create()
  -> io_worker_release(worker);
  -> kfree(worker);

The io_worker_release() and clear_bit_unlock() are executed twice.

Fixes: 3146cba9 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Link: https://lore.kernel.org/r/20210911085847.34849-1-cuibixuan@huawei.comReviwed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

71e1cef2

io_uring: dump sqe contents if issue fails · a87acfde

Jens Axboe authored Sep 11, 2021

I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.

Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a87acfde

block: fix too broad elevator check in blk_mq_free_request() · e0d78afe

Jens Axboe authored Oct 18, 2021

We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.

This fixes a boot crash

Fixes: 2ff0682d ("block: store elevator state in request")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e0d78afe

18 Oct, 2021 15 commits

nvme: wire up completion batching for the IRQ path · 4f502245

Jens Axboe authored Oct 18, 2021

Trivial to do now, just need our own io_comp_batch on the stack and pass
that in to the usual command completion handling.

I pondered making this dependent on how many entries we had to process,
but even for a single entry there's no discernable difference in
performance or latency. Running a sync workload over io_uring:

t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1

yields the below performance before the patch:

IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

and the following after:

IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1)
IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1)

which definitely isn't slower, about the same if you factor in a bit of
variance. For peak performance workloads, benchmarking shows a 2%
improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f502245

io_uring: utilize the io batching infrastructure for more efficient polled IO · b688f11e

Jens Axboe authored Oct 12, 2021

Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.

This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b688f11e

nvme: add support for batched completion of polled IO · c234a653

Jens Axboe authored Oct 08, 2021

Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c234a653

block: add support for blk_mq_end_request_batch() · f794f335

Jens Axboe authored Oct 08, 2021

Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f794f335

sbitmap: add helper to clear a batch of tags · 1aec5e4a

Jens Axboe authored Oct 08, 2021

sbitmap currently only supports clearing tags one-by-one, add a helper
that allows the caller to pass in an array of tags to clear.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1aec5e4a

block: add a struct io_comp_batch argument to fops->iopoll() · 5a72e899

Jens Axboe authored Oct 12, 2021

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a72e899

block: provide helpers for rq_list manipulation · 013a7f95

Jens Axboe authored Oct 13, 2021

Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

013a7f95

block: remove some blk_mq_hw_ctx debugfs entries · afd7de03

Jens Axboe authored Oct 18, 2021

Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
  confident in the iopoll implementation at this point, don't bother
  tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

afd7de03

block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes · 9a14d6ce

Jens Axboe authored Oct 16, 2021

These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9a14d6ce

block: cache rq_flags inside blk_mq_rq_ctx_init() · 12845906

Pavel Begunkov authored Oct 18, 2021

Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

12845906

block: blk_mq_rq_ctx_init cache ctx/q/hctx · 605f784e

Pavel Begunkov authored Oct 18, 2021

We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

605f784e

block: skip elevator fields init for non-elv queue · 4f266f2b

Pavel Begunkov authored Oct 18, 2021

Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4f266f2b

block: store elevator state in request · 2ff0682d

Jens Axboe authored Oct 15, 2021

Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2ff0682d

block: only mark bio as tracked if it really is tracked · 90b8faa0

Jens Axboe authored Oct 15, 2021

We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.

This saves considerable time for the case where the bio isn't tracked:

     2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90b8faa0

block: improve layout of struct request · b6087629

Jens Axboe authored Oct 15, 2021

It's been a while since this was analyzed, move some members around to
better flow with the use case. Initial state up top, and queued state
after that. This improves my peak case by about 1.5%, from 7750K to
7900K IOPS.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b6087629