Commits · 889fca73287b0ae21c9d8712379c9ae5a3b27d08 · Kirill Smelkov / linux

10 Feb, 2021 3 commits

io_uring: don't propagate io_comp_state · 889fca73

Pavel Begunkov authored Feb 10, 2021

There is no reason to drag io_comp_state into opcode handlers, we just
need a flag and the actual work will be done in __io_queue_sqe().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

889fca73

io_uring: make op handlers always take issue flags · 61e98203

Pavel Begunkov authored Feb 10, 2021

Make opcode handler interfaces a bit more consistent by always passing
in issue flags. Bulky but pretty easy and mechanical change.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

61e98203

io_uring: replace force_nonblock with flags · 45d189c6

Pavel Begunkov authored Feb 10, 2021

Replace bool force_nonblock with flags. It has a long standing goal of
differentiating context from which we execute. Currently we have some
subtle places where some invariants, like holding of uring_lock, are
subtly inferred.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45d189c6

08 Feb, 2021 1 commit

io_uring: cleanup up cancel SQPOLL reqs across exec · 0e9ddb39

Pavel Begunkov authored Feb 07, 2021

For SQPOLL rings tctx_inflight() always returns zero, so it might skip
doing full cancelation. It's fine because we jam all sqpoll submissions
in any case and do go through files cancel for them, but not nice.

Do the intended full cancellation, by mimicking __io_uring_task_cancel()
waiting but impersonating SQPOLL task.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0e9ddb39

05 Feb, 2021 3 commits

io_uring: refactor sendmsg/recvmsg iov managing · 257e84a5

Pavel Begunkov authored Feb 05, 2021

Current iov handling with recvmsg/sendmsg may be confusing. First make a
rule for msg->iov: either it points to an allocated iov that have to be
kfree()'d later, or it's NULL and we use fast_iov. That's much better
than current 3-state (also can point to fast_iov). And rename it into
free_iov for uniformity with read/write.

Also, instead of after struct io_async_msghdr copy fixing up of
msg.msg_iter.iov has been happening in io_recvmsg()/io_sendmsg(). Move
it into io_setup_async_msg(), that's the right place.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: add comment on NULL check before kfree()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

257e84a5

io_uring: clean iov usage for recvmsg buf select · 5476dfed

Pavel Begunkov authored Feb 05, 2021

Don't pretend we don't know that REQ_F_BUFFER_SELECT for recvmsg always
uses fast_iov -- clean up confusing intermixing kmsg->iov and
kmsg->fast_iov for buffer select.

Also don't init iter with garbage in __io_recvmsg_copy_hdr() only for it
to be set shortly after in io_recvmsg().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5476dfed

io_uring: set msg_name on msg fixup · 2a780802

Pavel Begunkov authored Feb 05, 2021

io_setup_async_msg() should fully prepare io_async_msghdr, let it also
handle assigning msg_name and don't hand code it in [send,recv]msg().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2a780802

04 Feb, 2021 13 commits

io_uring/io-wq: return 2-step work swap scheme · 5280f7e5

Pavel Begunkov authored Feb 04, 2021

Saving one lock/unlock for io-wq is not super important, but adds some
ugliness in the code. More important, atomic decs not turning it to zero
for some archs won't give the right ordering/barriers so the
io_steal_work() may pretty easily get subtly and completely broken.

Return back 2-step io-wq work exchange and clean it up.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5280f7e5

io_uring: deduplicate file table slot calculation · ea64ec02

Pavel Begunkov authored Feb 04, 2021

Extract a helper io_fixed_file_slot() returning a place in our fixed
files table, so we don't hand-code it three times in the code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ea64ec02

io_uring: io_import_iovec return type cleanup · 847595de

Pavel Begunkov authored Feb 04, 2021

io_import_iovec() doesn't return IO size anymore, only error code. Make
it more apparent by returning int instead of ssize and clean up
leftovers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

847595de

io_uring: treat NONBLOCK and RWF_NOWAIT similarly · 75c668cd

Pavel Begunkov authored Feb 04, 2021

Make decision making of whether we need to retry read/write similar for
O_NONBLOCK and RWF_NOWAIT. Set REQ_F_NOWAIT when either is specified and
use it for all relevant checks. Also fix resubmitting NOWAIT requests
via io_rw_reissue().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75c668cd

io_uring: highlight read-retry loop · b23df91b

Pavel Begunkov authored Feb 04, 2021

We already have implicit do-while for read-retries but with goto in the
end. Convert it to an actual do-while, it highlights it so making a
bit more understandable and is cleaner in general.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b23df91b

io_uring: inline io_read()'s iovec freeing · 5ea5dd45

Pavel Begunkov authored Feb 04, 2021

io_read() has not the simpliest control flow with a lot of jumps and
it's hard to read. One of those is a out_free: label, which frees iovec.
However, from the middle of io_read() iovec is NULL'ed and so
kfree(iovec) is no-op, it leaves us with two place where we can inline
it and further clean up the code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5ea5dd45

io_uring: don't forget to adjust io_size · 7335e3bf

Pavel Begunkov authored Feb 04, 2021

We have invariant in io_read() of how much we're trying to read spilled
into an iter and io_size variable. The last one controls decision making
about whether to do read-retries. However, io_size is modified only
after the first read attempt, so if we happen to go for a third retry in
a single call to io_read(), we will get io_size greater than in the
iterator, so may lead to various side effects up to live-locking.

Modify io_size each time.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7335e3bf

io_uring: let io_setup_async_rw take care of iovec · 6bf985dc

Pavel Begunkov authored Feb 04, 2021

Now we give out ownership of iovec into io_setup_async_rw(), so it
either sets request's context right or frees the iovec on error itself.
Makes our life a bit easier at call sites.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6bf985dc

io_uring: further simplify do_read error parsing · 1a2cc0ce

Pavel Begunkov authored Feb 04, 2021

First, instead of checking iov_iter_count(iter) for 0 to find out that
all needed bytes were read, just compare returned code against io_size.
It's more reliable and arguably cleaner.

Also, place the half-read case into an else branch and delete an extra
label.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1a2cc0ce

io_uring: refactor io_read for unsupported nowait · 6713e7a6

Pavel Begunkov authored Feb 04, 2021

!io_file_supports_async() case of io_read() is hard to read, it jumps
somewhere in the middle of the function just to do async setup and fail
on a similar check. Call io_setup_async_rw() directly for this case,
it's much easier to follow.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6713e7a6

io_uring: refactor io_cqring_wait · eeb60b9a

Pavel Begunkov authored Feb 04, 2021

It's easy to make a mistake in io_cqring_wait() because for all
break/continue clauses we need to watch for prepare/finish_wait to be
used correctly. Extract all those into a new helper
io_cqring_wait_schedule(), and transforming the loop into simple series
of func calls: prepare(); check_and_schedule(); finish();
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eeb60b9a

io_uring: refactor scheduling in io_cqring_wait · c1d5a224

Pavel Begunkov authored Feb 04, 2021

schedule_timeout() with timeout=MAX_SCHEDULE_TIMEOUT is guaranteed to
work just as schedule(), so instead of hand-coding it based on arguments
always use the timeout version and simplify code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c1d5a224

io_uring: deduplicate core cancellations sequence · 9936c7c2

Pavel Begunkov authored Feb 04, 2021

Files and task cancellations go over same steps trying to cancel
requests in io-wq, poll, etc. Deduplicate it with a helper.

note: new io_uring_try_cancel_requests() is former
__io_uring_cancel_task_requests() with files passed as an agrument and
flushing overflowed requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9936c7c2

01 Feb, 2021 20 commits

io_uring: simplify do_read return parsing · 57cd657b

Pavel Begunkov authored Feb 01, 2021

do_read() returning 0 bytes read (not -EAGAIN/etc.) is not an important
enough of a case to prioritise it. Fold it into ret < 0 check, so we get
rid of an extra if and make it a bit more readable.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

57cd657b

io_uring: deduplicate adding to REQ_F_INFLIGHT · ce3d5aae

Pavel Begunkov authored Feb 01, 2021

We don't know for how long REQ_F_INFLIGHT is going to stay, cleaner to
extract a helper for marking requests as so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ce3d5aae

io_uring: remove work flags after cleanup · e86d0047

Pavel Begunkov authored Feb 01, 2021

Shouldn't be a problem now, but it's better to clean
REQ_F_WORK_INITIALIZED and work->flags only after relevant resources are
killed, so cancellation see them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e86d0047

io_uring: inline io_req_drop_files() · 34e08fed

Pavel Begunkov authored Feb 01, 2021

req->files now have same lifetime as all other iowq-work resources,
inline io_req_drop_files() for consistency. Moreover, since
REQ_F_INFLIGHT is no more files specific, the function name became
very confusing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

34e08fed

io_uring: kill not used needs_file_no_error · ba13e23f

Pavel Begunkov authored Feb 01, 2021

We have no request types left using needs_file_no_error, remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ba13e23f

io_uring: fix inconsistent lock state · 9ae1f8dd

Pavel Begunkov authored Feb 01, 2021

WARNING: inconsistent lock state

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
syz-executor217/8450 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: spin_lock include/linux/spinlock.h:354 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_req_clean_work fs/io_uring.c:1398 [inline]
ffff888023d6e620 (&fs->lock){?.+.}-{2:2}, at: io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&fs->lock);
  <Interrupt>
    lock(&fs->lock);

 *** DEADLOCK ***

1 lock held by syz-executor217/8450:
 #0: ffff88802417c3e8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x1071/0x1f30 fs/io_uring.c:9442

stack backtrace:
CPU: 1 PID: 8450 Comm: syz-executor217 Not tainted 5.11.0-rc5-next-20210129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
[...]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:354 [inline]
 io_req_clean_work fs/io_uring.c:1398 [inline]
 io_dismantle_req+0x66f/0xf60 fs/io_uring.c:2029
 __io_free_req+0x3d/0x2e0 fs/io_uring.c:2046
 io_free_req fs/io_uring.c:2269 [inline]
 io_double_put_req fs/io_uring.c:2392 [inline]
 io_put_req+0xf9/0x570 fs/io_uring.c:2388
 io_link_timeout_fn+0x30c/0x480 fs/io_uring.c:6497
 __run_hrtimer kernel/time/hrtimer.c:1519 [inline]
 __hrtimer_run_queues+0x609/0xe40 kernel/time/hrtimer.c:1583
 hrtimer_interrupt+0x334/0x940 kernel/time/hrtimer.c:1645
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1085 [inline]
 __sysvec_apic_timer_interrupt+0x146/0x540 arch/x86/kernel/apic/apic.c:1102
 asm_call_irq_on_stack+0xf/0x20
 </IRQ>
 __run_sysvec_on_irqstack arch/x86/include/asm/irq_stack.h:37 [inline]
 run_sysvec_on_irqstack_cond arch/x86/include/asm/irq_stack.h:89 [inline]
 sysvec_apic_timer_interrupt+0xbd/0x100 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:629
RIP: 0010:__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:169 [inline]
RIP: 0010:_raw_spin_unlock_irq+0x25/0x40 kernel/locking/spinlock.c:199
 spin_unlock_irq include/linux/spinlock.h:404 [inline]
 io_queue_linked_timeout+0x194/0x1f0 fs/io_uring.c:6525
 __io_queue_sqe+0x328/0x1290 fs/io_uring.c:6594
 io_queue_sqe+0x631/0x10d0 fs/io_uring.c:6639
 io_queue_link_head fs/io_uring.c:6650 [inline]
 io_submit_sqe fs/io_uring.c:6697 [inline]
 io_submit_sqes+0x19b5/0x2720 fs/io_uring.c:6960
 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9443
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Don't free requests from under hrtimer context (softirq) as it may sleep
or take spinlocks improperly (e.g. non-irq versions).

Cc: stable@vger.kernel.org # 5.6+
Reported-by: syzbot+81d17233a2b02eafba33@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9ae1f8dd

io_uring: Fix NULL dereference in error in io_sqe_files_register() · 13770a71

Dan Carpenter authored Feb 01, 2021

If we hit a "goto out_free;" before the "ctx->file_data" pointer has
been assigned then it leads to a NULL derefence when we call:

	free_fixed_rsrc_data(ctx->file_data);

We can fix this by moving the assignment earlier.

Fixes: 1ad555c6 ("io_uring: create common fixed_rsrc_data allocation routines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

13770a71

io_uring: check kthread parked flag before sqthread goes to sleep · 8b28fdf2

Hao Xu authored Jan 31, 2021

Abaci reported this issue:

#[  605.170872] INFO: task kworker/u4:1:53 blocked for more than 143 seconds.
[  605.172123]       Not tainted 5.10.0+ #1
[  605.172811] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.173915] task:kworker/u4:1    state:D stack:    0 pid:   53 ppid:     2 flags:0x00004000
[  605.175130] Workqueue: events_unbound io_ring_exit_work
[  605.175931] Call Trace:
[  605.176334]  __schedule+0xe0e/0x25a0
[  605.176971]  ? firmware_map_remove+0x1a1/0x1a1
[  605.177631]  ? write_comp_data+0x2a/0x80
[  605.178272]  schedule+0xd0/0x270
[  605.178811]  schedule_timeout+0x6b6/0x940
[  605.179415]  ? mark_lock.part.0+0xca/0x1420
[  605.180062]  ? usleep_range+0x170/0x170
[  605.180684]  ? wait_for_completion+0x16d/0x280
[  605.181392]  ? mark_held_locks+0x9e/0xe0
[  605.182079]  ? rwlock_bug.part.0+0x90/0x90
[  605.182853]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  605.183817]  wait_for_completion+0x175/0x280
[  605.184713]  ? wait_for_completion_interruptible+0x340/0x340
[  605.185611]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.186307]  ? migrate_swap_stop+0x9c0/0x9c0
[  605.187046]  kthread_park+0x127/0x1c0
[  605.187738]  io_sq_thread_stop+0xd5/0x530
[  605.188459]  io_ring_exit_work+0xb1/0x970
[  605.189207]  process_one_work+0x92c/0x1510
[  605.189947]  ? pwq_dec_nr_in_flight+0x360/0x360
[  605.190682]  ? rwlock_bug.part.0+0x90/0x90
[  605.191430]  ? write_comp_data+0x2a/0x80
[  605.192207]  worker_thread+0x9b/0xe20
[  605.192900]  ? process_one_work+0x1510/0x1510
[  605.193599]  kthread+0x353/0x460
[  605.194154]  ? _raw_spin_unlock_irq+0x24/0x30
[  605.194910]  ? kthread_create_on_node+0x100/0x100
[  605.195821]  ret_from_fork+0x1f/0x30
[  605.196605]
[  605.196605] Showing all locks held in the system:
[  605.197598] 1 lock held by khungtaskd/25:
[  605.198301]  #0: ffffffff8b5f76a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x30
[  605.199914] 3 locks held by kworker/u4:1/53:
[  605.200609]  #0: ffff888100109938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x82a/0x1510
[  605.202108]  #1: ffff888100e47dc0 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x85e/0x1510
[  605.203681]  #2: ffff888116931870 (&sqd->lock){+.+.}-{3:3}, at: io_sq_thread_park.part.0+0x19/0x50
[  605.205183] 3 locks held by systemd-journal/161:
[  605.206037] 1 lock held by syslog-ng/254:
[  605.206674] 2 locks held by agetty/311:
[  605.207292]  #0: ffff888101097098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.208715]  #1: ffffc900000332e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.210131] 2 locks held by bash/677:
[  605.210723]  #0: ffff88810419a098 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x27/0x80
[  605.212105]  #1: ffffc900000512e8 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x222/0x1bb0
[  605.213777]
[  605.214151] =============================================

I believe this is caused by the follow race:

(ctx_list is empty now)
=> io_put_sq_data               |
==> kthread_park(sqd->thread);  |
====> set KTHREAD_SHOULD_PARK	|
====> wake_up_process(k)        | sq thread is running
				|
				|
				| needs_sched is true since no ctx,
				| so TASK_INTERRUPTIBLE set and schedule
				| out then never wake up again
				|
====> wait_for_completion	|
	(stuck here)

So check if sqthread gets park flag right before schedule().
since ctx_list is always empty when this problem happens, here I put
kthread_should_park() before setting the wakeup flag(ctx_list is empty
so this for loop is fast), where is close enough to schedule(). The
problem doesn't show again in my repro testing after this fix.
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8b28fdf2

MAINTAINERS: update io_uring section · 090da7d5

Pavel Begunkov authored Jan 27, 2021

Add the missing kernel io_uring header, add Pavel as a reviewer, and
exclude io_uring from the FILESYSTEMS section to avoid keep spamming Al
(mainly) with bug reports, patches, etc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

090da7d5

io_uring: Add skip option for __io_sqe_files_update · 4e0377a1

noah authored Jan 26, 2021

This patch adds support for skipping a file descriptor when using
IORING_REGISTER_FILES_UPDATE.  __io_sqe_files_update will skip fds set
to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn
added as a #define in io_uring.h
Signed-off-by: noah <goldstein.w.n@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e0377a1

io_uring: cleanup files_update looping · 67973b93

Pavel Begunkov authored Jan 26, 2021

Replace a while with a simple for loop, that looks way more natural, and
enables us to use "continue" as indexes are no more updated by hand in
the end of the loop.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

67973b93

io_uring: consolidate putting reqs task · 7c660731

Pavel Begunkov authored Jan 25, 2021

We grab a task for each request and while putting it it also have to do
extra work like inflight accounting and waking up that task. This
sequence is duplicated several time, it's good time to add a helper.
More to that, the helper generates better code due to better locality
and so not failing alias analysis.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7c660731

io_uring: ensure only sqo_task has file notes · ecfc8492

Pavel Begunkov authored Jan 25, 2021

For SQPOLL io_uring we want to have only one file note held by
sqo_task. Add a warning to make sure it holds. It's deep in
io_uring_add_task_file() out of hot path, so shouldn't hurt.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecfc8492

io_uring: simplify io_remove_personalities() · 0bead8cd

Yejune Deng authored Dec 24, 2020

The function io_remove_personalities() is very similar to
io_unregister_personality(),so implement io_remove_personalities()
calling io_unregister_personality().
Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0bead8cd

io_uring/io-wq: kill off now unused IO_WQ_WORK_NO_CANCEL · 4014d943

Jens Axboe authored Jan 19, 2021

It's no longer used as IORING_OP_CLOSE got rid for the need of flagging
it as uncancelable, kill it of.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4014d943

io_uring: get rid of intermediate IORING_OP_CLOSE stage · 9eac1904

Jens Axboe authored Jan 19, 2021

We currently split the close into two, in case we have a ->flush op
that we can't safely handle from non-blocking context. This requires
us to flag the op as uncancelable if we do need to punt it async, and
that means special handling for just this op type.

Use __close_fd_get_file() and grab the files lock so we can get the file
and check if we need to go async in one atomic operation. That gets rid
of the need for splitting this into two steps, and hence the need for
IO_WQ_WORK_NO_CANCEL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9eac1904

fs: provide locked helper variant of close_fd_get_file() · 53dec2ea

Jens Axboe authored Jan 19, 2021

Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

53dec2ea

io_uring: save atomic dec for inline executed reqs · e342c807

Pavel Begunkov authored Jan 19, 2021

When a request is completed with comp_state, its completion reference
put is deferred to io_submit_flush_completions(), but the submission
is put not far from there, so do it together to save one atomic dec per
request. That targets requests that complete inline, e.g. buffered rw,
send/recv.

Proper benchmarking haven't been conducted but for nops(batch=32) it was
around 7901 vs 8117 KIOPS (~2.7%), or ~4% per perf profiling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e342c807

io_uring: don't flush CQEs deep down the stack · 9affd664

Pavel Begunkov authored Jan 19, 2021

io_submit_flush_completions() is called down the stack in the _state
version of io_req_complete(), that's ok because is only called by
io_uring opcode handler functions directly. Move it up to
__io_queue_sqe() as preparation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9affd664

io_uring: help inlining of io_req_complete() · a38d68db

Pavel Begunkov authored Jan 19, 2021

__io_req_complete() inlining is a bit weird, some compilers don't
optimise out the non-NULL branch of it even when called as
io_req_complete(). Help it a bit by extracting state and stateless
helpers out of __io_req_complete().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a38d68db