Commits · 5d70904367b45b74dab9da5c023b6629f511e48f · Kirill Smelkov / linux

We cache all the reference to task + tctx, so if io_put_task() is
called by the corresponding task itself, we can save on atomics and
return the refs right back into the cache.

It's beneficial for all inline completions, and also iopolling, when
polling and submissions are done by the same task, including
SQPOLL|IOPOLL.

Note: io_uring_cancel_generic() can return refs to the cache as well,
so those should be flushed in the loop for tctx_inflight() to work
right.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e9dbe221

io_uring: drop exec checks from io_req_task_submit · af066f31

Pavel Begunkov authored Aug 09, 2021

In case of on-exec io_uring cancellations, tasks already wait for all
submitted requests to get completed/cancelled, so we don't need to check
for ->in_execve separately.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

af066f31

io_uring: kill unused IO_IOPOLL_BATCH · bbbca094

Pavel Begunkov authored Aug 09, 2021

IO_IOPOLL_BATCH is not used, delete it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bbbca094

io_uring: improve ctx hang handling · 58d3be2c

Pavel Begunkov authored Aug 09, 2021

If io_ring_exit_work() can't get it done in 5 minutes, something is
going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help
and it may take much of CPU time if there is a lot of workers stuck as
such.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

58d3be2c

io_uring: deduplicate open iopoll check · d3fddf6d

Pavel Begunkov authored Aug 09, 2021

Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat
and openat2 reuse it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d3fddf6d

io_uring: inline io_free_req_deferred · 543af3a1

Pavel Begunkov authored Aug 09, 2021

Inline io_free_req_deferred(), there is no reason to keep it separated.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

543af3a1

io_uring: move io_rsrc_node_alloc() definition · b9bd2bea

Pavel Begunkov authored Aug 09, 2021

Move the function together with io_rsrc_node_ref_zero() in the source
file as it is to get rid of forward declarations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b9bd2bea

io_uring: move io_put_task() definition · 6a290a14

Pavel Begunkov authored Aug 09, 2021

Move the function in the source file as it is to get rid of forward
declarations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6a290a14

io_uring: extract a helper for ctx quiesce · e73c5c7c

Pavel Begunkov authored Aug 09, 2021

Refactor __io_uring_register() by extracting a helper responsible for
ctx queisce. Looks better and will make it easier to add more
optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e73c5c7c

io_uring: optimise io_cqring_wait() hot path · 90291099

Pavel Begunkov authored Aug 09, 2021

Turns out we always init struct io_wait_queue in io_cqring_wait(), even
if it's not used after, i.e. there are already enough of CQEs. And often
it's exactly what happens, for instance, requests may have been
completed inline, or in case of io_uring_enter(submit=N, wait=1).

It shows up in my profiler, so optimise it by delaying the struct init.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com
[axboe: fixed up for new cqring wait]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90291099

io_uring: add more locking annotations for submit · 282cdc86

Pavel Begunkov authored Aug 09, 2021

Add more annotations for submission path functions holding ->uring_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

282cdc86

io_uring: don't halt iopoll too early · a2416e1e

Pavel Begunkov authored Aug 09, 2021

IOPOLL users should care more about getting completions for requests
they submitted, but not in "device did/completed something". Currently,
io_do_iopoll() may return a positive number, which will instruct
io_iopoll_check() to break the loop and end the syscall, even if there
is not enough CQEs or none at all.

Don't return positive numbers, so io_iopoll_check() exits only when it
gets an actual error, need reschedule or got enough CQEs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a2416e1e

io_uring: refactor io_alloc_req · 864ea921

Pavel Begunkov authored Aug 09, 2021

Replace the main if of io_flush_cached_reqs() with inverted condition +
goto, so all the cases are handled in the same way. And also extract
io_preinit_req() to make it cleaner and easier to refer to.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

864ea921

io-wq: improve wq_list_add_tail() · 8724dd8c

Pavel Begunkov authored Aug 09, 2021

Prepare nodes that we're going to add before actually linking them, it's
always safer and costs us nothing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8724dd8c

io_uring: remove unnecessary PF_EXITING check · 2215bed9

Pavel Begunkov authored Aug 09, 2021

We prefer nornal task_works even if it would fail requests inside. Kill
a PF_EXITING check in io_req_task_work_add(), task_work_add() handles
well dying tasks, i.e. return error when can't enqueue due to late
stages of do_exit().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2215bed9

io_uring: clean io-wq callbacks · ebc11b6c

Pavel Begunkov authored Aug 09, 2021

Move io-wq callbacks closer to each other, so it's easier to work with
them, and rename io_free_work() into io_wq_free_work() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ebc11b6c

io_uring: avoid touching inode in rw prep · c97d8a0f

Pavel Begunkov authored Aug 09, 2021

If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set.
However, for non-reg files io_prep_rw() still will look into inode to
double check, and that's expensive and can be avoided.

The only caveat is that it only currently works with 64+ bit
architectures, see FFS_ISREG, so we should consider that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c97d8a0f

io_uring: rename io_file_supports_async() · b191e2df

Pavel Begunkov authored Aug 09, 2021

io_file_supports_async() checks whether a file supports nowait
operations, so "async" in the name is misleading. Rename it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b191e2df

io_uring: inline fixed part of io_file_get() · ac177053

Pavel Begunkov authored Aug 09, 2021

Optimise io_file_get() with registered files, which is in a hot path,
by inlining parts of the function. Saves a function call, and
inefficiencies of passing arguments, e.g. evaluating
(sqe_flags & IOSQE_FIXED_FILE).

It couldn't have been done before as compilers were refusing to inline
it because of the function size.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ac177053

io_uring: use kvmalloc for fixed files · 042b0d85

Pavel Begunkov authored Aug 09, 2021

Instead of hand-coded two-level tables for registered files, allocate
them with kvmalloc(). In many cases small enough tables are enough, and
so can be kmalloc()'ed removing an extra memory load and a bunch of bit
logic instructions from the hot path. If the table is larger, we trade
off all the pros with a TLB-assisted memory lookup.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

042b0d85

io_uring: be smarter about waking multiple CQ ring waiters · 5fd46178

Jens Axboe authored Aug 06, 2021

Currently we only wake the first waiter, even if we have enough entries
posted to satisfy multiple waiters. Improve that situation so that
every waiter knows how much the CQ tail has to advance before they can
be safely woken up.

With this change, if we have N waiters each asking for 1 event and we get
4 completions, then we wake up 4 waiters. If we have N waiters asking
for 2 completions and we get 4 completions, then we wake up the first
two. Previously, only the first waiter would've been woken up.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5fd46178

io-wq: remove GFP_ATOMIC allocation off schedule out path · d3e9f732

Jens Axboe authored Aug 04, 2021

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
| [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [   90.202561] Call Trace:
| [   90.202577]  dump_stack_lvl+0x34/0x44
| [   90.202584]  ___might_sleep.cold+0x87/0x94
| [   90.202588]  rt_spin_lock+0x19/0x70
| [   90.202593]  ___slab_alloc+0xcb/0x7d0
| [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
| [   90.202603]  ? dequeue_entity+0xc3/0x290
| [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202610]  ? pick_next_task_fair+0xb9/0x330
| [   90.202612]  ? __schedule+0x670/0x1410
| [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
| [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202625]  io_wq_worker_sleeping+0x37/0x50
| [   90.202628]  schedule+0x30/0xd0
| [   90.202630]  schedule_timeout+0x8f/0x1a0
| [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
| [   90.202637]  io_wqe_worker+0xfd/0x320
| [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
| [   90.202644]  ? io_worker_handle_work+0x670/0x670
| [   90.202646]  ? io_worker_handle_work+0x670/0x670
| [   90.202649]  ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d3e9f732

22 Aug, 2021 2 commits

Linux 5.14-rc7 · e22ce8eb
Linus Torvalds authored Aug 22, 2021

e22ce8eb

Merge tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1bdc3d5b

Linus Torvalds authored Aug 22, 2021

Pull powerpc fixes from Michael Ellerman:

 - Fix random crashes on some 32-bit CPUs by adding isync() after
   locking/unlocking KUEP

 - Fix intermittent crashes when loading modules with strict module RWX

 - Fix a section mismatch introduce by a previous fix.

Thanks to Christophe Leroy, Fabiano Rosas, Laurent Vivier, Murilo
Opsfelder Araújo, Nathan Chancellor, and Stan Johnson.

h# -----BEGIN PGP SIGNATURE-----

* tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Fix set_memory_*() against concurrent accesses
  powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
  powerpc/xive: Do not mark xive_request_ipi() as __init

1bdc3d5b

21 Aug, 2021 9 commits

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · 9ff50bf2

Linus Torvalds authored Aug 21, 2021

Pull clk driver fixes from Stephen Boyd:

 - Make the regulator state match the GDSC power domain state at boot on
   Qualcomm SoCs so that the regulator isn't turned off inadvertently.

 - Fix earlycon on i.MX6Q SoCs

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gdsc: Ensure regulator init state matches GDSC state
  clk: imx6q: fix uart earlycon unwork

9ff50bf2

Merge tag 'char-misc-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 9085423f

Linus Torvalds authored Aug 21, 2021

Pull char/misc driver fixes from Greg KH:
 "Here are some small driver fixes for 5.14-rc7.

  They consist of:

   - revert for an interconnect patch that was found to have problems

   - ipack tpci200 driver fixes for reported problems

   - slimbus messaging and ngd fixes for reported problems

  All are small and have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  ipack: tpci200: fix memory leak in the tpci200_register
  ipack: tpci200: fix many double free issues in tpci200_pci_probe
  slimbus: ngd: reset dma setup during runtime pm
  slimbus: ngd: set correct device for pm
  slimbus: messaging: check for valid transaction id
  slimbus: messaging: start transaction ids from 1 instead of zero
  Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate"

9085423f

Merge tag 'usb-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · f4ff9e6b

Linus Torvalds authored Aug 21, 2021

Pull USB fix from Greg KH:
 "Here is a single USB typec tcpm fix for a reported problem for
  5.14-rc7. It showed up in 5.13 and resolves an issue that Hans found.
  It has been in linux-next this week with no reported problems"

* tag 'usb-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tcpm: Fix VDMs sometimes not being forwarded to alt-mode drivers

f4ff9e6b

Merge tag 'riscv-for-linus-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · a09434f1

Linus Torvalds authored Aug 21, 2021

Pull RISC-V fixes from Palmer Dabbelt:

 - fix the sifive-l2-cache device tree bindings for json-schema
   compatibility. This does not change the intended behavior of the
   binding.

 - avoid improperly freeing necessary resources during early boot.

* tag 'riscv-for-linus-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Fix a number of free'd resources in init_resources()
  dt-bindings: sifive-l2-cache: Fix 'select' matching

a09434f1

Merge tag 's390-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 5479a7fe

Linus Torvalds authored Aug 21, 2021

Pull s390 fix from Vasily Gorbik:

 - fix use after free of zpci_dev in pci code

* tag 's390-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pci: fix use after free of zpci_dev

5479a7fe

Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux · 15517c72

Linus Torvalds authored Aug 21, 2021

Pull mandatory file locking deprecation warning from Jeff Layton:
 "As discussed on the list, this patch just adds a new warning for folks
  who still have mandatory locking enabled and actually mount with '-o
  mand'. I'd like to get this in for v5.14 so we can push this out into
  stable kernels and hopefully reach folks who have mounts with -o mand.

  For now, I'm operating under the assumption that we'll fully remove
  this support in v5.15, but we can move that out if any legitimate
  users of this facility speak up between now and then"

* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: warn about impending deprecation of mandatory locks

15517c72

Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block · 002c0aef

Linus Torvalds authored Aug 21, 2021

Pull block fixes from Jens Axboe:
 "Three fixes from Ming Lei that should go into 5.14:

   - Fix for a kernel panic when iterating over tags for some cases
     where a flush request is present, a regression in this cycle.

   - Request timeout fix

   - Fix flush request checking"

* tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
  blk-mq: fix is_flush_rq
  blk-mq: fix kernel panic during iterating over flush request
  blk-mq: don't grab rq's refcount in blk_mq_check_expired()

002c0aef

Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block · 1e6907d5

Linus Torvalds authored Aug 21, 2021

Pull io_uring fixes from Jens Axboe:
 "A few small fixes that should go into this release:

   - Fix never re-assigning an initial error value for io_uring_enter()
     for SQPOLL, if asked to do nothing

   - Fix xa_alloc_cycle() return value checking, for cases where we have
     wrapped around

   - Fix for a ctx pin issue introduced in this cycle (Pavel)"

* tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
  io_uring: fix xa_alloc_cycle() error return value check
  io_uring: pin ctx on fallback execution
  io_uring: only assign io_uring_enter() SQPOLL error in actual error case

1e6907d5

fs: warn about impending deprecation of mandatory locks · fdd92b64

Jeff Layton authored Aug 20, 2021

We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>

fdd92b64