1. 19 Sep, 2024 1 commit
    • Jens Axboe's avatar
      io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL · 04beb6e0
      Jens Axboe authored
      If some part of the kernel adds task_work that needs executing, in terms
      of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two
      directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can
      be used for a variety of use case outside of task_work.
      
      However, io_cqring_wait_schedule() only tests explicitly for
      TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for
      the task, but used a different kind of signaling mechanism (or none at
      all). Normally this doesn't matter as any task_work will be run once
      the task exits to userspace, except if:
      
      1) The ring is setup with DEFER_TASKRUN
      2) The local work item may generate normal task_work
      
      For condition 2, this can happen when closing a file and it's the final
      put of that file, for example. This can cause stalls where a task is
      waiting to make progress inside io_cqring_wait(), but there's nothing else
      that will wake it up. Hence change the "should we schedule or loop around"
      check to check for the presence of task_work explicitly, rather than just
      TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the
      ordering of what type of task_work first in terms of ordering, to both
      make it consistent with other task_work runs in io_uring, but also to
      better handle the case of defer task_work generating normal task_work,
      like in the above example.
      Reported-by: default avatarJan Hendrik Farr <kernel@jfarr.cc>
      Link: https://github.com/axboe/liburing/issues/1235
      Cc: stable@vger.kernel.org
      Fixes: 846072f1 ("io_uring: mimimise io_cqring_wait_schedule")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04beb6e0
  2. 17 Sep, 2024 1 commit
  3. 16 Sep, 2024 3 commits
    • Dan Carpenter's avatar
      io_uring: clean up a type in io_uring_register_get_file() · 2f6a55e4
      Dan Carpenter authored
      Originally "fd" was unsigned int but it was changed to int when we pulled
      this code into a separate function in commit 0b6d253e
      ("io_uring/register: provide helper to get io_ring_ctx from 'fd'").  This
      doesn't really cause a runtime problem because the call to
      array_index_nospec() will clamp negative fds to 0 and nothing else uses
      the negative values.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Link: https://lore.kernel.org/r/6f6cb630-079f-4fdf-bf95-1082e0a3fc6e@stanley.mountainSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f6a55e4
    • Felix Moessbauer's avatar
      io_uring/sqpoll: do not put cpumask on stack · 7f44bead
      Felix Moessbauer authored
      Putting the cpumask on the stack is deprecated for a long time (since
      2d3854a3), as these can be big. Given that, change the on-stack
      allocation of allowed_mask to be dynamically allocated.
      
      Fixes: f011c9cf ("io_uring/sqpoll: do not allow pinning outside of cpuset")
      Signed-off-by: default avatarFelix Moessbauer <felix.moessbauer@siemens.com>
      Link: https://lore.kernel.org/r/20240916111150.1266191-1-felix.moessbauer@siemens.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f44bead
    • Jens Axboe's avatar
      io_uring/sqpoll: retain test for whether the CPU is valid · a09c1724
      Jens Axboe authored
      A recent commit ensured that SQPOLL cannot be setup with a CPU that
      isn't in the current tasks cpuset, but it also dropped testing whether
      the CPU is valid in the first place. Without that, if a task passes in
      a CPU value that is too high, the following KASAN splat can get
      triggered:
      
      BUG: KASAN: stack-out-of-bounds in io_sq_offload_create+0x858/0xaa4
      Read of size 8 at addr ffff800089bc7b90 by task wq-aff.t/1391
      
      CPU: 4 UID: 1000 PID: 1391 Comm: wq-aff.t Not tainted 6.11.0-rc7-00227-g371c468f4db6 #7080
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace.part.0+0xcc/0xe0
       show_stack+0x14/0x1c
       dump_stack_lvl+0x58/0x74
       print_report+0x16c/0x4c8
       kasan_report+0x9c/0xe4
       __asan_report_load8_noabort+0x1c/0x24
       io_sq_offload_create+0x858/0xaa4
       io_uring_setup+0x1394/0x17c4
       __arm64_sys_io_uring_setup+0x6c/0x180
       invoke_syscall+0x6c/0x260
       el0_svc_common.constprop.0+0x158/0x224
       do_el0_svc+0x3c/0x5c
       el0_svc+0x34/0x70
       el0t_64_sync_handler+0x118/0x124
       el0t_64_sync+0x168/0x16c
      
      The buggy address belongs to stack of task wq-aff.t/1391
       and is located at offset 48 in frame:
       io_sq_offload_create+0x0/0xaa4
      
      This frame has 1 object:
       [32, 40) 'allowed_mask'
      
      The buggy address belongs to the virtual mapping at
       [ffff800089bc0000, ffff800089bc9000) created by:
       kernel_clone+0x124/0x7e0
      
      The buggy address belongs to the physical page:
      page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0000d740af80 pfn:0x11740a
      memcg:ffff0000c2706f02
      flags: 0xbffe00000000000(node=0|zone=2|lastcpupid=0x1fff)
      raw: 0bffe00000000000 0000000000000000 dead000000000122 0000000000000000
      raw: ffff0000d740af80 0000000000000000 00000001ffffffff ffff0000c2706f02
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff800089bc7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff800089bc7b00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
      >ffff800089bc7b80: 00 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
                               ^
       ffff800089bc7c00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
       ffff800089bc7c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f3
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/oe-lkp/202409161632.cbeeca0d-lkp@intel.com
      Fixes: f011c9cf ("io_uring/sqpoll: do not allow pinning outside of cpuset")
      Tested-by: default avatarFelix Moessbauer <felix.moessbauer@siemens.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a09c1724
  4. 15 Sep, 2024 2 commits
  5. 14 Sep, 2024 1 commit
    • Jens Axboe's avatar
      io_uring: rename "copy buffers" to "clone buffers" · 636119af
      Jens Axboe authored
      A recent commit added support for copying registered buffers from one
      ring to another. But that term is a bit confusing, as no copying of
      buffer data is done here. What is being done is simply cloning the
      buffer registrations from one ring to another.
      
      Rename it while we still can, so that it's more descriptive. No
      functional changes in this patch.
      
      Fixes: 7cc2a6ea ("io_uring: add IORING_REGISTER_COPY_BUFFERS method")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      636119af
  6. 12 Sep, 2024 2 commits
    • Jens Axboe's avatar
      io_uring: add IORING_REGISTER_COPY_BUFFERS method · 7cc2a6ea
      Jens Axboe authored
      Buffers can get registered with io_uring, which allows to skip the
      repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This
      reduces the overhead of O_DIRECT IO.
      
      However, registrering buffers can take some time. Normally this isn't an
      issue as it's done at initialization time (and hence less critical), but
      for cases where rings can be created and destroyed as part of an IO
      thread pool, registering the same buffers for multiple rings become a
      more time sensitive proposition. As an example, let's say an application
      has an IO memory pool of 500G. Initial registration takes:
      
      Got 500 huge pages (each 1024MB)
      Registered 500 pages in 409 msec
      
      or about 0.4 seconds. If we go higher to 900 1GB huge pages being
      registered:
      
      Registered 900 pages in 738 msec
      
      which is, as expected, a fully linear scaling.
      
      Rather than have each ring pin/map/register the same buffer pool,
      provide an io_uring_register(2) opcode to simply duplicate the buffers
      that are registered with another ring. Adding the same 900GB of
      registered buffers to the target ring can then be accomplished in:
      
      Copied 900 pages in 17 usec
      
      While timing differs a bit, this provides around a 25,000-40,000x
      speedup for this use case.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cc2a6ea
    • Jens Axboe's avatar
      io_uring/register: provide helper to get io_ring_ctx from 'fd' · 0b6d253e
      Jens Axboe authored
      Can be done in one of two ways:
      
      1) Regular file descriptor, just fget()
      2) Registered ring, index our own table for that
      
      In preparation for adding another register use of needing to get a ctx
      from a file descriptor, abstract out this helper and use it in the main
      register syscall as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b6d253e
  7. 11 Sep, 2024 4 commits
  8. 10 Sep, 2024 2 commits
  9. 09 Sep, 2024 1 commit
    • Felix Moessbauer's avatar
      io_uring/sqpoll: do not allow pinning outside of cpuset · f011c9cf
      Felix Moessbauer authored
      The submit queue polling threads are userland threads that just never
      exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
      the affinity of the poller thread is set to the cpu specified in
      sq_thread_cpu. However, this CPU can be outside of the cpuset defined
      by the cgroup cpuset controller. This violates the rules defined by the
      cpuset controller and is a potential issue for realtime applications.
      
      In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
      case no explicit pinning is required by inheriting the one of the
      creating task. In case of explicit pinning, the check is more
      complicated, as also a cpu outside of the parent cpumask is allowed.
      We implemented this by using cpuset_cpus_allowed (that has support for
      cgroup cpusets) and testing if the requested cpu is in the set.
      
      Fixes: 37d1e2e3 ("io_uring: move SQPOLL thread io-wq forked worker")
      Cc: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarFelix Moessbauer <felix.moessbauer@siemens.com>
      Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f011c9cf
  10. 08 Sep, 2024 1 commit
  11. 02 Sep, 2024 2 commits
  12. 30 Aug, 2024 1 commit
  13. 29 Aug, 2024 5 commits
    • Jens Axboe's avatar
      io_uring/kbuf: add support for incremental buffer consumption · ae98dbf4
      Jens Axboe authored
      By default, any recv/read operation that uses provided buffers will
      consume at least 1 buffer fully (and maybe more, in case of bundles).
      This adds support for incremental consumption, meaning that an
      application may add large buffers, and each read/recv will just consume
      the part of the buffer that it needs.
      
      For example, let's say an application registers 1MB buffers in a
      provided buffer ring, for streaming receives. If it gets a short recv,
      then the full 1MB buffer will be consumed and passed back to the
      application. With incremental consumption, only the part that was
      actually used is consumed, and the buffer remains the current one.
      
      This means that both the application and the kernel needs to keep track
      of what the current receive point is. Each recv will still pass back a
      buffer ID and the size consumed, the only difference is that before the
      next receive would always be the next buffer in the ring. Now the same
      buffer ID may return multiple receives, each at an offset into that
      buffer from where the previous receive left off. Example:
      
      Application registers a provided buffer ring, and adds two 32K buffers
      to the ring.
      
      Buffer1 address: 0x1000000 (buffer ID 0)
      Buffer2 address: 0x2000000 (buffer ID 1)
      
      A recv completion is received with the following values:
      
      cqe->res	0x1000	(4k bytes received)
      cqe->flags	0x11	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
      
      and the application now knows that 4096b of data is available at
      0x1000000, the start of that buffer, and that more data from this buffer
      will be coming. Now the next receive comes in:
      
      cqe->res	0x2010	(8k bytes received)
      cqe->flags	0x11	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)
      
      which tells the application that 8k is available where the last
      completion left off, at 0x1001000. Next completion is:
      
      cqe->res	0x5000	(20k bytes received)
      cqe->flags	0x1	(CQE_F_BUFFER set, buffer ID 0)
      
      and the application now knows that 20k of data is available at
      0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE
      isn't set, as no more data is available in this buffer ID. The next
      completion is then:
      
      cqe->res	0x1000	(4k bytes received)
      cqe->flags	0x10001	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)
      
      which tells the application that buffer ID 1 is now the current one,
      hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next
      receive point for this buffer ID.
      
      When a buffer will be reused by future CQE completions,
      IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application
      that the kernel isn't done with the buffer yet, and that it should expect
      more completions for this buffer ID. Will only be set by provided buffer
      rings setup with IOU_PBUF_RING INC, as that's the only type of buffer
      that will see multiple consecutive completions for the same buffer ID.
      For any other provided buffer type, any completion that passes back
      a buffer to the application is final.
      
      Once a buffer has been fully consumed, the buffer ring head is
      incremented and the next receive will indicate the next buffer ID in the
      CQE cflags.
      
      On the send side, the application can manage how much data is sent from
      an existing buffer by setting sqe->len to the desired send length.
      
      An application can request incremental consumption by setting
      IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of
      that, any provided buffer ring setup and buffer additions is done like
      before, no changes there. The only change is in how an application may
      see multiple completions for the same buffer ID, hence needing to know
      where the next receive will happen.
      
      Note that like existing provided buffer rings, this should not be used
      with IOSQE_ASYNC, as both really require the ring to remain locked over
      the duration of the buffer selection and the operation completion. It
      will consume a buffer otherwise regardless of the size of the IO done.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ae98dbf4
    • Jens Axboe's avatar
      io_uring/kbuf: pass in 'len' argument for buffer commit · 6733e678
      Jens Axboe authored
      In preparation for needing the consumed length, pass in the length being
      completed. Unused right now, but will be used when it is possible to
      partially consume a buffer.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6733e678
    • Jens Axboe's avatar
      Revert "io_uring: Require zeroed sqe->len on provided-buffers send" · 641a6816
      Jens Axboe authored
      This reverts commit 79996b45.
      
      Revert the change that restricts a send provided buffer to be zero, so
      it will always consume the whole buffer. This is strictly needed for
      partial consumption, as the send may very well be a subset of the
      current buffer. In fact, that's the intended use case.
      
      For non-incremental provided buffer rings, an application should set
      sqe->len carefully to avoid the potential issue described in the
      reverted commit. It is recommended that '0' still be set for len for
      that case, if the application is set on maintaining more than 1 send
      inflight for the same socket. This is somewhat of a nonsensical thing
      to do.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      641a6816
    • Jens Axboe's avatar
      io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h · 2c8fa70b
      Jens Axboe authored
      In preparation for using this helper in kbuf.h as well, move it there and
      turn it into a macro.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2c8fa70b
    • Jens Axboe's avatar
      io_uring/kbuf: add io_kbuf_commit() helper · ecd5c9b2
      Jens Axboe authored
      Committing the selected ring buffer is currently done in three different
      spots, combine it into a helper and just call that.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ecd5c9b2
  14. 25 Aug, 2024 14 commits