1. 24 Jul, 2020 9 commits
  2. 23 Jul, 2020 1 commit
  3. 18 Jul, 2020 2 commits
    • Daniele Albano's avatar
      io_uring: always allow drain/link/hardlink/async sqe flags · 61710e43
      Daniele Albano authored
      We currently filter these for timeout_remove/async_cancel/files_update,
      but we only should be filtering for fixed file and buffer select. This
      also causes a second read of sqe->flags, which isn't needed.
      
      Just check req->flags for the relevant bits. This then allows these
      commands to be used in links, for example, like everything else.
      Signed-off-by: default avatarDaniele Albano <d.albano@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61710e43
    • Jens Axboe's avatar
      io_uring: ensure double poll additions work with both request types · 807abcb0
      Jens Axboe authored
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      807abcb0
  4. 15 Jul, 2020 1 commit
  5. 12 Jul, 2020 2 commits
  6. 10 Jul, 2020 2 commits
    • Jens Axboe's avatar
      io_uring: account user memory freed when exit has been queued · 309fc03a
      Jens Axboe authored
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      309fc03a
    • Yang Yingliang's avatar
      io_uring: fix memleak in io_sqe_files_register() · 667e57da
      Yang Yingliang authored
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      667e57da
  7. 09 Jul, 2020 4 commits
    • Jens Axboe's avatar
      io_uring: remove dead 'ctx' argument and move forward declaration · 4349f30e
      Jens Axboe authored
      We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
      on the mm of the current task. Drop the argument.
      
      Move io_file_put_work() to where we have the other forward declarations
      of functions.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4349f30e
    • Jens Axboe's avatar
      io_uring: get rid of __req_need_defer() · 2bc9930e
      Jens Axboe authored
      We just have one caller of this, req_need_defer(), just inline the
      code in there instead.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2bc9930e
    • Yang Yingliang's avatar
      io_uring: fix memleak in __io_sqe_files_update() · f3bd9dae
      Yang Yingliang authored
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f3bd9dae
    • Xiaoguang Wang's avatar
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang authored
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d5f9049
  8. 08 Jul, 2020 2 commits
  9. 07 Jul, 2020 3 commits
  10. 06 Jul, 2020 3 commits
  11. 05 Jul, 2020 6 commits
  12. 04 Jul, 2020 1 commit
    • Jens Axboe's avatar
      io_uring: fix regression with always ignoring signals in io_cqring_wait() · b7db41c9
      Jens Axboe authored
      When switching to TWA_SIGNAL for task_work notifications, we also made
      any signal based condition in io_cqring_wait() return -ERESTARTSYS.
      This breaks applications that rely on using signals to abort someone
      waiting for events.
      
      Check if we have a signal pending because of queued task_work, and
      repeat the signal check once we've run the task_work. This provides a
      reliable way of telling the two apart.
      
      Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
      we don't have the dependency situation described in the original commit,
      and we can get by with just using TWA_RESUME like we previously did.
      
      Fixes: ce593a6c ("io_uring: use signal based task_work running")
      Cc: stable@vger.kernel.org # v5.7
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Tested-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7db41c9
  13. 30 Jun, 2020 4 commits
    • Jens Axboe's avatar
      io_uring: use signal based task_work running · ce593a6c
      Jens Axboe authored
      Since 5.7, we've been using task_work to trigger async running of
      requests in the context of the original task. This generally works
      great, but there's a case where if the task is currently blocked
      in the kernel waiting on a condition to become true, it won't process
      task_work. Even though the task is woken, it just checks whatever
      condition it's waiting on, and goes back to sleep if it's still false.
      
      This is a problem if that very condition only becomes true when that
      task_work is run. An example of that is the task registering an eventfd
      with io_uring, and it's now blocked waiting on an eventfd read. That
      read could depend on a completion event, and that completion event
      won't get trigged until task_work has been run.
      
      Use the TWA_SIGNAL notification for task_work, so that we ensure that
      the task always runs the work when queued.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce593a6c
    • Oleg Nesterov's avatar
      task_work: teach task_work_add() to do signal_wake_up() · e91b4816
      Oleg Nesterov authored
      So that the target task will exit the wait_event_interruptible-like
      loop and call task_work_run() asap.
      
      The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
      new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
      avoid the race with recalc_sigpending(), so the patch also adds the
      new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.
      
      TODO: once this patch is merged we need to change all current users
      of task_work_add(notify = true) to use TWA_RESUME.
      
      Cc: stable@vger.kernel.org # v5.7
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e91b4816
    • Pavel Begunkov's avatar
      io_uring: fix missing ->mm on exit · 8eb06d7e
      Pavel Begunkov authored
      There is a fancy bug, where exiting user task may not have ->mm,
      that makes task_works to try to do kthread_use_mm(ctx->sqo_mm).
      
      Don't do that if sqo_mm is NULL.
      
      [  290.460558] WARNING: CPU: 6 PID: 150933 at kernel/kthread.c:1238
      	kthread_use_mm+0xf3/0x110
      [  290.460579] CPU: 6 PID: 150933 Comm: read-write2 Tainted: G
      	I E     5.8.0-rc2-00066-g9b21720607cf #531
      [  290.460580] RIP: 0010:kthread_use_mm+0xf3/0x110
      ...
      [  290.460584] Call Trace:
      [  290.460584]  __io_sq_thread_acquire_mm.isra.0.part.0+0x25/0x30
      [  290.460584]  __io_req_task_submit+0x64/0x80
      [  290.460584]  io_req_task_submit+0x15/0x20
      [  290.460585]  task_work_run+0x67/0xa0
      [  290.460585]  do_exit+0x35d/0xb70
      [  290.460585]  do_group_exit+0x43/0xa0
      [  290.460585]  get_signal+0x140/0x900
      [  290.460586]  do_signal+0x37/0x780
      [  290.460586]  __prepare_exit_to_usermode+0x126/0x1c0
      [  290.460586]  __syscall_return_slowpath+0x3b/0x1c0
      [  290.460587]  do_syscall_64+0x5f/0xa0
      [  290.460587]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      following with faults.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8eb06d7e
    • Pavel Begunkov's avatar
      io_uring: optimise io_req_find_next() fast check · 3fa5e0f3
      Pavel Begunkov authored
      gcc 9.2.0 compiles io_req_find_next() as a separate function leaving
      the first REQ_F_LINK_HEAD fast check not inlined. Help it by splitting
      out the check from the function.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3fa5e0f3