1. 12 Apr, 2021 22 commits
  2. 11 Apr, 2021 18 commits
    • Pavel Begunkov's avatar
      io_uring: reg buffer overflow checks hardening · 50e96989
      Pavel Begunkov authored
      We are safe with overflows in io_sqe_buffer_register() because it will
      just yield alloc failure, but it's nicer to check explicitly.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      50e96989
    • Jens Axboe's avatar
      io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE · 548d819d
      Jens Axboe authored
      Now that we have any worker being attached to the original task as
      threads, accounting of CPU time is directly attributed to the original
      task as well. This means that we no longer have to restrict SQPOLL to
      needing elevated privileges, as it's really no different from just having
      the task spawn a busy looping thread in userspace.
      Reported-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      548d819d
    • Jens Axboe's avatar
      io-wq: eliminate the need for a manager thread · 685fe7fe
      Jens Axboe authored
      io-wq relies on a manager thread to create/fork new workers, as needed.
      But there's really no strong need for it anymore. We have the following
      cases that fork a new worker:
      
      1) Work queue. This is done from the task itself always, and it's trivial
         to create a worker off that path, if needed.
      
      2) All workers have gone to sleep, and we have more work. This is called
         off the sched out path. For this case, use a task_work items to queue
         a fork-worker operation.
      
      3) Hashed work completion. Don't think we need to do anything off this
         case. If need be, it could just use approach 2 as well.
      
      Part of this change is incrementing the running worker count before the
      fork, to avoid cases where we observe we need a worker and then queue
      creation of one. Then new work comes in, we fork a new one. That last
      queue operation should have waited for the previous worker to come up,
      it's quite possible we don't even need it. Hence move the worker running
      from before we fork it off to more efficiently handle that case.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      685fe7fe
    • Jens Axboe's avatar
      kernel: allow fork with TIF_NOTIFY_SIGNAL pending · 66ae0d1e
      Jens Axboe authored
      fork() fails if signal_pending() is true, but there are two conditions
      that can lead to that:
      
      1) An actual signal is pending. We want fork to fail for that one, like
         we always have.
      
      2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
         We don't need to make it fail for that case.
      
      Allow fork() to proceed if just task_work is pending, by changing the
      signal_pending() check to task_sigpending().
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      66ae0d1e
    • Jens Axboe's avatar
      io_uring: allow events and user_data update of running poll requests · b69de288
      Jens Axboe authored
      This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
      IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
      masked into sqe->len. If set, the POLL_ADD will have the following
      behavior:
      
      - sqe->addr must contain the the user_data of the poll request that
        needs to be modified. This field is otherwise invalid for a POLL_ADD
        command.
      
      - If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
        new mask for the existing poll request. There are no checks for whether
        these are identical or not, if a matching poll request is found, then it
        is re-armed with the new mask.
      
      - If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
        user_data for the existing poll request.
      
      A POLL_ADD with any of these flags set may complete with any of the
      following results:
      
      1) 0, which means that we successfully found the existing poll request
         specified, and performed the re-arm procedure. Any error from that
         re-arm will be exposed as a completion event for that original poll
         request, not for the update request.
      2) -ENOENT, if no existing poll request was found with the given
         user_data.
      3) -EALREADY, if the existing poll request was already in the process of
         being removed/canceled/completing.
      4) -EACCES, if an attempt was made to modify an internal poll request
         (eg not one originally issued ass IORING_OP_POLL_ADD).
      
      The usual -EINVAL cases apply as well, if any invalid fields are set
      in the sqe for this command type.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b69de288
    • Jens Axboe's avatar
      io_uring: abstract out a io_poll_find_helper() · b2cb805f
      Jens Axboe authored
      We'll need this helper for another purpose, for now just abstract it
      out and have io_poll_cancel() use it for lookups.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b2cb805f
    • Jens Axboe's avatar
      io_uring: terminate multishot poll for CQ ring overflow · 5082620f
      Jens Axboe authored
      If we hit overflow and fail to allocate an overflow entry for the
      completion, terminate the multishot poll mode.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5082620f
    • Jens Axboe's avatar
      io_uring: abstract out helper for removing poll waitqs/hashes · b2c3f7e1
      Jens Axboe authored
      No functional changes in this patch, just preparation for kill multishot
      poll on CQ overflow.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b2c3f7e1
    • Jens Axboe's avatar
      io_uring: add multishot mode for IORING_OP_POLL_ADD · 88e41cf9
      Jens Axboe authored
      The default io_uring poll mode is one-shot, where once the event triggers,
      the poll command is completed and won't trigger any further events. If
      we're doing repeated polling on the same file or socket, then it can be
      more efficient to do multishot, where we keep triggering whenever the
      event becomes true.
      
      This deviates from the usual norm of having one CQE per SQE submitted. Add
      a CQE flag, IORING_CQE_F_MORE, which tells the application to expect
      further completion events from the submitted SQE. Right now the only user
      of this is POLL_ADD in multishot mode.
      
      Since sqe->poll_events is using the space that we normally use for adding
      flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot
      mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An
      application should expect more CQEs for the specificed SQE if the CQE is
      flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an
      error will terminate the poll request, in which case the flag will be
      cleared.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      88e41cf9
    • Jens Axboe's avatar
      io_uring: include cflags in completion trace event · 7471e1af
      Jens Axboe authored
      We should be including the completion flags for better introspection on
      exactly what completion event was logged.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7471e1af
    • Pavel Begunkov's avatar
      io_uring: allocate memory for overflowed CQEs · 6c2450ae
      Pavel Begunkov authored
      Instead of using a request itself for overflowed CQE stashing, allocate a
      separate entry. The disadvantage is that the allocation may fail and it
      will be accounted as lost (see rings->cq_overflow), so we lose reliability
      in case of memory pressure if the application is driving the CQ ring into
      overflow. However, it opens a way for for multiple CQEs per an SQE and
      even generating SQE-less CQEs.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      [axboe: use GFP_ATOMIC | __GFP_ACCOUNT]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c2450ae
    • Jens Axboe's avatar
      io_uring: mask in error/nval/hangup consistently for poll · 464dca61
      Jens Axboe authored
      Instead of masking these in as part of regular POLL_ADD prep, do it in
      io_init_poll_iocb(), and include NVAL as that's generally unmaskable,
      and RDHUP alongside the HUP that is already set.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      464dca61
    • Pavel Begunkov's avatar
      io_uring: optimise rw complete error handling · 9532b99b
      Pavel Begunkov authored
      Expect read/write to succeed and create a hot path for this case, in
      particular hide all error handling with resubmission under a single
      check with the desired result.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9532b99b
    • Pavel Begunkov's avatar
      io_uring: hide iter revert in resubmit_prep · ab454438
      Pavel Begunkov authored
      Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into
      io_resubmit_prep(), so we don't do heavy revert in hot path, also saves
      a couple of checks.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab454438
    • Pavel Begunkov's avatar
      io_uring: don't alter iopoll reissue fail ret code · 8c130827
      Pavel Begunkov authored
      When reissue_prep failed in io_complete_rw_iopoll(), we change return
      code to -EIO to prevent io_iopoll_complete() from doing resubmission.
      Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and
      retain the original return value.
      
      It also removes io_rw_reissue() from io_iopoll_complete() that will be
      used later.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8c130827
    • Pavel Begunkov's avatar
      io_uring: optimise kiocb_end_write for !ISREG · 1c98679d
      Pavel Begunkov authored
      file_end_write() is only for regular files, so the function do a couple
      of dereferences to get inode and check for it. However, we already have
      REQ_F_ISREG at hand, just use it and inline file_end_write().
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1c98679d
    • Pavel Begunkov's avatar
      io_uring: kill unused REQ_F_NO_FILE_TABLE · 59d70013
      Pavel Begunkov authored
      current->files are always valid now even for io-wq threads, so kill not
      used anymore REQ_F_NO_FILE_TABLE.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59d70013
    • Pavel Begunkov's avatar
      io_uring: don't init req->work fully in advance · e1d675df
      Pavel Begunkov authored
      req->work is mostly unused unless it's punted, and io_init_req() is too
      hot for fully initialising it. Fortunately, we can skip init work.next
      as it's controlled by io-wq, and can not touch work.flags by moving
      everything related into io_prep_async_work(). The only field left is
      req->work.creds, but there is nothing can be done, keep maintaining it.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e1d675df