Commit 9961a785 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Greatly improve send zerocopy performance, by enabling coalescing of
   sent buffers.

   MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
   io_uring side did not. In local testing, the crossover point for send
   zerocopy being faster is now around 3000 byte packets, and it
   performs better than the sync syscall variants as well.

   This feature relies on a shared branch with net-next, which was
   pulled into both branches.

 - Unification of how async preparation is done across opcodes.

   Previously, opcodes that required extra memory for async retry would
   allocate that as needed, using on-stack state until that was the
   case. If async retry was needed, the on-stack state was adjusted
   appropriately for a retry and then copied to the allocated memory.

   This led to some fragile and ugly code, particularly for read/write
   handling, and made storage retries more difficult than they needed to
   be. Allocate the memory upfront, as it's cheap from our pools, and
   use that state consistently both initially and also from the retry
   side.

 - Move away from using remap_pfn_range() for mapping the rings.

   This is really not the right interface to use and can cause lifetime
   issues or leaks. Additionally, it means the ring sq/cq arrays need to
   be physically contigious, which can cause problems in production with
   larger rings when services are restarted, as memory can be very
   fragmented at that point.

   Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
   the same treatment to mapped ring provided buffers. This also helps
   unify the code we have dealing with allocating and mapping memory.

   Hard to see in the diffstat as we're adding a few features as well,
   but this kills about ~400 lines of code from the codebase as well.

 - Add support for bundles for send/recv.

   When used with provided buffers, bundles support sending or receiving
   more than one buffer at the time, improving the efficiency by only
   needing to call into the networking stack once for multiple sends or
   receives.

 - Tweaks for our accept operations, supporting both a DONTWAIT flag for
   skipping poll arm and retry if we can, and a POLLFIRST flag that the
   application can use to skip the initial accept attempt and rely
   purely on poll for triggering the operation. Both of these have
   identical flags on the receive side already.

 - Make the task_work ctx locking unconditional.

   We had various code paths here that would do a mix of lock/trylock
   and set the task_work state to whether or not it was locked. All of
   that goes away, we lock it unconditionally and get rid of the state
   flag indicating whether it's locked or not.

   The state struct still exists as an empty type, can go away in the
   future.

 - Add support for specifying NOP completion values, allowing it to be
   used for error handling testing.

 - Use set/test bit for io-wq worker flags. Not strictly needed, but
   also doesn't hurt and helps silence a KCSAN warning.

 - Cleanups for io-wq locking and work assignments, closing a tiny race
   where cancelations would not be able to find the work item reliably.

 - Misc fixes, cleanups, and improvements

* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
  io_uring: support to inject result for NOP
  io_uring: fail NOP if non-zero op flags is passed in
  io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
  io_uring/net: add IORING_ACCEPT_DONTWAIT flag
  io_uring/filetable: don't unnecessarily clear/reset bitmap
  io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
  io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
  io_uring: Require zeroed sqe->len on provided-buffers send
  io_uring/notif: disable LAZY_WAKE for linked notifs
  io_uring/net: fix sendzc lazy wake polling
  io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
  io_uring/rw: reinstate thread check for retries
  io_uring/notif: implement notification stacking
  io_uring/notif: simplify io_notif_flush()
  net: add callback for setting a ubuf_info to skb
  net: extend ubuf_info callback to ops structure
  io_uring/net: support bundles for recv
  io_uring/net: support bundles for send
  io_uring/kbuf: add helpers for getting/peeking multiple buffers
  io_uring/net: add provided buffer support for IORING_OP_SEND
  ...
parents f4e8d802 deb1e496
......@@ -754,7 +754,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void *msg_control,
skb_zcopy_init(skb, msg_control);
} else if (msg_control) {
struct ubuf_info *uarg = msg_control;
uarg->callback(NULL, uarg, false);
uarg->ops->complete(NULL, uarg, false);
}
dev_queue_xmit(skb);
......
......@@ -1906,7 +1906,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
skb_zcopy_init(skb, msg_control);
} else if (msg_control) {
struct ubuf_info *uarg = msg_control;
uarg->callback(NULL, uarg, false);
uarg->ops->complete(NULL, uarg, false);
}
skb_reset_network_header(skb);
......
......@@ -390,9 +390,8 @@ bool xenvif_rx_queue_tail(struct xenvif_queue *queue, struct sk_buff *skb);
void xenvif_carrier_on(struct xenvif *vif);
/* Callback from stack when TX packet can be released */
void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf,
bool zerocopy_success);
/* Callbacks from stack when TX packet can be released */
extern const struct ubuf_info_ops xenvif_ubuf_ops;
static inline pending_ring_idx_t nr_pending_reqs(struct xenvif_queue *queue)
{
......
......@@ -593,7 +593,7 @@ int xenvif_init_queue(struct xenvif_queue *queue)
for (i = 0; i < MAX_PENDING_REQS; i++) {
queue->pending_tx_info[i].callback_struct = (struct ubuf_info_msgzc)
{ { .callback = xenvif_zerocopy_callback },
{ { .ops = &xenvif_ubuf_ops },
{ { .ctx = NULL,
.desc = i } } };
queue->grant_tx_handle[i] = NETBACK_INVALID_HANDLE;
......
......@@ -1156,7 +1156,7 @@ static int xenvif_handle_frag_list(struct xenvif_queue *queue, struct sk_buff *s
uarg = skb_shinfo(skb)->destructor_arg;
/* increase inflight counter to offset decrement in callback */
atomic_inc(&queue->inflight_packets);
uarg->callback(NULL, uarg, true);
uarg->ops->complete(NULL, uarg, true);
skb_shinfo(skb)->destructor_arg = NULL;
/* Fill the skb with the new (local) frags. */
......@@ -1278,8 +1278,9 @@ static int xenvif_tx_submit(struct xenvif_queue *queue)
return work_done;
}
void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf_base,
bool zerocopy_success)
static void xenvif_zerocopy_callback(struct sk_buff *skb,
struct ubuf_info *ubuf_base,
bool zerocopy_success)
{
unsigned long flags;
pending_ring_idx_t index;
......@@ -1312,6 +1313,10 @@ void xenvif_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *ubuf_base,
xenvif_skb_zerocopy_complete(queue);
}
const struct ubuf_info_ops xenvif_ubuf_ops = {
.complete = xenvif_zerocopy_callback,
};
static inline void xenvif_tx_dealloc_action(struct xenvif_queue *queue)
{
struct gnttab_unmap_grant_ref *gop;
......
......@@ -423,13 +423,20 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
pdu->result = le64_to_cpu(nvme_req(req)->result.u64);
/*
* For iopoll, complete it directly.
* For iopoll, complete it directly. Note that using the uring_cmd
* helper for this is safe only because we check blk_rq_is_poll().
* As that returns false if we're NOT on a polled queue, then it's
* safe to use the polled completion helper.
*
* Otherwise, move the completion to task work.
*/
if (blk_rq_is_poll(req))
nvme_uring_task_cb(ioucmd, IO_URING_F_UNLOCKED);
else
if (blk_rq_is_poll(req)) {
if (pdu->bio)
blk_rq_unmap_user(pdu->bio);
io_uring_cmd_iopoll_done(ioucmd, pdu->result, pdu->status);
} else {
io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
}
return RQ_END_IO_FREE;
}
......
......@@ -380,7 +380,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
}
}
static void vhost_zerocopy_callback(struct sk_buff *skb,
static void vhost_zerocopy_complete(struct sk_buff *skb,
struct ubuf_info *ubuf_base, bool success)
{
struct ubuf_info_msgzc *ubuf = uarg_to_msgzc(ubuf_base);
......@@ -408,6 +408,10 @@ static void vhost_zerocopy_callback(struct sk_buff *skb,
rcu_read_unlock_bh();
}
static const struct ubuf_info_ops vhost_ubuf_ops = {
.complete = vhost_zerocopy_complete,
};
static inline unsigned long busy_clock(void)
{
return local_clock() >> 10;
......@@ -879,7 +883,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
ubuf->ubuf.callback = vhost_zerocopy_callback;
ubuf->ubuf.ops = &vhost_ubuf_ops;
ubuf->ubuf.flags = SKBFL_ZEROCOPY_FRAG;
refcount_set(&ubuf->ubuf.refcnt, 1);
msg.msg_control = &ctl;
......
......@@ -11,7 +11,6 @@ void __io_uring_cancel(bool cancel_all);
void __io_uring_free(struct task_struct *tsk);
void io_uring_unreg_ringfd(void);
const char *io_uring_get_opcode(u8 opcode);
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
bool io_is_uring_fops(struct file *file);
static inline void io_uring_files_cancel(void)
......@@ -45,11 +44,6 @@ static inline const char *io_uring_get_opcode(u8 opcode)
{
return "";
}
static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
return -EOPNOTSUPP;
}
static inline bool io_is_uring_fops(struct file *file)
{
return false;
......
......@@ -26,12 +26,25 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
#if defined(CONFIG_IO_URING)
int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
struct iov_iter *iter, void *ioucmd);
/*
* Completes the request, i.e. posts an io_uring CQE and deallocates @ioucmd
* and the corresponding io_uring request.
*
* Note: the caller should never hard code @issue_flags and is only allowed
* to pass the mask provided by the core io_uring code.
*/
void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2,
unsigned issue_flags);
void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
void (*task_work_cb)(struct io_uring_cmd *, unsigned),
unsigned flags);
/*
* Note: the caller should never hard code @issue_flags and only use the
* mask provided by the core io_uring code.
*/
void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
unsigned int issue_flags);
......@@ -56,6 +69,17 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
}
#endif
/*
* Polled completions must ensure they are coming from a poll queue, and
* hence are completed inside the usual poll handling loops.
*/
static inline void io_uring_cmd_iopoll_done(struct io_uring_cmd *ioucmd,
ssize_t ret, ssize_t res2)
{
lockdep_assert(in_task());
io_uring_cmd_done(ioucmd, ret, res2, 0);
}
/* users must follow the IOU_F_TWQ_LAZY_WAKE semantics */
static inline void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd,
void (*task_work_cb)(struct io_uring_cmd *, unsigned))
......
/* SPDX-License-Identifier: GPL-2.0-or-later */
#ifndef _LINUX_IO_URING_NET_H
#define _LINUX_IO_URING_NET_H
struct io_uring_cmd;
#if defined(CONFIG_IO_URING)
int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
#else
static inline int io_uring_cmd_sock(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
return -EOPNOTSUPP;
}
#endif
#endif
......@@ -205,6 +205,7 @@ struct io_submit_state {
bool plug_started;
bool need_plug;
bool cq_flush;
unsigned short submit_nr;
unsigned int cqes_count;
struct blk_plug plug;
......@@ -219,7 +220,7 @@ struct io_ev_fd {
};
struct io_alloc_cache {
struct io_wq_work_node list;
void **entries;
unsigned int nr_cached;
unsigned int max_cached;
size_t elem_size;
......@@ -299,6 +300,8 @@ struct io_ring_ctx {
struct io_hash_table cancel_table_locked;
struct io_alloc_cache apoll_cache;
struct io_alloc_cache netmsg_cache;
struct io_alloc_cache rw_cache;
struct io_alloc_cache uring_cache;
/*
* Any cancelable uring_cmd is added to this list in
......@@ -341,14 +344,8 @@ struct io_ring_ctx {
unsigned cq_last_tm_flush;
} ____cacheline_aligned_in_smp;
struct io_uring_cqe completion_cqes[16];
spinlock_t completion_lock;
/* IRQ completion list, under ->completion_lock */
unsigned int locked_free_nr;
struct io_wq_work_list locked_free_list;
struct list_head io_buffers_comp;
struct list_head cq_overflow_list;
struct io_hash_table cancel_table;
......@@ -371,9 +368,6 @@ struct io_ring_ctx {
struct list_head io_buffers_cache;
/* deferred free list, protected by ->uring_lock */
struct hlist_head io_buf_list;
/* Keep this last, we don't need it for the fast path */
struct wait_queue_head poll_wq;
struct io_restriction restrictions;
......@@ -438,8 +432,6 @@ struct io_ring_ctx {
};
struct io_tw_state {
/* ->uring_lock is taken, callbacks can use io_tw_lock to lock it */
bool locked;
};
enum {
......@@ -480,6 +472,7 @@ enum {
REQ_F_CAN_POLL_BIT,
REQ_F_BL_EMPTY_BIT,
REQ_F_BL_NO_RECYCLE_BIT,
REQ_F_BUFFERS_COMMIT_BIT,
/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
......@@ -558,6 +551,8 @@ enum {
REQ_F_BL_EMPTY = IO_REQ_FLAG(REQ_F_BL_EMPTY_BIT),
/* don't recycle provided buffers for this request */
REQ_F_BL_NO_RECYCLE = IO_REQ_FLAG(REQ_F_BL_NO_RECYCLE_BIT),
/* buffer ring head needs incrementing on put */
REQ_F_BUFFERS_COMMIT = IO_REQ_FLAG(REQ_F_BUFFERS_COMMIT_BIT),
};
typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
......
......@@ -527,6 +527,13 @@ enum {
#define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS)
struct ubuf_info_ops {
void (*complete)(struct sk_buff *, struct ubuf_info *,
bool zerocopy_success);
/* has to be compatible with skb_zcopy_set() */
int (*link_skb)(struct sk_buff *skb, struct ubuf_info *uarg);
};
/*
* The callback notifies userspace to release buffers when skb DMA is done in
* lower device, the skb last reference should be 0 when calling this.
......@@ -536,8 +543,7 @@ enum {
* The desc field is used to track userspace buffer index.
*/
struct ubuf_info {
void (*callback)(struct sk_buff *, struct ubuf_info *,
bool zerocopy_success);
const struct ubuf_info_ops *ops;
refcount_t refcnt;
u8 flags;
};
......@@ -1662,14 +1668,13 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset)
}
#endif
extern const struct ubuf_info_ops msg_zerocopy_ubuf_ops;
struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
struct ubuf_info *uarg);
void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
bool success);
int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb, struct iov_iter *from,
size_t length);
......@@ -1757,13 +1762,13 @@ static inline void *skb_zcopy_get_nouarg(struct sk_buff *skb)
static inline void net_zcopy_put(struct ubuf_info *uarg)
{
if (uarg)
uarg->callback(NULL, uarg, true);
uarg->ops->complete(NULL, uarg, true);
}
static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
{
if (uarg) {
if (uarg->callback == msg_zerocopy_callback)
if (uarg->ops == &msg_zerocopy_ubuf_ops)
msg_zerocopy_put_abort(uarg, have_uref);
else if (have_uref)
net_zcopy_put(uarg);
......@@ -1777,7 +1782,7 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
if (uarg) {
if (!skb_zcopy_is_nouarg(skb))
uarg->callback(skb, uarg, zerocopy_success);
uarg->ops->complete(skb, uarg, zerocopy_success);
skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
}
......
......@@ -72,6 +72,7 @@ struct io_uring_sqe {
__u32 waitid_flags;
__u32 futex_flags;
__u32 install_fd_flags;
__u32 nop_flags;
};
__u64 user_data; /* data to be passed back at completion time */
/* pack this to avoid bogus arm OABI complaints */
......@@ -115,7 +116,7 @@ struct io_uring_sqe {
*/
#define IORING_FILE_INDEX_ALLOC (~0U)
enum {
enum io_uring_sqe_flags_bit {
IOSQE_FIXED_FILE_BIT,
IOSQE_IO_DRAIN_BIT,
IOSQE_IO_LINK_BIT,
......@@ -351,11 +352,20 @@ enum io_uring_op {
* 0 is reported if zerocopy was actually possible.
* IORING_NOTIF_USAGE_ZC_COPIED if data was copied
* (at least partially).
*
* IORING_RECVSEND_BUNDLE Used with IOSQE_BUFFER_SELECT. If set, send or
* recv will grab as many buffers from the buffer
* group ID given and send them all. The completion
* result will be the number of buffers send, with
* the starting buffer ID in cqe->flags as per
* usual for provided buffer usage. The buffers
* will be contigious from the starting buffer ID.
*/
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
#define IORING_RECV_MULTISHOT (1U << 1)
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
#define IORING_SEND_ZC_REPORT_USAGE (1U << 3)
#define IORING_RECVSEND_BUNDLE (1U << 4)
/*
* cqe.res for IORING_CQE_F_NOTIF if
......@@ -370,11 +380,13 @@ enum io_uring_op {
* accept flags stored in sqe->ioprio
*/
#define IORING_ACCEPT_MULTISHOT (1U << 0)
#define IORING_ACCEPT_DONTWAIT (1U << 1)
#define IORING_ACCEPT_POLL_FIRST (1U << 2)
/*
* IORING_OP_MSG_RING command types, stored in sqe->addr
*/
enum {
enum io_uring_msg_ring_flags {
IORING_MSG_DATA, /* pass sqe->len as 'res' and off as user_data */
IORING_MSG_SEND_FD, /* send a registered fd to another ring */
};
......@@ -396,6 +408,13 @@ enum {
*/
#define IORING_FIXED_FD_NO_CLOEXEC (1U << 0)
/*
* IORING_OP_NOP flags (sqe->nop_flags)
*
* IORING_NOP_INJECT_RESULT Inject result from sqe->result
*/
#define IORING_NOP_INJECT_RESULT (1U << 0)
/*
* IO completion data structure (Completion Queue Entry)
*/
......@@ -425,9 +444,7 @@ struct io_uring_cqe {
#define IORING_CQE_F_SOCK_NONEMPTY (1U << 2)
#define IORING_CQE_F_NOTIF (1U << 3)
enum {
IORING_CQE_BUFFER_SHIFT = 16,
};
#define IORING_CQE_BUFFER_SHIFT 16
/*
* Magic offsets for the application to mmap the data it needs
......@@ -522,11 +539,12 @@ struct io_uring_params {
#define IORING_FEAT_CQE_SKIP (1U << 11)
#define IORING_FEAT_LINKED_FILE (1U << 12)
#define IORING_FEAT_REG_REG_RING (1U << 13)
#define IORING_FEAT_RECVSEND_BUNDLE (1U << 14)
/*
* io_uring_register(2) opcodes and arguments
*/
enum {
enum io_uring_register_op {
IORING_REGISTER_BUFFERS = 0,
IORING_UNREGISTER_BUFFERS = 1,
IORING_REGISTER_FILES = 2,
......@@ -583,7 +601,7 @@ enum {
};
/* io-wq worker categories */
enum {
enum io_wq_type {
IO_WQ_BOUND,
IO_WQ_UNBOUND,
};
......@@ -688,7 +706,7 @@ struct io_uring_buf_ring {
* IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)
* to get a virtual mapping for the ring.
*/
enum {
enum io_uring_register_pbuf_ring_flags {
IOU_PBUF_RING_MMAP = 1,
};
......@@ -719,7 +737,7 @@ struct io_uring_napi {
/*
* io_uring_restriction->opcode values
*/
enum {
enum io_uring_register_restriction_op {
/* Allow an io_uring_register(2) opcode */
IORING_RESTRICTION_REGISTER_OP = 0,
......@@ -775,7 +793,7 @@ struct io_uring_recvmsg_out {
/*
* Argument for IORING_OP_URING_CMD when file is a socket
*/
enum {
enum io_uring_socket_op {
SOCKET_URING_OP_SIOCINQ = 0,
SOCKET_URING_OP_SIOCOUTQ,
SOCKET_URING_OP_GETSOCKOPT,
......
......@@ -2,13 +2,14 @@
#
# Makefile for io_uring
obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \
sync.o advise.o filetable.o \
openclose.o uring_cmd.o epoll.o \
statx.o net.o msg_ring.o timeout.o \
sqpoll.o fdinfo.o tctx.o poll.o \
cancel.o kbuf.o rsrc.o rw.o opdef.o \
notif.o waitid.o register.o truncate.o
obj-$(CONFIG_IO_URING) += io_uring.o opdef.o kbuf.o rsrc.o notif.o \
tctx.o filetable.o rw.o net.o poll.o \
uring_cmd.o openclose.o sqpoll.o \
xattr.o nop.o fs.o splice.o sync.o \
msg_ring.o advise.o openclose.o \
epoll.o statx.o timeout.o fdinfo.o \
cancel.o waitid.o register.o \
truncate.o memmap.o
obj-$(CONFIG_IO_WQ) += io-wq.o
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
......@@ -4,63 +4,58 @@
/*
* Don't allow the cache to grow beyond this size.
*/
#define IO_ALLOC_CACHE_MAX 512
struct io_cache_entry {
struct io_wq_work_node node;
};
#define IO_ALLOC_CACHE_MAX 128
static inline bool io_alloc_cache_put(struct io_alloc_cache *cache,
struct io_cache_entry *entry)
void *entry)
{
if (cache->nr_cached < cache->max_cached) {
cache->nr_cached++;
wq_stack_add_head(&entry->node, &cache->list);
kasan_mempool_poison_object(entry);
if (!kasan_mempool_poison_object(entry))
return false;
cache->entries[cache->nr_cached++] = entry;
return true;
}
return false;
}
static inline bool io_alloc_cache_empty(struct io_alloc_cache *cache)
{
return !cache->list.next;
}
static inline struct io_cache_entry *io_alloc_cache_get(struct io_alloc_cache *cache)
static inline void *io_alloc_cache_get(struct io_alloc_cache *cache)
{
if (cache->list.next) {
struct io_cache_entry *entry;
if (cache->nr_cached) {
void *entry = cache->entries[--cache->nr_cached];
entry = container_of(cache->list.next, struct io_cache_entry, node);
kasan_mempool_unpoison_object(entry, cache->elem_size);
cache->list.next = cache->list.next->next;
cache->nr_cached--;
return entry;
}
return NULL;
}
static inline void io_alloc_cache_init(struct io_alloc_cache *cache,
/* returns false if the cache was initialized properly */
static inline bool io_alloc_cache_init(struct io_alloc_cache *cache,
unsigned max_nr, size_t size)
{
cache->list.next = NULL;
cache->nr_cached = 0;
cache->max_cached = max_nr;
cache->elem_size = size;
cache->entries = kvmalloc_array(max_nr, sizeof(void *), GFP_KERNEL);
if (cache->entries) {
cache->nr_cached = 0;
cache->max_cached = max_nr;
cache->elem_size = size;
return false;
}
return true;
}
static inline void io_alloc_cache_free(struct io_alloc_cache *cache,
void (*free)(struct io_cache_entry *))
void (*free)(const void *))
{
while (1) {
struct io_cache_entry *entry = io_alloc_cache_get(cache);
void *entry;
if (!cache->entries)
return;
if (!entry)
break;
while ((entry = io_alloc_cache_get(cache)) != NULL)
free(entry);
}
cache->nr_cached = 0;
kvfree(cache->entries);
cache->entries = NULL;
}
#endif
......@@ -184,9 +184,7 @@ static int __io_async_cancel(struct io_cancel_data *cd,
io_ring_submit_lock(ctx, issue_flags);
ret = -ENOENT;
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
struct io_uring_task *tctx = node->task->io_uring;
ret = io_async_cancel_one(tctx, cd);
ret = io_async_cancel_one(node->task->io_uring, cd);
if (ret != -ENOENT) {
if (!all)
break;
......
......@@ -50,9 +50,9 @@ static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id,
* Caller holds a reference to the file already, we don't need to do
* anything else to get an extra reference.
*/
__cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
__cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
{
struct io_ring_ctx *ctx = f->private_data;
struct io_ring_ctx *ctx = file->private_data;
struct io_overflow_cqe *ocqe;
struct io_rings *r = ctx->rings;
struct rusage sq_usage;
......
......@@ -84,12 +84,12 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
return ret;
file_slot->file_ptr = 0;
io_file_bitmap_clear(&ctx->file_table, slot_index);
} else {
io_file_bitmap_set(&ctx->file_table, slot_index);
}
*io_get_tag_slot(ctx->file_data, slot_index) = 0;
io_fixed_file_set(file_slot, file);
io_file_bitmap_set(&ctx->file_table, slot_index);
return 0;
}
......
......@@ -9,7 +9,7 @@
#include "../kernel/futex/futex.h"
#include "io_uring.h"
#include "rsrc.h"
#include "alloc_cache.h"
#include "futex.h"
struct io_futex {
......@@ -27,27 +27,21 @@ struct io_futex {
};
struct io_futex_data {
union {
struct futex_q q;
struct io_cache_entry cache;
};
struct futex_q q;
struct io_kiocb *req;
};
void io_futex_cache_init(struct io_ring_ctx *ctx)
{
io_alloc_cache_init(&ctx->futex_cache, IO_NODE_ALLOC_CACHE_MAX,
sizeof(struct io_futex_data));
}
#define IO_FUTEX_ALLOC_CACHE_MAX 32
static void io_futex_cache_entry_free(struct io_cache_entry *entry)
bool io_futex_cache_init(struct io_ring_ctx *ctx)
{
kfree(container_of(entry, struct io_futex_data, cache));
return io_alloc_cache_init(&ctx->futex_cache, IO_FUTEX_ALLOC_CACHE_MAX,
sizeof(struct io_futex_data));
}
void io_futex_cache_free(struct io_ring_ctx *ctx)
{
io_alloc_cache_free(&ctx->futex_cache, io_futex_cache_entry_free);
io_alloc_cache_free(&ctx->futex_cache, kfree);
}
static void __io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts)
......@@ -63,7 +57,7 @@ static void io_futex_complete(struct io_kiocb *req, struct io_tw_state *ts)
struct io_ring_ctx *ctx = req->ctx;
io_tw_lock(ctx, ts);
if (!io_alloc_cache_put(&ctx->futex_cache, &ifd->cache))
if (!io_alloc_cache_put(&ctx->futex_cache, ifd))
kfree(ifd);
__io_futex_complete(req, ts);
}
......@@ -259,11 +253,11 @@ static void io_futex_wake_fn(struct wake_q_head *wake_q, struct futex_q *q)
static struct io_futex_data *io_alloc_ifd(struct io_ring_ctx *ctx)
{
struct io_cache_entry *entry;
struct io_futex_data *ifd;
entry = io_alloc_cache_get(&ctx->futex_cache);
if (entry)
return container_of(entry, struct io_futex_data, cache);
ifd = io_alloc_cache_get(&ctx->futex_cache);
if (ifd)
return ifd;
return kmalloc(sizeof(struct io_futex_data), GFP_NOWAIT);
}
......
......@@ -13,7 +13,7 @@ int io_futex_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
unsigned int issue_flags);
bool io_futex_remove_all(struct io_ring_ctx *ctx, struct task_struct *task,
bool cancel_all);
void io_futex_cache_init(struct io_ring_ctx *ctx);
bool io_futex_cache_init(struct io_ring_ctx *ctx);
void io_futex_cache_free(struct io_ring_ctx *ctx);
#else
static inline int io_futex_cancel(struct io_ring_ctx *ctx,
......@@ -27,8 +27,9 @@ static inline bool io_futex_remove_all(struct io_ring_ctx *ctx,
{
return false;
}
static inline void io_futex_cache_init(struct io_ring_ctx *ctx)
static inline bool io_futex_cache_init(struct io_ring_ctx *ctx)
{
return false;
}
static inline void io_futex_cache_free(struct io_ring_ctx *ctx)
{
......
......@@ -25,10 +25,10 @@
#define WORKER_IDLE_TIMEOUT (5 * HZ)
enum {
IO_WORKER_F_UP = 1, /* up and active */
IO_WORKER_F_RUNNING = 2, /* account as running */
IO_WORKER_F_FREE = 4, /* worker on free list */
IO_WORKER_F_BOUND = 8, /* is doing bounded work */
IO_WORKER_F_UP = 0, /* up and active */
IO_WORKER_F_RUNNING = 1, /* account as running */
IO_WORKER_F_FREE = 2, /* worker on free list */
IO_WORKER_F_BOUND = 3, /* is doing bounded work */
};
enum {
......@@ -44,21 +44,20 @@ enum {
*/
struct io_worker {
refcount_t ref;
unsigned flags;
int create_index;
unsigned long flags;
struct hlist_nulls_node nulls_node;
struct list_head all_list;
struct task_struct *task;
struct io_wq *wq;
struct io_wq_work *cur_work;
struct io_wq_work *next_work;
raw_spinlock_t lock;
struct completion ref_done;
unsigned long create_state;
struct callback_head create_work;
int create_index;
union {
struct rcu_head rcu;
......@@ -165,7 +164,7 @@ static inline struct io_wq_acct *io_work_get_acct(struct io_wq *wq,
static inline struct io_wq_acct *io_wq_get_acct(struct io_worker *worker)
{
return io_get_acct(worker->wq, worker->flags & IO_WORKER_F_BOUND);
return io_get_acct(worker->wq, test_bit(IO_WORKER_F_BOUND, &worker->flags));
}
static void io_worker_ref_put(struct io_wq *wq)
......@@ -225,7 +224,7 @@ static void io_worker_exit(struct io_worker *worker)
wait_for_completion(&worker->ref_done);
raw_spin_lock(&wq->lock);
if (worker->flags & IO_WORKER_F_FREE)
if (test_bit(IO_WORKER_F_FREE, &worker->flags))
hlist_nulls_del_rcu(&worker->nulls_node);
list_del_rcu(&worker->all_list);
raw_spin_unlock(&wq->lock);
......@@ -410,7 +409,7 @@ static void io_wq_dec_running(struct io_worker *worker)
struct io_wq_acct *acct = io_wq_get_acct(worker);
struct io_wq *wq = worker->wq;
if (!(worker->flags & IO_WORKER_F_UP))
if (!test_bit(IO_WORKER_F_UP, &worker->flags))
return;
if (!atomic_dec_and_test(&acct->nr_running))
......@@ -430,8 +429,8 @@ static void io_wq_dec_running(struct io_worker *worker)
*/
static void __io_worker_busy(struct io_wq *wq, struct io_worker *worker)
{
if (worker->flags & IO_WORKER_F_FREE) {
worker->flags &= ~IO_WORKER_F_FREE;
if (test_bit(IO_WORKER_F_FREE, &worker->flags)) {
clear_bit(IO_WORKER_F_FREE, &worker->flags);
raw_spin_lock(&wq->lock);
hlist_nulls_del_init_rcu(&worker->nulls_node);
raw_spin_unlock(&wq->lock);
......@@ -444,8 +443,8 @@ static void __io_worker_busy(struct io_wq *wq, struct io_worker *worker)
static void __io_worker_idle(struct io_wq *wq, struct io_worker *worker)
__must_hold(wq->lock)
{
if (!(worker->flags & IO_WORKER_F_FREE)) {
worker->flags |= IO_WORKER_F_FREE;
if (!test_bit(IO_WORKER_F_FREE, &worker->flags)) {
set_bit(IO_WORKER_F_FREE, &worker->flags);
hlist_nulls_add_head_rcu(&worker->nulls_node, &wq->free_list);
}
}
......@@ -539,7 +538,6 @@ static void io_assign_current_work(struct io_worker *worker,
raw_spin_lock(&worker->lock);
worker->cur_work = work;
worker->next_work = NULL;
raw_spin_unlock(&worker->lock);
}
......@@ -564,10 +562,7 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
* clear the stalled flag.
*/
work = io_get_next_work(acct, worker);
raw_spin_unlock(&acct->lock);
if (work) {
__io_worker_busy(wq, worker);
/*
* Make sure cancelation can find this, even before
* it becomes the active work. That avoids a window
......@@ -576,11 +571,17 @@ static void io_worker_handle_work(struct io_wq_acct *acct,
* current work item for this worker.
*/
raw_spin_lock(&worker->lock);
worker->next_work = work;
worker->cur_work = work;
raw_spin_unlock(&worker->lock);
} else {
break;
}
raw_spin_unlock(&acct->lock);
if (!work)
break;
__io_worker_busy(wq, worker);
io_assign_current_work(worker, work);
__set_current_state(TASK_RUNNING);
......@@ -631,7 +632,8 @@ static int io_wq_worker(void *data)
bool exit_mask = false, last_timeout = false;
char buf[TASK_COMM_LEN];
worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
set_mask_bits(&worker->flags, 0,
BIT(IO_WORKER_F_UP) | BIT(IO_WORKER_F_RUNNING));
snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
set_task_comm(current, buf);
......@@ -695,11 +697,11 @@ void io_wq_worker_running(struct task_struct *tsk)
if (!worker)
return;
if (!(worker->flags & IO_WORKER_F_UP))
if (!test_bit(IO_WORKER_F_UP, &worker->flags))
return;
if (worker->flags & IO_WORKER_F_RUNNING)
if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
return;
worker->flags |= IO_WORKER_F_RUNNING;
set_bit(IO_WORKER_F_RUNNING, &worker->flags);
io_wq_inc_running(worker);
}
......@@ -713,12 +715,12 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
if (!worker)
return;
if (!(worker->flags & IO_WORKER_F_UP))
if (!test_bit(IO_WORKER_F_UP, &worker->flags))
return;
if (!(worker->flags & IO_WORKER_F_RUNNING))
if (!test_bit(IO_WORKER_F_RUNNING, &worker->flags))
return;
worker->flags &= ~IO_WORKER_F_RUNNING;
clear_bit(IO_WORKER_F_RUNNING, &worker->flags);
io_wq_dec_running(worker);
}
......@@ -732,7 +734,7 @@ static void io_init_new_worker(struct io_wq *wq, struct io_worker *worker,
raw_spin_lock(&wq->lock);
hlist_nulls_add_head_rcu(&worker->nulls_node, &wq->free_list);
list_add_tail_rcu(&worker->all_list, &wq->all_list);
worker->flags |= IO_WORKER_F_FREE;
set_bit(IO_WORKER_F_FREE, &worker->flags);
raw_spin_unlock(&wq->lock);
wake_up_new_task(tsk);
}
......@@ -838,7 +840,7 @@ static bool create_io_worker(struct io_wq *wq, int index)
init_completion(&worker->ref_done);
if (index == IO_WQ_ACCT_BOUND)
worker->flags |= IO_WORKER_F_BOUND;
set_bit(IO_WORKER_F_BOUND, &worker->flags);
tsk = create_io_thread(io_wq_worker, worker, NUMA_NO_NODE);
if (!IS_ERR(tsk)) {
......@@ -924,8 +926,8 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work)
{
struct io_wq_acct *acct = io_work_get_acct(wq, work);
unsigned long work_flags = work->flags;
struct io_cb_cancel_data match;
unsigned work_flags = work->flags;
bool do_create;
/*
......@@ -1005,8 +1007,7 @@ static bool io_wq_worker_cancel(struct io_worker *worker, void *data)
* may dereference the passed in work.
*/
raw_spin_lock(&worker->lock);
if (__io_wq_worker_cancel(worker, match, worker->cur_work) ||
__io_wq_worker_cancel(worker, match, worker->next_work))
if (__io_wq_worker_cancel(worker, match, worker->cur_work))
match->nr_running++;
raw_spin_unlock(&worker->lock);
......
This diff is collapsed.
......@@ -62,16 +62,12 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
}
bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
void io_req_cqe_overflow(struct io_kiocb *req);
int io_run_task_work_sig(struct io_ring_ctx *ctx);
void io_req_defer_failed(struct io_kiocb *req, s32 res);
void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags);
bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags);
bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
struct file *io_file_get_normal(struct io_kiocb *req, int fd);
struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
unsigned issue_flags);
......@@ -79,7 +75,6 @@ struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
bool io_alloc_async_data(struct io_kiocb *req);
void io_req_task_queue(struct io_kiocb *req);
void io_queue_iowq(struct io_kiocb *req, struct io_tw_state *ts_dont_use);
void io_req_task_complete(struct io_kiocb *req, struct io_tw_state *ts);
void io_req_task_queue_fail(struct io_kiocb *req, int ret);
void io_req_task_submit(struct io_kiocb *req, struct io_tw_state *ts);
......@@ -97,7 +92,6 @@ int io_poll_issue(struct io_kiocb *req, struct io_tw_state *ts);
int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr);
int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin);
void __io_submit_flush_completions(struct io_ring_ctx *ctx);
int io_req_prep_async(struct io_kiocb *req);
struct io_wq_work *io_wq_free_work(struct io_wq_work *work);
void io_wq_submit_work(struct io_wq_work *work);
......@@ -110,9 +104,6 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
bool cancel_all);
void *io_mem_alloc(size_t size);
void io_mem_free(void *ptr);
enum {
IO_EVENTFD_OP_SIGNAL_BIT,
IO_EVENTFD_OP_FREE_BIT,
......@@ -121,9 +112,9 @@ enum {
void io_eventfd_ops(struct rcu_head *rcu);
void io_activate_pollwq(struct io_ring_ctx *ctx);
#if defined(CONFIG_PROVE_LOCKING)
static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
{
#if defined(CONFIG_PROVE_LOCKING)
lockdep_assert(in_task());
if (ctx->flags & IORING_SETUP_IOPOLL) {
......@@ -142,18 +133,21 @@ static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
else
lockdep_assert(current == ctx->submitter_task);
}
}
#else
static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
{
}
#endif
}
static inline void io_req_task_work_add(struct io_kiocb *req)
{
__io_req_task_work_add(req, 0);
}
static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
{
if (!wq_list_empty(&ctx->submit_state.compl_reqs) ||
ctx->submit_state.cq_flush)
__io_submit_flush_completions(ctx);
}
#define io_for_each_link(pos, head) \
for (pos = (head); pos; pos = pos->link)
......@@ -340,15 +334,12 @@ static inline int io_run_task_work(void)
static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
{
return task_work_pending(current) || !wq_list_empty(&ctx->work_llist);
return task_work_pending(current) || !llist_empty(&ctx->work_llist);
}
static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
{
if (!ts->locked) {
mutex_lock(&ctx->uring_lock);
ts->locked = true;
}
lockdep_assert_held(&ctx->uring_lock);
}
/*
......
This diff is collapsed.
......@@ -41,8 +41,26 @@ struct io_buffer {
__u16 bgid;
};
enum {
/* can alloc a bigger vec */
KBUF_MODE_EXPAND = 1,
/* if bigger vec allocated, free old one */
KBUF_MODE_FREE = 2,
};
struct buf_sel_arg {
struct iovec *iovs;
size_t out_len;
size_t max_len;
int nr_iovs;
int mode;
};
void __user *io_buffer_select(struct io_kiocb *req, size_t *len,
unsigned int issue_flags);
int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
unsigned int issue_flags);
int io_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg);
void io_destroy_buffers(struct io_ring_ctx *ctx);
int io_remove_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
......@@ -55,8 +73,6 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg);
int io_register_pbuf_status(struct io_ring_ctx *ctx, void __user *arg);
void io_kbuf_mmap_list_free(struct io_ring_ctx *ctx);
void __io_put_kbuf(struct io_kiocb *req, unsigned issue_flags);
bool io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags);
......@@ -64,6 +80,7 @@ bool io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags);
void io_put_bl(struct io_ring_ctx *ctx, struct io_buffer_list *bl);
struct io_buffer_list *io_pbuf_get_bl(struct io_ring_ctx *ctx,
unsigned long bgid);
int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma);
static inline bool io_kbuf_recycle_ring(struct io_kiocb *req)
{
......@@ -76,7 +93,7 @@ static inline bool io_kbuf_recycle_ring(struct io_kiocb *req)
*/
if (req->buf_list) {
req->buf_index = req->buf_list->bgid;
req->flags &= ~REQ_F_BUFFER_RING;
req->flags &= ~(REQ_F_BUFFER_RING|REQ_F_BUFFERS_COMMIT);
return true;
}
return false;
......@@ -100,11 +117,16 @@ static inline bool io_kbuf_recycle(struct io_kiocb *req, unsigned issue_flags)
return false;
}
static inline void __io_put_kbuf_ring(struct io_kiocb *req)
static inline void __io_put_kbuf_ring(struct io_kiocb *req, int nr)
{
if (req->buf_list) {
req->buf_index = req->buf_list->bgid;
req->buf_list->head++;
struct io_buffer_list *bl = req->buf_list;
if (bl) {
if (req->flags & REQ_F_BUFFERS_COMMIT) {
bl->head += nr;
req->flags &= ~REQ_F_BUFFERS_COMMIT;
}
req->buf_index = bl->bgid;
}
req->flags &= ~REQ_F_BUFFER_RING;
}
......@@ -113,7 +135,7 @@ static inline void __io_put_kbuf_list(struct io_kiocb *req,
struct list_head *list)
{
if (req->flags & REQ_F_BUFFER_RING) {
__io_put_kbuf_ring(req);
__io_put_kbuf_ring(req, 1);
} else {
req->buf_index = req->kbuf->bgid;
list_add(&req->kbuf->list, list);
......@@ -121,22 +143,18 @@ static inline void __io_put_kbuf_list(struct io_kiocb *req,
}
}
static inline unsigned int io_put_kbuf_comp(struct io_kiocb *req)
static inline void io_kbuf_drop(struct io_kiocb *req)
{
unsigned int ret;
lockdep_assert_held(&req->ctx->completion_lock);
if (!(req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)))
return 0;
return;
ret = IORING_CQE_F_BUFFER | (req->buf_index << IORING_CQE_BUFFER_SHIFT);
__io_put_kbuf_list(req, &req->ctx->io_buffers_comp);
return ret;
}
static inline unsigned int io_put_kbuf(struct io_kiocb *req,
unsigned issue_flags)
static inline unsigned int __io_put_kbufs(struct io_kiocb *req, int nbufs,
unsigned issue_flags)
{
unsigned int ret;
......@@ -145,9 +163,21 @@ static inline unsigned int io_put_kbuf(struct io_kiocb *req,
ret = IORING_CQE_F_BUFFER | (req->buf_index << IORING_CQE_BUFFER_SHIFT);
if (req->flags & REQ_F_BUFFER_RING)
__io_put_kbuf_ring(req);
__io_put_kbuf_ring(req, nbufs);
else
__io_put_kbuf(req, issue_flags);
return ret;
}
static inline unsigned int io_put_kbuf(struct io_kiocb *req,
unsigned issue_flags)
{
return __io_put_kbufs(req, 1, issue_flags);
}
static inline unsigned int io_put_kbufs(struct io_kiocb *req, int nbufs,
unsigned issue_flags)
{
return __io_put_kbufs(req, nbufs, issue_flags);
}
#endif
// SPDX-License-Identifier: GPL-2.0
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/io_uring.h>
#include <linux/io_uring_types.h>
#include <asm/shmparam.h>
#include "memmap.h"
#include "kbuf.h"
static void *io_mem_alloc_compound(struct page **pages, int nr_pages,
size_t size, gfp_t gfp)
{
struct page *page;
int i, order;
order = get_order(size);
if (order > MAX_PAGE_ORDER)
return ERR_PTR(-ENOMEM);
else if (order)
gfp |= __GFP_COMP;
page = alloc_pages(gfp, order);
if (!page)
return ERR_PTR(-ENOMEM);
for (i = 0; i < nr_pages; i++)
pages[i] = page + i;
return page_address(page);
}
static void *io_mem_alloc_single(struct page **pages, int nr_pages, size_t size,
gfp_t gfp)
{
void *ret;
int i;
for (i = 0; i < nr_pages; i++) {
pages[i] = alloc_page(gfp);
if (!pages[i])
goto err;
}
ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
if (ret)
return ret;
err:
while (i--)
put_page(pages[i]);
return ERR_PTR(-ENOMEM);
}
void *io_pages_map(struct page ***out_pages, unsigned short *npages,
size_t size)
{
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN;
struct page **pages;
int nr_pages;
void *ret;
nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
pages = kvmalloc_array(nr_pages, sizeof(struct page *), gfp);
if (!pages)
return ERR_PTR(-ENOMEM);
ret = io_mem_alloc_compound(pages, nr_pages, size, gfp);
if (!IS_ERR(ret))
goto done;
ret = io_mem_alloc_single(pages, nr_pages, size, gfp);
if (!IS_ERR(ret)) {
done:
*out_pages = pages;
*npages = nr_pages;
return ret;
}
kvfree(pages);
*out_pages = NULL;
*npages = 0;
return ret;
}
void io_pages_unmap(void *ptr, struct page ***pages, unsigned short *npages,
bool put_pages)
{
bool do_vunmap = false;
if (!ptr)
return;
if (put_pages && *npages) {
struct page **to_free = *pages;
int i;
/*
* Only did vmap for the non-compound multiple page case.
* For the compound page, we just need to put the head.
*/
if (PageCompound(to_free[0]))
*npages = 1;
else if (*npages > 1)
do_vunmap = true;
for (i = 0; i < *npages; i++)
put_page(to_free[i]);
}
if (do_vunmap)
vunmap(ptr);
kvfree(*pages);
*pages = NULL;
*npages = 0;
}
void io_pages_free(struct page ***pages, int npages)
{
struct page **page_array = *pages;
if (!page_array)
return;
unpin_user_pages(page_array, npages);
kvfree(page_array);
*pages = NULL;
}
struct page **io_pin_pages(unsigned long uaddr, unsigned long len, int *npages)
{
unsigned long start, end, nr_pages;
struct page **pages;
int ret;
end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
start = uaddr >> PAGE_SHIFT;
nr_pages = end - start;
if (WARN_ON_ONCE(!nr_pages))
return ERR_PTR(-EINVAL);
pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
if (!pages)
return ERR_PTR(-ENOMEM);
ret = pin_user_pages_fast(uaddr, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
pages);
/* success, mapped all pages */
if (ret == nr_pages) {
*npages = nr_pages;
return pages;
}
/* partial map, or didn't map anything */
if (ret >= 0) {
/* if we did partial map, release any pages we did get */
if (ret)
unpin_user_pages(pages, ret);
ret = -EFAULT;
}
kvfree(pages);
return ERR_PTR(ret);
}
void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
unsigned long uaddr, size_t size)
{
struct page **page_array;
unsigned int nr_pages;
void *page_addr;
*npages = 0;
if (uaddr & (PAGE_SIZE - 1) || !size)
return ERR_PTR(-EINVAL);
nr_pages = 0;
page_array = io_pin_pages(uaddr, size, &nr_pages);
if (IS_ERR(page_array))
return page_array;
page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL);
if (page_addr) {
*pages = page_array;
*npages = nr_pages;
return page_addr;
}
io_pages_free(&page_array, nr_pages);
return ERR_PTR(-ENOMEM);
}
static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
size_t sz)
{
struct io_ring_ctx *ctx = file->private_data;
loff_t offset = pgoff << PAGE_SHIFT;
switch ((pgoff << PAGE_SHIFT) & IORING_OFF_MMAP_MASK) {
case IORING_OFF_SQ_RING:
case IORING_OFF_CQ_RING:
/* Don't allow mmap if the ring was setup without it */
if (ctx->flags & IORING_SETUP_NO_MMAP)
return ERR_PTR(-EINVAL);
return ctx->rings;
case IORING_OFF_SQES:
/* Don't allow mmap if the ring was setup without it */
if (ctx->flags & IORING_SETUP_NO_MMAP)
return ERR_PTR(-EINVAL);
return ctx->sq_sqes;
case IORING_OFF_PBUF_RING: {
struct io_buffer_list *bl;
unsigned int bgid;
void *ptr;
bgid = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT;
bl = io_pbuf_get_bl(ctx, bgid);
if (IS_ERR(bl))
return bl;
ptr = bl->buf_ring;
io_put_bl(ctx, bl);
return ptr;
}
}
return ERR_PTR(-EINVAL);
}
int io_uring_mmap_pages(struct io_ring_ctx *ctx, struct vm_area_struct *vma,
struct page **pages, int npages)
{
unsigned long nr_pages = npages;
vm_flags_set(vma, VM_DONTEXPAND);
return vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
}
#ifdef CONFIG_MMU
__cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
{
struct io_ring_ctx *ctx = file->private_data;
size_t sz = vma->vm_end - vma->vm_start;
long offset = vma->vm_pgoff << PAGE_SHIFT;
void *ptr;
ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);
if (IS_ERR(ptr))
return PTR_ERR(ptr);
switch (offset & IORING_OFF_MMAP_MASK) {
case IORING_OFF_SQ_RING:
case IORING_OFF_CQ_RING:
return io_uring_mmap_pages(ctx, vma, ctx->ring_pages,
ctx->n_ring_pages);
case IORING_OFF_SQES:
return io_uring_mmap_pages(ctx, vma, ctx->sqe_pages,
ctx->n_sqe_pages);
case IORING_OFF_PBUF_RING:
return io_pbuf_mmap(file, vma);
}
return -EINVAL;
}
unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff,
unsigned long flags)
{
void *ptr;
/*
* Do not allow to map to user-provided address to avoid breaking the
* aliasing rules. Userspace is not able to guess the offset address of
* kernel kmalloc()ed memory area.
*/
if (addr)
return -EINVAL;
ptr = io_uring_validate_mmap_request(filp, pgoff, len);
if (IS_ERR(ptr))
return -ENOMEM;
/*
* Some architectures have strong cache aliasing requirements.
* For such architectures we need a coherent mapping which aliases
* kernel memory *and* userspace memory. To achieve that:
* - use a NULL file pointer to reference physical memory, and
* - use the kernel virtual address of the shared io_uring context
* (instead of the userspace-provided address, which has to be 0UL
* anyway).
* - use the same pgoff which the get_unmapped_area() uses to
* calculate the page colouring.
* For architectures without such aliasing requirements, the
* architecture will return any suitable mapping because addr is 0.
*/
filp = NULL;
flags |= MAP_SHARED;
pgoff = 0; /* has been translated to ptr above */
#ifdef SHM_COLOUR
addr = (uintptr_t) ptr;
pgoff = addr >> PAGE_SHIFT;
#else
addr = 0UL;
#endif
return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
}
#else /* !CONFIG_MMU */
int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
{
return is_nommu_shared_mapping(vma->vm_flags) ? 0 : -EINVAL;
}
unsigned int io_uring_nommu_mmap_capabilities(struct file *file)
{
return NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_WRITE;
}
unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
unsigned long len, unsigned long pgoff,
unsigned long flags)
{
void *ptr;
ptr = io_uring_validate_mmap_request(file, pgoff, len);
if (IS_ERR(ptr))
return PTR_ERR(ptr);
return (unsigned long) ptr;
}
#endif /* !CONFIG_MMU */
#ifndef IO_URING_MEMMAP_H
#define IO_URING_MEMMAP_H
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
void io_pages_free(struct page ***pages, int npages);
int io_uring_mmap_pages(struct io_ring_ctx *ctx, struct vm_area_struct *vma,
struct page **pages, int npages);
void *io_pages_map(struct page ***out_pages, unsigned short *npages,
size_t size);
void io_pages_unmap(void *ptr, struct page ***pages, unsigned short *npages,
bool put_pages);
void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
unsigned long uaddr, size_t size);
#ifndef CONFIG_MMU
unsigned int io_uring_nommu_mmap_capabilities(struct file *file);
#endif
unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
unsigned long len, unsigned long pgoff,
unsigned long flags);
int io_uring_mmap(struct file *file, struct vm_area_struct *vma);
#endif
......@@ -83,7 +83,7 @@ static int io_msg_exec_remote(struct io_kiocb *req, task_work_func_t func)
return -EOWNERDEAD;
init_task_work(&msg->tw, func);
if (task_work_add(ctx->submitter_task, &msg->tw, TWA_SIGNAL))
if (task_work_add(task, &msg->tw, TWA_SIGNAL))
return -EOWNERDEAD;
return IOU_ISSUE_SKIP_COMPLETE;
......@@ -147,13 +147,11 @@ static int io_msg_ring_data(struct io_kiocb *req, unsigned int issue_flags)
if (target_ctx->flags & IORING_SETUP_IOPOLL) {
if (unlikely(io_double_lock_ctx(target_ctx, issue_flags)))
return -EAGAIN;
if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
ret = 0;
io_double_unlock_ctx(target_ctx);
} else {
if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
ret = 0;
}
if (io_post_aux_cqe(target_ctx, msg->user_data, msg->len, flags))
ret = 0;
if (target_ctx->flags & IORING_SETUP_IOPOLL)
io_double_unlock_ctx(target_ctx);
return ret;
}
......
This diff is collapsed.
......@@ -3,22 +3,15 @@
#include <linux/net.h>
#include <linux/uio.h>
#include "alloc_cache.h"
struct io_async_msghdr {
#if defined(CONFIG_NET)
union {
struct iovec fast_iov[UIO_FASTIOV];
struct {
struct iovec fast_iov_one;
__kernel_size_t controllen;
int namelen;
__kernel_size_t payloadlen;
};
struct io_cache_entry cache;
};
struct iovec fast_iov;
/* points to an allocated iov, if NULL we use fast_iov instead */
struct iovec *free_iov;
int free_iov_nr;
int namelen;
__kernel_size_t controllen;
__kernel_size_t payloadlen;
struct sockaddr __user *uaddr;
struct msghdr msg;
struct sockaddr_storage addr;
......@@ -27,22 +20,15 @@ struct io_async_msghdr {
#if defined(CONFIG_NET)
struct io_async_connect {
struct sockaddr_storage address;
};
int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_shutdown(struct io_kiocb *req, unsigned int issue_flags);
int io_sendmsg_prep_async(struct io_kiocb *req);
void io_sendmsg_recvmsg_cleanup(struct io_kiocb *req);
int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags);
int io_send(struct io_kiocb *req, unsigned int issue_flags);
int io_send_prep_async(struct io_kiocb *req);
int io_recvmsg_prep_async(struct io_kiocb *req);
int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags);
int io_recv(struct io_kiocb *req, unsigned int issue_flags);
......@@ -55,7 +41,6 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags);
int io_socket_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_socket(struct io_kiocb *req, unsigned int issue_flags);
int io_connect_prep_async(struct io_kiocb *req);
int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_connect(struct io_kiocb *req, unsigned int issue_flags);
......@@ -64,9 +49,9 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags);
int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
void io_send_zc_cleanup(struct io_kiocb *req);
void io_netmsg_cache_free(struct io_cache_entry *entry);
void io_netmsg_cache_free(const void *entry);
#else
static inline void io_netmsg_cache_free(struct io_cache_entry *entry)
static inline void io_netmsg_cache_free(const void *entry)
{
}
#endif
......@@ -10,16 +10,34 @@
#include "io_uring.h"
#include "nop.h"
struct io_nop {
/* NOTE: kiocb has the file as the first member, so don't do it here */
struct file *file;
int result;
};
int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
unsigned int flags;
struct io_nop *nop = io_kiocb_to_cmd(req, struct io_nop);
flags = READ_ONCE(sqe->nop_flags);
if (flags & ~IORING_NOP_INJECT_RESULT)
return -EINVAL;
if (flags & IORING_NOP_INJECT_RESULT)
nop->result = READ_ONCE(sqe->len);
else
nop->result = 0;
return 0;
}
/*
* IORING_OP_NOP just posts a completion event, nothing else.
*/
int io_nop(struct io_kiocb *req, unsigned int issue_flags)
{
io_req_set_res(req, 0, 0);
struct io_nop *nop = io_kiocb_to_cmd(req, struct io_nop);
if (nop->result < 0)
req_set_fail(req);
io_req_set_res(req, nop->result, 0);
return IOU_OK;
}
......@@ -9,35 +9,36 @@
#include "notif.h"
#include "rsrc.h"
static void io_notif_complete_tw_ext(struct io_kiocb *notif, struct io_tw_state *ts)
static const struct ubuf_info_ops io_ubuf_ops;
static void io_notif_tw_complete(struct io_kiocb *notif, struct io_tw_state *ts)
{
struct io_notif_data *nd = io_notif_to_data(notif);
struct io_ring_ctx *ctx = notif->ctx;
if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
do {
notif = cmd_to_io_kiocb(nd);
if (nd->account_pages && ctx->user) {
__io_unaccount_mem(ctx->user, nd->account_pages);
nd->account_pages = 0;
}
io_req_task_complete(notif, ts);
}
lockdep_assert(refcount_read(&nd->uarg.refcnt) == 0);
static void io_tx_ubuf_callback(struct sk_buff *skb, struct ubuf_info *uarg,
bool success)
{
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
if (unlikely(nd->zc_report) && (nd->zc_copied || !nd->zc_used))
notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
if (nd->account_pages && notif->ctx->user) {
__io_unaccount_mem(notif->ctx->user, nd->account_pages);
nd->account_pages = 0;
}
if (refcount_dec_and_test(&uarg->refcnt))
__io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
nd = nd->next;
io_req_task_complete(notif, ts);
} while (nd);
}
static void io_tx_ubuf_callback_ext(struct sk_buff *skb, struct ubuf_info *uarg,
bool success)
void io_tx_ubuf_complete(struct sk_buff *skb, struct ubuf_info *uarg,
bool success)
{
struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
struct io_kiocb *notif = cmd_to_io_kiocb(nd);
unsigned tw_flags;
if (nd->zc_report) {
if (success && !nd->zc_used && skb)
......@@ -45,23 +46,64 @@ static void io_tx_ubuf_callback_ext(struct sk_buff *skb, struct ubuf_info *uarg,
else if (!success && !nd->zc_copied)
WRITE_ONCE(nd->zc_copied, true);
}
io_tx_ubuf_callback(skb, uarg, success);
if (!refcount_dec_and_test(&uarg->refcnt))
return;
if (nd->head != nd) {
io_tx_ubuf_complete(skb, &nd->head->uarg, success);
return;
}
tw_flags = nd->next ? 0 : IOU_F_TWQ_LAZY_WAKE;
notif->io_task_work.func = io_notif_tw_complete;
__io_req_task_work_add(notif, tw_flags);
}
void io_notif_set_extended(struct io_kiocb *notif)
static int io_link_skb(struct sk_buff *skb, struct ubuf_info *uarg)
{
struct io_notif_data *nd = io_notif_to_data(notif);
struct io_notif_data *nd, *prev_nd;
struct io_kiocb *prev_notif, *notif;
struct ubuf_info *prev_uarg = skb_zcopy(skb);
if (nd->uarg.callback != io_tx_ubuf_callback_ext) {
nd->account_pages = 0;
nd->zc_report = false;
nd->zc_used = false;
nd->zc_copied = false;
nd->uarg.callback = io_tx_ubuf_callback_ext;
notif->io_task_work.func = io_notif_complete_tw_ext;
nd = container_of(uarg, struct io_notif_data, uarg);
notif = cmd_to_io_kiocb(nd);
if (!prev_uarg) {
net_zcopy_get(&nd->uarg);
skb_zcopy_init(skb, &nd->uarg);
return 0;
}
/* handle it separately as we can't link a notif to itself */
if (unlikely(prev_uarg == &nd->uarg))
return 0;
/* we can't join two links together, just request a fresh skb */
if (unlikely(nd->head != nd || nd->next))
return -EEXIST;
/* don't mix zc providers */
if (unlikely(prev_uarg->ops != &io_ubuf_ops))
return -EEXIST;
prev_nd = container_of(prev_uarg, struct io_notif_data, uarg);
prev_notif = cmd_to_io_kiocb(nd);
/* make sure all noifications can be finished in the same task_work */
if (unlikely(notif->ctx != prev_notif->ctx ||
notif->task != prev_notif->task))
return -EEXIST;
nd->head = prev_nd->head;
nd->next = prev_nd->next;
prev_nd->next = nd;
net_zcopy_get(&nd->head->uarg);
return 0;
}
static const struct ubuf_info_ops io_ubuf_ops = {
.complete = io_tx_ubuf_complete,
.link_skb = io_link_skb,
};
struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
......@@ -76,11 +118,15 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
notif->task = current;
io_get_task_refs(1);
notif->rsrc_node = NULL;
notif->io_task_work.func = io_req_task_complete;
nd = io_notif_to_data(notif);
nd->zc_report = false;
nd->account_pages = 0;
nd->next = NULL;
nd->head = nd;
nd->uarg.flags = IO_NOTIF_UBUF_FLAGS;
nd->uarg.callback = io_tx_ubuf_callback;
nd->uarg.ops = &io_ubuf_ops;
refcount_set(&nd->uarg.refcnt, 1);
return notif;
}
......@@ -13,14 +13,19 @@
struct io_notif_data {
struct file *file;
struct ubuf_info uarg;
unsigned long account_pages;
struct io_notif_data *next;
struct io_notif_data *head;
unsigned account_pages;
bool zc_report;
bool zc_used;
bool zc_copied;
};
struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx);
void io_notif_set_extended(struct io_kiocb *notif);
void io_tx_ubuf_complete(struct sk_buff *skb, struct ubuf_info *uarg,
bool success);
static inline struct io_notif_data *io_notif_to_data(struct io_kiocb *notif)
{
......@@ -32,9 +37,7 @@ static inline void io_notif_flush(struct io_kiocb *notif)
{
struct io_notif_data *nd = io_notif_to_data(notif);
/* drop slot's master ref */
if (refcount_dec_and_test(&nd->uarg.refcnt))
__io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
io_tx_ubuf_complete(NULL, &nd->uarg, true);
}
static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
......
......@@ -67,7 +67,8 @@ const struct io_issue_def io_issue_defs[] = {
.iopoll = 1,
.iopoll_queue = 1,
.vectored = 1,
.prep = io_prep_rwv,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_readv,
.issue = io_read,
},
[IORING_OP_WRITEV] = {
......@@ -81,7 +82,8 @@ const struct io_issue_def io_issue_defs[] = {
.iopoll = 1,
.iopoll_queue = 1,
.vectored = 1,
.prep = io_prep_rwv,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_writev,
.issue = io_write,
},
[IORING_OP_FSYNC] = {
......@@ -99,7 +101,8 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
.prep = io_prep_rw_fixed,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_read_fixed,
.issue = io_read,
},
[IORING_OP_WRITE_FIXED] = {
......@@ -112,7 +115,8 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
.prep = io_prep_rw_fixed,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_write_fixed,
.issue = io_write,
},
[IORING_OP_POLL_ADD] = {
......@@ -138,8 +142,8 @@ const struct io_issue_def io_issue_defs[] = {
.unbound_nonreg_file = 1,
.pollout = 1,
.ioprio = 1,
.manual_alloc = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_sendmsg_prep,
.issue = io_sendmsg,
#else
......@@ -152,8 +156,8 @@ const struct io_issue_def io_issue_defs[] = {
.pollin = 1,
.buffer_select = 1,
.ioprio = 1,
.manual_alloc = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_recvmsg_prep,
.issue = io_recvmsg,
#else
......@@ -162,6 +166,7 @@ const struct io_issue_def io_issue_defs[] = {
},
[IORING_OP_TIMEOUT] = {
.audit_skip = 1,
.async_size = sizeof(struct io_timeout_data),
.prep = io_timeout_prep,
.issue = io_timeout,
},
......@@ -191,6 +196,7 @@ const struct io_issue_def io_issue_defs[] = {
},
[IORING_OP_LINK_TIMEOUT] = {
.audit_skip = 1,
.async_size = sizeof(struct io_timeout_data),
.prep = io_link_timeout_prep,
.issue = io_no_issue,
},
......@@ -199,6 +205,7 @@ const struct io_issue_def io_issue_defs[] = {
.unbound_nonreg_file = 1,
.pollout = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_connect_prep,
.issue = io_connect,
#else
......@@ -239,7 +246,8 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
.prep = io_prep_rw,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_read,
.issue = io_read,
},
[IORING_OP_WRITE] = {
......@@ -252,7 +260,8 @@ const struct io_issue_def io_issue_defs[] = {
.ioprio = 1,
.iopoll = 1,
.iopoll_queue = 1,
.prep = io_prep_rw,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_write,
.issue = io_write,
},
[IORING_OP_FADVISE] = {
......@@ -272,8 +281,9 @@ const struct io_issue_def io_issue_defs[] = {
.pollout = 1,
.audit_skip = 1,
.ioprio = 1,
.manual_alloc = 1,
.buffer_select = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_sendmsg_prep,
.issue = io_send,
#else
......@@ -288,6 +298,7 @@ const struct io_issue_def io_issue_defs[] = {
.audit_skip = 1,
.ioprio = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_recvmsg_prep,
.issue = io_recv,
#else
......@@ -403,6 +414,7 @@ const struct io_issue_def io_issue_defs[] = {
.plug = 1,
.iopoll = 1,
.iopoll_queue = 1,
.async_size = 2 * sizeof(struct io_uring_sqe),
.prep = io_uring_cmd_prep,
.issue = io_uring_cmd,
},
......@@ -412,8 +424,8 @@ const struct io_issue_def io_issue_defs[] = {
.pollout = 1,
.audit_skip = 1,
.ioprio = 1,
.manual_alloc = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_send_zc_prep,
.issue = io_send_zc,
#else
......@@ -425,8 +437,8 @@ const struct io_issue_def io_issue_defs[] = {
.unbound_nonreg_file = 1,
.pollout = 1,
.ioprio = 1,
.manual_alloc = 1,
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep = io_send_zc_prep,
.issue = io_sendmsg_zc,
#else
......@@ -439,10 +451,12 @@ const struct io_issue_def io_issue_defs[] = {
.pollin = 1,
.buffer_select = 1,
.audit_skip = 1,
.async_size = sizeof(struct io_async_rw),
.prep = io_read_mshot_prep,
.issue = io_read_mshot,
},
[IORING_OP_WAITID] = {
.async_size = sizeof(struct io_waitid_async),
.prep = io_waitid_prep,
.issue = io_waitid,
},
......@@ -488,16 +502,12 @@ const struct io_cold_def io_cold_defs[] = {
.name = "NOP",
},
[IORING_OP_READV] = {
.async_size = sizeof(struct io_async_rw),
.name = "READV",
.prep_async = io_readv_prep_async,
.cleanup = io_readv_writev_cleanup,
.fail = io_rw_fail,
},
[IORING_OP_WRITEV] = {
.async_size = sizeof(struct io_async_rw),
.name = "WRITEV",
.prep_async = io_writev_prep_async,
.cleanup = io_readv_writev_cleanup,
.fail = io_rw_fail,
},
......@@ -505,12 +515,10 @@ const struct io_cold_def io_cold_defs[] = {
.name = "FSYNC",
},
[IORING_OP_READ_FIXED] = {
.async_size = sizeof(struct io_async_rw),
.name = "READ_FIXED",
.fail = io_rw_fail,
},
[IORING_OP_WRITE_FIXED] = {
.async_size = sizeof(struct io_async_rw),
.name = "WRITE_FIXED",
.fail = io_rw_fail,
},
......@@ -526,8 +534,6 @@ const struct io_cold_def io_cold_defs[] = {
[IORING_OP_SENDMSG] = {
.name = "SENDMSG",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep_async = io_sendmsg_prep_async,
.cleanup = io_sendmsg_recvmsg_cleanup,
.fail = io_sendrecv_fail,
#endif
......@@ -535,14 +541,11 @@ const struct io_cold_def io_cold_defs[] = {
[IORING_OP_RECVMSG] = {
.name = "RECVMSG",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep_async = io_recvmsg_prep_async,
.cleanup = io_sendmsg_recvmsg_cleanup,
.fail = io_sendrecv_fail,
#endif
},
[IORING_OP_TIMEOUT] = {
.async_size = sizeof(struct io_timeout_data),
.name = "TIMEOUT",
},
[IORING_OP_TIMEOUT_REMOVE] = {
......@@ -555,15 +558,10 @@ const struct io_cold_def io_cold_defs[] = {
.name = "ASYNC_CANCEL",
},
[IORING_OP_LINK_TIMEOUT] = {
.async_size = sizeof(struct io_timeout_data),
.name = "LINK_TIMEOUT",
},
[IORING_OP_CONNECT] = {
.name = "CONNECT",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_connect),
.prep_async = io_connect_prep_async,
#endif
},
[IORING_OP_FALLOCATE] = {
.name = "FALLOCATE",
......@@ -583,12 +581,10 @@ const struct io_cold_def io_cold_defs[] = {
.cleanup = io_statx_cleanup,
},
[IORING_OP_READ] = {
.async_size = sizeof(struct io_async_rw),
.name = "READ",
.fail = io_rw_fail,
},
[IORING_OP_WRITE] = {
.async_size = sizeof(struct io_async_rw),
.name = "WRITE",
.fail = io_rw_fail,
},
......@@ -601,14 +597,14 @@ const struct io_cold_def io_cold_defs[] = {
[IORING_OP_SEND] = {
.name = "SEND",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.cleanup = io_sendmsg_recvmsg_cleanup,
.fail = io_sendrecv_fail,
.prep_async = io_send_prep_async,
#endif
},
[IORING_OP_RECV] = {
.name = "RECV",
#if defined(CONFIG_NET)
.cleanup = io_sendmsg_recvmsg_cleanup,
.fail = io_sendrecv_fail,
#endif
},
......@@ -679,14 +675,10 @@ const struct io_cold_def io_cold_defs[] = {
},
[IORING_OP_URING_CMD] = {
.name = "URING_CMD",
.async_size = 2 * sizeof(struct io_uring_sqe),
.prep_async = io_uring_cmd_prep_async,
},
[IORING_OP_SEND_ZC] = {
.name = "SEND_ZC",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep_async = io_send_prep_async,
.cleanup = io_send_zc_cleanup,
.fail = io_sendrecv_fail,
#endif
......@@ -694,8 +686,6 @@ const struct io_cold_def io_cold_defs[] = {
[IORING_OP_SENDMSG_ZC] = {
.name = "SENDMSG_ZC",
#if defined(CONFIG_NET)
.async_size = sizeof(struct io_async_msghdr),
.prep_async = io_sendmsg_prep_async,
.cleanup = io_send_zc_cleanup,
.fail = io_sendrecv_fail,
#endif
......@@ -705,7 +695,6 @@ const struct io_cold_def io_cold_defs[] = {
},
[IORING_OP_WAITID] = {
.name = "WAITID",
.async_size = sizeof(struct io_waitid_async),
},
[IORING_OP_FUTEX_WAIT] = {
.name = "FUTEX_WAIT",
......
......@@ -27,22 +27,19 @@ struct io_issue_def {
unsigned iopoll : 1;
/* have to be put into the iopoll list */
unsigned iopoll_queue : 1;
/* opcode specific path will handle ->async_data allocation if needed */
unsigned manual_alloc : 1;
/* vectored opcode, set if 1) vectored, and 2) handler needs to know */
unsigned vectored : 1;
/* size of async data needed, if any */
unsigned short async_size;
int (*issue)(struct io_kiocb *, unsigned int);
int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
};
struct io_cold_def {
/* size of async data needed, if any */
unsigned short async_size;
const char *name;
int (*prep_async)(struct io_kiocb *);
void (*cleanup)(struct io_kiocb *);
void (*fail)(struct io_kiocb *);
};
......
......@@ -14,6 +14,7 @@
#include <uapi/linux/io_uring.h>
#include "io_uring.h"
#include "alloc_cache.h"
#include "refs.h"
#include "napi.h"
#include "opdef.h"
......@@ -322,8 +323,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
__poll_t mask = mangle_poll(req->cqe.res &
req->apoll_events);
if (!io_fill_cqe_req_aux(req, ts->locked, mask,
IORING_CQE_F_MORE)) {
if (!io_req_post_cqe(req, mask, IORING_CQE_F_MORE)) {
io_req_set_res(req, mask, 0);
return IOU_POLL_REMOVE_POLL_USE_RES;
}
......@@ -687,17 +687,15 @@ static struct async_poll *io_req_alloc_apoll(struct io_kiocb *req,
unsigned issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_cache_entry *entry;
struct async_poll *apoll;
if (req->flags & REQ_F_POLLED) {
apoll = req->apoll;
kfree(apoll->double_poll);
} else if (!(issue_flags & IO_URING_F_UNLOCKED)) {
entry = io_alloc_cache_get(&ctx->apoll_cache);
if (entry == NULL)
apoll = io_alloc_cache_get(&ctx->apoll_cache);
if (!apoll)
goto alloc_apoll;
apoll = container_of(entry, struct async_poll, cache);
apoll->poll.retries = APOLL_MAX_RETRY;
} else {
alloc_apoll:
......@@ -1056,8 +1054,3 @@ int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags)
io_req_set_res(req, ret, 0);
return IOU_OK;
}
void io_apoll_cache_free(struct io_cache_entry *entry)
{
kfree(container_of(entry, struct async_poll, cache));
}
This diff is collapsed.
......@@ -33,6 +33,13 @@ static inline void req_ref_get(struct io_kiocb *req)
atomic_inc(&req->refs);
}
static inline void req_ref_put(struct io_kiocb *req)
{
WARN_ON_ONCE(!(req->flags & REQ_F_REFCOUNT));
WARN_ON_ONCE(req_ref_zero_or_close_to_overflow(req));
atomic_dec(&req->refs);
}
static inline void __io_req_set_refcount(struct io_kiocb *req, int nr)
{
if (!(req->flags & REQ_F_REFCOUNT)) {
......
......@@ -368,8 +368,7 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
/* now propagate the restriction to all registered users */
list_for_each_entry(node, &ctx->tctx_list, ctx_node) {
struct io_uring_task *tctx = node->task->io_uring;
tctx = node->task->io_uring;
if (WARN_ON_ONCE(!tctx->io_wq))
continue;
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment