1. 21 Jul, 2022 16 commits
  2. 20 Jul, 2022 22 commits
  3. 19 Jul, 2022 2 commits
    • Jakub Kicinski's avatar
      Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux · 7f9eee19
      Jakub Kicinski authored
      Pavel Begunkov says:
      
      ====================
      io_uring zerocopy send
      
      The patchset implements io_uring zerocopy send. It works with both registered
      and normal buffers, mixing is allowed but not recommended. Apart from usual
      request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
      the userspace when buffers are freed and can be reused (see API design below),
      which is delivered into io_uring's Completion Queue. Those "buffer-free"
      notifications are not necessarily per request, but the userspace has control
      over it and should explicitly attaching a number of requests to a single
      notification. The series also adds some internal optimisations when used with
      registered buffers like removing page referencing.
      
      From the kernel networking perspective there are two main changes. The first
      one is passing ubuf_info into the network layer from io_uring (inside of an
      in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
      caching on the io_uring side, but also helps to avoid cross-referencing
      and synchronisation problems. The second part is an optional optimisation
      removing page referencing for requests with registered buffers.
      
      Benchmarking UDP with an optimised version of the selftest (see [1]), which
      sends a bunch of requests, waits for completions and repeats. "+ flush" column
      posts one additional "buffer-free" notification per request, and just "zc"
      doesn't post buffer notifications at all.
      
      NIC (requests / second):
      IO size | non-zc    | zc             | zc + flush
      4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
      1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
      1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
      600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
      
      dummy (requests / second):
      IO size | non-zc    | zc             | zc + flush
      8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
      4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
      1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
      600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
      
      Previously it also brought a massive performance speedup compared to the
      msg_zerocopy tool (see [3]), which is probably not super interesting. There
      is also an additional bunch of refcounting optimisations that was omitted from
      the series for simplicity and as they don't change the picture drastically,
      they will be sent as follow up, as well as flushing optimisations closing the
      performance gap b/w two last columns.
      
      For TCP on localhost (with hacks enabling localhost zerocopy) and including
      additional overhead for receive:
      
      IO size | non-zc    | zc
      1200    | 4174      | 4148
      4096    | 7597      | 11228
      
      Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
      omitted optimisations will somewhat help, should look better for 4000,
      but couldn't test properly because of setup problems.
      
      Links:
      
        liburing (benchmark + tests):
        [1] https://github.com/isilence/liburing/tree/zc_v4
      
        kernel repo:
        [2] https://github.com/isilence/linux/tree/zc_v4
      
        RFC v1:
        [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
      
        RFC v2:
        https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
      
        Net patches based:
        git@github.com:isilence/linux.git zc_v4-net-base
        or
        https://github.com/isilence/linux/tree/zc_v4-net-base
      
      API design overview:
      
        The series introduces an io_uring concept of notifactors. From the userspace
        perspective it's an entity to which it can bind one or more requests and then
        requesting to flush it. Flushing a notifier makes it impossible to attach new
        requests to it, and instructs the notifier to post a completion once all
        requests attached to it are completed and the kernel doesn't need the buffers
        anymore.
      
        Notifications are stored in notification slots, which should be registered as
        an array in io_uring. Each slot stores only one notifier at any particular
        moment. Flushing removes it from the slot and the slot automatically replaces
        it with a new notifier. All operations with notifiers are done by specifying
        an index of a slot it's currently in.
      
        When registering a notification the userspace specifies a u64 tag for each
        slot, which will be copied in notification completion entries as
        cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
        sequence number counting notifiers of a slot.
      
      ====================
      
      Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7f9eee19
    • Pavel Begunkov's avatar
      tcp: support externally provided ubufs · eb315a7d
      Pavel Begunkov authored
      Teach tcp how to use external ubuf_info provided in msghdr and
      also prepare it for managed frags by sprinkling
      skb_zcopy_downgrade_managed() when it could mix managed and not managed
      frags.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eb315a7d