• Eric Dumazet's avatar
    net: generalize skb freeing deferral to per-cpu lists · 68822bdf
    Eric Dumazet authored
    Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
    lock is released") helped bulk TCP flows to move the cost of skbs
    frees outside of critical section where socket lock was held.
    
    But for RPC traffic, or hosts with RFS enabled, the solution is far from
    being ideal.
    
    For RPC traffic, recvmsg() has to return to user space right after
    skb payload has been consumed, meaning that BH handler has no chance
    to pick the skb before recvmsg() thread. This issue is more visible
    with BIG TCP, as more RPC fit one skb.
    
    For RFS, even if BH handler picks the skbs, they are still picked
    from the cpu on which user thread is running.
    
    Ideally, it is better to free the skbs (and associated page frags)
    on the cpu that originally allocated them.
    
    This patch removes the per socket anchor (sk->defer_list) and
    instead uses a per-cpu list, which will hold more skbs per round.
    
    This new per-cpu list is drained at the end of net_action_rx(),
    after incoming packets have been processed, to lower latencies.
    
    In normal conditions, skbs are added to the per-cpu list with
    no further action. In the (unlikely) cases where the cpu does not
    run net_action_rx() handler fast enough, we use an IPI to raise
    NET_RX_SOFTIRQ on the remote cpu.
    
    Also, we do not bother draining the per-cpu list from dev_cpu_dead()
    This is because skbs in this list have no requirement on how fast
    they should be freed.
    
    Note that we can add in the future a small per-cpu cache
    if we see any contention on sd->defer_lock.
    
    Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
    and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
    page recycling strategy used by NIC driver (its page pool capacity
    being too small compared to number of skbs/pages held in sockets
    receive queues)
    
    Note that this tuning was only done to demonstrate worse
    conditions for skb freeing for this particular test.
    These conditions can happen in more general production workload.
    
    10 runs of one TCP_STREAM flow
    
    Before:
    Average throughput: 49685 Mbit.
    
    Kernel profiles on cpu running user thread recvmsg() show high cost for
    skb freeing related functions (*)
    
        57.81%  [kernel]       [k] copy_user_enhanced_fast_string
    (*) 12.87%  [kernel]       [k] skb_release_data
    (*)  4.25%  [kernel]       [k] __free_one_page
    (*)  3.57%  [kernel]       [k] __list_del_entry_valid
         1.85%  [kernel]       [k] __netif_receive_skb_core
         1.60%  [kernel]       [k] __skb_datagram_iter
    (*)  1.59%  [kernel]       [k] free_unref_page_commit
    (*)  1.16%  [kernel]       [k] __slab_free
         1.16%  [kernel]       [k] _copy_to_iter
    (*)  1.01%  [kernel]       [k] kfree
    (*)  0.88%  [kernel]       [k] free_unref_page
         0.57%  [kernel]       [k] ip6_rcv_core
         0.55%  [kernel]       [k] ip6t_do_table
         0.54%  [kernel]       [k] flush_smp_call_function_queue
    (*)  0.54%  [kernel]       [k] free_pcppages_bulk
         0.51%  [kernel]       [k] llist_reverse_order
         0.38%  [kernel]       [k] process_backlog
    (*)  0.38%  [kernel]       [k] free_pcp_prepare
         0.37%  [kernel]       [k] tcp_recvmsg_locked
    (*)  0.37%  [kernel]       [k] __list_add_valid
         0.34%  [kernel]       [k] sock_rfree
         0.34%  [kernel]       [k] _raw_spin_lock_irq
    (*)  0.33%  [kernel]       [k] __page_cache_release
         0.33%  [kernel]       [k] tcp_v6_rcv
    (*)  0.33%  [kernel]       [k] __put_page
    (*)  0.29%  [kernel]       [k] __mod_zone_page_state
         0.27%  [kernel]       [k] _raw_spin_lock
    
    After patch:
    Average throughput: 73076 Mbit.
    
    Kernel profiles on cpu running user thread recvmsg() looks better:
    
        81.35%  [kernel]       [k] copy_user_enhanced_fast_string
         1.95%  [kernel]       [k] _copy_to_iter
         1.95%  [kernel]       [k] __skb_datagram_iter
         1.27%  [kernel]       [k] __netif_receive_skb_core
         1.03%  [kernel]       [k] ip6t_do_table
         0.60%  [kernel]       [k] sock_rfree
         0.50%  [kernel]       [k] tcp_v6_rcv
         0.47%  [kernel]       [k] ip6_rcv_core
         0.45%  [kernel]       [k] read_tsc
         0.44%  [kernel]       [k] _raw_spin_lock_irqsave
         0.37%  [kernel]       [k] _raw_spin_lock
         0.37%  [kernel]       [k] native_irq_return_iret
         0.33%  [kernel]       [k] __inet6_lookup_established
         0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
         0.29%  [kernel]       [k] tcp_rcv_established
         0.29%  [kernel]       [k] llist_reverse_order
    
    v2: kdoc issue (kernel bots)
        do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
        replace the sk_buff_head with a single-linked list (Jakub)
        add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
    68822bdf
dev.c 285 KB