• Eric Dumazet's avatar
    net: napi: add hard irqs deferral feature · 6f8b12d6
    Eric Dumazet authored
    Back in commit 3b47d303 ("net: gro: add a per device gro flush timer")
    we added the ability to arm one high resolution timer, that we used
    to keep not-complete packets in GRO engine a bit longer, hoping that further
    frames might be added to them.
    
    Since then, we added the napi_complete_done() interface, and commit
    364b6055 ("net: busy-poll: return busypolling status to drivers")
    allowed drivers to avoid re-arming NIC interrupts if we made a promise
    that their NAPI poll() handler would be called in the near future.
    
    This infrastructure can be leveraged, thanks to a new device parameter,
    which allows to arm the napi hrtimer, instead of re-arming the device
    hard IRQ.
    
    We have noticed that on some servers with 32 RX queues or more, the chit-chat
    between the NIC and the host caused by IRQ delivery and re-arming could hurt
    throughput by ~20% on 100Gbit NIC.
    
    In contrast, hrtimers are using local (percpu) resources and might have lower
    cost.
    
    The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
    than gro_flush_timeout (/sys/class/net/ethX/)
    
    By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
    
    This patch does not change the prior behavior of gro_flush_timeout
    if used alone : NIC hard irqs should be rearmed as before.
    
    One concrete usage can be :
    
    echo 20000 >/sys/class/net/eth1/gro_flush_timeout
    echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
    
    If at least one packet is retired, then we will reset napi counter
    to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
    of the queue.
    
    On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
    avoidance was only possible if napi->poll() was exhausting its budget
    and not call napi_complete_done().
    
    This feature also can be used to work around some non-optimal NIC irq
    coalescing strategies.
    
    Having the ability to insert XX usec delays between each napi->poll()
    can increase cache efficiency, since we increase batch sizes.
    
    It also keeps serving cpus not idle too long, reducing tail latencies.
    Co-developed-by: default avatarLuigi Rizzo <lrizzo@google.com>
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    6f8b12d6
net-sysfs.c 45.1 KB