• Alexander Lobakin's avatar
    iommu/dma: avoid expensive indirect calls for sync operations · ea01fa70
    Alexander Lobakin authored
    
    
    When IOMMU is on, the actual synchronization happens in the same cases
    as with the direct DMA. Advertise %DMA_F_CAN_SKIP_SYNC in IOMMU DMA to
    skip sync ops calls (indirect) for non-SWIOTLB buffers.
    
    perf profile before the patch:
    
        18.53%  [kernel]       [k] gq_rx_skb
        14.77%  [kernel]       [k] napi_reuse_skb
         8.95%  [kernel]       [k] skb_release_data
         5.42%  [kernel]       [k] dev_gro_receive
         5.37%  [kernel]       [k] memcpy
    <*>  5.26%  [kernel]       [k] iommu_dma_sync_sg_for_cpu
         4.78%  [kernel]       [k] tcp_gro_receive
    <*>  4.42%  [kernel]       [k] iommu_dma_sync_sg_for_device
         4.12%  [kernel]       [k] ipv6_gro_receive
         3.65%  [kernel]       [k] gq_pool_get
         3.25%  [kernel]       [k] skb_gro_receive
         2.07%  [kernel]       [k] napi_gro_frags
         1.98%  [kernel]       [k] tcp6_gro_receive
         1.27%  [kernel]       [k] gq_rx_prep_buffers
         1.18%  [kernel]       [k] gq_rx_napi_handler
         0.99%  [kernel]       [k] csum_partial
         0.74%  [kernel]       [k] csum_ipv6_magic
         0.72%  [kernel]       [k] free_pcp_prepare
         0.60%  [kernel]       [k] __napi_poll
         0.58%  [kernel]       [k] net_rx_action
         0.56%  [kernel]       [k] read_tsc
    <*>  0.50%  [kernel]       [k] __x86_indirect_thunk_r11
         0.45%  [kernel]       [k] memset
    
    After patch, lines with <*> no longer show up, and overall
    cpu usage looks much better (~60% instead of ~72%):
    
        25.56%  [kernel]       [k] gq_rx_skb
         9.90%  [kernel]       [k] napi_reuse_skb
         7.39%  [kernel]       [k] dev_gro_receive
         6.78%  [kernel]       [k] memcpy
         6.53%  [kernel]       [k] skb_release_data
         6.39%  [kernel]       [k] tcp_gro_receive
         5.71%  [kernel]       [k] ipv6_gro_receive
         4.35%  [kernel]       [k] napi_gro_frags
         4.34%  [kernel]       [k] skb_gro_receive
         3.50%  [kernel]       [k] gq_pool_get
         3.08%  [kernel]       [k] gq_rx_napi_handler
         2.35%  [kernel]       [k] tcp6_gro_receive
         2.06%  [kernel]       [k] gq_rx_prep_buffers
         1.32%  [kernel]       [k] csum_partial
         0.93%  [kernel]       [k] csum_ipv6_magic
         0.65%  [kernel]       [k] net_rx_action
    
    iavf yields +10% of Mpps on Rx. This also unblocks batched allocations
    of XSk buffers when IOMMU is active.
    Co-developed-by: default avatarEric Dumazet <edumazet@google.com>
    Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
    Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
    Signed-off-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    ea01fa70
dma-iommu.c 50.7 KB