Commit 86d23145 authored by Jesper Dangaard Brouer's avatar Jesper Dangaard Brouer Committed by Alexei Starovoitov

bpf: cpumap memory prefetchw optimizations for struct page

A lot of the performance gain comes from this patch.

While analysing performance overhead it was found that the largest CPU
stalls were caused when touching the struct page area. It is first read with
a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
written by page_frag_free() call.

Measurements show that the prefetchw (W) variant operation is needed to
achieve the performance gain. We believe this optimization it two fold,
first the W-variant saves one step in the cache-coherency protocol, and
second it helps us to avoid the non-temporal prefetch HW optimizations and
bring this into all cache-levels. It might be worth investigating if
prefetch into L2 will have the same benefit.
Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: default avatarSong Liu <songliubraving@fb.com>
Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parent 8f0504a9
......@@ -280,6 +280,18 @@ static int cpu_map_kthread_run(void *data)
* consume side valid as no-resize allowed of queue.
*/
n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH);
for (i = 0; i < n; i++) {
void *f = frames[i];
struct page *page = virt_to_page(f);
/* Bring struct page memory area to curr CPU. Read by
* build_skb_around via page_is_pfmemalloc(), and when
* freed written by page_frag_free call.
*/
prefetchw(page);
}
m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs);
if (unlikely(m == 0)) {
for (i = 0; i < n; i++)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment