• Kirill Smelkov's avatar
    virtmem: Benchmarks for pagefault handling · 3cfc2728
    Kirill Smelkov authored
    Benchmark the time it takes for virtmem to handle pagefault with noop
    loadblk for loadblk both implemented in C and in Python.
    
    On my computer it is:
    
    	name          µs/op
    	PagefaultC    269 ± 0%
    	pagefault_py  291 ± 0%
    
    Quite a big time in other words.
    
    It turned out to be mostly spent in fallocate'ing pages on tmpfs from
    /dev/shm. Part of the above 269 µs/op is taken by freeing (reclaiming)
    pages back when benchmarking work size exceed /dev/shm size, and part to
    allocating.
    
    If I limit the work size (via npage in benchmem.c) to be less than whole
    /dev/shm it starts to be ~ 170 µs/op and with additional tracing it
    shows as something like this:
    
        	.. on_pagefault_start   0.954 µs
        	.. vma_on_pagefault_pre 0.954 µs
        	.. ramh_alloc_page_pre  0.954 µs
        	.. ramh_alloc_page      169.992 µs
        	.. vma_on_pagefault     172.853 µs
        	.. vma_on_pagefault_pre 172.853 µs
        	.. vma_on_pagefault     174.046 µs
        	.. on_pagefault_end     174.046 µs
        	.. whole:               171.900 µs
    
    so almost all time is spent in ramh_alloc_page which is doing the fallocate:
    
    	https://lab.nexedi.com/nexedi/wendelin.core/blob/f11386a4/bigfile/ram_shmfs.c#L125
    
    Simple benchmark[1] confirmed it is indeed the case for fallocate(tmpfs) to be
    relatively slow[2] (and that for recent kernels it regressed somewhat
    compared to Linux 3.16). Profile flamegraph for that benchmark[3] shows
    internal loading of shmem_fallocate which for 1 hardware page is not
    that too slow (e.g. <1µs) but when a request comes for a region
    internally performs it page by page and so accumulates that ~ 170µs for 2M.
    
    I've tried to briefly rerun the benchmark with huge pages activated on /dev/shm via
    
    	mount /dev/shm -o huge=always,remount
    
    as both regular user and as root but it was executing several times
    slower. Probably something to investigate more later.
    
    [1] https://lab.nexedi.com/kirr/misc/blob/4f84a06e/tmpfs/t_fallocate.c
    [2] https://lab.nexedi.com/kirr/misc/blob/4f84a06e/tmpfs/1.txt
    [3] https://lab.nexedi.com/kirr/misc/raw/4f84a06e/tmpfs/fallocate-2M-nohuge.svg
    3cfc2728
bench_virtmem.c 3.12 KB