• Yang Shi's avatar
    mm: swap: make page_evictable() inline · 1eb6234e
    Yang Shi authored
    When backporting commit 9c4e6b1a ("mm, mlock, vmscan: no more skipping
    pagevecs") to our 4.9 kernel, our test bench noticed around 10% down with
    a couple of vm-scalability's test cases (lru-file-readonce,
    lru-file-readtwice and lru-file-mmap-read).  I didn't see that much down
    on my VM (32c-64g-2nodes).  It might be caused by the test configuration,
    which is 32c-256g with NUMA disabled and the tests were run in root memcg,
    so the tests actually stress only one inactive and active lru.  It sounds
    not very usual in mordern production environment.
    
    That commit did two major changes:
    1. Call page_evictable()
    2. Use smp_mb to force the PG_lru set visible
    
    It looks they contribute the most overhead.  The page_evictable() is a
    function which does function prologue and epilogue, and that was used by
    page reclaim path only.  However, lru add is a very hot path, so it sounds
    better to make it inline.  However, it calls page_mapping() which is not
    inlined either, but the disassemble shows it doesn't do push and pop
    operations and it sounds not very straightforward to inline it.
    
    Other than this, it sounds smp_mb() is not necessary for x86 since
    SetPageLRU is atomic which enforces memory barrier already, replace it
    with smp_mb__after_atomic() in the following patch.
    
    With the two fixes applied, the tests can get back around 5% on that test
    bench and get back normal on my VM.  Since the test bench configuration is
    not that usual and I also saw around 6% up on the latest upstream, so it
    sounds good enough IMHO.
    
    The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
    	mainline	w/ inline fix
              150MB            154MB
    
    With this patch the throughput gets 2.67% up.  The data with using
    smp_mb__after_atomic() is showed in the following patch.
    
    Shakeel Butt did the below test:
    
    On a real machine with limiting the 'dd' on a single node and reading 100
    GiB sparse file (less than a single node).  Just ran a single instance to
    not cause the lru lock contention.  The cmdline used is "dd if=file-100GiB
    of=/dev/null bs=4k".  Ran the cmd 10 times with drop_caches in between and
    measured the time it took.
    
    Without patch: 56.64143 +- 0.672 sec
    
    With patches: 56.10 +- 0.21 sec
    
    [akpm@linux-foundation.org: move page_evictable() to internal.h]
    Fixes: 9c4e6b1a ("mm, mlock, vmscan: no more skipping pagevecs")
    Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Tested-by: default avatarShakeel Butt <shakeelb@google.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Link: http://lkml.kernel.org/r/1584500541-46817-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    1eb6234e
swap.h 21.3 KB