• Nhat Pham's avatar
    workingset: refactor LRU refault to expose refault recency check · ffcb5f52
    Nhat Pham authored
    Patch series "cachestat: a new syscall for page cache state of files",
    v13.
    
    There is currently no good way to query the page cache statistics of large
    files and directory trees.  There is mincore(), but it scales poorly: the
    kernel writes out a lot of bitmap data that userspace has to aggregate,
    when the user really does not care about per-page information in that
    case.  The user also needs to mmap and unmap each file as it goes along,
    which can be quite slow as well.
    
    Some use cases where this information could come in handy:
      * Allowing database to decide whether to perform an index scan or direct
        table queries based on the in-memory cache state of the index.
      * Visibility into the writeback algorithm, for performance issues
        diagnostic.
      * Workload-aware writeback pacing: estimating IO fulfilled by page cache
        (and IO to be done) within a range of a file, allowing for more
        frequent syncing when and where there is IO capacity, and batching
        when there is not.
      * Computing memory usage of large files/directory trees, analogous to
        the du tool for disk usage.
    
    More information about these use cases could be found in this thread:
    https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
    
    This series of patches introduces a new system call, cachestat, that
    summarizes the page cache statistics (number of cached pages, dirty pages,
    pages marked for writeback, evicted pages etc.) of a file, in a specified
    range of bytes.  It also include a selftest suite that tests some typical
    usage.  Currently, the syscall is only wired in for x86 architecture.
    
    This interface is inspired by past discussion and concerns with fincore,
    which has a similar design (and as a result, issues) as mincore.  Relevant
    links:
    
    https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04207.html
    https://lkml.indiana.edu/hypermail/linux/kernel/1302.1/04209.html
    
    
    I have also developed a small tool that computes the memory usage of files
    and directories, analogous to the du utility.  User can choose between
    mincore or cachestat (with cachestat exporting more information than
    mincore).  To compare the performance of these two options, I benchmarked
    the tool on the root directory of a Meta's server machine, each for five
    runs:
    
    Using cachestat
    real -- Median: 33.377s, Average: 33.475s, Standard Deviation: 0.3602
    user -- Median: 4.08s, Average: 4.1078s, Standard Deviation: 0.0742
    sys -- Median: 28.823s, Average: 28.8866s, Standard Deviation: 0.2689
    
    Using mincore:
    real -- Median: 102.352s, Average: 102.3442s, Standard Deviation: 0.2059
    user -- Median: 10.149s, Average: 10.1482s, Standard Deviation: 0.0162
    sys -- Median: 91.186s, Average: 91.2084s, Standard Deviation: 0.2046
    
    I also ran both syscalls on a 2TB sparse file:
    
    Using cachestat:
    real    0m0.009s
    user    0m0.000s
    sys     0m0.009s
    
    Using mincore:
    real    0m37.510s
    user    0m2.934s
    sys     0m34.558s
    
    Very large files like this are the pathological case for mincore.  In
    fact, to compute the stats for a single 2TB file, mincore takes as long as
    cachestat takes to compute the stats for the entire tree!  This could
    easily happen inadvertently when we run it on subdirectories.  Mincore is
    clearly not suitable for a general-purpose command line tool.
    
    Regarding security concerns, cachestat() should not pose any additional
    issues.  The caller already has read permission to the file itself (since
    they need an fd to that file to call cachestat).  This means that the
    caller can access the underlying data in its entirety, which is a much
    greater source of information (and as a result, a much greater security
    risk) than the cache status itself.
    
    The latest API change (in v13 of the patch series) is suggested by Jens
    Axboe.  It allows for 64-bit length argument, even on 32-bit architecture
    (which is previously not possible due to the limit on the number of
    syscall arguments).  Furthermore, it eliminates the need for compatibility
    handling - every user can use the same ABI.
    
    
    This patch (of 4):
    
    In preparation for computing recently evicted pages in cachestat, refactor
    workingset_refault and lru_gen_refault to expose a helper function that
    would test if an evicted page is recently evicted.
    
    [penguin-kernel@I-love.SAKURA.ne.jp: add missing rcu_read_unlock() in lru_gen_refault()]
      Link: https://lkml.kernel.org/r/610781bc-cf11-fc89-a46f-87cb8235d439@I-love.SAKURA.ne.jp
    Link: https://lkml.kernel.org/r/20230503013608.2431726-1-nphamcs@gmail.com
    Link: https://lkml.kernel.org/r/20230503013608.2431726-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
    Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Brian Foster <bfoster@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Michael Kerrisk <mtk.manpages@gmail.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    ffcb5f52
workingset.c 26.8 KB