• Mateusz Guzik's avatar
    vfs: add rcu-based find_inode variants for iget ops · 7180f8d9
    Mateusz Guzik authored
    This avoids one inode hash lock acquire in the common case on inode
    creation, in effect significantly reducing contention.
    
    On the stock kernel said lock is typically taken twice:
    1. once to check if the inode happens to already be present
    2. once to add it to the hash
    
    The back-to-back lock/unlock pattern is known to degrade performance
    significantly, which is further exacerbated if the hash is heavily
    populated (long chains to walk, extending hold time). Arguably hash
    sizing and hashing algo need to be revisited, but that's beyond the
    scope of this patch.
    
    With the acquire from step 1 eliminated with RCU lookup throughput
    increases significantly at the scale of 20 cores (benchmark results at
    the bottom).
    
    So happens the hash already supports RCU-based operation, but lookups on
    inode insertions didn't take advantage of it.
    
    This of course has its limits as the global lock is still a bottleneck.
    There was a patchset posted which introduced fine-grained locking[1] but
    it appears staled. Apart from that doubt was expressed whether a
    handrolled hash implementation is appropriate to begin with, suggesting
    replacement with rhashtables. Nobody committed to carrying [1] across
    the finish line or implementing anything better, thus the bandaid below.
    
    iget_locked consumers (notably ext4) get away without any changes
    because inode comparison method is built-in.
    
    iget5_locked consumers pass a custom callback. Since removal of locking
    adds more problems (inode can be changing) it's not safe to assume all
    filesystems happen to cope.  Thus iget5_locked_rcu gets added, requiring
    manual conversion of interested filesystems.
    
    In order to reduce code duplication find_inode and find_inode_fast grow
    an argument indicating whether inode hash lock is held, which is passed
    down in case sleeping is necessary. They always rcu_read_lock, which is
    redundant but harmless. Doing it conditionally reduces readability for
    no real gain that I can see. RCU-alike restrictions were already put on
    callbacks due to the hash spinlock being held.
    
    Benchmarking:
    There is a real cache-busting workload scanning millions of files in
    parallel (it's a backup appliance), where the initial lookup is
    guaranteed to fail resulting in the two lock acquires on stock kernel
    (and one with the patch at hand).
    
    Implemented below is a synthetic benchmark providing the same behavior.
    [I shall note the workload is not running on Linux, instead it was
    causing trouble elsewhere. Benchmark below was used while addressing
    said problems and was found to adequately represent the real workload.]
    
    Total real time fluctuates by 1-2s.
    
    With 20 threads each walking a dedicated 1000 dirs * 1000 files
    directory tree to stat(2) on a 32 core + 24GB RAM vm:
    
    ext4 (needed mkfs.ext4 -N 24000000):
    before: 3.77s user 890.90s system 1939% cpu 46.118 total
    after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
    
    That's 20 million files to visit, while the machine can only cache about
    15 million at a time (obtained from ext4_inode_cache object count in
    /proc/slabinfo). Since each terminal inode is only visited once per run
    this amounts to 0% hit ratio for the dentry cache and the hash table
    (there are however hits for the intermediate directories).
    
    On repeated runs the kernel caches the last ~15 mln, meaning there is ~5
    mln of uncached inodes which are going to be visited first, evicting the
    previously cached state as it happens.
    
    Lack of hits can be trivially verified with bpftrace, like so:
    bpftrace -e 'kretprobe:find_inode_fast { @[kstack(), retval != 0] = count(); }'\
    -c "/bin/sh walktrees /testfs 20"
    
    Best ran more than once.
    
    Expected results after "warmup":
    [snip]
    @[
        __ext4_iget+275
        ext4_lookup+224
        __lookup_slow+130
        walk_component+219
        link_path_walk.part.0.constprop.0+614
        path_lookupat+62
        filename_lookup+204
        vfs_statx+128
        vfs_fstatat+131
        __do_sys_newfstatat+38
        do_syscall_64+87
        entry_SYSCALL_64_after_hwframe+118
    , 1]: 20000
    @[
        __ext4_iget+275
        ext4_lookup+224
        __lookup_slow+130
        walk_component+219
        path_lookupat+106
        filename_lookup+204
        vfs_statx+128
        vfs_fstatat+131
        __do_sys_newfstatat+38
        do_syscall_64+87
        entry_SYSCALL_64_after_hwframe+118
    , 1]: 20000000
    
    That is 20 million calls for the initial lookup and 20 million after
    allocating a new inode, all of them failing to return a value != 0
    (i.e., they are returning NULL -- no match found).
    
    Of course aborting the benchmark in the middle and starting it again (or
    messing with the state in other ways) is going to alter these results.
    
    Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
    
    [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/Signed-off-by: default avatarMateusz Guzik <mjguzik@gmail.com>
    Link: https://lore.kernel.org/r/20240611173824.535995-2-mjguzik@gmail.comReviewed-by: default avatarJan Kara <jack@suse.cz>
    Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
    7180f8d9
inode.c 70.8 KB