• Nick Piggin's avatar
    fs: scale files_lock · 6416ccb7
    Nick Piggin authored
    fs: scale files_lock
    
    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).
    
    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.
    
    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.
    
    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.
    
    Testing results:
    
    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.
    
    Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
    
    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.
    
    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
    
                    throughput
    2.6.34-rc2      24.5
    +patch          24.9
    
                    us      sys     idle    IO wait (in %)
    2.6.34-rc2      51.25   28.25   17.25   3.25
    +patch          53.75   18.5    19      8.75
    
    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.
    
    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.
    
    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen <tim.c.chen@linux.intel.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
    Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    6416ccb7
super.c 25.5 KB