• Uladzislau Rezki (Sony)'s avatar
    rcu: Support kfree_bulk() interface in kfree_rcu() · 34c88174
    Uladzislau Rezki (Sony) authored
    The kfree_rcu() logic can be improved further by using kfree_bulk()
    interface along with "basic batching support" introduced earlier.
    
    The are at least two advantages of using "bulk" interface:
    - in case of large number of kfree_rcu() requests kfree_bulk()
      reduces the per-object overhead caused by calling kfree()
      per-object.
    
    - reduces the number of cache-misses due to "pointer chasing"
      between objects which can be far spread between each other.
    
    This approach defines a new kfree_rcu_bulk_data structure that
    stores pointers in an array with a specific size. Number of entries
    in that array depends on PAGE_SIZE making kfree_rcu_bulk_data
    structure to be exactly one page.
    
    Since it deals with "block-chain" technique there is an extra
    need in dynamic allocation when a new block is required. Memory
    is allocated with GFP_NOWAIT | __GFP_NOWARN flags, i.e. that
    allows to skip direct reclaim under low memory condition to
    prevent stalling and fails silently under high memory pressure.
    
    The "emergency path" gets maintained when a system is run out of
    memory. In that case objects are linked into regular list.
    
    The "rcuperf" was run to analyze this change in terms of memory
    consumption and kfree_bulk() throughput.
    
    1) Testing on the Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs
    with following parameters:
    
    kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 kfree_vary_obj_size=1
    dev.2020.01.10a branch
    
    Default / CONFIG_SLAB
    53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB
    53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB
    53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB
    
    Patch / CONFIG_SLAB
    23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB
    23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB
    24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB
    
    Default / CONFIG_SLUB
    51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB
    51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB
    51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB
    
    Patch / CONFIG_SLUB
    50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB
    50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB
    50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB
    
    in case of CONFIG_SLAB there is double increase in performance and
    slightly higher memory usage. As for CONFIG_SLUB, the performance
    figures are better together with lower memory usage.
    
    2) Testing on the HiKey-960, arm64, 8xCPUs with below parameters:
    
    CONFIG_SLAB=y
    kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1
    
    102898760401 ns, loops: 200000, batches: 5822, memory footprint: 158MB
    89947009882  ns, loops: 200000, batches: 6715, memory footprint: 115MB
    
    rcuperf shows approximately ~12% better throughput in case of
    using "bulk" interface. The "drain logic" or its RCU callback
    does the work faster that leads to better throughput.
    Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
    Tested-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    34c88174
tree.c 125 KB