• Uladzislau Rezki (Sony)'s avatar
    mm: vmalloc: add va_alloc() helper · 38f6b9af
    Uladzislau Rezki (Sony) authored
    Patch series "Mitigate a vmap lock contention", v3.
    
    1. Motivation
    
    - Offload global vmap locks making it scaled to number of CPUS;
    
    - If possible and there is an agreement, we can remove the "Per cpu kva
      allocator" to make the vmap code to be more simple;
    
    - There were complaints from XFS folk that a vmalloc might be contented
      on their workloads.
    
    2. Design(high level overview)
    
    We introduce an effective vmap node logic.  A node behaves as independent
    entity to serve an allocation request directly(if possible) from its pool.
    That way it bypasses a global vmap space that is protected by its own
    lock.
    
    An access to pools are serialized by CPUs.  Number of nodes are equal to
    number of CPUs in a system.  Please note the high threshold is bound to
    128 nodes.
    
    Pools are size segregated and populated based on system demand.  The
    maximum alloc request that can be stored into a segregated storage is 256
    pages.  The lazily drain path decays a pool by 25% as a first step and as
    second populates it by fresh freed VAs for reuse instead of returning them
    into a global space.
    
    When a VA is obtained(alloc path), it is stored in separate nodes.  A
    va->va_start address is converted into a correct node where it should be
    placed and resided.  Doing so we balance VAs across the nodes as a result
    an access becomes scalable.  The addr_to_node() function does a proper
    address conversion to a correct node.
    
    A vmap space is divided on segments with fixed size, it is 16 pages.  That
    way any address can be associated with a segment number.  Number of
    segments are equal to num_possible_cpus() but not grater then 128.  The
    numeration starts from 0.  See below how it is converted:
    
    static inline unsigned int
    addr_to_node_id(unsigned long addr)
    {
    	return (addr / zone_size) % nr_nodes;
    }
    
    On a free path, a VA can be easily found by converting its "va_start"
    address to a certain node it resides.  It is moved from "busy" data to
    "lazy" data structure.  Later on, as noted earlier, the lazy kworker
    decays each node pool and populates it by fresh incoming VAs.  Please
    note, a VA is returned to a node that did an alloc request.
    
    3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
    
    sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
    
    <default perf>
     94.41%     0.89%  [kernel]        [k] _raw_spin_lock
     93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
     76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
     72.96%     0.81%  [kernel]        [k] alloc_vmap_area
     56.94%     0.00%  [kernel]        [k] __get_vm_area_node
     41.95%     0.00%  [kernel]        [k] vmalloc
     37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
     35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
     35.17%     0.00%  [kernel]        [k] ret_from_fork
     35.17%     0.00%  [kernel]        [k] kthread
     35.08%     0.00%  [test_vmalloc]  [k] test_func
     34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
     28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
     23.53%     0.25%  [kernel]        [k] vfree.part.0
     21.72%     0.00%  [kernel]        [k] remove_vm_area
     20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
      2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
    <default perf>
       vs
    <patch-series perf>
     82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
     63.36%     0.02%  [kernel]        [k] vmalloc
     63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
     30.42%     4.46%  [kernel]        [k] vfree.part.0
     28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
     27.28%     0.19%  [kernel]        [k] __get_vm_area_node
     26.13%     1.50%  [kernel]        [k] alloc_vmap_area
     21.72%    21.67%  [kernel]        [k] clear_page_rep
     19.51%     2.43%  [kernel]        [k] _raw_spin_lock
     16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
     13.40%     2.07%  [kernel]        [k] free_unref_page
     10.62%     0.01%  [kernel]        [k] remove_vm_area
      9.02%     8.73%  [kernel]        [k] insert_vmap_area
      8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
      8.94%     0.00%  [kernel]        [k] ret_from_fork
      8.94%     0.00%  [kernel]        [k] kthread
      8.29%     0.00%  [test_vmalloc]  [k] test_func
      7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
      5.30%     4.73%  [kernel]        [k] purge_vmap_node
      4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
    <patch-series perf>
    
    confirms that a native_queued_spin_lock_slowpath goes down to
    16.51% percent from 93.07%.
    
    The throughput is ~12x higher:
    
    urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
    Run the test with following parameters: run_test_mask=7 nr_threads=64
    Done.
    Check the kernel ring buffer to see the summary.
    
    real    10m51.271s
    user    0m0.013s
    sys     0m0.187s
    urezki@pc638:~$
    
    urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
    Run the test with following parameters: run_test_mask=7 nr_threads=64
    Done.
    Check the kernel ring buffer to see the summary.
    
    real    0m51.301s
    user    0m0.015s
    sys     0m0.040s
    urezki@pc638:~$
    
    
    This patch (of 11):
    
    Currently __alloc_vmap_area() function contains an open codded logic that
    finds and adjusts a VA based on allocation request.
    
    Introduce a va_alloc() helper that adjusts found VA only.  There is no a
    functional change as a result of this patch.
    
    Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com
    Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
    Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
    Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
    Cc: Paul E. McKenney <paulmck@kernel.org>
    Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    38f6b9af
vmalloc.c 116 KB