• Huang Ying's avatar
    mm, pcp: avoid to drain PCP when process exit · ca71fe1a
    Huang Ying authored
    Patch series "mm: PCP high auto-tuning", v3.
    
    The page allocation performance requirements of different workloads are
    often different.  So, we need to tune the PCP (Per-CPU Pageset) high on
    each CPU automatically to optimize the page allocation performance.
    
    The list of patches in series is as follows,
    
    [1/9] mm, pcp: avoid to drain PCP when process exit
    [2/9] cacheinfo: calculate per-CPU data cache size
    [3/9] mm, pcp: reduce lock contention for draining high-order pages
    [4/9] mm: restrict the pcp batch scale factor to avoid too long latency
    [5/9] mm, page_alloc: scale the number of pages that are batch allocated
    [6/9] mm: add framework for PCP high auto-tuning
    [7/9] mm: tune PCP high automatically
    [8/9] mm, pcp: decrease PCP high if free pages < high watermark
    [9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
    
    Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
    high-order pages freeing.
    
    Patch [4/9], [5/9] optimize batch freeing and allocating.
    
    Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
    auto-tuning method.
    
    Patch [9/9] optimize the PCP draining for consecutive high order page
    freeing based on PCP high auto-tuning.
    
    The test results for patches with performance impact are as follows,
    
    kbuild
    ======
    
    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.
    
    	build time   lock contend%	free_high	alloc_zone
    	----------	----------	---------	----------
    base	     100.0	      14.0          100.0            100.0
    patch1	      99.5	      12.8	     19.5	      95.6
    patch3	      99.4	      12.6	      7.1	      95.6
    patch5	      98.6	      11.0	      8.1	      97.1
    patch7	      95.1	       0.5	      2.8	      15.6
    patch9	      95.0	       1.0	      8.8	      20.0
    
    The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
    allocation optimization (patch [5/9]) reduces zone lock contention a
    little.  The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
    visibly.  Where the tuning target: the number of pages allocated from zone
    reduces greatly.  So, the zone contention cycles% reduces greatly.
    
    With PCP tuning patches (patch [7/9], [9/9]), the average used memory
    during test increases up to 18.4% because more pages are cached in PCP. 
    But at the end of the test, the number of the used memory decreases to the
    same level as that of the base patch.  That is, the pages cached in PCP
    will be released to zone after not being used actively.
    
    netperf SCTP_STREAM_MANY
    ========================
    
    On a 2-socket Intel server with 128 logical CPU, we tested
    SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
    
    	     score   lock contend%	free_high	alloc_zone  cache miss rate%
    	     -----	----------	---------	----------  ----------------
    base	     100.0	       2.1          100.0            100.0	         1.3
    patch1	      99.4	       2.1	     99.4	      99.4		 1.3
    patch3	     106.4	       1.3	     13.3	     106.3		 1.3
    patch5	     106.0	       1.2	     13.2	     105.9		 1.3
    patch7	     103.4	       1.9	      6.7	      90.3		 7.6
    patch9	     108.6	       1.3	     13.7	     108.6		 1.3
    
    The PCP draining optimization (patch [1/9]+[3/9]) improves performance. 
    The PCP high auto-tuning (patch [7/9]) reduces performance a little
    because PCP draining cannot be triggered in time sometimes.  So, the cache
    miss rate% increases.  The further PCP draining optimization (patch [9/9])
    based on PCP tuning restore the performance.
    
    lmbench3 UNIX (AF_UNIX)
    =======================
    
    On a 2-socket Intel server with 128 logical CPU, we tested UNIX
    (AF_UNIX socket) test case of lmbench3 test suite with 16-pair
    processes.
    
    	     score   lock contend%	free_high	alloc_zone  cache miss rate%
    	     -----	----------	---------	----------  ----------------
    base	     100.0	      51.4          100.0            100.0	         0.2
    patch1	     116.8	      46.1           69.5	     104.3	         0.2
    patch3	     199.1	      21.3            7.0	     104.9	         0.2
    patch5	     200.0	      20.8            7.1	     106.9	         0.3
    patch7	     191.6	      19.9            6.8	     103.8	         2.8
    patch9	     193.4	      21.7            7.0	     104.7	         2.1
    
    The PCP draining optimization (patch [1/9], [3/9]) improves performance
    much.  The PCP tuning (patch [7/9]) reduces performance a little because
    PCP draining cannot be triggered in time sometimes.  The further PCP
    draining optimization (patch [9/9]) based on PCP tuning restores the
    performance partly.
    
    The patchset adds several fields in struct per_cpu_pages.  The struct
    layout before/after the patchset is as follows,
    
    base
    ====
    
    struct per_cpu_pages {
    	spinlock_t                 lock;                 /*     0     4 */
    	int                        count;                /*     4     4 */
    	int                        high;                 /*     8     4 */
    	int                        batch;                /*    12     4 */
    	short int                  free_factor;          /*    16     2 */
    	short int                  expire;               /*    18     2 */
    
    	/* XXX 4 bytes hole, try to pack */
    
    	struct list_head           lists[13];            /*    24   208 */
    
    	/* size: 256, cachelines: 4, members: 7 */
    	/* sum members: 228, holes: 1, sum holes: 4 */
    	/* padding: 24 */
    } __attribute__((__aligned__(64)));
    
    patched
    =======
    
    struct per_cpu_pages {
    	spinlock_t                 lock;                 /*     0     4 */
    	int                        count;                /*     4     4 */
    	int                        high;                 /*     8     4 */
    	int                        high_min;             /*    12     4 */
    	int                        high_max;             /*    16     4 */
    	int                        batch;                /*    20     4 */
    	u8                         flags;                /*    24     1 */
    	u8                         alloc_factor;         /*    25     1 */
    	u8                         expire;               /*    26     1 */
    
    	/* XXX 1 byte hole, try to pack */
    
    	short int                  free_count;           /*    28     2 */
    
    	/* XXX 2 bytes hole, try to pack */
    
    	struct list_head           lists[13];            /*    32   208 */
    
    	/* size: 256, cachelines: 4, members: 11 */
    	/* sum members: 237, holes: 2, sum holes: 3 */
    	/* padding: 16 */
    } __attribute__((__aligned__(64)));
    
    The size of the struct doesn't changed with the patchset.
    
    
    This patch (of 9):
    
    In commit f26b3fa0 ("mm/page_alloc: limit number of high-order pages
    on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when
    PCP is mostly used for high-order pages freeing to improve the cache-hot
    pages reusing between page allocation and freeing CPUs.
    
    But, the PCP draining mechanism may be triggered unexpectedly when process
    exits.  With some customized trace point, it was found that PCP draining
    (free_high == true) was triggered with the order-1 page freeing with the
    following call stack,
    
     => free_unref_page_commit
     => free_unref_page
     => __mmdrop
     => exit_mm
     => do_exit
     => do_group_exit
     => __x64_sys_exit_group
     => do_syscall_64
    
    Checking the source code, this is the page table PGD freeing
    (mm_free_pgd()).  It's a order-1 page freeing if
    CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
    security.
    
    Just before that, page freeing with the following call stack was found,
    
     => free_unref_page_commit
     => free_unref_page_list
     => release_pages
     => tlb_batch_pages_flush
     => tlb_finish_mmu
     => exit_mmap
     => __mmput
     => exit_mm
     => do_exit
     => do_group_exit
     => __x64_sys_exit_group
     => do_syscall_64
    
    So, when a process exits,
    
    - a large number of user pages of the process will be freed without
      page allocation, it's highly possible that pcp->free_factor becomes >
      0.  In fact, this is expected behavior to improve process exit
      performance.
    
    - after freeing all user pages, the PGD will be freed, which is a
      order-1 page freeing, PCP will be drained.
    
    All in all, when a process exits, it's high possible that the PCP will be
    drained.  This is an unexpected behavior.
    
    To avoid this, in the patch, the PCP draining will only be triggered for 2
    consecutive high-order page freeing.
    
    On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
    in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
    kbuild server that is used by 0-Day kbuild service.  With the patch, the
    cycles% of the spinlock contention (mostly for zone lock) decreases from
    14.0% to 12.8% (with PCP size == 367).  The number of PCP draining for
    high order pages freeing (free_high) decreases 80.5%.
    
    This helps network workload too for reduced zone lock contention.  On a
    2-socket Intel server with 128 logical CPU, with the patch, the network
    bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with
    16-pair processes increase 16.8%.  The cycles% of the spinlock contention
    (mostly for zone lock) decreases from 51.4% to 46.1%.  The number of PCP
    draining for high order pages freeing (free_high) decreases 30.5%.  The
    cache miss rate keeps 0.2%.
    
    Link: https://lkml.kernel.org/r/20231016053002.756205-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20231016053002.756205-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <jweiner@redhat.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Arjan van de Ven <arjan@linux.intel.com>
    Cc: Sudeep Holla <sudeep.holla@arm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    ca71fe1a
page_alloc.c 185 KB