• Yujie Liu's avatar
    sched/numa: Fix the vma scan starving issue · f22cde43
    Yujie Liu authored
    Problem statement:
    Since commit fc137c0d ("sched/numa: enhance vma scanning logic"), the
    Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
    the vma scan might create less Numa page fault information.  The
    insufficient information makes it harder for the Numa balancer to make
    decision.  Later, commit b7a5b537 ("sched/numa: Complete scanning of
    partial VMAs regardless of PID activity") and commit 84db47ca
    ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
    bring back part of the performance.
    
    Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
    long duration of remote Numa node read was observed by PMU events: A few
    cores having ~500MB/s remote memory access for ~20 seconds.  It causes
    high core-to-core variance and performance penalty.  After the
    investigation, it is found that many vmas are skipped due to the active
    PID check.  According to the trace events, in most cases,
    vma_is_accessed() returns false because the history access info stored in
    pids_active array has been cleared.
    
    Proposal:
    The main idea is to adjust vma_is_accessed() to let it return true easier.
    Thus compare the diff between mm->numa_scan_seq and
    vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
    scan the vma.
    
    This patch especially helps the cases where there are small number of
    threads, like the process-based SPECcpu.  Without this patch, if the
    SPECcpu process access the vma at the beginning, then sleeps for a long
    time, the pid_active array will be cleared.  A a result, if this process
    is woken up again, it never has a chance to set prot_none anymore. 
    Because only the first 2 times of access is granted for vma scan:
    (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
    worse, no other threads within the task can help set the prot_none.  This
    causes information lost.
    
    Raghavendra helped test current patch and got the positive result
    on the AMD platform:
    
    autonumabench NUMA01
                                base                  patched
    Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
    Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
    
    Duration User      380345.36   368252.04
    Duration System      1358.89     1156.23
    Duration Elapsed     2277.45     2213.25
    
    autonumabench NUMA02
    
    Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
    Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
    
    Duration User        1513.23     1575.48
    Duration System         8.33        8.13
    Duration Elapsed       28.59       29.71
    
    kernbench
    
    Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
    Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
    Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
    
    Duration User       68816.41    67615.74
    Duration System     21873.94    22848.08
    Duration Elapsed      506.66      504.55
    
    Intel 256 CPUs/2 Sockets:
    autonuma benchmark also shows improvements:
    
                                                   v6.10-rc5              v6.10-rc5
                                                                             +patch
    Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
    Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
    Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
    Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
    Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
    Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
    Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
    Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
    
                       v6.10-rc5   v6.10-rc5
                                      +patch
    Duration User     1064010.16  1075416.23
    Duration System      3307.64     3104.66
    Duration Elapsed     4537.54     4604.73
    
    The SPECcpu remote node access issue disappears with the patch applied.
    
    Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
    Fixes: fc137c0d ("sched/numa: enhance vma scanning logic")
    Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
    Co-developed-by: default avatarChen Yu <yu.c.chen@intel.com>
    Signed-off-by: default avatarYujie Liu <yujie.liu@intel.com>
    Reported-by: default avatarXiaoping Zhou <xiaoping.zhou@intel.com>
    Reviewed-and-tested-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
    Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Cc: "Chen, Tim C" <tim.c.chen@intel.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Juri Lelli <juri.lelli@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Raghavendra K T <raghavendra.kt@amd.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    f22cde43
fair.c 355 KB