• Mel Gorman's avatar
    sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on · 2c833627
    Mel Gorman authored
    find_idlest_group() compares a local group with each other group to select
    the one that is most idle. When comparing groups in different NUMA domains,
    a very slight imbalance is enough to select a remote NUMA node even if the
    runnable load on both groups is 0 or close to 0. This ignores the cost of
    remote accesses entirely and is a problem when selecting the CPU for a
    newly forked task to run on.  This is problematic when a forking server
    is almost guaranteed to run on a remote node incurring numerous remote
    accesses and potentially causing automatic NUMA balancing to try migrate
    the task back or migrate the data to another node. Similar weirdness is
    observed if a basic shell command pipes output to another as each process
    in the pipeline is likely to start on different nodes and then get adjusted
    later by wake_affine().
    
    This patch adds imbalance to remote domains when considering whether to
    select CPUs from remote domains. If the local domain is selected, imbalance
    will still be used to try select a CPU from a lower scheduler domain's group
    instead of stacking tasks on the same CPU.
    
    A variety of workloads and machines were tested and as expected, there is no
    difference on UMA. The difference on NUMA can be dramatic. This is a comparison
    of elapsed times running the git regression test suite. It's fork-intensive with
    short-lived processes:
    
                                      4.15.0                 4.15.0
                                noexit-v1r23           sdnuma-v1r23
     Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
     Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
     Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
     Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
     Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)
    
                   4.15.0      4.15.0
             noexit-v1r23 sdnuma-v1r23
     User         5434.12     5188.41
     System       4878.77     3467.09
     Elapsed     10259.06     8624.21
    
    That shows a considerable reduction in elapsed times. It's important to
    note that automatic NUMA balancing does not affect this load as processes
    are too short-lived.
    
    There is also a noticable impact on hackbench such as this example using
    processes and pipes:
    
     hackbench-process-pipes
                                   4.15.0                 4.15.0
                             noexit-v1r23           sdnuma-v1r23
     Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
     Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
     Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
     Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
     Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
     Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
     Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
     Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
     Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
     Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
     Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
     Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
     Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
     Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
     Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)
    
    It's not a universal win as there are occasions when spreading wide and
    quickly is a benefit but it's more of a win than it is a loss. For other
    workloads, there is little difference but netperf is interesting. Without
    the patch, the server and client starts on different nodes but quickly get
    migrated due to wake_affine. Hence, the difference is overall performance
    is marginal but detectable:
    
                                          4.15.0                 4.15.0
                                    noexit-v1r23           sdnuma-v1r23
     Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
     Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
     Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
     Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
     Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
     Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
     Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
     Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
     Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
     Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
     Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
     Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
     Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
     Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
     Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
     Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
     Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
     Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)
    
    However, what is very interesting is how automatic NUMA balancing behaves.
    Each netperf instance runs long enough for balancing to activate:
    
     NUMA base PTE updates             4620        1473
     NUMA huge PMD updates                0           0
     NUMA page range updates           4620        1473
     NUMA hint faults                  4301        1383
     NUMA hint local faults            1309         451
     NUMA hint local percent             30          32
     NUMA pages migrated               1335         491
     AutoNUMA cost                      21%          6%
    
    There is an unfortunate number of remote faults although tracing indicated
    that the vast majority are in shared libraries. However, the tendency to
    start tasks on the same node if there is capacity means that there were
    far fewer PTE updates and faults incurred overall.
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matt Fleming <matt@codeblueprint.co.uk>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    2c833627
fair.c 262 KB