• Julia Lawall's avatar
    sched/fair: Check for idle core in wake_affine · d8fcb81f
    Julia Lawall authored
    In the case of a thread wakeup, wake_affine determines whether a core
    will be chosen for the thread on the socket where the thread ran
    previously or on the socket of the waker.  This is done primarily by
    comparing the load of the core where th thread ran previously (prev)
    and the load of the waker (this).
    
    commit 11f10e54 ("sched/fair: Use load instead of runnable load
    in wakeup path") changed the load computation from the runnable load
    to the load average, where the latter includes the load of threads
    that have already blocked on the core.
    
    When a short-running daemon processes happens to run on prev, this
    change raised the situation that prev could appear to have a greater
    load than this, even when prev is actually idle.  When prev and this
    are on the same socket, the idle prev is detected later, in
    select_idle_sibling.  But if that does not hold, prev is completely
    ignored, causing the waking thread to move to the socket of the waker.
    In the case of N mostly active threads on N cores, this triggers other
    migrations and hurts performance.
    
    In contrast, before commit 11f10e54, the load on an idle core
    was 0, and in the case of a non-idle waker core, the effect of
    wake_affine was to select prev as the target for searching for a core
    for the waking thread.
    
    To avoid unnecessary migrations, extend wake_affine_idle to check
    whether the core where the thread previously ran is currently idle,
    and if so simply return that core as the target.
    
    [1] commit 11f10e54 ("sched/fair: Use load instead of runnable
    load in wakeup path")
    
    This particularly has an impact when using the ondemand power manager,
    where kworkers run every 0.004 seconds on all cores, increasing the
    likelihood that an idle core will be considered to have a load.
    
    The following numbers were obtained with the benchmarking tool
    hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
    benchmarks (https://www.nas.nasa.gov/publications/npb.html).  The
    tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
    2.10GHz.  Active (intel_pstate) and passive (intel_cpufreq) power
    management were used.  Times are in seconds.  All experiments use all
    160 hardware threads.
    
    	v5.9/intel-pstate	v5.9+patch/intel-pstate
    bt.C.c	24.725724+-0.962340	23.349608+-1.607214
    lu.C.x	29.105952+-4.804203	25.249052+-5.561617
    sp.C.x	31.220696+-1.831335	30.227760+-2.429792
    ua.C.x	26.606118+-1.767384	25.778367+-1.263850
    
    	v5.9/ondemand		v5.9+patch/ondemand
    bt.C.c	25.330360+-1.028316	23.544036+-1.020189
    lu.C.x	35.872659+-4.872090	23.719295+-3.883848
    sp.C.x	32.141310+-2.289541	29.125363+-0.872300
    ua.C.x	29.024597+-1.667049	25.728888+-1.539772
    
    On the smaller data sets (A and B) and on the other NAS benchmarks
    there is no impact on performance.
    
    This also has a major impact on the splash2x.volrend benchmark of the
    parsec benchmark suite that goes from 1m25 without this patch to 0m45,
    in active (intel_pstate) mode.
    
    Fixes: 11f10e54 ("sched/fair: Use load instead of runnable load in wakeup path")
    Signed-off-by: default avatarJulia Lawall <Julia.Lawall@inria.fr>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
    d8fcb81f
fair.c 297 KB