• Tim Chen's avatar
    sched/balance: Skip unnecessary updates to idle load balancer's flags · f90cc919
    Tim Chen authored
    We observed that the overhead on trigger_load_balance(), now renamed
    sched_balance_trigger(), has risen with a system's core counts.
    
    For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
    having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
    trigger_load_balance(). On older systems with fewer cores/socket, this
    function's overhead was less than 0.1%.
    
    The cause of this overhead was that there are multiple cpus calling
    kick_ilb(flags), updating the balancing work needed to a common idle
    load balancer cpu. The ilb_cpu's flags field got updated unconditionally
    with atomic_fetch_or().  The atomic read and writes to ilb_cpu's flags
    causes much cache bouncing and cpu cycles overhead. This is seen in the
    annotated profile below.
    
                 kick_ilb():
                 if (ilb_cpu < 0)
                   test   %r14d,%r14d
                 ↑ js     6c
                 flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
                   mov    $0x2d600,%rdi
                   movslq %r14d,%r8
                   mov    %rdi,%rdx
                   add    -0x7dd0c3e0(,%r8,8),%rdx
                 arch_atomic_read():
      0.01         mov    0x64(%rdx),%esi
     35.58         add    $0x64,%rdx
                 arch_atomic_fetch_or():
    
                 static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
                 {
                 int val = arch_atomic_read(v);
    
                 do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
      0.03  157:   mov    %r12d,%ecx
                 arch_atomic_try_cmpxchg():
                 return arch_try_cmpxchg(&v->counter, old, new);
      0.00         mov    %esi,%eax
                 arch_atomic_fetch_or():
                 do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
                   or     %esi,%ecx
                 arch_atomic_try_cmpxchg():
                 return arch_try_cmpxchg(&v->counter, old, new);
      0.01         lock   cmpxchg %ecx,(%rdx)
     42.96       ↓ jne    2d2
                 kick_ilb():
    
    With instrumentation, we found that 81% of the updates do not result in
    any change in the ilb_cpu's flags.  That is, multiple cpus are asking
    the ilb_cpu to do the same things over and over again, before the ilb_cpu
    has a chance to run NOHZ load balance.
    
    Skip updates to ilb_cpu's flags if no new work needs to be done.
    Such updates do not change ilb_cpu's NOHZ flags.  This requires an extra
    atomic read but it is less expensive than frequent unnecessary atomic
    updates that generate cache bounces.
    
    We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
    (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
    Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarChen Yu <yu.c.chen@intel.com>
    Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
    Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
    f90cc919
fair.c 355 KB