• Suren Baghdasaryan's avatar
    sched/psi: Stop relying on timer_pending() for poll_work rescheduling · 710ffe67
    Suren Baghdasaryan authored
    Psi polling mechanism is trying to minimize the number of wakeups to
    run psi_poll_work and is currently relying on timer_pending() to detect
    when this work is already scheduled. This provides a window of opportunity
    for psi_group_change to schedule an immediate psi_poll_work after
    poll_timer_fn got called but before psi_poll_work could reschedule itself.
    Below is the depiction of this entire window:
    
    poll_timer_fn
      wake_up_interruptible(&group->poll_wait);
    
    psi_poll_worker
      wait_event_interruptible(group->poll_wait, ...)
      psi_poll_work
        psi_schedule_poll_work
          if (timer_pending(&group->poll_timer)) return;
          ...
          mod_timer(&group->poll_timer, jiffies + delay);
    
    Prior to 461daba0 we used to rely on poll_scheduled atomic which was
    reset and set back inside psi_poll_work and therefore this race window
    was much smaller.
    The larger window causes increased number of wakeups and our partners
    report visible power regression of ~10mA after applying 461daba0.
    Bring back the poll_scheduled atomic and make this race window even
    narrower by resetting poll_scheduled only when we reach polling expiration
    time. This does not completely eliminate the possibility of extra wakeups
    caused by a race with psi_group_change however it will limit it to the
    worst case scenario of one extra wakeup per every tracking window (0.5s
    in the worst case).
    This patch also ensures correct ordering between clearing poll_scheduled
    flag and obtaining changed_states using memory barrier. Correct ordering
    between updating changed_states and setting poll_scheduled is ensured by
    atomic_xchg operation.
    By tracing the number of immediate rescheduling attempts performed by
    psi_group_change and the number of these attempts being blocked due to
    psi monitor being already active, we can assess the effects of this change:
    
    Before the patch:
                                               Run#1    Run#2      Run#3
    Immediate reschedules attempted:           684365   1385156    1261240
    Immediate reschedules blocked:             682846   1381654    1258682
    Immediate reschedules (delta):             1519     3502       2558
    Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%
    
    After the patch:
                                               Run#1    Run#2      Run#3
    Immediate reschedules attempted:           882244   770298    426218
    Immediate reschedules blocked:             881996   769796    426074
    Immediate reschedules (delta):             248      502       144
    Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%
    
    The number of non-blocked immediate reschedules dropped from 0.22-0.25%
    to 0.03-0.07%. The drop is attributed to the decrease in the race window
    size and the fact that we allow this race only when psi monitors reach
    polling window expiration time.
    
    Fixes: 461daba0 ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
    Reported-by: default avatarKathleen Chang <yt.chang@mediatek.com>
    Reported-by: default avatarWenju Xu <wenju.xu@mediatek.com>
    Reported-by: default avatarJonathan Chen <jonathan.jmchen@mediatek.com>
    Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Tested-by: default avatarSH Chen <show-hong.chen@mediatek.com>
    Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
    710ffe67
psi.c 44.4 KB