• Johannes Weiner's avatar
    mm: vmscan: restore incremental cgroup iteration · b82b5307
    Johannes Weiner authored
    Currently, reclaim always walks the entire cgroup tree in order to ensure
    fairness between groups.  While overreclaim is limited in shrink_lruvec(),
    many of our systems have a sizable number of active groups, and an even
    bigger number of idle cgroups with cache left behind by previous jobs; the
    mere act of walking all these cgroups can impose significant latency on
    direct reclaimers.
    
    In the past, we've used a save-and-restore iterator that enabled
    incremental tree walks over multiple reclaim invocations.  This ensured
    fairness, while keeping the work of individual reclaimers small.
    
    However, in edge cases with a lot of reclaim concurrency, individual
    reclaimers would sometimes not see enough of the cgroup tree to make
    forward progress and (prematurely) declare OOM.  Consequently we switched
    to comprehensive walks in 1ba6fc9a ("mm: vmscan: do not share cgroup
    iteration between reclaimers").
    
    To address the latency problem without bringing back the premature OOM
    issue, reinstate the shared iteration, but with a restart condition to do
    the full walk in the OOM case - similar to what we do for memory.low
    enforcement and active page protection.
    
    In the worst case, we do one more full tree walk before declaring
    OOM. But the vast majority of direct reclaim scans can then finish
    much quicker, while fairness across the tree is maintained:
    
    - Before this patch, we observed that direct reclaim always takes more
      than 100us and most direct reclaim time is spent in reclaim cycles
      lasting between 1ms and 1 second. Almost 40% of direct reclaim time
      was spent on reclaim cycles exceeding 100ms.
    
    - With this patch, almost all page reclaim cycles last less than 10ms,
      and a good amount of direct page reclaim finishes in under 100us. No
      page reclaim cycles lasting over 100ms were observed anymore.
    
    The shared iterator state is maintaned inside the target cgroup, so
    fair and incremental walks are performed during both global reclaim
    and cgroup limit reclaim of complex subtrees.
    
    Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: default avatarRik van Riel <riel@surriel.com>
    Reported-by: default avatarRik van Riel <riel@surriel.com>
    Reviewed-by: default avatarShakeel Butt <shakeel.butt@linux.dev>
    Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
    Cc: Facebook Kernel Team <kernel-team@fb.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Rik van Riel <riel@surriel.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    b82b5307
vmscan.c 209 KB