• Johannes Weiner's avatar
    mm: memcontrol: fix cpuhotplug statistics flushing · a3d4c05a
    Johannes Weiner authored
    Patch series "mm: memcontrol: switch to rstat", v3.
    
    This series converts memcg stats tracking to the streamlined rstat
    infrastructure provided by the cgroup core code.  rstat is already used by
    the CPU controller and the IO controller.  This change is motivated by
    recent accuracy problems in memcg's custom stats code, as well as the
    benefits of sharing common infra with other controllers.
    
    The current memcg implementation does batched tree aggregation on the
    write side: local stat changes are cached in per-cpu counters, which are
    then propagated upward in batches when a threshold (32 pages) is exceeded.
    This is cheap, but the error introduced by the lazy upward propagation
    adds up: 32 pages times CPUs times cgroups in the subtree.  We've had
    complaints from service owners that the stats do not reliably track and
    react to allocation behavior as expected, sometimes swallowing the results
    of entire test applications.
    
    The original memcg stat implementation used to do tree aggregation
    exclusively on the read side: local stats would only ever be tracked in
    per-cpu counters, and a memory.stat read would iterate the entire subtree
    and sum those counters up.  This didn't keep up with the times:
    
     - Cgroup trees are much bigger now. We switched to lazily-freed
       cgroups, where deleted groups would hang around until their remaining
       page cache has been reclaimed. This can result in large subtrees that
       are expensive to walk, while most of the groups are idle and their
       statistics don't change much anymore.
    
     - Automated monitoring increased. With the proliferation of userspace
       oom killing, proactive reclaim, and higher-resolution logging of
       workload trends in general, top-level stat files are polled at least
       once a second in many deployments.
    
     - The lifetime of cgroups got shorter. Where most cgroup setups in the
       past would have a few large policy-oriented cgroups for everything
       running on the system, newer cgroup deployments tend to create one
       group per application - which gets deleted again as the processes
       exit. An aggregation scheme that doesn't retain child data inside the
       parents loses event history of the subtree.
    
    Rstat addresses all three of those concerns through intelligent,
    persistent read-side aggregation.  As statistics change at the local
    level, rstat tracks - on a per-cpu basis - only those parts of a subtree
    that have changes pending and require aggregation.  The actual
    aggregation occurs on the colder read side - which can now skip over
    (potentially large) numbers of recently idle cgroups.
    
    ===
    
    The test_kmem cgroup selftest is currently failing due to excessive
    cumulative vmstat drift from 100 subgroups:
    
        ok 1 test_kmem_basic
        memory.current = 8810496
        slab + anon + file + kernel_stack = 17074568
        slab = 6101384
        anon = 946176
        file = 0
        kernel_stack = 10027008
        not ok 2 test_kmem_memcg_deletion
        ok 3 test_kmem_proc_kpagecgroup
        ok 4 test_kmem_kernel_stacks
        ok 5 test_kmem_dead_cgroups
        ok 6 test_percpu_basic
    
    As you can see, memory.stat items far exceed memory.current.  The kernel
    stack alone is bigger than all of charged memory.  That's because the
    memory of the test has been uncharged from memory.current, but the
    negative vmstat deltas are still sitting in the percpu caches.
    
    The test at this time isn't even counting percpu, pagetables etc.  yet,
    which would further contribute to the error.  The last patch in the series
    updates the test to include them - as well as reduces the vmstat
    tolerances in general to only expect page_counter batching.
    
    With all patches applied, the (now more stringent) test succeeds:
    
        ok 1 test_kmem_basic
        ok 2 test_kmem_memcg_deletion
        ok 3 test_kmem_proc_kpagecgroup
        ok 4 test_kmem_kernel_stacks
        ok 5 test_kmem_dead_cgroups
        ok 6 test_percpu_basic
    
    ===
    
    A kernel build test confirms that overhead is comparable.  Two kernels are
    built simultaneously in a nested tree with several idle siblings:
    
    root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16)
                                                 `- build-b (defconfig, make -j16)
                                                 `- idle-1
                                                 `- ...
                                                 `- idle-9
    
    During the builds, kernelbuild/memory.stat is read once a second.
    
    A perf diff shows that the changes in cycle distribution is
    minimal. Top 10 kernel symbols:
    
         0.09%     +0.08%  [kernel.kallsyms]                       [k] __mod_memcg_lruvec_state
         0.00%     +0.06%  [kernel.kallsyms]                       [k] cgroup_rstat_updated
         0.08%     -0.05%  [kernel.kallsyms]                       [k] __mod_memcg_state.part.0
         0.16%     -0.04%  [kernel.kallsyms]                       [k] release_pages
         0.00%     +0.03%  [kernel.kallsyms]                       [k] __count_memcg_events
         0.01%     +0.03%  [kernel.kallsyms]                       [k] mem_cgroup_charge_statistics.constprop.0
         0.10%     -0.02%  [kernel.kallsyms]                       [k] get_mem_cgroup_from_mm
         0.05%     -0.02%  [kernel.kallsyms]                       [k] mem_cgroup_update_lru_size
         0.57%     +0.01%  [kernel.kallsyms]                       [k] asm_exc_page_fault
    
    ===
    
    The on-demand aggregated stats are now fully accurate:
    
    $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \
      grep -e inactive_file /sys/fs/cgroup/memory.stat
    
    vanilla:                              patched:
    nr_inactive_file 1574105088           nr_inactive_file 1027801088
       inactive_file 1577410560              inactive_file 1027801088
    
    ===
    
    This patch (of 8):
    
    The memcg hotunplug callback erroneously flushes counts on the local CPU,
    not the counts of the CPU going away; those counts will be lost.
    
    Flush the CPU that is actually going away.
    
    Also simplify the code a bit by using mod_memcg_state() and
    count_memcg_events() instead of open-coding the upward flush - this is
    comparable to how vmstat.c handles hotunplug flushing.
    
    Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org
    Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org
    Fixes: a983b5eb ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
    Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a3d4c05a
memcontrol.c 190 KB