• Chris Down's avatar
    mm, memcg: throttle allocators when failing reclaim over memory.high · 0e4b01df
    Chris Down authored
    We're trying to use memory.high to limit workloads, but have found that
    containment can frequently fail completely and cause OOM situations
    outside of the cgroup.  This happens especially with swap space -- either
    when none is configured, or swap is full.  These failures often also don't
    have enough warning to allow one to react, whether for a human or for a
    daemon monitoring PSI.
    
    Here is output from a simple program showing how long it takes in usec
    (column 2) to allocate a megabyte of anonymous memory (column 1) when a
    cgroup is already beyond its memory high setting, and no swap is
    available:
    
        [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
        > --wait -t timeout 300 /root/mdf
        [...]
        95  1035
        96  1038
        97  1000
        98  1036
        99  1048
        100 1590
        101 1968
        102 1776
        103 1863
        104 1757
        105 1921
        106 1893
        107 1760
        108 1748
        109 1843
        110 1716
        111 1924
        112 1776
        113 1831
        114 1766
        115 1836
        116 1588
        117 1912
        118 1802
        119 1857
        120 1731
        [...]
        [System OOM in 2-3 seconds]
    
    The delay does go up extremely marginally past the 100MB memory.high
    threshold, as now we spend time scanning before returning to usermode, but
    it's nowhere near enough to contain growth.  It also doesn't get worse the
    more pages you have, since it only considers nr_pages.
    
    The current situation goes against both the expectations of users of
    memory.high, and our intentions as cgroup v2 developers.  In
    cgroup-v2.txt, we claim that we will throttle and only under "extreme
    conditions" will memory.high protection be breached.  Likewise, cgroup v2
    users generally also expect that memory.high should throttle workloads as
    they exceed their high threshold.  However, as seen above, this isn't
    always how it works in practice -- even on banal setups like those with no
    swap, or where swap has become exhausted, we can end up with memory.high
    being breached and us having no weapons left in our arsenal to combat
    runaway growth with, since reclaim is futile.
    
    It's also hard for system monitoring software or users to tell how bad the
    situation is, as "high" events for the memcg may in some cases be benign,
    and in others be catastrophic.  The current status quo is that we fail
    containment in a way that doesn't provide any advance warning that things
    are about to go horribly wrong (for example, we are about to invoke the
    kernel OOM killer).
    
    This patch introduces explicit throttling when reclaim is failing to keep
    memcg size contained at the memory.high setting.  It does so by applying
    an exponential delay curve derived from the memcg's overage compared to
    memory.high.  In the normal case where the memcg is either below or only
    marginally over its memory.high setting, no throttling will be performed.
    
    This composes well with system health monitoring and remediation, as these
    allocator delays are factored into PSI's memory pressure calculations.
    This both creates a mechanism system administrators or applications
    consuming the PSI interface to trivially see that the memcg in question is
    struggling and use that to make more reasonable decisions, and permits
    them enough time to act.  Either of these can act with significantly more
    nuance than that we can provide using the system OOM killer.
    
    This is a similar idea to memory.oom_control in cgroup v1 which would put
    the cgroup to sleep if the threshold was violated, but it's also
    significantly improved as it results in visible memory pressure, and also
    doesn't schedule indefinitely, which previously made tracing and other
    introspection difficult (ie.  it's clamped at 2*HZ per allocation through
    MEMCG_MAX_HIGH_DELAY_JIFFIES).
    
    Contrast the previous results with a kernel with this patch:
    
        [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
        > --wait -t timeout 300 /root/mdf
        [...]
        95  1002
        96  1000
        97  1002
        98  1003
        99  1000
        100 1043
        101 84724
        102 330628
        103 610511
        104 1016265
        105 1503969
        106 2391692
        107 2872061
        108 3248003
        109 4791904
        110 5759832
        111 6912509
        112 8127818
        113 9472203
        114 12287622
        115 12480079
        116 14144008
        117 15808029
        118 16384500
        119 16383242
        120 16384979
        [...]
    
    As you can see, in the normal case, memory allocation takes around 1000
    usec.  However, as we exceed our memory.high, things start to increase
    exponentially, but fairly leniently at first.  Our first megabyte over
    memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
    next is almost an entire second.  This gets worse until we reach our
    eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
    However, this is still making forward progress, so permits tracing or
    further analysis with programs like GDB.
    
    We use an exponential curve for our delay penalty for a few reasons:
    
    1. We run mem_cgroup_handle_over_high to potentially do reclaim after
       we've already performed allocations, which means that temporarily
       going over memory.high by a small amount may be perfectly legitimate,
       even for compliant workloads. We don't want to unduly penalise such
       cases.
    2. An exponential curve (as opposed to a static or linear delay) allows
       ramping up memory pressure stats more gradually, which can be useful
       to work out that you have set memory.high too low, without destroying
       application performance entirely.
    
    This patch expands on earlier work by Johannes Weiner. Thanks!
    
    [akpm@linux-foundation.org: fix max() warning]
    [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
    [akpm@linux-foundation.org: fix it even more]
    [chris@chrisdown.name: fix 64-bit divide even more]
    Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: default avatarChris Down <chris@chrisdown.name>
    Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Nathan Chancellor <natechancellor@gmail.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    0e4b01df
memcontrol.c 187 KB