• Michal Hocko's avatar
    mm, oom: introduce oom reaper · aac45363
    Michal Hocko authored
    This patch (of 5):
    
    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.
    
    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory.  Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.
    
    It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.
    
    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway.  There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete.  This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.
    
    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g.  allocating memory).  Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.
    
    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt.  basically any
    lock blocking forward progress as described above.  In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between.  Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.
    
    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work.  This means that only a single mm_struct can be reaped at
    the time.  As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
    Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
    Suggested-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Acked-by: default avatarDavid Rientjes <rientjes@google.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Andrea Argangeli <andrea@kernel.org>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    aac45363
internal.h 15 KB