Commit c73322d0 authored by Johannes Weiner's avatar Johannes Weiner Committed by Linus Torvalds

mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage.  We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages.  In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met.  Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer.  If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance.  Up until
commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism.  It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them.  This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met.  Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off.  If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node.  This is the same number of shots
direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
  Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
  Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
Reported-by: default avatarJia He <hejianet@gmail.com>
Tested-by: default avatarJia He <hejianet@gmail.com>
Acked-by: default avatarMichal Hocko <mhocko@suse.com>
Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: default avatarMinchan Kim <minchan@kernel.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent a87c75fb
...@@ -630,6 +630,8 @@ typedef struct pglist_data { ...@@ -630,6 +630,8 @@ typedef struct pglist_data {
int kswapd_order; int kswapd_order;
enum zone_type kswapd_classzone_idx; enum zone_type kswapd_classzone_idx;
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
#ifdef CONFIG_COMPACTION #ifdef CONFIG_COMPACTION
int kcompactd_max_order; int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx; enum zone_type kcompactd_classzone_idx;
......
...@@ -80,6 +80,12 @@ static inline void set_page_refcounted(struct page *page) ...@@ -80,6 +80,12 @@ static inline void set_page_refcounted(struct page *page)
extern unsigned long highest_memmap_pfn; extern unsigned long highest_memmap_pfn;
/*
* Maximum number of reclaim retries without progress before the OOM
* killer is consider the only way forward.
*/
#define MAX_RECLAIM_RETRIES 16
/* /*
* in mm/vmscan.c: * in mm/vmscan.c:
*/ */
......
...@@ -3521,12 +3521,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) ...@@ -3521,12 +3521,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
return false; return false;
} }
/*
* Maximum number of reclaim retries without any progress before OOM killer
* is consider as the only way to move forward.
*/
#define MAX_RECLAIM_RETRIES 16
/* /*
* Checks whether it makes sense to retry the reclaim to make a forward progress * Checks whether it makes sense to retry the reclaim to make a forward progress
* for the given allocation request. * for the given allocation request.
...@@ -4534,7 +4528,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) ...@@ -4534,7 +4528,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
K(node_page_state(pgdat, NR_UNSTABLE_NFS)), K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
node_page_state(pgdat, NR_PAGES_SCANNED), node_page_state(pgdat, NR_PAGES_SCANNED),
!pgdat_reclaimable(pgdat) ? "yes" : "no"); pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
"yes" : "no");
} }
for_each_populated_zone(zone) { for_each_populated_zone(zone) {
......
...@@ -2620,6 +2620,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) ...@@ -2620,6 +2620,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc->nr_scanned - nr_scanned, sc)); sc->nr_scanned - nr_scanned, sc));
/*
* Kswapd gives up on balancing particular nodes after too
* many failures to reclaim anything from them and goes to
* sleep. On reclaim progress, reset the failure counter. A
* successful direct reclaim run will revive a dormant kswapd.
*/
if (reclaimable)
pgdat->kswapd_failures = 0;
return reclaimable; return reclaimable;
} }
...@@ -2694,10 +2703,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) ...@@ -2694,10 +2703,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
GFP_KERNEL | __GFP_HARDWALL)) GFP_KERNEL | __GFP_HARDWALL))
continue; continue;
if (sc->priority != DEF_PRIORITY &&
!pgdat_reclaimable(zone->zone_pgdat))
continue; /* Let kswapd poll it */
/* /*
* If we already have plenty of memory free for * If we already have plenty of memory free for
* compaction in this zone, don't free any more. * compaction in this zone, don't free any more.
...@@ -2817,7 +2822,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, ...@@ -2817,7 +2822,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
return 0; return 0;
} }
static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) static bool allow_direct_reclaim(pg_data_t *pgdat)
{ {
struct zone *zone; struct zone *zone;
unsigned long pfmemalloc_reserve = 0; unsigned long pfmemalloc_reserve = 0;
...@@ -2825,6 +2830,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) ...@@ -2825,6 +2830,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
int i; int i;
bool wmark_ok; bool wmark_ok;
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
return true;
for (i = 0; i <= ZONE_NORMAL; i++) { for (i = 0; i <= ZONE_NORMAL; i++) {
zone = &pgdat->node_zones[i]; zone = &pgdat->node_zones[i];
if (!managed_zone(zone) || if (!managed_zone(zone) ||
...@@ -2905,7 +2913,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, ...@@ -2905,7 +2913,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
/* Throttle based on the first usable node */ /* Throttle based on the first usable node */
pgdat = zone->zone_pgdat; pgdat = zone->zone_pgdat;
if (pfmemalloc_watermark_ok(pgdat)) if (allow_direct_reclaim(pgdat))
goto out; goto out;
break; break;
} }
...@@ -2927,14 +2935,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, ...@@ -2927,14 +2935,14 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
*/ */
if (!(gfp_mask & __GFP_FS)) { if (!(gfp_mask & __GFP_FS)) {
wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
pfmemalloc_watermark_ok(pgdat), HZ); allow_direct_reclaim(pgdat), HZ);
goto check_pending; goto check_pending;
} }
/* Throttle until kswapd wakes the process */ /* Throttle until kswapd wakes the process */
wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
pfmemalloc_watermark_ok(pgdat)); allow_direct_reclaim(pgdat));
check_pending: check_pending:
if (fatal_signal_pending(current)) if (fatal_signal_pending(current))
...@@ -3114,7 +3122,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) ...@@ -3114,7 +3122,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
/* /*
* The throttled processes are normally woken up in balance_pgdat() as * The throttled processes are normally woken up in balance_pgdat() as
* soon as pfmemalloc_watermark_ok() is true. But there is a potential * soon as allow_direct_reclaim() is true. But there is a potential
* race between when kswapd checks the watermarks and a process gets * race between when kswapd checks the watermarks and a process gets
* throttled. There is also a potential race if processes get * throttled. There is also a potential race if processes get
* throttled, kswapd wakes, a large process exits thereby balancing the * throttled, kswapd wakes, a large process exits thereby balancing the
...@@ -3128,6 +3136,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) ...@@ -3128,6 +3136,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
if (waitqueue_active(&pgdat->pfmemalloc_wait)) if (waitqueue_active(&pgdat->pfmemalloc_wait))
wake_up_all(&pgdat->pfmemalloc_wait); wake_up_all(&pgdat->pfmemalloc_wait);
/* Hopeless node, leave it to direct reclaim */
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
return true;
for (i = 0; i <= classzone_idx; i++) { for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i; struct zone *zone = pgdat->node_zones + i;
...@@ -3214,9 +3226,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) ...@@ -3214,9 +3226,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
count_vm_event(PAGEOUTRUN); count_vm_event(PAGEOUTRUN);
do { do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true; bool raise_priority = true;
sc.nr_reclaimed = 0;
sc.reclaim_idx = classzone_idx; sc.reclaim_idx = classzone_idx;
/* /*
...@@ -3295,7 +3307,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) ...@@ -3295,7 +3307,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
* able to safely make forward progress. Wake them * able to safely make forward progress. Wake them
*/ */
if (waitqueue_active(&pgdat->pfmemalloc_wait) && if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
pfmemalloc_watermark_ok(pgdat)) allow_direct_reclaim(pgdat))
wake_up_all(&pgdat->pfmemalloc_wait); wake_up_all(&pgdat->pfmemalloc_wait);
/* Check if kswapd should be suspending */ /* Check if kswapd should be suspending */
...@@ -3306,10 +3318,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) ...@@ -3306,10 +3318,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
* Raise priority if scanning rate is too low or there was no * Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages * progress in reclaiming pages
*/ */
if (raise_priority || !sc.nr_reclaimed) nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
if (raise_priority || !nr_reclaimed)
sc.priority--; sc.priority--;
} while (sc.priority >= 1); } while (sc.priority >= 1);
if (!sc.nr_reclaimed)
pgdat->kswapd_failures++;
out: out:
/* /*
* Return the order kswapd stopped reclaiming at as * Return the order kswapd stopped reclaiming at as
...@@ -3509,6 +3525,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) ...@@ -3509,6 +3525,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
if (!waitqueue_active(&pgdat->kswapd_wait)) if (!waitqueue_active(&pgdat->kswapd_wait))
return; return;
/* Hopeless node, leave it to direct reclaim */
if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
return;
/* Only wake kswapd if all zones are unbalanced */ /* Only wake kswapd if all zones are unbalanced */
for (z = 0; z <= classzone_idx; z++) { for (z = 0; z <= classzone_idx; z++) {
zone = pgdat->node_zones + z; zone = pgdat->node_zones + z;
...@@ -3779,9 +3799,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order) ...@@ -3779,9 +3799,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
return NODE_RECLAIM_FULL; return NODE_RECLAIM_FULL;
if (!pgdat_reclaimable(pgdat))
return NODE_RECLAIM_FULL;
/* /*
* Do not scan if the allocation should not be delayed. * Do not scan if the allocation should not be delayed.
*/ */
......
...@@ -1425,7 +1425,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, ...@@ -1425,7 +1425,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n node_unreclaimable: %u" "\n node_unreclaimable: %u"
"\n start_pfn: %lu" "\n start_pfn: %lu"
"\n node_inactive_ratio: %u", "\n node_inactive_ratio: %u",
!pgdat_reclaimable(zone->zone_pgdat), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
zone->zone_start_pfn, zone->zone_start_pfn,
zone->zone_pgdat->inactive_ratio); zone->zone_pgdat->inactive_ratio);
seq_putc(m, '\n'); seq_putc(m, '\n');
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment