• Nishanth Aravamudan's avatar
    mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages · f012a84a
    Nishanth Aravamudan authored
    Based upon 675becce ("mm: vmscan: do not throttle based on pfmemalloc
    reserves if node has no ZONE_NORMAL") from Mel.
    
    We have a system with the following topology:
    
    # numactl -H
    available: 3 nodes (0,2-3)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
    23 24 25 26 27 28 29 30 31
    node 0 size: 28273 MB
    node 0 free: 27323 MB
    node 2 cpus:
    node 2 size: 16384 MB
    node 2 free: 0 MB
    node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
    node 3 size: 30533 MB
    node 3 free: 13273 MB
    node distances:
    node   0   2   3
      0:  10  20  20
      2:  20  10  20
      3:  20  20  10
    
    Node 2 has no free memory, because:
    # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
    1
    
    This leads to the following zoneinfo:
    
    Node 2, zone      DMA
      pages free     0
            min      1840
            low      2300
            high     2760
            scanned  0
            spanned  262144
            present  262144
            managed  262144
    ...
      all_unreclaimable: 1
    
    If one then attempts to allocate some normal 16M hugepages via
    
    echo 37 > /proc/sys/vm/nr_hugepages
    
    The echo never returns and kswapd2 consumes CPU cycles.
    
    This is because throttle_direct_reclaim ends up calling
    wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
    pfmemalloc_watermark_ok() in turn checks all zones on the node if there
    are any reserves, and if so, then indicates the watermarks are ok, by
    seeing if there are sufficient free pages.
    
    675becce added a condition already for memoryless nodes.  In this case,
    though, the node has memory, it is just all consumed (and not
    reclaimable).  Effectively, though, the result is the same on this call to
    pfmemalloc_watermark_ok() and thus seems like a reasonable additional
    condition.
    
    With this change, the afore-mentioned 16M hugepage allocation attempt
    succeeds and correctly round-robins between Nodes 1 and 3.
    Signed-off-by: default avatarNishanth Aravamudan <nacc@linux.vnet.ibm.com>
    Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Anton Blanchard <anton@samba.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.cz>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Dan Streetman <ddstreet@ieee.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    f012a84a
vmscan.c 109 KB