• Krupa Ramakrishnan's avatar
    mm/page_alloc: use accumulated load when building node fallback list · 54d032ce
    Krupa Ramakrishnan authored
    In build_zonelists(), when the fallback list is built for the nodes, the
    node load gets reinitialized during each iteration.  This results in
    nodes with same distances occupying the same slot in different node
    fallback lists rather than appearing in the intended round- robin
    manner.  This results in one node getting picked for allocation more
    compared to other nodes with the same distance.
    
    As an example, consider a 4 node system with the following distance
    matrix.
    
      Node 0  1  2  3
      ----------------
      0    10 12 32 32
      1    12 10 32 32
      2    32 32 10 12
      3    32 32 12 10
    
    For this case, the node fallback list gets built like this:
    
      Node  Fallback list
      ---------------------
      0     0 1 2 3
      1     1 0 3 2
      2     2 3 0 1
      3     3 2 0 1 <-- Unexpected fallback order
    
    In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
    same order which results in more allocations getting satisfied from node
    0 compared to node 1.
    
    The effect of this on remote memory bandwidth as seen by stream
    benchmark is shown below:
    
      Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
    	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
      Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
    	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
    
      ----------------------------------------
    		BANDWIDTH (MB/s)
          TEST	Case 1		Case 2
      ----------------------------------------
          COPY	57479.6		110791.8
         SCALE	55372.9		105685.9
           ADD	50460.6		96734.2
        TRIADD	50397.6		97119.1
      ----------------------------------------
    
    The bandwidth drop in Case 1 occurs because most of the allocations get
    satisfied by node 0 as it appears first in the fallback order for both
    nodes 2 and 3.
    
    This can be fixed by accumulating the node load in build_zonelists()
    rather than reinitializing it during each iteration.  With this the
    nodes with the same distance rightly get assigned in the round robin
    manner.
    
    In fact this was how it was originally until commit f0c0b2b8
    ("change zonelist order: zonelist order selection logic") dropped the
    load accumulation and resorted to initializing the load during each
    iteration.
    
    While zonelist ordering was removed by commit c9bff3ee ("mm,
    page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
    accumulation in build_zonelists() remained.  So essentially this patch
    reverts back to the accumulated node load logic.
    
    After this fix, the fallback order gets built like this:
    
      Node Fallback list
      ------------------
      0    0 1 2 3
      1    1 0 3 2
      2    2 3 0 1
      3    3 2 1 0 <-- Note the change here
    
    The bandwidth in Case 1 improves and matches Case 2 as shown below.
    
      ----------------------------------------
    		BANDWIDTH (MB/s)
          TEST	Case 1		Case 2
      ----------------------------------------
          COPY	110438.9	110107.2
         SCALE	105930.5	105817.5
           ADD	97005.1		96159.8
        TRIADD	97441.5		96757.1
      ----------------------------------------
    
    The correctness of the fallback list generation has been verified for
    the above node configuration where the node 3 starts as memory-less node
    and comes up online only during memory hotplug.
    
    [bharata@amd.com: Added changelog, review, test validation]
    
    Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
    Fixes: f0c0b2b8 ("change zonelist order: zonelist order selection logic")
    Signed-off-by: default avatarKrupa Ramakrishnan <krupa.ramakrishnan@amd.com>
    Co-developed-by: default avatarSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: default avatarSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
    Signed-off-by: default avatarBharata B Rao <bharata@amd.com>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    54d032ce
page_alloc.c 262 KB