• Aaron Lu's avatar
    swap: choose swap device according to numa node · a2468cc9
    Aaron Lu authored
    If the system has more than one swap device and swap device has the node
    information, we can make use of this information to decide which swap
    device to use in get_swap_pages() to get better performance.
    
    The current code uses a priority based list, swap_avail_list, to decide
    which swap device to use and if multiple swap devices share the same
    priority, they are used round robin.  This patch changes the previous
    single global swap_avail_list into a per-numa-node list, i.e.  for each
    numa node, it sees its own priority based list of available swap
    devices.  Swap device's priority can be promoted on its matching node's
    swap_avail_list.
    
    The current swap device's priority is set as: user can set a >=0 value,
    or the system will pick one starting from -1 then downwards.  The
    priority value in the swap_avail_list is the negated value of the swap
    device's due to plist being sorted from low to high.  The new policy
    doesn't change the semantics for priority >=0 cases, the previous
    starting from -1 then downwards now becomes starting from -2 then
    downwards and -1 is reserved as the promoted value.
    
    Take 4-node EX machine as an example, suppose 4 swap devices are
    available, each sit on a different node:
    swapA on node 0
    swapB on node 1
    swapC on node 2
    swapD on node 3
    
    After they are all swapped on in the sequence of ABCD.
    
    Current behaviour:
    their priorities will be:
    swapA: -1
    swapB: -2
    swapC: -3
    swapD: -4
    And their position in the global swap_avail_list will be:
    swapA   -> swapB   -> swapC   -> swapD
    prio:1     prio:2     prio:3     prio:4
    
    New behaviour:
    their priorities will be(note that -1 is skipped):
    swapA: -2
    swapB: -3
    swapC: -4
    swapD: -5
    And their positions in the 4 swap_avail_lists[nid] will be:
    swap_avail_lists[0]: /* node 0's available swap device list */
    swapA   -> swapB   -> swapC   -> swapD
    prio:1     prio:3     prio:4     prio:5
    swap_avali_lists[1]: /* node 1's available swap device list */
    swapB   -> swapA   -> swapC   -> swapD
    prio:1     prio:2     prio:4     prio:5
    swap_avail_lists[2]: /* node 2's available swap device list */
    swapC   -> swapA   -> swapB   -> swapD
    prio:1     prio:2     prio:3     prio:5
    swap_avail_lists[3]: /* node 3's available swap device list */
    swapD   -> swapA   -> swapB   -> swapC
    prio:1     prio:2     prio:3     prio:4
    
    To see the effect of the patch, a test that starts N process, each mmap
    a region of anonymous memory and then continually write to it at random
    position to trigger both swap in and out is used.
    
    On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
    are used as swap devices with each attached to a different node, the
    result is:
    
    runtime=30m/processes=32/total test size=128G/each process mmap region=4G
    kernel         throughput
    vanilla        13306
    auto-binding   15169 +14%
    
    runtime=30m/processes=64/total test size=128G/each process mmap region=2G
    kernel         throughput
    vanilla        11885
    auto-binding   14879 +25%
    
    [aaron.lu@intel.com: v2]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
    [akpm@linux-foundation.org: use kmalloc_array()]
    Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
    Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: default avatarAaron Lu <aaron.lu@intel.com>
    Cc: "Chen, Tim C" <tim.c.chen@intel.com>
    Cc: Huang Ying <ying.huang@intel.com>
    Cc: Andi Kleen <andi@firstfloor.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Hugh Dickins <hughd@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a2468cc9
swap_numa.txt 2.94 KB