• Shaohua Li's avatar
    swap: add a simple detector for inappropriate swapin readahead · 579f8290
    Shaohua Li authored
    This is a patch to improve swap readahead algorithm.  It's from Hugh and
    I slightly changed it.
    
    Hugh's original changelog:
    
    swapin readahead does a blind readahead, whether or not the swapin is
    sequential.  This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.
    
    This patch adds very simplistic random read detection.  Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly.  There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.
    
    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).
    
    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).
    
    HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
    
    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.
    
    HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon     73921   76210   75611   76904   78191  121542
    Seq Shmem    73601   73176   73855   72947   74543  118322
    Rand Anon   895392  831243  871569  845197  846496  841680
    Rand Shmem 1058375 1053486  827935  764955  764376  756489
    
    SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon     24634   24198   24673   25107   21614   70018
    Seq Shmem    24959   24932   25052   25703   22030   69678
    Rand Anon    43014   26146   28075   25989   26935   25901
    Rand Shmem   45349   45215   28249   24268   24138   24332
    
    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.
    
    Shaohua Li:
    
    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch.  I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit.  And in such case, continuing doing readahead is good
    actually.
    
    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially.  So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.
    
    Here is my test result (unit second, 3 runs average):
    	Vanilla		Hugh		New
    Seq	356		370		360
    Random	4525		2447		2444
    
    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'.  The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads.  These are expected behavior.  while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).
    
    Original patches by: Shaohua Li and Konstantin Khlebnikov.
    
    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: default avatarHugh Dickins <hughd@google.com>
    Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
    Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    579f8290
swap_state.c 12.9 KB