• Rik van Riel's avatar
    mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
    Rik van Riel authored
    The old anon_vma code can lead to scalability issues with heavily forking
    workloads.  Specifically, each anon_vma will be shared between the parent
    process and all its child processes.
    
    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes.  However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.
    
    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock.  This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands.  Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.
    
    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA.  At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated.  The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.
    
    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
     This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.
    
    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations.  This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures.  This in
    turn means error handling needs to be added to the calling functions.
    
    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock.  To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.
    
    Some test results:
    
    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.
    
    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time.  The anon_vma lock contention appears to be resolved.
    
    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: default avatarRik van Riel <riel@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Larry Woodman <lwoodman@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    5beb4930
rmap.c 39.5 KB