1. 09 Oct, 2012 40 commits
    • Michel Lespinasse's avatar
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse authored
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • Michel Lespinasse's avatar
      mm anon rmap: remove anon_vma_moveto_tail · 108d6642
      Michel Lespinasse authored
      mremap() had a clever optimization where move_ptes() did not take the
      anon_vma lock to avoid a race with anon rmap users such as page migration.
       Instead, the avc's were ordered in such a way that the origin vma was
      always visited by rmap before the destination.  This ordering and the use
      of page table locks rmap usage safe.  However, we want to replace the use
      of linked lists in anon rmap with an interval tree, and this will make it
      harder to impose such ordering as the interval tree will always be sorted
      by the avc->vma->vm_pgoff value.  For now, let's replace the
      anon_vma_moveto_tail() ordering function with proper anon_vma locking in
      move_ptes().  Once we have the anon interval tree in place, we will
      re-introduce an optimization to avoid taking these locks in the most
      common cases.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108d6642
    • Michel Lespinasse's avatar
      mm: interval tree updates · 9826a516
      Michel Lespinasse authored
      Update the generic interval tree code that was introduced in "mm: replace
      vma prio_tree with an interval tree".
      
      Changes:
      
      - fixed 'endpoing' typo noticed by Andrew Morton
      
      - replaced include/linux/interval_tree_tmpl.h, which was used as a
        template (including it automatically defined the interval tree
        functions) with include/linux/interval_tree_generic.h, which only
        defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
        defines the interval tree functions when invoked. Now that is a very
        long macro which is unfortunate, but it does make the usage sites
        (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
      
      - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
        instead of duplicating that code in the interval tree template.
      
      - replaced vma_interval_tree_add(), which was actually handling the
        nonlinear and interval tree cases, with vma_interval_tree_insert_after()
        which handles only the interval tree case and has an API that is more
        consistent with the other interval tree handling functions.
        The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9826a516
    • Michel Lespinasse's avatar
      rbtree: move augmented rbtree functionality to rbtree_augmented.h · 9c079add
      Michel Lespinasse authored
      Provide rb_insert_augmented() and rb_erase_augmented() through a new
      rbtree_augmented.h include file.  rb_erase_augmented() is defined there as
      an __always_inline function, in order to allow inlining of augmented
      rbtree callbacks into it.  Since this generates a relatively large
      function, each augmented rbtree user should make sure to have a single
      call site.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c079add
    • Michel Lespinasse's avatar
      prio_tree: remove · 147e615f
      Michel Lespinasse authored
      After both prio_tree users have been converted to use red-black trees,
      there is no need to keep around the prio tree library anymore.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      147e615f
    • Michel Lespinasse's avatar
      kmemleak: use rbtree instead of prio tree · 85d3a316
      Michel Lespinasse authored
      kmemleak uses a tree where each node represents an allocated memory object
      in order to quickly find out what object a given address is part of.
      However, the objects don't overlap, so rbtrees are a better choice than
      prio tree for this use.  They are both faster and have lower memory
      overhead.
      
      Tested by booting a kernel with kmemleak enabled, loading the
      kmemleak_test module, and looking for the expected messages.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85d3a316
    • Michel Lespinasse's avatar
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse authored
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • Michel Lespinasse's avatar
      rbtree: add prio tree and interval tree tests · fff3fd8a
      Michel Lespinasse authored
      Patch 1 implements support for interval trees, on top of the augmented
      rbtree API. It also adds synthetic tests to compare the performance of
      interval trees vs prio trees. Short answers is that interval trees are
      slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
      on search. It is debatable how realistic the synthetic test is, and I have
      not made such measurements yet, but my impression is that interval trees
      would still come out faster.
      
      Patch 2 uses a preprocessor template to make the interval tree generic,
      and uses it as a replacement for the vma prio_tree.
      
      Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
      a basic rbtree. We don't actually need the augmented rbtree support here
      because the intervals are always non-overlapping.
      
      Patch 4 removes the now-unused prio tree library.
      
      Patch 5 proposes an additional optimization to rb_erase_augmented, now
      providing it as an inline function so that the augmented callbacks can be
      inlined in. This provides an additional 5-10% performance improvement
      for the interval tree insert/erase benchmark. There is a maintainance cost
      as it exposes augmented rbtree users to some of the rbtree library internals;
      however I think this cost shouldn't be too high as I expect the augmented
      rbtree will always have much less users than the base rbtree.
      
      I should probably add a quick summary of why I think it makes sense to
      replace prio trees with augmented rbtree based interval trees now.  One of
      the drivers is that we need augmented rbtrees for Rik's vma gap finding
      code, and once you have them, it just makes sense to use them for interval
      trees as well, as this is the simpler and more well known algorithm.  prio
      trees, in comparison, seem *too* clever: they impose an additional 'heap'
      constraint on the tree, which they use to guarantee a faster worst-case
      complexity of O(k+log N) for stabbing queries in a well-balanced prio
      tree, vs O(k*log N) for interval trees (where k=number of matches,
      N=number of intervals).  Now this sounds great, but in practice prio trees
      don't realize this theorical benefit.  First, the additional constraint
      makes them harder to update, so that the kernel implementation has to
      simplify things by balancing them like a radix tree, which is not always
      ideal.  Second, the fact that there are both index and heap properties
      makes both tree manipulation and search more complex, which results in a
      higher multiplicative time constant.  As it turns out, the simple interval
      tree algorithm ends up running faster than the more clever prio tree.
      
      This patch:
      
      Add two test modules:
      
      - prio_tree_test measures the performance of lib/prio_tree.c, both for
        insertion/removal and for stabbing searches
      
      - interval_tree_test measures the performance of a library of equivalent
        functionality, built using the augmented rbtree support.
      
      In order to support the second test module, lib/interval_tree.c is
      introduced. It is kept separate from the interval_tree_test main file
      for two reasons: first we don't want to provide an unfair advantage
      over prio_tree_test by having everything in a single compilation unit,
      and second there is the possibility that the interval tree functionality
      could get some non-test users in kernel over time.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fff3fd8a
    • Michel Lespinasse's avatar
      rbtree: add RB_DECLARE_CALLBACKS() macro · 3908836a
      Michel Lespinasse authored
      As proposed by Peter Zijlstra, this makes it easier to define the augmented
      rbtree callbacks.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3908836a
    • Michel Lespinasse's avatar
      rbtree: remove prior augmented rbtree implementation · 9d9e6f97
      Michel Lespinasse authored
      convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
      and remove the old augmented rbtree implementation.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d9e6f97
    • Michel Lespinasse's avatar
      rbtree: faster augmented rbtree manipulation · 14b94af0
      Michel Lespinasse authored
      Introduce new augmented rbtree APIs that allow minimal recalculation of
      augmented node information.
      
      A new callback is added to the rbtree insertion and erase rebalancing
      functions, to be called on each tree rotations. Such rotations preserve
      the subtree's root augmented value, but require recalculation of the one
      child that was previously located at the subtree root.
      
      In the insertion case, the handcoded search phase must be updated to
      maintain the augmented information on insertion, and then the rbtree
      coloring/rebalancing algorithms keep it up to date.
      
      In the erase case, things are more complicated since it is library
      code that manipulates the rbtree in order to remove internal nodes.
      This requires a couple additional callbacks to copy a subtree's
      augmented value when a new root is stitched in, and to recompute
      augmented values down the ancestry path when a node is removed from
      the tree.
      
      In order to preserve maximum speed for the non-augmented case,
      we provide two versions of each tree manipulation function.
      rb_insert_augmented() is the augmented equivalent of rb_insert_color(),
      and rb_erase_augmented() is the augmented equivalent of rb_erase().
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14b94af0
    • Michel Lespinasse's avatar
      rbtree: augmented rbtree test · dadf9353
      Michel Lespinasse authored
      Small test to measure the performance of augmented rbtrees.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dadf9353
    • Michel Lespinasse's avatar
      rbtree: low level optimizations in rb_erase() · 4f035ad6
      Michel Lespinasse authored
      Various minor optimizations in rb_erase():
      - Avoid multiple loading of node->__rb_parent_color when computing parent
        and color information (possibly not in close sequence, as there might
        be further branches in the algorithm)
      - In the 1-child subcase of case 1, copy the __rb_parent_color field from
        the erased node to the child instead of recomputing it from the desired
        parent and color
      - When searching for the erased node's successor, differentiate between
        cases 2 and 3 based on whether any left links were followed. This avoids
        a condition later down.
      - In case 3, keep a pointer to the erased node's right child so we don't
        have to refetch it later to adjust its parent.
      - In the no-childs subcase of cases 2 and 3, place the rebalance assigment
        last so that the compiler can remove the following if(rebalance) test.
      
      Also, added some comments to illustrate cases 2 and 3.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f035ad6
    • Michel Lespinasse's avatar
      rbtree: handle 1-child recoloring in rb_erase() instead of rb_erase_color() · 46b6135a
      Michel Lespinasse authored
      An interesting observation for rb_erase() is that when a node has
      exactly one child, the node must be black and the child must be red.
      An interesting consequence is that removing such a node can be done by
      simply replacing it with its child and making the child black,
      which we can do efficiently in rb_erase(). __rb_erase_color() then
      only needs to handle the no-childs case and can be modified accordingly.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46b6135a
    • Michel Lespinasse's avatar
      rbtree: place easiest case first in rb_erase() · 60670b80
      Michel Lespinasse authored
      In rb_erase, move the easy case (node to erase has no more than
      1 child) first. I feel the code reads easier that way.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60670b80
    • Michel Lespinasse's avatar
      rbtree: add __rb_change_child() helper function · 7abc704a
      Michel Lespinasse authored
      Add __rb_change_child() as an inline helper function to replace code that
      would otherwise be duplicated 4 times in the source.
      
      No changes to binary size or speed.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7abc704a
    • Michel Lespinasse's avatar
      rbtree test: fix sparse warning about 64-bit constant · 28d75309
      Michel Lespinasse authored
      Just a small fix to make sparse happy.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reported-by: default avatarFengguang Wu <wfg@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28d75309
    • Michel Lespinasse's avatar
      rbtree: optimize fetching of sibling node · 59633abf
      Michel Lespinasse authored
      When looking to fetch a node's sibling, we went through a sequence of:
      - check if node is the parent's left child
      - if it is, then fetch the parent's right child
      
      This can be replaced with:
      - fetch the parent's right child as an assumed sibling
      - check that node is NOT the fetched child
      
      This avoids fetching the parent's left child when node is actually
      that child. Saves a bit on code size, though it doesn't seem to make
      a large difference in speed.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <David.Woodhouse@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59633abf
    • Michel Lespinasse's avatar
      rbtree: coding style adjustments · 7ce6ff9e
      Michel Lespinasse authored
      Set comment and indentation style to be consistent with linux coding style
      and the rest of the file, as suggested by Peter Zijlstra
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ce6ff9e
    • Michel Lespinasse's avatar
      rbtree: low level optimizations in __rb_erase_color() · 6280d235
      Michel Lespinasse authored
      In __rb_erase_color(), we often already have pointers to the nodes being
      rotated and/or know what their colors must be, so we can generate more
      efficient code than the generic __rb_rotate_left() and __rb_rotate_right()
      functions.
      
      Also when the current node is red or when flipping the sibling's color,
      the parent is already known so we can use the more efficient
      rb_set_parent_color() function to set the desired color.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6280d235
    • Michel Lespinasse's avatar
      rbtree: optimize case selection logic in __rb_erase_color() · e125d147
      Michel Lespinasse authored
      In __rb_erase_color(), we have to select one of 3 cases depending on the
      color on the 'other' node children.  If both children are black, we flip a
      few node colors and iterate.  Otherwise, we do either one or two tree
      rotations, depending on the color of the 'other' child opposite to 'node',
      and then we are done.
      
      The corresponding logic had duplicate checks for the color of the 'other'
      child opposite to 'node'.  It was checking it first to determine if both
      children are black, and then to determine how many tree rotations are
      required.  Rearrange the logic to avoid that extra check.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e125d147
    • Michel Lespinasse's avatar
      rbtree: adjust node color in __rb_erase_color() only when necessary · d6ff1273
      Michel Lespinasse authored
      In __rb_erase_color(), we were always setting a node to black after
      exiting the main loop.  And in one case, after fixing up the tree to
      satisfy all rbtree invariants, we were setting the current node to root
      just to guarantee a loop exit, at which point the root would be set to
      black.  However this is not necessary, as the root of an rbtree is already
      known to be black.  The only case where the color flip is required is when
      we exit the loop due to the current node being red, and it's easiest to
      just do the flip at that point instead of doing it after the loop.
      
      [adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6ff1273
    • Michel Lespinasse's avatar
      rbtree: low level optimizations in rb_insert_color() · 5bc9188a
      Michel Lespinasse authored
      - Use the newly introduced rb_set_parent_color() function to flip the color
        of nodes whose parent is already known.
      - Optimize rb_parent() when the node is known to be red - there is no need
        to mask out the color in that case.
      - Flipping gparent's color to red requires us to fetch its rb_parent_color
        field, so we can reuse it as the parent value for the next loop iteration.
      - Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree
        rotations: we already have pointers to all relevant nodes, and know their
        colors (either because we want to adjust it, or because we've tested it,
        or we can deduce it as black due to the node proximity to a known red node).
        So we can generate more efficient code by making use of the node pointers
        we already have, and setting both the parent and color attributes for
        nodes all at once. Also in Case 2, some node attributes don't have to
        be set because we know another tree rotation (Case 3) will always follow
        and override them.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bc9188a
    • Michel Lespinasse's avatar
      rbtree: adjust root color in rb_insert_color() only when necessary · 6d58452d
      Michel Lespinasse authored
      The root node of an rbtree must always be black.  However,
      rb_insert_color() only needs to maintain this invariant when it has been
      broken - that is, when it exits the loop due to the current (red) node
      being the root.  In all other cases (exiting after tree rotations, or
      exiting due to an existing black parent) the invariant is already
      satisfied, so there is no need to adjust the root node color.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d58452d
    • Michel Lespinasse's avatar
      rbtree: break out of rb_insert_color loop after tree rotation · 1f052865
      Michel Lespinasse authored
      It is a well known property of rbtrees that insertion never requires more
      than two tree rotations.  In our implementation, after one loop iteration
      identified one or two necessary tree rotations, we would iterate and look
      for more.  However at that point the node's parent would always be black,
      which would cause us to exit the loop.
      
      We can make the code flow more obvious by just adding a break statement
      after the tree rotations, where we know we are done.  Additionally, in the
      cases where two tree rotations are necessary, we don't have to update the
      'node' pointer as it wouldn't be used until the next loop iteration, which
      we now avoid due to this break statement.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f052865
    • Michel Lespinasse's avatar
      rbtree: performance and correctness test · 910a742d
      Michel Lespinasse authored
      This small module helps measure the performance of rbtree insert and
      erase.
      
      Additionally, we run a few correctness tests to check that the rbtrees
      have all desired properties:
      
      - contains the right number of nodes in the order desired,
      - never two consecutive red nodes on any path,
      - all paths to leaf nodes have the same number of black nodes,
      - root node is black
      
      [akpm@linux-foundation.org: fix printk warning: sparc64 cycles_t is unsigned long]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      910a742d
    • Michel Lespinasse's avatar
      rbtree: move some implementation details from rbtree.h to rbtree.c · bf7ad8ee
      Michel Lespinasse authored
      rbtree users must use the documented APIs to manipulate the tree
      structure.  Low-level helpers to manipulate node colors and parenthood are
      not part of that API, so move them to lib/rbtree.c
      
      [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf7ad8ee
    • Michel Lespinasse's avatar
      rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.c · ea5272f5
      Michel Lespinasse authored
      The recently added code to use rbtrees in sysctl did not follow the proper
      rbtree interface on insertion - it was calling rb_link_node() which
      inserts a new node into the binary tree, but missed the call to
      rb_insert_color() which properly balances the rbtree and establishes all
      expected rbtree invariants.
      
      I found out about this only because faulty commit also used
      rb_init_node(), which I am removing within this patchset.  But I think
      it's an easy mistake to make, and it makes me wonder if we should change
      the rbtree API so that insertions would be done with a single rb_insert()
      call (even if its implementation could still inline the rb_link_node()
      part and call a private __rb_insert_color function to do the rebalancing).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea5272f5
    • Michel Lespinasse's avatar
      rbtree: empty nodes have no color · 4c199a93
      Michel Lespinasse authored
      Empty nodes have no color.  We can make use of this property to simplify
      the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros.  Also,
      we can get rid of the rb_init_node function which had been introduced by
      commit 88d19cf3 ("timers: Add rb_init_node() to allow for stack
      allocated rb nodes") to avoid some issue with the empty node's color not
      being initialized.
      
      I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
      doing there, though.  axboe introduced them in commit 10fd48f2
      ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev").  The way I
      see it, the 'empty node' abstraction is only used by rbtree users to
      flag nodes that they haven't inserted in any rbtree, so asking the
      predecessor or successor of such nodes doesn't make any sense.
      
      One final rb_init_node() caller was recently added in sysctl code to
      implement faster sysctl name lookups.  This code doesn't make use of
      RB_EMPTY_NODE at all, and from what I could see it only called
      rb_init_node() under the mistaken assumption that such initialization was
      required before node insertion.
      
      [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4c199a93
    • Michel Lespinasse's avatar
      rbtree: reference Documentation/rbtree.txt for usage instructions · 1457d287
      Michel Lespinasse authored
      I recently started looking at the rbtree code (with an eye towards
      improving the augmented rbtree support, but I haven't gotten there yet).
      I noticed a lot of possible speed improvements, which I am now proposing
      in this patch set.
      
      Patches 1-4 are preparatory: remove internal functions from rbtree.h so
      that users won't be tempted to use them instead of the documented APIs,
      clean up some incorrect usages I've noticed (in particular, with the
      recently added fs/proc/proc_sysctl.c rbtree usage), reference the
      documentation so that people have one less excuse to miss it, etc.
      
      Patch 5 is a small module I wrote to check the rbtree performance.  It
      creates 100 nodes with random keys and repeatedly inserts and erases them
      from an rbtree.  Additionally, it has code to check for rbtree invariants
      after each insert or erase operation.
      
      Patches 6-12 is where the rbtree optimizations are done, and they touch
      only that one file, lib/rbtree.c .  I am getting good results out of these
      - in my small benchmark doing rbtree insertion (including search) and
      erase, I'm seeing a 30% runtime reduction on Sandybridge E5, which is more
      than I initially thought would be possible.  (the results aren't as
      impressive on my two other test hosts though, AMD barcelona and Intel
      Westmere, where I am seeing 14% runtime reduction only).  The code size -
      both source (ommiting comments) and compiled - is also shorter after these
      changes.  However, I do admit that the updated code is more arduous to
      read - one big reason for that is the removal of the tree rotation
      helpers, which added some overhead but also made it easier to reason about
      things locally.  Overall, I believe this is an acceptable compromise,
      given that this code doesn't get modified very often, and that I have good
      tests for it.
      
      Upon Peter's suggestion, I added comments showing the rtree configuration
      before every rotation.  I think they help; however it's still best to have
      a copy of the cormen/leiserson/rivest book when digging into this code.
      
      This patch: reference Documentation/rbtree.txt for usage instructions
      
      include/linux/rbtree.h included some basic usage instructions, while
      Documentation/rbtree.txt had some more complete and easier to follow
      instructions.  Replacing the former with a reference to the latter.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1457d287
    • Michel Lespinasse's avatar
      ipc/mqueue: remove unnecessary rb_init_node() calls · 1638113d
      Michel Lespinasse authored
      Commit d6629859 ("ipc/mqueue: improve performance of send/recv") and
      ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
      rbtree of message priorities, and usage of rb_init_node() to initialize
      the corresponding nodes.  As it turns out, rb_init_node() is unnecessary
      here, as the nodes are fully initialized on insertion by rb_link_node()
      and the code doesn't access nodes that aren't inserted on the rbtree.
      
      Removing the rb_init_node() calls as I removed that function during
      rbtree API cleanups (the only other use of it was in a place that
      similarly didn't require it).
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1638113d
    • Gerald Schaefer's avatar
      thp, s390: architecture backend for thp on s390 · 1ae1c1d0
      Gerald Schaefer authored
      This implements the architecture backend for transparent hugepages
      on s390.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ae1c1d0
    • Gerald Schaefer's avatar
      thp, s390: disable thp for kvm host on s390 · 274023da
      Gerald Schaefer authored
      This patch is part of the architecture backend for thp on s390.  It
      disables thp for kvm hosts, because there is no kvm host hugepage support
      so far.  Existing thp mappings are split by follow_page() with FOLL_SPLIT,
      and future thp mappings are prevented by setting VM_NOHUGEPAGE in
      mm->def_flags.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      274023da
    • Gerald Schaefer's avatar
      thp, s390: thp pagetable pre-allocation for s390 · 9501d09f
      Gerald Schaefer authored
      This patch is part of the architecture backend for thp on s390.  It
      provides the pagetable pre-allocation functions
      pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw().  Unlike
      other archs, s390 has no struct page * as pgtable_t, but rather a pointer
      to the page table.  So instead of saving the pagetable pre- allocation
      list info inside the struct page, it is being saved within the pagetable
      itself.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9501d09f
    • Gerald Schaefer's avatar
      thp, s390: thp splitting backend for s390 · 75077afb
      Gerald Schaefer authored
      This patch is part of the architecture backend for thp on s390.  It
      provides the functions related to thp splitting, including serialization
      against gup.  Unlike other archs, pmdp_splitting_flush() cannot use a tlb
      flushing operation to serialize against gup on s390, because that wouldn't
      be stopped by the disabled IRQs.  So instead, smp_call_function() is
      called with an empty function, which will have the expected effect.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75077afb
    • Gerald Schaefer's avatar
      thp: make MADV_HUGEPAGE check for mm->def_flags · 8e72033f
      Gerald Schaefer authored
      This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if
      VM_NOHUGEPAGE is set in mm->def_flags.  On s390, the VM_NOHUGEPAGE flag
      will be set in mm->def_flags for kvm processes, to prevent any future thp
      mappings.  In order to also prevent MADV_HUGEPAGE on such an mm,
      hugepage_madvise() should check mm->def_flags.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e72033f
    • Gerald Schaefer's avatar
      thp: introduce pmdp_invalidate() · 46dcde73
      Gerald Schaefer authored
      On s390, a valid page table entry must not be changed while it is attached
      to any CPU.  So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
      operation would be necessary there.  This patch introduces the
      pmdp_invalidate() function, to allow architecture-specific
      implementations.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46dcde73
    • Gerald Schaefer's avatar
      thp: remove assumptions on pgtable_t type · e3ebcf64
      Gerald Schaefer authored
      The thp page table pre-allocation code currently assumes that pgtable_t is
      of type "struct page *".  This may not be true for all architectures, so
      this patch removes that assumption by replacing the functions
      prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
      can be defined architecture-specific.
      
      It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
      operating on a pgtable_t.  Apart from the VM_BUG_ON removal, there will be
      no functional change introduced by this patch.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3ebcf64
    • Gerald Schaefer's avatar
      thp, x86: introduce HAVE_ARCH_TRANSPARENT_HUGEPAGE · 15626062
      Gerald Schaefer authored
      Cleanup patch in preparation for transparent hugepage support on s390.
      Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
      make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
      64BIT)) && MMU".
      
      This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead.  x86 already has
      MMU "def_bool y", so the MMU check is superfluous there and
      HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15626062
    • Michel Lespinasse's avatar
      mm: fix potential anon_vma locking issue in mprotect() · ca42b26a
      Michel Lespinasse authored
      Fix an anon_vma locking issue in the following situation:
      
      - vma has no anon_vma
      - next has an anon_vma
      - vma is being shrunk / next is being expanded, due to an mprotect call
      
      We need to take next's anon_vma lock to avoid races with rmap users (such
      as page migration) while next is being expanded.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca42b26a