• Jason Gunthorpe's avatar
    mm/mmu_notifier: add an interval tree notifier · 99cb252f
    Jason Gunthorpe authored
    Of the 13 users of mmu_notifiers, 8 of them use only
    invalidate_range_start/end() and immediately intersect the
    mmu_notifier_range with some kind of internal list of VAs.  4 use an
    interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
    of some kind (scif_dma, vhost, gntdev, hmm)
    
    And the remaining 5 either don't use invalidate_range_start() or do some
    special thing with it.
    
    It turns out that building a correct scheme with an interval tree is
    pretty complicated, particularly if the use case is synchronizing against
    another thread doing get_user_pages().  Many of these implementations have
    various subtle and difficult to fix races.
    
    This approach puts the interval tree as common code at the top of the mmu
    notifier call tree and implements a shareable locking scheme.
    
    It includes:
     - An interval tree tracking VA ranges, with per-range callbacks
     - A read/write locking scheme for the interval tree that avoids
       sleeping in the notifier path (for OOM killer)
     - A sequence counter based collision-retry locking scheme to tell
       device page fault that a VA range is being concurrently invalidated.
    
    This is based on various ideas:
    - hmm accumulates invalidated VA ranges and releases them when all
      invalidates are done, via active_invalidate_ranges count.
      This approach avoids having to intersect the interval tree twice (as
      umem_odp does) at the potential cost of a longer device page fault.
    
    - kvm/umem_odp use a sequence counter to drive the collision retry,
      via invalidate_seq
    
    - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
      This makes adding/removing interval tree members more deterministic
    
    - seqlock, except this version makes the seqlock idea multi-holder on the
      write side by protecting it with active_invalidate_ranges and a spinlock
    
    To minimize MM overhead when only the interval tree is being used, the
    entire SRCU and hlist overheads are dropped using some simple
    branches. Similarly the interval tree overhead is dropped when in hlist
    mode.
    
    The overhead from the mandatory spinlock is broadly the same as most of
    existing users which already had a lock (or two) of some sort on the
    invalidation path.
    
    Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.caAcked-by: default avatarChristian König <christian.koenig@amd.com>
    Tested-by: default avatarPhilip Yang <Philip.Yang@amd.com>
    Tested-by: default avatarRalph Campbell <rcampbell@nvidia.com>
    Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    99cb252f
Kconfig 24.2 KB