• Aneesh Kumar K.V's avatar
    mm/mremap: hold the rmap lock in write mode when moving page table entries. · 97113eb3
    Aneesh Kumar K.V authored
    To avoid a race between rmap walk and mremap, mremap does
    take_rmap_locks().  The lock was taken to ensure that rmap walk don't miss
    a page table entry due to PTE moves via move_pagetables().  The kernel
    does further optimization of this lock such that if we are going to find
    the newly added vma after the old vma, the rmap lock is not taken.  This
    is because rmap walk would find the vmas in the same order and if we don't
    find the page table attached to older vma we would find it with the new
    vma which we would iterate later.
    
    As explained in commit eb66ae03 ("mremap: properly flush TLB before
    releasing the page") mremap is special in that it doesn't take ownership
    of the page.  The optimized version for PUD/PMD aligned mremap also
    doesn't hold the ptl lock.  This can result in stale TLB entries as show
    below.
    
    This patch updates the rmap locking requirement in mremap to handle the race condition
    explained below with optimized mremap::
    
    Optmized PMD move
    
        CPU 1                           CPU 2                                   CPU 3
    
        mremap(old_addr, new_addr)      page_shrinker/try_to_unmap_one
    
        mmap_write_lock_killable()
    
                                        addr = old_addr
                                        lock(pte_ptl)
        lock(pmd_ptl)
        pmd = *old_pmd
        pmd_clear(old_pmd)
        flush_tlb_range(old_addr)
    
        *new_pmd = pmd
                                                                                *new_addr = 10; and fills
                                                                                TLB with new addr
                                                                                and old pfn
    
        unlock(pmd_ptl)
                                        ptep_clear_flush()
                                        old pfn is free.
                                                                                Stale TLB entry
    
    Optimized PUD move also suffers from a similar race.  Both the above race
    condition can be fixed if we force mremap path to take rmap lock.
    
    Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com
    Fixes: 2c91bd4a ("mm: speed up mremap by 20x on large regions")
    Fixes: c49dd340 ("mm: speedup mremap on 1GB or larger regions")
    Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
    
    Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Acked-by: default avatarHugh Dickins <hughd@google.com>
    Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
    Cc: Joel Fernandes <joel@joelfernandes.org>
    Cc: Kalesh Singh <kaleshsingh@google.com>
    Cc: Kirill A. Shutemov <kirill@shutemov.name>
    Cc: Michael Ellerman <mpe@ellerman.id.au>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    97113eb3
mremap.c 27.4 KB