• Aneesh Kumar K.V's avatar
    powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race · 75358ea3
    Aneesh Kumar K.V authored
    MADV_DONTNEED holds mmap_sem in read mode and that implies a
    parallel page fault is possible and the kernel can end up with a level 1 PTE
    entry (THP entry) converted to a level 0 PTE entry without flushing
    the THP TLB entry.
    
    Most architectures including POWER have issues with kernel instantiating a level
    0 PTE entry while holding level 1 TLB entries.
    
    The code sequence I am looking at is
    
    down_read(mmap_sem)                         down_read(mmap_sem)
    
    zap_pmd_range()
     zap_huge_pmd()
      pmd lock held
      pmd_cleared
      table details added to mmu_gather
      pmd_unlock()
                                             insert a level 0 PTE entry()
    
    tlb_finish_mmu().
    
    Fix this by forcing a tlb flush before releasing pmd lock if this is
    not a fullmm invalidate. We can safely skip this invalidate for
    task exit case (fullmm invalidate) because in that case we are sure
    there can be no parallel fault handlers.
    
    This do change the Qemu guest RAM del/unplug time as below
    
    128 core, 496GB guest:
    
    Without patch:
    munmap start: timer = 196449 ms, PID=6681
    munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms
    
    With patch:
    munmap start: timer = 196345 ms, PID=6879
    munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms
    Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    Link: https://lore.kernel.org/r/20200505071729.54912-23-aneesh.kumar@linux.ibm.com
    75358ea3
pgtable.h 40.1 KB