• Jann Horn's avatar
    mm: don't drop VMA locks in mm_drop_all_locks() · 90717566
    Jann Horn authored
    Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
    lock is held write-locked by the caller, and the caller is responsible for
    dropping the mmap lock at a later point (which will also release the VMA
    locks).
    
    Calling vma_end_write_all() here is dangerous because the caller might
    have write-locked a VMA with the expectation that it will stay
    write-locked until the mmap_lock is released, as usual.
    
    This _almost_ becomes a problem in the following scenario:
    
    An anonymous VMA A and an SGX VMA B are mapped adjacent to each other. 
    Userspace calls munmap() on a range starting at the start address of A and
    ending in the middle of B.
    
    Hypothetical call graph with additional notes in brackets:
    
    do_vmi_align_munmap
      [begin first for_each_vma_range loop]
      vma_start_write [on VMA A]
      vma_mark_detached [on VMA A]
      __split_vma [on VMA B]
        sgx_vma_open [== new->vm_ops->open]
          sgx_encl_mm_add
            __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
              mm_take_all_locks
              mm_drop_all_locks
                vma_end_write_all [drops VMA lock taken on VMA A before]
      vma_start_write [on VMA B]
      vma_mark_detached [on VMA B]
      [end first for_each_vma_range loop]
      vma_iter_clear_gfp [removes VMAs from maple tree]
      mmap_write_downgrade
      unmap_region
      mmap_read_unlock
    
    In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
    holds a VMA write lock on VMA A, the VMA write lock has actually been
    invalidated inside __split_vma().
    
    The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
    actually happen here, as far as I understand, because we are duplicating
    an existing SGX VMA, but sgx_encl_mm_add() only calls
    __mmu_notifier_register() for the first SGX VMA created in a given
    process.  So this could only happen in fork(), not on munmap().  But in my
    view it is just pure luck that this can't happen.
    
    Also, we wouldn't actually have any bad consequences from this in
    do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
    A, we've already marked VMA A as detached, which makes it completely
    ineligible for any VMA-locked page faults.  But again, that's just pure
    luck.
    
    So remove the vma_end_write_all(), so that VMA write locks are only ever
    released on mmap_write_unlock() or mmap_write_downgrade().
    
    Also add comments to document the locking rules established by this patch.
    
    Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
    Fixes: eeff9a5d ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
    Signed-off-by: default avatarJann Horn <jannh@google.com>
    Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    90717566
mmap.c 103 KB