• David Hildenbrand's avatar
    mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed() · fce831c9
    David Hildenbrand authored
    For now we only get the (small) zeropage mapped to user space in four
    cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):
    
    (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
        do_anonymous_page() will not refcount it and map it pte_mkspecial()
    (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
        (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
        map it pte_mkspecial().
    (3) KSM in mergeable VMA (anonymous VMA or COW mapping).
        cmp_and_merge_page() will not refcount it and map it
        pte_mkspecial().
    (4) FSDAX as an optimization for holes.
        vmf_insert_mixed()->__vm_insert_mixed() might end up calling
        insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
        zeropage and not mapping it pte_mkspecial(). With
        CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
        not refcount it and map it pte_mkspecial().
    
    In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
    VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a08 ("dax:
    remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit
    e1fb4a08 ("dax: remove VM_MIXEDMAP for fsdax and device dax").
    
    Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
    would currently return the zeropage.  We'll refcount the zeropage when
    mapping and when unmapping.
    
    Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
    vm_normal_page() would currently refuse to return the zeropage.  So we'd
    refcount it when mapping but not when unmapping it ...  do we have fsdax
    without CONFIG_ARCH_HAS_PTE_SPECIAL in practice?  Hard to tell.
    
    Independent of that, we should never refcount the zeropage when we might
    be holding that reference for a long time, because even without an
    accounting imbalance we might overflow the refcount.  As there is interest
    in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
    support for that in the cases where it makes sense:
    
    (A) Never refcount the zeropage when mapping it:
    
    In insert_page(), special-case the zeropage, do not refcount it, and use
    pte_mkspecial().  Don't involve insert_pfn(), adjusting insert_page()
    looks cleaner than branching off to insert_pfn().
    
    (B) Never refcount the zeropage when unmapping it:
    
    In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP
    mapping without CONFIG_ARCH_HAS_PTE_SPECIAL.  Add a VM_WARN_ON_ONCE()
    sanity check if we'd ever return the zeropage, which could happen if
    someone forgets to set pte_mkspecial() when mapping the zeropage. 
    Document that.
    
    (C) Allow the zeropage only where reasonable
    
    s390x never wants the zeropage in some processes running legacy KVM guests
    that make use of storage keys.  So disallow that.
    
    Further, using the zeropage in COW mappings is unproblematic (just what we
    do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it
    and GUP with FOLL_LONGTERM would work as expected.
    
    Similarly, mappings that can never have writable PTEs (implying no write
    faults) are also not problematic, because nothing could end up mapping the
    PTE writable by mistake later.  But in case we could have writable PTEs,
    we'll only allow the zeropage in FSDAX VMAs, that are incompatible with
    GUP and are blocked there completely.
    
    We'll always require the zeropage to be mapped with pte_special(). 
    GUP-fast will reject the zeropage that way, but GUP-slow will allow it. 
    (Note that GUP does not refcount the zeropage with FOLL_PIN, because there
    were issues with overflowing the refcount in the past).
    
    Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
    catch early during testing if we'd ever find a zeropage unexpectedly in
    code that wants to upgrade write permissions.
    
    Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
    with VM_FAULT_SIGBUS, like we do for other sanity checks.  Drop the stale
    comment regarding reserved pages from insert_page().
    
    Note that:
    * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
      vmf_insert_pfn() would allow the zeropage in some cases and
      not refcount it.
    * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
      mappings and we'll leave that alone for now. People can simply use
      one of the other interfaces.
    * we won't bother with the huge zeropage for now. It's never
      PTE-mapped and also GUP does not special-case it yet.
    
    Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Vincent Donnefort <vdonnefort@google.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    fce831c9
memory.c 179 KB