• Peter Xu's avatar
    userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
    Peter Xu authored
    Patch series "Implement IOCTL to get and optionally clear info about
    PTEs", v33.
    
    *Motivation*
    The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
    GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
    retrieves the addresses of the pages that are written to in a region of
    virtual memory.
    
    This syscall is used in Windows applications and games etc.  This syscall
    is being emulated in pretty slow manner in userspace.  Our purpose is to
    enhance the kernel such that we translate it efficiently in a better way. 
    Currently some out of tree hack patches are being used to efficiently
    emulate it in some kernels.  We intend to replace those with these
    patches.  So the whole gaming on Linux can effectively get benefit from
    this.  It means there would be tons of users of this code.
    
    CRIU use case [2] was mentioned by Andrei and Danylo:
    > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
    > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
    > shadow memory [4]. Being able to migrate such binaries allows to highly
    > reduce the amount of work needed to identify and fix post-migration
    > crashes, which happen constantly.
    
    Andrei defines the following uses of this code:
    * it is more granular and allows us to track changed pages more
      effectively. The current interface can clear dirty bits for the entire
      process only. In addition, reading info about pages is a separate
      operation. It means we must freeze the process to read information
      about all its pages, reset dirty bits, only then we can start dumping
      pages. The information about pages becomes more and more outdated,
      while we are processing pages. The new interface solves both these
      downsides. First, it allows us to read pte bits and clear the
      soft-dirty bit atomically. It means that CRIU will not need to freeze
      processes to pre-dump their memory. Second, it clears soft-dirty bits
      for a specified region of memory. It means CRIU will have actual info
      about pages to the moment of dumping them.
    * The new interface has to be much faster because basic page filtering
      is happening in the kernel. With the old interface, we have to read
      pagemap for each page.
    
    *Implementation Evolution (Short Summary)*
    From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
    feature can be used under the hood with some additions like:
    * reset soft-dirty flag for only a specific region of memory instead of
    clearing the flag for the entire process
    * get and clear soft-dirty flag for a specific region atomically
    
    So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
    flag. But using soft-dirty flag, sometimes we get extra pages which weren't
    even written. They had become soft-dirty because of VMA merging and
    VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
    able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
    reported that mprotect etc messes up the soft-dirty flag while ignoring
    VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
    discussed if we can revert these patches. But we could not reach to any
    conclusion. So at this point, I made couple of tries to solve this whole
    VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
    * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
    regression. We left it behind.
    * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
    got the reply don't increase the size of the VMA by 8 bytes.
    
    At this point, we left soft-dirty considering it is too much delicate and
    userfaultfd [9] seemed like the only way forward. From there onward, we
    have been basing soft-dirty emulation on userfaultfd wp feature where
    kernel resolves the faults itself when WP_ASYNC feature is used. It was
    straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
    those pages dirty or written-to which are really written in reality. (PS
    There is another WP_UNPOPULATED userfautfd feature is required which is
    needed to avoid pre-faulting memory before write-protecting [9].)
    
    All the different masks were added on the request of CRIU devs to create
    interface more generic and better.
    
    [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
    [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
    [3] https://github.com/google/sanitizers
    [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
    [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
    [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
    [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
    [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
    [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
    [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
    
    
    This patch (of 6):
    
    Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
    userfaultfd wr-protect faults to be resolved by the kernel directly.
    
    It can be used like a high accuracy version of soft-dirty, without vma
    modifications during tracking, and also with ranged support by default
    rather than for a whole mm when reset the protections due to existence of
    ioctl(UFFDIO_WRITEPROTECT).
    
    Several goals of such a dirty tracking interface:
    
    1. All types of memory should be supported and tracable. This is nature
       for soft-dirty but should mention when the context is userfaultfd,
       because it used to only support anon/shmem/hugetlb. The problem is for
       a dirty tracking purpose these three types may not be enough, and it's
       legal to track anything e.g. any page cache writes from mmap.
    
    2. Protections can be applied to partial of a memory range, without vma
       split/merge fuss.  The hope is that the tracking itself should not
       affect any vma layout change.  It also helps when reset happens because
       the reset will not need mmap write lock which can block the tracee.
    
    3. Accuracy needs to be maintained.  This means we need pte markers to work
       on any type of VMA.
    
    One could question that, the whole concept of async dirty tracking is not
    really close to fundamentally what userfaultfd used to be: it's not "a
    fault to be serviced by userspace" anymore. However, using userfaultfd-wp
    here as a framework is convenient for us in at least:
    
    1. VM_UFFD_WP vma flag, which has a very good name to suite something like
       this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
       feature bit to identify from a sync version of uffd-wp registration.
    
    2. PTE markers logic can be leveraged across the whole kernel to maintain
       the uffd-wp bit as long as an arch supports, this also applies to this
       case where uffd-wp bit will be a hint to dirty information and it will
       not go lost easily (e.g. when some page cache ptes got zapped).
    
    3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
       resetting a range of memory, while there's no counterpart in the old
       soft-dirty world, hence if this is wanted in a new design we'll need a
       new interface otherwise.
    
    We can somehow understand that commonality because uffd-wp was
    fundamentally a similar idea of write-protecting pages just like
    soft-dirty.
    
    This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
    far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
    gives us chance to modify impl of WP_ASYNC just in case it could be not
    depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
    to imply that because both features will rely on PTE_MARKER_UFFD_WP config
    option, so they'll show up together (or both missing) in an UFFDIO_API
    probe.
    
    vma_can_userfault() now allows any VMA if the userfaultfd registration is
    only about async uffd-wp.  So we can track dirty for all kinds of memory
    including generic file systems (like XFS, EXT4 or BTRFS).
    
    One trick worth mention in do_wp_page() is that we need to manually update
    vmf->orig_pte here because it can be used later with a pte_same() check -
    this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
    
    The major defect of this approach of dirty tracking is we need to populate
    the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
    It's unwanted in the case where the range of memory to track is huge and
    unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
    without having any page cache installed yet).  One way to improve this is
    to allow pte markers exist for larger than PTE level for PMD+.  That will
    not change the interface if to implemented, so we can leave that for
    later.
    
    Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
    Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
    Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
    Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
    Cc: Alex Sierra <alex.sierra@amd.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
    Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Michal Miroslaw <emmir@google.com>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Paul Gofman <pgofman@codeweavers.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Suren Baghdasaryan <surenb@google.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Yun Zhou <yun.zhou@windriver.com>
    Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    d61ea1cb
memory.c 167 KB