1. 29 Dec, 2023 10 commits
    • Suren Baghdasaryan's avatar
      selftests/mm: call uffd_test_ctx_clear at the end of the test · 1c8d39fa
      Suren Baghdasaryan authored
      uffd_test_ctx_clear() is being called from uffd_test_ctx_init() to unmap
      areas used in the previous test run.  This approach is problematic because
      while unmapping areas uffd_test_ctx_clear() uses page_size and nr_pages
      which might differ from one test run to another.  Fix this by calling
      uffd_test_ctx_clear() after each test is done.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-4-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c8d39fa
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_MOVE uABI · adef4406
      Andrea Arcangeli authored
      Implement the uABI of UFFDIO_MOVE ioctl.
      UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
      needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
      available (in userspace) for recycling, as is usually the case in heap
      compaction algorithms, then we can avoid the page allocation and memcpy
      (done by UFFDIO_COPY). Also, since the pages are recycled in the
      userspace, we avoid the need to release (via madvise) the pages back to
      the kernel [2].
      
      We see over 40% reduction (on a Google pixel 6 device) in the compacting
      thread's completion time by using UFFDIO_MOVE vs.  UFFDIO_COPY.  This was
      measured using a benchmark that emulates a heap compaction implementation
      using userfaultfd (to allow concurrent accesses by application threads). 
      More details of the usecase are explained in [2].  Furthermore,
      UFFDIO_MOVE enables moving swapped-out pages without touching them within
      the same vma.  Today, it can only be done by mremap, however it forces
      splitting the vma.
      
      [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
      [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
      
      Update for the ioctl_userfaultfd(2)  manpage:
      
         UFFDIO_MOVE
             (Since Linux xxx)  Move a continuous memory chunk into the
             userfault registered range and optionally wake up the blocked
             thread. The source and destination addresses and the number of
             bytes to move are specified by the src, dst, and len fields of
             the uffdio_move structure pointed to by argp:
      
                 struct uffdio_move {
                     __u64 dst;    /* Destination of move */
                     __u64 src;    /* Source of move */
                     __u64 len;    /* Number of bytes to move */
                     __u64 mode;   /* Flags controlling behavior of move */
                     __s64 move;   /* Number of bytes moved, or negated error */
                 };
      
             The following value may be bitwise ORed in mode to change the
             behavior of the UFFDIO_MOVE operation:
      
             UFFDIO_MOVE_MODE_DONTWAKE
                    Do not wake up the thread that waits for page-fault
                    resolution
      
             UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
                    Allow holes in the source virtual range that is being moved.
                    When not specified, the holes will result in ENOENT error.
                    When specified, the holes will be accounted as successfully
                    moved memory. This is mostly useful to move hugepage aligned
                    virtual regions without knowing if there are transparent
                    hugepages in the regions or not, but preventing the risk of
                    having to split the hugepage during the operation.
      
             The move field is used by the kernel to return the number of
             bytes that was actually moved, or an error (a negated errno-
             style value).  If the value returned in move doesn't match the
             value that was specified in len, the operation fails with the
             error EAGAIN.  The move field is output-only; it is not read by
             the UFFDIO_MOVE operation.
      
             The operation may fail for various reasons. Usually, remapping of
             pages that are not exclusive to the given process fail; once KSM
             might deduplicate pages or fork() COW-shares pages during fork()
             with child processes, they are no longer exclusive. Further, the
             kernel might only perform lightweight checks for detecting whether
             the pages are exclusive, and return -EBUSY in case that check fails.
             To make the operation more likely to succeed, KSM should be
             disabled, fork() should be avoided or MADV_DONTFORK should be
             configured for the source VMA before fork().
      
             This ioctl(2) operation returns 0 on success.  In this case, the
             entire area was moved.  On error, -1 is returned and errno is
             set to indicate the error.  Possible errors include:
      
             EAGAIN The number of bytes moved (i.e., the value returned in
                    the move field) does not equal the value that was
                    specified in the len field.
      
             EINVAL Either dst or len was not a multiple of the system page
                    size, or the range specified by src and len or dst and len
                    was invalid.
      
             EINVAL An invalid bit was specified in the mode field.
      
             ENOENT
                    The source virtual memory range has unmapped holes and
                    UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
      
             EEXIST
                    The destination virtual memory range is fully or partially
                    mapped.
      
             EBUSY
                    The pages in the source virtual memory range are either
                    pinned or not exclusive to the process. The kernel might
                    only perform lightweight checks for detecting whether the
                    pages are exclusive. To make the operation more likely to
                    succeed, KSM should be disabled, fork() should be avoided
                    or MADV_DONTFORK should be configured for the source virtual
                    memory area before fork().
      
             ENOMEM Allocating memory needed for the operation failed.
      
             ESRCH
                    The target process has exited at the time of a UFFDIO_MOVE
                    operation.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adef4406
    • Andrea Arcangeli's avatar
      mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() · 880a99b6
      Andrea Arcangeli authored
      Patch series "userfaultfd move option", v6.
      
      This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has
      long been implemented and maintained by Andrea in his local tree [1], but
      was not upstreamed due to lack of use cases where this approach would be
      better than allocating a new page and copying the contents.  Previous
      upstraming attempts could be found at [6] and [7].
      
      UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
      needs pages to be allocated [2].  However, with UFFDIO_MOVE, if pages are
      available (in userspace) for recycling, as is usually the case in heap
      compaction algorithms, then we can avoid the page allocation and memcpy
      (done by UFFDIO_COPY).  Also, since the pages are recycled in the
      userspace, we avoid the need to release (via madvise) the pages back to
      the kernel [3].  We see over 40% reduction (on a Google pixel 6 device) in
      the compacting thread's completion time by using UFFDIO_MOVE vs. 
      UFFDIO_COPY.  This was measured using a benchmark that emulates a heap
      compaction implementation using userfaultfd (to allow concurrent accesses
      by application threads).  More details of the usecase are explained in
      [3].
      
      Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
      touching them within the same vma. Today, it can only be done by mremap,
      however it forces splitting the vma.
      
      TODOs for follow-up improvements:
      - cross-mm support. Known differences from single-mm and missing pieces:
      	- memcg recharging (might need to isolate pages in the process)
      	- mm counters
      	- cross-mm deposit table moves
      	- cross-mm test
      	- document the address space where src and dest reside in struct
      	  uffdio_move
      
      - TLB flush batching.  Will require extensive changes to PTL locking in
        move_pages_pte().  OTOH that might let us reuse parts of mremap code.
      
      
      This patch (of 5):
      
      For now, folio_move_anon_rmap() was only used to move a folio to a
      different anon_vma after fork(), whereby the root anon_vma stayed
      unchanged.  For that, it was sufficient to hold the folio lock when
      calling folio_move_anon_rmap().
      
      However, we want to make use of folio_move_anon_rmap() to move folios
      between VMAs that have a different root anon_vma.  As folio_referenced()
      performs an RMAP walk without holding the folio lock but only holding the
      anon_vma in read mode, holding the folio lock is insufficient.
      
      When moving to an anon_vma with a different root anon_vma, we'll have to
      hold both, the folio lock and the anon_vma lock in write mode. 
      Consequently, whenever we succeeded in folio_lock_anon_vma_read() to
      read-lock the anon_vma, we have to re-check if the mapping was changed in
      the meantime.  If that was the case, we have to retry.
      
      Note that folio_move_anon_rmap() must only be called if the anon page is
      exclusive to a process, and must not be called on KSM folios.
      
      This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the
      anon_vma lock in write mode, and the mmap_lock in read mode.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: kernel-team@android.com
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      880a99b6
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix more functions for block size > PAGE_SIZE · fa399c31
      Matthew Wilcox (Oracle) authored
      Both __block_write_full_folio() and block_read_full_folio() assumed that
      block size <= PAGE_SIZE.  Replace the shift with a divide, which is
      probably cheaper than first calculating the shift.  That lets us remove
      block_size_bits() as these were the last callers.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa399c31
    • Matthew Wilcox (Oracle)'s avatar
      buffer: handle large folios in __block_write_begin_int() · b0619401
      Matthew Wilcox (Oracle) authored
      When __block_write_begin_int() was converted to support folios, we did not
      expect large folios to be passed to it.  With the current work to support
      large block size storage devices, this will no longer be true so change
      the checks on 'from' and 'to' to be related to the size of the folio
      instead of PAGE_SIZE.  Also remove an assumption that the block size is
      smaller than PAGE_SIZE.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0619401
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix various functions for block size > PAGE_SIZE · 4b04646c
      Matthew Wilcox (Oracle) authored
      If i_blkbits is larger than PAGE_SHIFT, we shift by a negative number,
      which is undefined.  It is safe to shift the block left as a block device
      must be smaller than MAX_LFS_FILESIZE, which is guaranteed to fit in
      loff_t.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b04646c
    • Matthew Wilcox (Oracle)'s avatar
      buffer: cast block to loff_t before shifting it · 80844194
      Matthew Wilcox (Oracle) authored
      While sector_t is always defined as a u64 today, that hasn't always been
      the case and it might not always be the same size as loff_t in the future.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80844194
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix grow_buffers() for block size > PAGE_SIZE · 5f3bd90d
      Matthew Wilcox (Oracle) authored
      We must not shift by a negative number so work in terms of a byte offset
      to avoid the awkward shift left-or-right-depending-on-sign option.  This
      means we need to use check_mul_overflow() to ensure that a large block
      number does not result in a wrap.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      [nathan@kernel.org: add cast in grow_buffers() to avoid a multiplication libcall]
        Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.orgSigned-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f3bd90d
    • Matthew Wilcox (Oracle)'s avatar
      buffer: calculate block number inside folio_init_buffers() · 382497ad
      Matthew Wilcox (Oracle) authored
      The calculation of block from index doesn't work for devices with a block
      size larger than PAGE_SIZE as we end up shifting by a negative number. 
      Instead, calculate the number of the first block from the folio's position
      in the block device.  We no longer need to pass sizebits to
      grow_dev_folio().
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      382497ad
    • Matthew Wilcox (Oracle)'s avatar
      buffer: return bool from grow_dev_folio() · 6d840a18
      Matthew Wilcox (Oracle) authored
      Patch series "More buffer_head cleanups", v2.
      
      The first patch is a left-over from last cycle.  The rest fix "obvious"
      block size > PAGE_SIZE problems.  I haven't tested with a large block size
      setup (but I have done an ext4 xfstests run).
      
      
      This patch (of 7):
      
      Rename grow_dev_page() to grow_dev_folio() and make it return a bool. 
      Document what that bool means; it's more subtle than it first appears. 
      Also rename the 'failed' label to 'unlock' beacuse it's not exactly
      'failed'.  It just hasn't succeeded.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d840a18
  2. 20 Dec, 2023 30 commits