1. 29 Dec, 2023 23 commits
    • Stefan Roesch's avatar
      mm/ksm: add sysfs knobs for advisor · 66790e9a
      Stefan Roesch authored
      This adds four new knobs for the KSM advisor to influence its behaviour.
      
      The knobs are:
      - advisor_mode:
          none:      no advisor (default)
          scan-time: scan time advisor
      - advisor_max_cpu: 70 (default, cpu usage percent)
      - advisor_min_pages_to_scan: 500 (default)
      - advisor_max_pages_to_scan: 30000 (default)
      - advisor_target_scan_time: 200 (default in seconds)
      
      The new values will take effect on the next scan round.
      
      Link: https://lkml.kernel.org/r/20231218231054.1625219-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      66790e9a
    • Stefan Roesch's avatar
      mm/ksm: add ksm advisor · 4e5fa4f5
      Stefan Roesch authored
      Patch series "mm/ksm: Add ksm advisor", v5.
      
      What is the KSM advisor?
      =========================
      The ksm advisor automatically manages the pages_to_scan setting to achieve
      a target scan time.  The target scan time defines how many seconds it
      should take to scan all the candidate KSM pages.  In other words the
      pages_to_scan rate is changed by the advisor to achieve the target scan
      time.
      
      Why do we need a KSM advisor?
      ==============================
      The number of candidate pages for KSM is dynamic.  It can often be
      observed that during the startup of an application more candidate pages
      need to be processed.  Without an advisor the pages_to_scan parameter
      needs to be sized for the maximum number of candidate pages.  With the
      scan time advisor the pages_to_scan parameter based can be changed based
      on demand.
      
      Algorithm
      ==========
      The algorithm calculates the change value based on the target scan time
      and the previous scan time.  To avoid pertubations an exponentially
      weighted moving average is applied.
      
      The algorithm has a max and min
      value to:
      - guarantee responsiveness to changes
      - to limit CPU resource consumption
      
      Parameters to influence the KSM scan advisor
      =============================================
      The respective parameters are:
      - ksm_advisor_mode
        0: None (default), 1: scan time advisor
      - ksm_advisor_target_scan_time
        how many seconds a scan should of all candidate pages take
      - ksm_advisor_max_cpu
        upper limit for the cpu usage in percent of the ksmd background thread
      
      The initial value and the max value for the pages_to_scan parameter can
      be limited with:
      - ksm_advisor_min_pages_to_scan
        minimum value for pages_to_scan per batch
      - ksm_advisor_max_pages_to_scan
        maximum value for pages_to_scan per batch
      
      The default settings for the above two parameters should be suitable for
      most workloads.
      
      The parameters are exposed as knobs in /sys/kernel/mm/ksm. By default the
      scan time advisor is disabled.
      
      Currently there are two advisors:
      - none and
      - scan-time.
      
      Resource savings
      =================
      Tests with various workloads have shown considerable CPU savings. Most
      of the workloads I have investigated have more candidate pages during
      startup. Once the workload is stable in terms of memory, the number of
      candidate pages is reduced. Without the advisor, the pages_to_scan needs
      to be sized for the maximum number of candidate pages. So having this
      advisor definitely helps in reducing CPU consumption.
      
      For the instagram workload, the advisor achieves a 25% CPU reduction.
      Once the memory is stable, the pages_to_scan parameter gets reduced to
      about 40% of its max value.
      
      The new advisor works especially well if the smart scan feature is also
      enabled.
      
      How is defining a target scan time better?
      ===========================================
      For an administrator it is more logical to set a target scan time.. The
      administrator can determine how many pages are scanned on each scan.
      Therefore setting a target scan time makes more sense.
      
      In addition the administrator might have a good idea about the memory
      sizing of its respective workloads.
      
      Setting cpu limits is easier than setting The pages_to_scan parameter. The
      pages_to_scan parameter is per batch. For the administrator it is difficult
      to set the pages_to_scan parameter.
      
      Tracing
      =======
      A new tracing event has been added for the scan time advisor. The new
      trace event is called ksm_advisor. It reports the scan time, the new
      pages_to_scan setting and the cpu usage of the ksmd background thread.
      
      Other approaches
      =================
      
      Approach 1: Adapt pages_to_scan after processing each batch. If KSM
        merges pages, increase the scan rate, if less KSM pages, reduce the
        the pages_to_scan rate. This doesn't work too well. While it increases
        the pages_to_scan for a short period, but generally it ends up with a
        too low pages_to_scan rate.
      
      Approach 2: Adapt pages_to_scan after each scan. The problem with that
        approach is that the calculated scan rate tends to be high. The more
        aggressive KSM scans, the more pages it can de-duplicate.
      
      There have been earlier attempts at an advisor:
        propose auto-run mode of ksm and its tests
        (https://marc.info/?l=linux-mm&m=166029880214485&w=2)
      
      
      This patch (of 5):
      
      This adds the ksm advisor.  The ksm advisor automatically manages the
      pages_to_scan setting to achieve a target scan time.  The target scan time
      defines how many seconds it should take to scan all the candidate KSM
      pages.  In other words the pages_to_scan rate is changed by the advisor to
      achieve the target scan time.  The algorithm has a max and min value to:
      
      - guarantee responsiveness to changes
      - limit CPU resource consumption
      
      The respective parameters are:
      - ksm_advisor_target_scan_time (how many seconds a scan should take)
      - ksm_advisor_max_cpu (maximum value for cpu percent usage)
      
      - ksm_advisor_min_pages (minimum value for pages_to_scan per batch)
      - ksm_advisor_max_pages (maximum value for pages_to_scan per batch)
      
      The algorithm calculates the change value based on the target scan time
      and the previous scan time. To avoid pertubations an exponentially
      weighted moving average is applied.
      
      The advisor is managed by two main parameters: target scan time,
      cpu max time for the ksmd background thread. These parameters determine
      how aggresive ksmd scans.
      
      In addition there are min and max values for the pages_to_scan parameter
      to make sure that its initial and max values are not set too low or too
      high.  This ensures that it is able to react to changes quickly enough.
      
      The default values are:
      - target scan time: 200 secs
      - max cpu: 70%
      - min pages: 500
      - max pages: 30000
      
      By default the advisor is disabled. Currently there are two advisors:
      none and scan-time.
      
      Tests with various workloads have shown considerable CPU savings.  Most of
      the workloads I have investigated have more candidate pages during
      startup, once the workload is stable in terms of memory, the number of
      candidate pages is reduced.  Without the advisor, the pages_to_scan needs
      to be sized for the maximum number of candidate pages.  So having this
      advisor definitely helps in reducing CPU consumption.
      
      For the instagram workload, the advisor achieves a 25% CPU reduction. 
      Once the memory is stable, the pages_to_scan parameter gets reduced to
      about 40% of its max value.
      
      Link: https://lkml.kernel.org/r/20231218231054.1625219-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20231218231054.1625219-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e5fa4f5
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove page_add_new_anon_rmap and lru_cache_add_inactive_or_unevictable · cafa8e37
      Matthew Wilcox (Oracle) authored
      All callers have now been converted to folio_add_new_anon_rmap() and
      folio_add_lru_vma() so we can remove the wrapper.
      
      Link: https://lkml.kernel.org/r/20231211162214.2146080-10-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cafa8e37
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove stale example from comment · b2926ac8
      Matthew Wilcox (Oracle) authored
      folio_add_new_anon_rmap() no longer works this way, so just remove the
      entire example.
      
      Link: https://lkml.kernel.org/r/20231211162214.2146080-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b2926ac8
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove some calls to page_add_new_anon_rmap() · 2853b66b
      Matthew Wilcox (Oracle) authored
      We already have the folio in these functions, we just need to use it. 
      folio_add_new_anon_rmap() didn't exist at the time they were converted to
      folios.
      
      Link: https://lkml.kernel.org/r/20231211162214.2146080-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2853b66b
    • Matthew Wilcox (Oracle)'s avatar
    • Matthew Wilcox (Oracle)'s avatar
      mm: remove PageAnonExclusive assertions in unuse_pte() · 8d294a8c
      Matthew Wilcox (Oracle) authored
      The page in question is either freshly allocated or known to be in
      the swap cache; these assertions are not particularly useful.
      
      Link: https://lkml.kernel.org/r/20231212164813.2540119-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8d294a8c
    • Matthew Wilcox (Oracle)'s avatar
      mm: convert ksm_might_need_to_copy() to work on folios · 96db66d9
      Matthew Wilcox (Oracle) authored
      Patch series "Finish two folio conversions".
      
      Most callers of page_add_new_anon_rmap() and
      lru_cache_add_inactive_or_unevictable() have been converted to their folio
      equivalents, but there are still a few stragglers.  There's a bit of
      preparatory work in ksm and unuse_pte(), but after that it's pretty
      mechanical.
      
      
      This patch (of 9):
      
      Accept a folio as an argument and return a folio result.  Removes a call
      to compound_head() in do_swap_page(), and prevents folio & page from
      getting out of sync in unuse_pte().
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      [willy@infradead.org: fix smatch warning]
        Link: https://lkml.kernel.org/r/ZXnPtblC6A1IkyAB@casper.infradead.org
      [david@redhat.com: only adjust the page if the folio changed]
        Link: https://lkml.kernel.org/r/6a8f2110-fa91-4c10-9eae-88315309a6e3@redhat.com
      Link: https://lkml.kernel.org/r/20231211162214.2146080-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20231211162214.2146080-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96db66d9
    • Suren Baghdasaryan's avatar
      selftests/mm: add UFFDIO_MOVE ioctl test · a2bf6a9c
      Suren Baghdasaryan authored
      Add tests for new UFFDIO_MOVE ioctl which uses uffd to move source into
      destination buffer while checking the contents of both after the move. 
      After the operation the content of the destination buffer should match the
      original source buffer's content while the source buffer should be zeroed.
      Separate tests are designed for PMD aligned and unaligned cases because
      they utilize different code paths in the kernel.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-6-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2bf6a9c
    • Suren Baghdasaryan's avatar
      selftests/mm: add uffd_test_case_ops to allow test case-specific operations · e8a42240
      Suren Baghdasaryan authored
      Currently each test can specify unique operations using uffd_test_ops,
      however these operations are per-memory type and not per-test.  Add
      uffd_test_case_ops which each test case can customize for its own needs
      regardless of the memory type being used.  Pre- and post-allocation
      operations are added, some of which will be used in the next patch to
      implement test-specific operations like madvise after memory is allocated
      but before it is accessed.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-5-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8a42240
    • Suren Baghdasaryan's avatar
      selftests/mm: call uffd_test_ctx_clear at the end of the test · 1c8d39fa
      Suren Baghdasaryan authored
      uffd_test_ctx_clear() is being called from uffd_test_ctx_init() to unmap
      areas used in the previous test run.  This approach is problematic because
      while unmapping areas uffd_test_ctx_clear() uses page_size and nr_pages
      which might differ from one test run to another.  Fix this by calling
      uffd_test_ctx_clear() after each test is done.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-4-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1c8d39fa
    • Andrea Arcangeli's avatar
      userfaultfd: UFFDIO_MOVE uABI · adef4406
      Andrea Arcangeli authored
      Implement the uABI of UFFDIO_MOVE ioctl.
      UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
      needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
      available (in userspace) for recycling, as is usually the case in heap
      compaction algorithms, then we can avoid the page allocation and memcpy
      (done by UFFDIO_COPY). Also, since the pages are recycled in the
      userspace, we avoid the need to release (via madvise) the pages back to
      the kernel [2].
      
      We see over 40% reduction (on a Google pixel 6 device) in the compacting
      thread's completion time by using UFFDIO_MOVE vs.  UFFDIO_COPY.  This was
      measured using a benchmark that emulates a heap compaction implementation
      using userfaultfd (to allow concurrent accesses by application threads). 
      More details of the usecase are explained in [2].  Furthermore,
      UFFDIO_MOVE enables moving swapped-out pages without touching them within
      the same vma.  Today, it can only be done by mremap, however it forces
      splitting the vma.
      
      [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
      [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
      
      Update for the ioctl_userfaultfd(2)  manpage:
      
         UFFDIO_MOVE
             (Since Linux xxx)  Move a continuous memory chunk into the
             userfault registered range and optionally wake up the blocked
             thread. The source and destination addresses and the number of
             bytes to move are specified by the src, dst, and len fields of
             the uffdio_move structure pointed to by argp:
      
                 struct uffdio_move {
                     __u64 dst;    /* Destination of move */
                     __u64 src;    /* Source of move */
                     __u64 len;    /* Number of bytes to move */
                     __u64 mode;   /* Flags controlling behavior of move */
                     __s64 move;   /* Number of bytes moved, or negated error */
                 };
      
             The following value may be bitwise ORed in mode to change the
             behavior of the UFFDIO_MOVE operation:
      
             UFFDIO_MOVE_MODE_DONTWAKE
                    Do not wake up the thread that waits for page-fault
                    resolution
      
             UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
                    Allow holes in the source virtual range that is being moved.
                    When not specified, the holes will result in ENOENT error.
                    When specified, the holes will be accounted as successfully
                    moved memory. This is mostly useful to move hugepage aligned
                    virtual regions without knowing if there are transparent
                    hugepages in the regions or not, but preventing the risk of
                    having to split the hugepage during the operation.
      
             The move field is used by the kernel to return the number of
             bytes that was actually moved, or an error (a negated errno-
             style value).  If the value returned in move doesn't match the
             value that was specified in len, the operation fails with the
             error EAGAIN.  The move field is output-only; it is not read by
             the UFFDIO_MOVE operation.
      
             The operation may fail for various reasons. Usually, remapping of
             pages that are not exclusive to the given process fail; once KSM
             might deduplicate pages or fork() COW-shares pages during fork()
             with child processes, they are no longer exclusive. Further, the
             kernel might only perform lightweight checks for detecting whether
             the pages are exclusive, and return -EBUSY in case that check fails.
             To make the operation more likely to succeed, KSM should be
             disabled, fork() should be avoided or MADV_DONTFORK should be
             configured for the source VMA before fork().
      
             This ioctl(2) operation returns 0 on success.  In this case, the
             entire area was moved.  On error, -1 is returned and errno is
             set to indicate the error.  Possible errors include:
      
             EAGAIN The number of bytes moved (i.e., the value returned in
                    the move field) does not equal the value that was
                    specified in the len field.
      
             EINVAL Either dst or len was not a multiple of the system page
                    size, or the range specified by src and len or dst and len
                    was invalid.
      
             EINVAL An invalid bit was specified in the mode field.
      
             ENOENT
                    The source virtual memory range has unmapped holes and
                    UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
      
             EEXIST
                    The destination virtual memory range is fully or partially
                    mapped.
      
             EBUSY
                    The pages in the source virtual memory range are either
                    pinned or not exclusive to the process. The kernel might
                    only perform lightweight checks for detecting whether the
                    pages are exclusive. To make the operation more likely to
                    succeed, KSM should be disabled, fork() should be avoided
                    or MADV_DONTFORK should be configured for the source virtual
                    memory area before fork().
      
             ENOMEM Allocating memory needed for the operation failed.
      
             ESRCH
                    The target process has exited at the time of a UFFDIO_MOVE
                    operation.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adef4406
    • Andrea Arcangeli's avatar
      mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() · 880a99b6
      Andrea Arcangeli authored
      Patch series "userfaultfd move option", v6.
      
      This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has
      long been implemented and maintained by Andrea in his local tree [1], but
      was not upstreamed due to lack of use cases where this approach would be
      better than allocating a new page and copying the contents.  Previous
      upstraming attempts could be found at [6] and [7].
      
      UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
      needs pages to be allocated [2].  However, with UFFDIO_MOVE, if pages are
      available (in userspace) for recycling, as is usually the case in heap
      compaction algorithms, then we can avoid the page allocation and memcpy
      (done by UFFDIO_COPY).  Also, since the pages are recycled in the
      userspace, we avoid the need to release (via madvise) the pages back to
      the kernel [3].  We see over 40% reduction (on a Google pixel 6 device) in
      the compacting thread's completion time by using UFFDIO_MOVE vs. 
      UFFDIO_COPY.  This was measured using a benchmark that emulates a heap
      compaction implementation using userfaultfd (to allow concurrent accesses
      by application threads).  More details of the usecase are explained in
      [3].
      
      Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
      touching them within the same vma. Today, it can only be done by mremap,
      however it forces splitting the vma.
      
      TODOs for follow-up improvements:
      - cross-mm support. Known differences from single-mm and missing pieces:
      	- memcg recharging (might need to isolate pages in the process)
      	- mm counters
      	- cross-mm deposit table moves
      	- cross-mm test
      	- document the address space where src and dest reside in struct
      	  uffdio_move
      
      - TLB flush batching.  Will require extensive changes to PTL locking in
        move_pages_pte().  OTOH that might let us reuse parts of mremap code.
      
      
      This patch (of 5):
      
      For now, folio_move_anon_rmap() was only used to move a folio to a
      different anon_vma after fork(), whereby the root anon_vma stayed
      unchanged.  For that, it was sufficient to hold the folio lock when
      calling folio_move_anon_rmap().
      
      However, we want to make use of folio_move_anon_rmap() to move folios
      between VMAs that have a different root anon_vma.  As folio_referenced()
      performs an RMAP walk without holding the folio lock but only holding the
      anon_vma in read mode, holding the folio lock is insufficient.
      
      When moving to an anon_vma with a different root anon_vma, we'll have to
      hold both, the folio lock and the anon_vma lock in write mode. 
      Consequently, whenever we succeeded in folio_lock_anon_vma_read() to
      read-lock the anon_vma, we have to re-check if the mapping was changed in
      the meantime.  If that was the case, we have to retry.
      
      Note that folio_move_anon_rmap() must only be called if the anon page is
      exclusive to a process, and must not be called on KSM folios.
      
      This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the
      anon_vma lock in write mode, and the mmap_lock in read mode.
      
      Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com
      Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: kernel-team@android.com
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      880a99b6
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix more functions for block size > PAGE_SIZE · fa399c31
      Matthew Wilcox (Oracle) authored
      Both __block_write_full_folio() and block_read_full_folio() assumed that
      block size <= PAGE_SIZE.  Replace the shift with a divide, which is
      probably cheaper than first calculating the shift.  That lets us remove
      block_size_bits() as these were the last callers.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa399c31
    • Matthew Wilcox (Oracle)'s avatar
      buffer: handle large folios in __block_write_begin_int() · b0619401
      Matthew Wilcox (Oracle) authored
      When __block_write_begin_int() was converted to support folios, we did not
      expect large folios to be passed to it.  With the current work to support
      large block size storage devices, this will no longer be true so change
      the checks on 'from' and 'to' to be related to the size of the folio
      instead of PAGE_SIZE.  Also remove an assumption that the block size is
      smaller than PAGE_SIZE.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0619401
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix various functions for block size > PAGE_SIZE · 4b04646c
      Matthew Wilcox (Oracle) authored
      If i_blkbits is larger than PAGE_SHIFT, we shift by a negative number,
      which is undefined.  It is safe to shift the block left as a block device
      must be smaller than MAX_LFS_FILESIZE, which is guaranteed to fit in
      loff_t.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b04646c
    • Matthew Wilcox (Oracle)'s avatar
      buffer: cast block to loff_t before shifting it · 80844194
      Matthew Wilcox (Oracle) authored
      While sector_t is always defined as a u64 today, that hasn't always been
      the case and it might not always be the same size as loff_t in the future.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80844194
    • Matthew Wilcox (Oracle)'s avatar
      buffer: fix grow_buffers() for block size > PAGE_SIZE · 5f3bd90d
      Matthew Wilcox (Oracle) authored
      We must not shift by a negative number so work in terms of a byte offset
      to avoid the awkward shift left-or-right-depending-on-sign option.  This
      means we need to use check_mul_overflow() to ensure that a large block
      number does not result in a wrap.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      [nathan@kernel.org: add cast in grow_buffers() to avoid a multiplication libcall]
        Link: https://lkml.kernel.org/r/20231128-avoid-muloti4-grow_buffers-v1-1-bc3d0f0ec483@kernel.orgSigned-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f3bd90d
    • Matthew Wilcox (Oracle)'s avatar
      buffer: calculate block number inside folio_init_buffers() · 382497ad
      Matthew Wilcox (Oracle) authored
      The calculation of block from index doesn't work for devices with a block
      size larger than PAGE_SIZE as we end up shifting by a negative number. 
      Instead, calculate the number of the first block from the folio's position
      in the block device.  We no longer need to pass sizebits to
      grow_dev_folio().
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      382497ad
    • Matthew Wilcox (Oracle)'s avatar
      buffer: return bool from grow_dev_folio() · 6d840a18
      Matthew Wilcox (Oracle) authored
      Patch series "More buffer_head cleanups", v2.
      
      The first patch is a left-over from last cycle.  The rest fix "obvious"
      block size > PAGE_SIZE problems.  I haven't tested with a large block size
      setup (but I have done an ext4 xfstests run).
      
      
      This patch (of 7):
      
      Rename grow_dev_page() to grow_dev_folio() and make it return a bool. 
      Document what that bool means; it's more subtle than it first appears. 
      Also rename the 'failed' label to 'unlock' beacuse it's not exactly
      'failed'.  It just hasn't succeeded.
      
      Link: https://lkml.kernel.org/r/20231109210608.2252323-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d840a18
  2. 20 Dec, 2023 17 commits