1. 02 Sep, 2024 40 commits
    • Jann Horn's avatar
      mm: fix (harmless) type confusion in lock_vma_under_rcu() · 17fe833b
      Jann Horn authored
      There is a (harmless) type confusion in lock_vma_under_rcu(): After
      vma_start_read(), we have taken the VMA lock but don't know yet whether
      the VMA has already been detached and scheduled for RCU freeing.  At this
      point, ->vm_start and ->vm_end are accessed.
      
      vm_area_struct contains a union such that ->vm_rcu uses the same memory as
      ->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a
      detached VMA is illegal and leads to type confusion between union members.
      
      Fix it by reordering the vma->detached check above the address checks, and
      document the rules for RCU readers accessing VMAs.
      
      This will probably change the number of observed VMA_LOCK_MISS events
      (since previously, trying to access a detached VMA whose ->vm_rcu has been
      scheduled would bail out when checking the fault address against the
      rcu_head members reinterpreted as VMA bounds).
      
      Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com
      Fixes: 50ee3253 ("mm: introduce lock_vma_under_rcu to be used from arch-specific code")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17fe833b
    • Nhat Pham's avatar
      zswap: track swapins from disk more accurately · 0e400844
      Nhat Pham authored
      Currently, there are a couple of issues with our disk swapin tracking for
      dynamic zswap shrinker heuristics:
      
      1. We only increment the swapin counter on pivot pages. This means we
         are not taking into account pages that also need to be swapped in,
         but are already taken care of as part of the readahead window.
      
      2. We are also incrementing when the pages are read from the zswap pool,
         which is inaccurate.
      
      This patch rectifies these issues by incrementing the counter whenever we
      need to perform a non-zswap read.  Note that we are slightly overcounting,
      as a page might be read into memory by the readahead algorithm even though
      it will not be neeeded by users - however, this is an acceptable
      inaccuracy, as the readahead logic itself will adapt to these kind of
      scenarios.
      
      To test this change, I built the kernel under a cgroup with its memory.max
      set to 2 GB:
      
      real: 236.66s
      user: 4286.06s
      sys: 652.86s
      swapins: 81552
      
      For comparison, with just the new second chance algorithm, the build time
      is as follows:
      
      real: 244.85s
      user: 4327.22s
      sys: 664.39s
      swapins: 94663
      
      Without neither:
      
      real: 263.89s
      user: 4318.11s
      sys: 673.29s
      swapins: 227300.5
      
      (average over 5 runs)
      
      With this change, the kernel CPU time reduces by a further 1.7%, and the
      real time is reduced by another 3.3%, compared to just the second chance
      algorithm by itself.  The swapins count also reduces by another 13.85%.
      
      Combinng the two changes, we reduce the real time by 10.32%, kernel CPU
      time by 3%, and number of swapins by 64.12%.
      
      To gauge the new scheme's ability to offload cold data, I ran another
      benchmark, in which the kernel was built under a cgroup with memory.max
      set to 3 GB, but with 0.5 GB worth of cold data allocated before each
      build (in a shmem file).
      
      Under the old scheme:
      
      real: 197.18s
      user: 4365.08s
      sys: 289.02s
      zswpwb: 72115.2
      
      Under the new scheme:
      
      real: 195.8s
      user: 4362.25s
      sys: 290.14s
      zswpwb: 87277.8
      
      (average over 5 runs)
      
      Notice that we actually observe a 21% increase in the number of written
      back pages - so the new scheme is just as good, if not better at
      offloading pages from the zswap pool when they are cold.  Build time
      reduces by around 0.7% as a result.
      
      [nphamcs@gmail.com: squeeze a comment into a single line]
        Link: https://lkml.kernel.org/r/20240806004518.3183562-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-3-nphamcs@gmail.com
      Fixes: b5ba474f ("zswap: shrink zswap pool based on memory pressure")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Takero Funaki <flintglass@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0e400844
    • Nhat Pham's avatar
      zswap: implement a second chance algorithm for dynamic zswap shrinker · e31c38e0
      Nhat Pham authored
      Patch series "improving dynamic zswap shrinker protection scheme", v3.
      
      When experimenting with the memory-pressure based (i.e "dynamic") zswap
      shrinker in production, we observed a sharp increase in the number of
      swapins, which led to performance regression.  We were able to trace this
      regression to the following problems with the shrinker's warm pages
      protection scheme: 
      
      1. The protection decays way too rapidly, and the decaying is coupled with
         zswap stores, leading to anomalous patterns, in which a small batch of
         zswap stores effectively erase all the protection in place for the
         warmer pages in the zswap LRU.
      
         This observation has also been corroborated upstream by Takero Funaki
         (in [1]).
      
      2. We inaccurately track the number of swapped in pages, missing the
         non-pivot pages that are part of the readahead window, while counting
         the pages that are found in the zswap pool.
      
      
      To alleviate these two issues, this patch series improve the dynamic zswap
      shrinker in the following manner:
      
      1. Replace the protection size tracking scheme with a second chance
         algorithm. This new scheme removes the need for haphazard stats
         decaying, and automatically adjusts the pace of pages aging with memory
         pressure, and writeback rate with pool activities: slowing down when
         the pool is dominated with zswpouts, and speeding up when the pool is
         dominated with stale entries.
      
      2. Fix the tracking of the number of swapins to take into account
         non-pivot pages in the readahead window.
      
      With these two changes in place, in a kernel-building benchmark without
      any cold data added, the number of swapins is reduced by 64.12%.  This
      translate to a 10.32% reduction in build time.  We also observe a 3%
      reduction in kernel CPU time.
      
      In another benchmark, with cold data added (to gauge the new algorithm's
      ability to offload cold data), the new second chance scheme outperforms
      the old protection scheme by around 0.7%, and actually written back around
      21% more pages to backing swap device.  So the new scheme is just as good,
      if not even better than the old scheme on this front as well.
      
      [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/
      
      
      This patch (of 2):
      
      Current zswap shrinker's heuristics to prevent overshrinking is brittle
      and inaccurate, specifically in the way we decay the protection size (i.e
      making pages in the zswap LRU eligible for reclaim).
      
      We currently decay protection aggressively in zswap_lru_add() calls.  This
      leads to the following unfortunate effect: when a new batch of pages enter
      zswap, the protection size rapidly decays to below 25% of the zswap LRU
      size, which is way too low.
      
      We have observed this effect in production, when experimenting with the
      zswap shrinker: the rate of shrinking shoots up massively right after a
      new batch of zswap stores.  This is somewhat the opposite of what we want
      originally - when new pages enter zswap, we want to protect both these new
      pages AND the pages that are already protected in the zswap LRU.
      
      Replace existing heuristics with a second chance algorithm
      
      1. When a new zswap entry is stored in the zswap pool, its referenced
         bit is set.
      2. When the zswap shrinker encounters a zswap entry with the referenced
         bit set, give it a second chance - only flips the referenced bit and
         rotate it in the LRU.
      3. If the shrinker encounters the entry again, this time with its
         referenced bit unset, then it can reclaim the entry.
      
      In this manner, the aging of the pages in the zswap LRUs are decoupled
      from zswap stores, and picks up the pace with increasing memory pressure
      (which is what we want).
      
      The second chance scheme allows us to modulate the writeback rate based on
      recent pool activities.  Entries that recently entered the pool will be
      protected, so if the pool is dominated by such entries the writeback rate
      will reduce proportionally, protecting the workload's workingset.On the
      other hand, stale entries will be written back quickly, which increases
      the effective writeback rate.
      
      The referenced bit is added at the hole after the `length` field of struct
      zswap_entry, so there is no extra space overhead for this algorithm.
      
      We will still maintain the count of swapins, which is consumed and
      subtracted from the lru size in zswap_shrinker_count(), to further
      penalize past overshrinking that led to disk swapins.  The idea is that
      had we considered this many more pages in the LRU active/protected, they
      would not have been written back and we would not have had to swapped them
      in.
      
      To test this new heuristics, I built the kernel under a cgroup with
      memory.max set to 2G, on a host with 36 cores:
      
      With the old shrinker:
      
      real: 263.89s
      user: 4318.11s
      sys: 673.29s
      swapins: 227300.5
      
      With the second chance algorithm:
      
      real: 244.85s
      user: 4327.22s
      sys: 664.39s
      swapins: 94663
      
      (average over 5 runs)
      
      We observe an 1.3% reduction in kernel CPU usage, and around 7.2%
      reduction in real time. Note that the number of swapped in pages
      dropped by 58%.
      
      [nphamcs@gmail.com: fix a small mistake in the referenced bit documentation]
        Link: https://lkml.kernel.org/r/20240806003403.3142387-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240805232243.2896283-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Takero Funaki <flintglass@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e31c38e0
    • David Gow's avatar
      mm: only enforce minimum stack gap size if it's sensible · 69b50d43
      David Gow authored
      The generic mmap_base code tries to leave a gap between the top of the
      stack and the mmap base address, but enforces a minimum gap size (MIN_GAP)
      of 128MB, which is too large on some setups.  In particular, on arm tasks
      without ADDR_LIMIT_32BIT, the STACK_TOP value is less than 128MB, so it's
      impossible to fit such a gap in.
      
      Only enforce this minimum if MIN_GAP < MAX_GAP, as we'd prefer to honour
      MAX_GAP, which is defined proportionally, so scales better and always
      leaves us with both _some_ stack space and some room for mmap.
      
      This fixes the usercopy KUnit test suite on 32-bit arm, as it doesn't set
      any personality flags so gets the default (in this case 26-bit) task size.
      This test can be run with: ./tools/testing/kunit/kunit.py run --arch arm
      usercopy --make_options LLVM=1
      
      Link: https://lkml.kernel.org/r/20240803074642.1849623-2-davidgow@google.com
      Fixes: dba79c3d ("arm: use generic mmap top-down layout and brk randomization")
      Signed-off-by: default avatarDavid Gow <davidgow@google.com>
      Reviewed-by: default avatarKees Cook <kees@kernel.org>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69b50d43
    • Yang Li's avatar
      mm: remove duplicated include in vma_internal.h · a06e79d3
      Yang Li authored
      The header files linux/bug.h is included twice in vma_internal.h, so one
      inclusion of each can be removed.
      
      Link: https://lkml.kernel.org/r/20240802060216.24591-1-yang.lee@linux.alibaba.comSigned-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9636Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a06e79d3
    • David Hildenbrand's avatar
      mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk · e317a8d8
      David Hildenbrand authored
      Let's simplify by reusing folio_walk.  Keep the existing behavior by
      handling migration entries and zeropages.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-12-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e317a8d8
    • David Hildenbrand's avatar
      mm: remove follow_page() · 7290840d
      David Hildenbrand authored
      All users are gone, let's remove it and any leftovers in comments.  We'll
      leave any FOLL/follow_page_() naming cleanups as future work.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7290840d
    • David Hildenbrand's avatar
      s390/mm/fault: convert do_secure_storage_access() from follow_page() to folio_walk · 0b31a3ce
      David Hildenbrand authored
      Let's get rid of another follow_page() user and perform the conversion
      under PTL: Note that this is also what follow_page_pte() ends up doing.
      
      Unfortunately we cannot currently optimize out the additional reference,
      because arch_make_folio_accessible() must be called with a raised refcount
      to protect against concurrent conversion to secure.  We can just move the
      arch_make_folio_accessible() under the PTL, like follow_page_pte() would.
      
      We'll effectively drop the "writable" check implied by FOLL_WRITE:
      follow_page_pte() would also not check that when calling
      arch_make_folio_accessible(), so there is no good reason for doing that
      here.
      
      We'll lose the secretmem check from follow_page() as well, about which we
      shouldn't really care.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b31a3ce
    • David Hildenbrand's avatar
      s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk · 85a7e543
      David Hildenbrand authored
      Let's get rid of another follow_page() user and perform the UV calls under
      PTL -- which likely should be fine.
      
      No need for an additional reference while holding the PTL:
      uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount,
      so any concurrent make_folio_secure() would see an unexpted reference and
      cannot set PG_arch_1 concurrently.
      
      Do we really need a writable PTE?  Likely yes, because the "destroy" part
      is, in comparison to the export, a destructive operation.  So we'll keep
      the writability check for now.
      
      We'll lose the secretmem check from follow_page().  Likely we don't care
      about that here.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85a7e543
    • David Hildenbrand's avatar
      mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk · 8710f6ed
      David Hildenbrand authored
      Let's remove yet another follow_page() user.  Note that we have to do the
      split without holding the PTL, after folio_walk_end().  We don't care
      about losing the secretmem check in follow_page().
      
      [david@redhat.com: teach can_split_folio() that we are not holding an additional reference]
        Link: https://lkml.kernel.org/r/c75d1c6c-8ea6-424f-853c-1ccda6c77ba2@redhat.com
      Link: https://lkml.kernel.org/r/20240802155524.517137-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8710f6ed
    • David Hildenbrand's avatar
      mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk · b1d3e9bb
      David Hildenbrand authored
      Let's use folio_walk instead, for example avoiding taking temporary folio
      references if the folio does obviously not even apply and getting rid of
      one more follow_page() user.  We cannot move all handling under the PTL,
      so leave the rmap handling (which implies an allocation) out.
      
      Note that zeropages obviously don't apply: old code could just have
      specified FOLL_DUMP.  Further, we don't care about losing the secretmem
      check in follow_page(): these are never anon pages and
      vma_ksm_compatible() would never consider secretmem vmas (VM_SHARED |
      VM_MAYSHARE must be set for secretmem, see secretmem_mmap()).
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1d3e9bb
    • David Hildenbrand's avatar
      mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk · 184e916c
      David Hildenbrand authored
      Let's use folio_walk instead, for example avoiding taking temporary folio
      references if the folio does not even apply and getting rid of one more
      follow_page() user.
      
      Note that zeropages obviously don't apply: old code could just have
      specified FOLL_DUMP.  Anon folios are never secretmem, so we don't care
      about losing the check in follow_page().
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      184e916c
    • David Hildenbrand's avatar
      mm/migrate: convert add_page_for_migration() from follow_page() to folio_walk · 7dff875c
      David Hildenbrand authored
      Let's use folio_walk instead, so we can avoid taking a folio reference
      when we won't even be trying to migrate the folio and to get rid of
      another follow_page()/FOLL_DUMP user.  Use FW_ZEROPAGE so we can return
      "-EFAULT" for it as documented.
      
      We now perform the folio_likely_mapped_shared() check under PTL, which is
      what we want: relying on the mapcount and friends after dropping the PTL
      does not make too much sense, as the page can get unmapped concurrently
      from this process.
      
      Further, we perform the folio isolation under PTL, similar to how we
      handle it for MADV_PAGEOUT.
      
      The possible return values for follow_page() were confusing, especially
      with FOLL_DUMP set. We'll handle it like documented in the man page:
       * -EFAULT: This is a zero page or the memory area is not mapped by the
          process.
       * -ENOENT: The page is not present.
      
      We'll keep setting -ENOENT for ZONE_DEVICE.  Maybe not the right thing to
      do, but it likely doesn't really matter (just like for weird devmap,
      whereby we fake "not present").
      
      The other errros are left as is, and match the documentation in the man
      page.
      
      While at it, rename add_page_for_migration() to add_folio_for_migration().
      
      We'll lose the "secretmem" check, but that shouldn't really matter because
      these folios cannot ever be migrated.  Should vma_migratable() refuse
      these VMAs?  Maybe.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7dff875c
    • David Hildenbrand's avatar
      mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk · 46d6a9b4
      David Hildenbrand authored
      Let's use folio_walk instead, so we can avoid taking a folio reference
      just to read the nid and get rid of another follow_page()/FOLL_DUMP user. 
      Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented.
      
      The possible return values for follow_page() were confusing, especially
      with FOLL_DUMP set.  We'll handle it like documented in the man page:
      
      * -EFAULT: This is a zero page or the memory area is not mapped by the
         process.
      * -ENOENT: The page is not present.
      
      We'll keep setting -ENOENT for ZONE_DEVICE.  Maybe not the right thing to
      do, but it likely doesn't really matter (just like for weird devmap,
      whereby we fake "not present").
      
      Note that the other errors (-EACCESS, -EBUSY, -EIO, -EINVAL, -ENOMEM) so
      far only applied when actually moving pages, not when only querying stats.
      
      We'll effectively drop the "secretmem" check we had in follow_page(), but
      that shouldn't really matter here, we're not accessing folio/page content
      after all.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46d6a9b4
    • David Hildenbrand's avatar
      mm/pagewalk: introduce folio_walk_start() + folio_walk_end() · aa39ca69
      David Hildenbrand authored
      We want to get rid of follow_page(), and have a more reasonable way to
      just lookup a folio mapped at a certain address, perform some checks while
      still under PTL, and then only conditionally grab a folio reference if
      really required.
      
      Further, we might want to get rid of some walk_page_range*() users that
      really only want to temporarily lookup a single folio at a single address.
      
      So let's add a new page table walker that does exactly that, similarly to
      GUP also being able to walk hugetlb VMAs.
      
      Add folio_walk_end() as a macro for now: the compiler is not easy to
      please with the pte_unmap()->kunmap_local().
      
      Note that one difference between follow_page() and get_user_pages(1) is
      that follow_page() will not trigger faults to get something mapped.  So
      folio_walk is at least currently not a replacement for get_user_pages(1),
      but could likely be extended/reused to achieve something similar in the
      future.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa39ca69
    • David Hildenbrand's avatar
      mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES · 3523a37e
      David Hildenbrand authored
      Patch series "mm: replace follow_page() by folio_walk".
      
      Looking into a way of moving the last folio_likely_mapped_shared() call in
      add_folio_for_migration() under the PTL, I found myself removing
      follow_page().  This paves the way for cleaning up all the FOLL_, follow_*
      terminology to just be called "GUP" nowadays.
      
      The new page table walker will lookup a mapped folio and return to the
      caller with the PTL held, such that the folio cannot get unmapped
      concurrently.  Callers can then conditionally decide whether they really
      want to take a short-term folio reference or whether the can simply unlock
      the PTL and be done with it.
      
      folio_walk is similar to page_vma_mapped_walk(), except that we don't know
      the folio we want to walk to and that we are only walking to exactly one
      PTE/PMD/PUD.
      
      folio_walk provides access to the pte/pmd/pud (and the referenced folio
      page because things like KSM need that), however, as part of this series
      no page table modifications are performed by users.
      
      We might be able to convert some other walk_page_range() users that really
      only walk to one address, such as DAMON with
      damon_mkold_ops/damon_young_ops.  It might make sense to extend folio_walk
      in the future to optionally fault in a folio (if applicable), such that we
      can replace some get_user_pages() users that really only want to lookup a
      single page/folio under PTL without unconditionally grabbing a folio
      reference.
      
      I have plans to extend the approach to a range walker that will try
      batching various page table entries (not just folio pages) to be a better
      replace for walk_page_range() -- and users will be able to opt in which
      type of page table entries they want to process -- but that will require
      more work and more thoughts.
      
      KSM seems to work just fine (ksm_functional_tests selftests) and
      move_pages seems to work (migration selftest).  I tested the leaf
      implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
      on arm64 using move_pages and did some more testing on x86-64.  Cross
      compiled on a bunch of architectures.
      
      
      This patch (of 11):
      
      We want to make use of vm_normal_page_pmd() in generic page table walking
      code where we might walk hugetlb folios that are mapped by PMDs even
      without CONFIG_TRANSPARENT_HUGEPAGE.
      
      So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with
      CONFIG_PGTABLE_HAS_HUGE_LEAVES.
      
      Link: https://lkml.kernel.org/r/20240802155524.517137-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240802155524.517137-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3523a37e
    • Andrew Morton's avatar
      include/linux/mmzone.h: clean up watermark accessors · 620943d7
      Andrew Morton authored
      - we have a helper wmark_pages().  Teach min_wmark_pages(),
        low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use
        it instead of open-coding its implementation.
      
      - there's no reason to implement all these things as macros.  Redo them
        in C.
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      620943d7
    • Kaiyang Zhao's avatar
      mm: print the promo watermark in zoneinfo · 528afe6b
      Kaiyang Zhao authored
      Print the promo watermark in zoneinfo just like other watermarks.  This
      helps users check and verify all the watermarks are appropriate.
      
      Link: https://lkml.kernel.org/r/20240801232548.36604-3-kaiyang2@cs.cmu.eduSigned-off-by: default avatarKaiyang Zhao <kaiyang2@cs.cmu.edu>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      528afe6b
    • Kaiyang Zhao's avatar
      mm: create promo_wmark_pages and clean up open-coded sites · 03790c51
      Kaiyang Zhao authored
      Patch series "mm: print the promo watermark in zoneinfo", v2.
      
      
      This patch (of 2):
      
      Define promo_wmark_pages and convert current call sites of wmark_pages
      with fixed WMARK_PROMO to using it instead.
      
      Link: https://lkml.kernel.org/r/20240801232548.36604-1-kaiyang2@cs.cmu.edu
      Link: https://lkml.kernel.org/r/20240801232548.36604-2-kaiyang2@cs.cmu.eduSigned-off-by: default avatarKaiyang Zhao <kaiyang2@cs.cmu.edu>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      03790c51
    • Kaiyang Zhao's avatar
      mm: consider CMA pages in watermark check for NUMA balancing target node · 6d192303
      Kaiyang Zhao authored
      Currently in migrate_balanced_pgdat(), ALLOC_CMA flag is not passed when
      checking watermark on the migration target node.  This does not match the
      gfp in alloc_misplaced_dst_folio() which allows allocation from CMA.
      
      This causes promotion failures when there are a lot of available CMA
      memory in the system.
      
      Therefore, we change the alloc_flags passed to zone_watermark_ok() in
      migrate_balanced_pgdat().
      
      Link: https://lkml.kernel.org/r/20240801180456.25927-1-kaiyang2@cs.cmu.eduSigned-off-by: default avatarKaiyang Zhao <kaiyang2@cs.cmu.edu>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d192303
    • Takero Funaki's avatar
      mm: zswap: fix global shrinker error handling logic · 81920438
      Takero Funaki authored
      This patch fixes the zswap global shrinker, which did not shrink the zpool
      as expected.
      
      The issue addressed is that shrink_worker() did not distinguish between
      unexpected errors and expected errors, such as failed writeback from an
      empty memcg.  The shrinker would stop shrinking after iterating through
      the memcg tree 16 times, even if there was only one empty memcg.
      
      With this patch, the shrinker no longer considers encountering an empty
      memcg, encountering a memcg with writeback disabled, or reaching the end
      of a memcg tree walk as a failure, as long as there are memcgs that are
      candidates for writeback.  Systems with one or more empty memcgs will now
      observe significantly higher zswap writeback activity after the zswap pool
      limit is hit.
      
      To avoid an infinite loop when there are no writeback candidates, this
      patch tracks writeback attempts during memcg tree walks and limits reties
      if no writeback candidates are found.
      
      To handle the empty memcg case, the helper function shrink_memcg() is
      modified to check if the memcg is empty and then return -ENOENT.
      
      Link: https://lkml.kernel.org/r/20240731004918.33182-3-flintglass@gmail.com
      Fixes: a65b0e76 ("zswap: make shrinking memcg-aware")
      Signed-off-by: default avatarTakero Funaki <flintglass@gmail.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81920438
    • Takero Funaki's avatar
      mm: zswap: fix global shrinker memcg iteration · c5519e0a
      Takero Funaki authored
      Patch series "mm: zswap: fixes for global shrinker", v5.
      
      This series addresses issues in the zswap global shrinker that could not
      shrink stored pages.  With this series, the shrinker continues to shrink
      pages until it reaches the accept threshold more reliably, gives much
      higher writeback when the zswap pool limit is hit.
      
      
      This patch (of 2):
      
      This patch fixes an issue where the zswap global shrinker stopped
      iterating through the memcg tree.
      
      The problem was that shrink_worker() would restart iterating memcg tree
      from the tree root, considering an offline memcg as a failure, and abort
      shrinking after encountering the same offline memcg 16 times even if there
      is only one offline memcg.  After this change, an offline memcg in the
      tree is no longer considered a failure.  This allows the shrinker to
      continue shrinking the other online memcgs regardless of whether an
      offline memcg exists, gives higher zswap writeback activity.
      
      To avoid holding refcount of offline memcg encountered during the memcg
      tree walking, shrink_worker() must continue iterating to release the
      offline memcg to ensure the next memcg stored in the cursor is online.
      
      The offline memcg cleaner has also been changed to avoid the same issue. 
      When the next memcg of the offlined memcg is also offline, the refcount
      stored in the iteration cursor was held until the next shrink_worker()
      run.  The cleaner must release the offline memcg recursively.
      
      [yosryahmed@google.com: make critical section more obvious, unify comments]
        Link: https://lkml.kernel.org/r/CAJD7tkaScz+SbB90Q1d5mMD70UfM2a-J2zhXDT9sePR7Qap45Q@mail.gmail.com
      Link: https://lkml.kernel.org/r/20240731004918.33182-1-flintglass@gmail.com
      Link: https://lkml.kernel.org/r/20240731004918.33182-2-flintglass@gmail.com
      Fixes: a65b0e76 ("zswap: make shrinking memcg-aware")
      Signed-off-by: default avatarTakero Funaki <flintglass@gmail.com>
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Chengming Zhou <chengming.zhou@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5519e0a
    • Zhaoyu Liu's avatar
      mm: swap: allocate folio only first time in __read_swap_cache_async() · 1d344030
      Zhaoyu Liu authored
      It should be checked by filemap_get_folio() if SWAP_HAS_CACHE was
      marked while reading a share swap page. It would re-allocate a folio
      if the swap cache was not ready now. We save the new folio to avoid
      page allocating again.
      
      Link: https://lkml.kernel.org/r/20240731133101.GA2096752@bytedanceSigned-off-by: default avatarZhaoyu Liu <liuzhaoyu.zackary@bytedance.com>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d344030
    • David Hildenbrand's avatar
      mm: clarify folio_likely_mapped_shared() documentation for KSM folios · 17d5f38b
      David Hildenbrand authored
      For KSM folios, the function actually does what it is supposed to do: even
      having multiple mappings inside the same MM is considered "sharing", as
      there is no real relationship between these KSM page mappings -- in
      contrast to mapping the same file range twice and having the same
      pagecache page mapped twice.
      
      Link: https://lkml.kernel.org/r/20240731160758.808925-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17d5f38b
    • David Hildenbrand's avatar
      mm/rmap: cleanup partially-mapped handling in __folio_remove_rmap() · 6654d289
      David Hildenbrand authored
      Let's simplify and reduce code indentation.  In the RMAP_LEVEL_PTE case,
      we already check for nr when computing partially_mapped.
      
      For RMAP_LEVEL_PMD, it's a bit more confusing.  Likely, we don't need the
      "nr" check, but we could have "nr < nr_pmdmapped" also if we stumbled into
      the "/* Raced ahead of another remove and an add?  */" case.  So let's
      simply move the nr check in there.
      
      Note that partially_mapped is always false for small folios.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20240710214350.147864-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6654d289
    • David Hildenbrand's avatar
      mm/hugetlb: remove hugetlb_follow_page_mask() leftover · 94ccd21e
      David Hildenbrand authored
      We removed hugetlb_follow_page_mask() in commit 9cb28da5 ("mm/gup:
      handle hugetlb in the generic follow_page_mask code") but forgot to
      cleanup some leftovers.
      
      While at it, simplify the hugetlb comment, it's overly detailed and rather
      confusing.  Stating that we may end up in there during coredumping is
      sufficient to explain the PF_DUMPCORE usage.
      
      Link: https://lkml.kernel.org/r/20240731142000.625044-1-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94ccd21e
    • Wei Yang's avatar
      mm/memory_hotplug: get rid of __ref · f732e242
      Wei Yang authored
      After commit 73db3abd ("init/modpost: conditionally check section
      mismatch to __meminit*"), we can get rid of __ref annotations.
      
      Link: https://lkml.kernel.org/r/20240726010157.6177-1-richard.weiyang@gmail.comSigned-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f732e242
    • Barry Song's avatar
      mm: swap: add nr argument in swapcache_prepare and swapcache_clear to support large folios · 9f101bef
      Barry Song authored
      Right now, swapcache_prepare() and swapcache_clear() supports one entry
      only, to support large folios, we need to handle multiple swap entries.
      
      To optimize stack usage, we iterate twice in __swap_duplicate(): the first
      time to verify that all entries are valid, and the second time to apply
      the modifications to the entries.
      
      Currently, we're using nr=1 for the existing users.
      
      [v-songbaohua@oppo.com: clarify swap_count_continued and improve readability for  __swap_duplicate]
        Link: https://lkml.kernel.org/r/20240802071817.47081-1-21cnbao@gmail.com
      Link: https://lkml.kernel.org/r/20240730071339.107447-2-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Gao Xiang <xiang@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kairui Song <kasong@tencent.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nhat Pham <nphamcs@gmail.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9f101bef
    • Uros Bizjak's avatar
      mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool · 7e60dcb2
      Uros Bizjak authored
      Compiling z3fold.c results in several sparse warnings:
      
      z3fold.c:797:21: warning: incorrect type in initializer (different address spaces)
      z3fold.c:797:21:    expected void const [noderef] __percpu *__vpp_verify
      z3fold.c:797:21:    got struct list_head *
      z3fold.c:852:37: warning: incorrect type in initializer (different address spaces)
      z3fold.c:852:37:    expected void const [noderef] __percpu *__vpp_verify
      z3fold.c:852:37:    got struct list_head *
      z3fold.c:924:25: warning: incorrect type in assignment (different address spaces)
      z3fold.c:924:25:    expected struct list_head *unbuddied
      z3fold.c:924:25:    got void [noderef] __percpu *_res
      z3fold.c:930:33: warning: incorrect type in initializer (different address spaces)
      z3fold.c:930:33:    expected void const [noderef] __percpu *__vpp_verify
      z3fold.c:930:33:    got struct list_head *
      z3fold.c:949:25: warning: incorrect type in argument 1 (different address spaces)
      z3fold.c:949:25:    expected void [noderef] __percpu *__pdata
      z3fold.c:949:25:    got struct list_head *unbuddied
      z3fold.c:979:25: warning: incorrect type in argument 1 (different address spaces)
      z3fold.c:979:25:    expected void [noderef] __percpu *__pdata
      z3fold.c:979:25:    got struct list_head *unbuddied
      
      Add __percpu annotation to *unbuddied pointer to fix these warnings.
      
      Link: https://lkml.kernel.org/r/20240730123445.5875-1-ubizjak@gmail.comSigned-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e60dcb2
    • Hao Ge's avatar
      mm/cma: change the addition of totalcma_pages in the cma_init_reserved_mem · 5c053250
      Hao Ge authored
      Replace the unnecessary division calculation with cma->count when update
      the value of totalcma_pages.
      
      Link: https://lkml.kernel.org/r/20240729080431.70916-1-hao.ge@linux.devSigned-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c053250
    • Wei Yang's avatar
      mm: improve code consistency with zonelist_* helper functions · 29943248
      Wei Yang authored
      Replace direct access to zoneref->zone, zoneref->zone_idx, or
      zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper
      functions for consistency.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20240729091717.464-1-shivankg@amd.comCo-developed-by: default avatarShivank Garg <shivankg@amd.com>
      Signed-off-by: default avatarShivank Garg <shivankg@amd.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      29943248
    • Lorenzo Stoakes's avatar
      tools: add skeleton code for userland testing of VMA logic · 9325b8b5
      Lorenzo Stoakes authored
      Establish a new userland VMA unit testing implementation under
      tools/testing which utilises existing logic providing maple tree support
      in userland utilising the now-shared code previously exclusive to radix
      tree testing.
      
      This provides fundamental VMA operations whose API is defined in mm/vma.h,
      while stubbing out superfluous functionality.
      
      This exists as a proof-of-concept, with the test implementation functional
      and sufficient to allow userland compilation of vma.c, but containing only
      cursory tests to demonstrate basic functionality.
      
      Link: https://lkml.kernel.org/r/533ffa2eec771cbe6b387dd049a7f128a53eb616.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Tested-by: default avatarSeongJae Park <sj@kernel.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9325b8b5
    • Lorenzo Stoakes's avatar
      tools: separate out shared radix-tree components · 74579d8d
      Lorenzo Stoakes authored
      The core components contained within the radix-tree tests which provide
      shims for kernel headers and access to the maple tree are useful for
      testing other things, so separate them out and make the radix tree tests
      dependent on the shared components.
      
      This lays the groundwork for us to add VMA tests of the newly introduced
      vma.c file.
      
      Link: https://lkml.kernel.org/r/1ee720c265808168e0d75608e687607d77c36719.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      74579d8d
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add entry for new VMA files · 802443a4
      Lorenzo Stoakes authored
      The vma files contain logic split from mmap.c for the most part and are
      all relevant to VMA logic, so maintain the same reviewers for both.
      
      Link: https://lkml.kernel.org/r/bf2581cce2b4d210deabb5376c6aa0ad6facf1ff.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      802443a4
    • Lorenzo Stoakes's avatar
      mm: move internal core VMA manipulation functions to own file · 49b1b8d6
      Lorenzo Stoakes authored
      This patch introduces vma.c and moves internal core VMA manipulation
      functions to this file from mmap.c.
      
      This allows us to isolate VMA functionality in a single place such that we
      can create userspace testing code that invokes this functionality in an
      environment where we can implement simple unit tests of core
      functionality.
      
      This patch ensures that core VMA functionality is explicitly marked as
      such by its presence in mm/vma.h.
      
      It also places the header includes required by vma.c in vma_internal.h,
      which is simply imported by vma.c.  This makes the VMA functionality
      testable, as userland testing code can simply stub out functionality as
      required.
      
      Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      49b1b8d6
    • Lorenzo Stoakes's avatar
      mm: move vma_shrink(), vma_expand() to internal header · d61f0d59
      Lorenzo Stoakes authored
      The vma_shrink() and vma_expand() functions are internal VMA manipulation
      functions which we ought to abstract for use outside of memory management
      code.
      
      To achieve this, we replace shift_arg_pages() in fs/exec.c with an
      invocation of a new relocate_vma_down() function implemented in mm/mmap.c,
      which enables us to also move move_page_tables() and vma_iter_prev_range()
      to internal.h.
      
      The purpose of doing this is to isolate key VMA manipulation functions in
      order that we can both abstract them and later render them easily
      testable.
      
      Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61f0d59
    • Lorenzo Stoakes's avatar
      mm: move vma_modify() and helpers to internal header · fa04c08f
      Lorenzo Stoakes authored
      These are core VMA manipulation functions which invoke VMA splitting and
      merging and should not be directly accessed from outside of mm/.
      
      Link: https://lkml.kernel.org/r/5efde0c6342a8860d5ffc90b415f3989fd8ed0b2.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa04c08f
    • Lorenzo Stoakes's avatar
      userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c · a17c7d8f
      Lorenzo Stoakes authored
      Patch series "Make core VMA operations internal and testable", v4.
      
      There are a number of "core" VMA manipulation functions implemented in
      mm/mmap.c, notably those concerning VMA merging, splitting, modifying,
      expanding and shrinking, which logically don't belong there.
      
      More importantly this functionality represents an internal implementation
      detail of memory management and should not be exposed outside of mm/
      itself.
      
      This patch series isolates core VMA manipulation functionality into its
      own file, mm/vma.c, and provides an API to the rest of the mm code in
      mm/vma.h.
      
      Importantly, it also carefully implements mm/vma_internal.h, which
      specifies which headers need to be imported by vma.c, leading to the very
      useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h.
      
      This means we can then re-implement vma_internal.h in userland, adding
      shims for kernel mechanisms as required, allowing us to unit test internal
      VMA functionality.
      
      This testing is useful as opposed to an e.g.  kunit implementation as this
      way we can avoid all external kernel side-effects while testing, run tests
      VERY quickly, and iterate on and debug problems quickly.
      
      Excitingly this opens the door to, in the future, recreating precise
      problems observed in production in userland and very quickly debugging
      problems that might otherwise be very difficult to reproduce.
      
      This patch series takes advantage of existing shim logic and full userland
      maple tree support contained in tools/testing/radix-tree/ and
      tools/include/linux/, separating out shared components of the radix tree
      implementation to provide this testing.
      
      Kernel functionality is stubbed and shimmed as needed in
      tools/testing/vma/ which contains a fully functional userland
      vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly
      tested from userland.
      
      A simple, skeleton testing implementation is provided in
      tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA
      merge, modify (testing split), expand and shrink functionality work
      correctly.
      
      
      This patch (of 4):
      
      This patch forms part of a patch series intending to separate out VMA
      logic and render it testable from userspace, which requires that core
      manipulation functions be exposed in an mm/-internal header file.
      
      In order to do this, we must abstract APIs we wish to test, in this
      instance functions which ultimately invoke vma_modify().
      
      This patch therefore moves all logic which ultimately invokes vma_modify()
      to mm/userfaultfd.c, trying to transfer code at a functional granularity
      where possible.
      
      [lorenzo.stoakes@oracle.com: fix user-after-free in userfaultfd_clear_vma()]
        Link: https://lkml.kernel.org/r/3c947ddc-b804-49b7-8fe9-3ea3ca13def5@lucifer.local
      Link: https://lkml.kernel.org/r/cover.1722251717.git.lorenzo.stoakes@oracle.com
      Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Gow <davidgow@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Rae Moar <rmoar@google.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Pengfei Xu <pengfei.xu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a17c7d8f
    • David Finkel's avatar
      mm, memcg: cg2 memory{.swap,}.peak write tests · d075bcce
      David Finkel authored
      Extend two existing tests to cover extracting memory usage through the
      newly mutable memory.peak and memory.swap.peak handlers.
      
      In particular, make sure to exercise adding and removing watchers with
      overlapping lifetimes so the less-trivial logic gets tested.
      
      The new/updated tests attempt to detect a lack of the write handler by
      fstat'ing the memory.peak and memory.swap.peak files and skip the tests if
      that's the case.  Additionally, skip if the file doesn't exist at all.
      
      [davidf@vimeo.com: update tests]
        Link: https://lkml.kernel.org/r/20240730231304.761942-3-davidf@vimeo.com
      Link: https://lkml.kernel.org/r/20240729143743.34236-3-davidf@vimeo.comSigned-off-by: default avatarDavid Finkel <davidf@vimeo.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d075bcce
    • David Finkel's avatar
      mm, memcg: cg2 memory{.swap,}.peak write handlers · c6f53ed8
      David Finkel authored
      Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7.
      
      
      This patch (of 2):
      
      Other mechanisms for querying the peak memory usage of either a process or
      v1 memory cgroup allow for resetting the high watermark.  Restore parity
      with those mechanisms, but with a less racy API.
      
      For example:
       - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
         the high watermark.
       - writing "5" to the clear_refs pseudo-file in a processes's proc
         directory resets the peak RSS.
      
      This change is an evolution of a previous patch, which mostly copied the
      cgroup v1 behavior, however, there were concerns about races/ownership
      issues with a global reset, so instead this change makes the reset
      filedescriptor-local.
      
      Writing any non-empty string to the memory.peak and memory.swap.peak
      pseudo-files reset the high watermark to the current usage for subsequent
      reads through that same FD.
      
      Notably, following Johannes's suggestion, this implementation moves the
      O(FDs that have written) behavior onto the FD write(2) path.  Instead, on
      the page-allocation path, we simply add one additional watermark to
      conditionally bump per-hierarchy level in the page-counter.
      
      Additionally, this takes Longman's suggestion of nesting the
      page-charging-path checks for the two watermarks to reduce the number of
      common-case comparisons.
      
      This behavior is particularly useful for work scheduling systems that need
      to track memory usage of worker processes/cgroups per-work-item.  Since
      memory can't be squeezed like CPU can (the OOM-killer has opinions), these
      systems need to track the peak memory usage to compute system/container
      fullness when binpacking workitems.
      
      Most notably, Vimeo's use-case involves a system that's doing global
      binpacking across many Kubernetes pods/containers, and while we can use
      PSI for some local decisions about overload, we strive to avoid packing
      workloads too tightly in the first place.  To facilitate this, we track
      the peak memory usage.  However, since we run with long-lived workers (to
      amortize startup costs) we need a way to track the high watermark while a
      work-item is executing.  Polling runs the risk of missing short spikes
      that last for timescales below the polling interval, and peak memory
      tracking at the cgroup level is otherwise perfect for this use-case.
      
      As this data is used to ensure that binpacked work ends up with sufficient
      headroom, this use-case mostly avoids the inaccuracies surrounding
      reclaimable memory.
      
      Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com
      Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com
      Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.comSigned-off-by: default avatarDavid Finkel <davidf@vimeo.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Koutný <mkoutny@suse.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6f53ed8