1. 04 Oct, 2023 40 commits
    • SeongJae Park's avatar
      mm/damon/vaddr: call damon_update_region_access_rate() always · 22a77880
      SeongJae Park authored
      When getting mm_struct of the monitoring target process fails, there wil
      be no need to increase the access rate counter (nr_accesses) of the
      regions for the process.  Hence, damon_va_check_accesses() skips calling
      damon_update_region_access_rate() in the case.  This breaks the assumption
      that damon_update_region_access_rate() is called for every region, for
      every sampling interval.  Call the function for every region even in the
      case.  This might increase the overhead in some cases, but such case would
      not be frequent, so no significant impact is really expected.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      22a77880
    • SeongJae Park's avatar
      mm/damon/core: define and use a dedicated function for region access rate update · 78fbfb15
      SeongJae Park authored
      Patch series "mm/damon: provide pseudo-moving sum based access rate".
      
      DAMON checks the access to each region for every sampling interval,
      increase the access rate counter of the region, namely nr_accesses, if the
      access was made.  For every aggregation interval, the counter is reset. 
      The counter is exposed to users to be used as a metric showing the
      relative access rate (frequency) of each region.  In other words, DAMON
      provides access rate of each region in every aggregation interval.  The
      aggregation avoids temporal access pattern changes making things
      confusing.  However, this also makes a few DAMON-related operations to
      unnecessarily need to be aligned to the aggregation interval.  This can
      restrict the flexibility of DAMON applications, especially when the
      aggregation interval is huge.
      
      To provide the monitoring results in finer-grained timing while keeping
      handling of temporal access pattern change, this patchset implements a
      pseudo-moving sum based access rate metric.  It is pseudo-moving sum
      because strict moving sum implementation would need to keep all values for
      last time window, and that could incur high overhead of there could be
      arbitrary number of values in a time window.  Especially in case of the
      nr_accesses, since the sampling interval and aggregation interval can
      arbitrarily set and the past values should be maintained for every region,
      it could be risky.  The pseudo-moving sum assumes there were no temporal
      access pattern change in last discrete time window to remove the needs for
      keeping the list of the last time window values.  As a result, it beocmes
      not strict moving sum implementation, but provides a reasonable accuracy.
      
      Also, it keeps an important property of the moving sum.  That is, the
      moving sum becomes same to discrete-window based sum at the time that
      aligns to the time window.  This means using the pseudo moving sum based
      nr_accesses makes no change to users who shows the value for every
      aggregation interval.
      
      Patches Sequence
      ----------------
      
      The sequence of the patches is as follows.  The first four patches are for
      preparation of the change.  The first two (patches 1 and 2) implements a
      helper function for nr_accesses update and eliminate corner case that
      skips use of the function, respectively.  Following two (patches 3 and 4)
      respectively implement the pseudo-moving sum function and its simple unit
      test case.
      
      Two patches for making DAMON to use the pseudo-moving sum follow.  The
      fifthe one (patch 5) introduces a new field for representing the
      pseudo-moving sum-based access rate of each region, and the sixth one
      makes the new representation to actually updated with the pseudo-moving
      sum function.
      
      Last two patches (patches 7 and 8) makes followup fixes for skipping
      unnecessary updates and marking the moving sum function as static,
      respectively.
      
      
      This patch (of 8):
      
      Each DAMON operarions set is updating nr_accesses field of each
      damon_region for each of their access check results, from the
      check_accesses() callback.  Directly accessing the field could make things
      complex to manage and change in future.  Define and use a dedicated
      function for the purpose.
      
      Link: https://lkml.kernel.org/r/20230915025251.72816-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20230915025251.72816-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78fbfb15
    • SeongJae Park's avatar
      mm/damon/core: use number of passed access sampling as a timer · 4472edf6
      SeongJae Park authored
      DAMON sleeps for sampling interval after each sampling, and check if the
      aggregation interval and the ops update interval have passed using
      ktime_get_coarse_ts64() and baseline timestamps for the intervals.  That
      design is for making the operations occur at deterministic timing
      regardless of the time that spend for each work.  However, it turned out
      it is not that useful, and incur not-that-intuitive results.
      
      After all, timer functions, and especially sleep functions that DAMON uses
      to wait for specific timing, are not necessarily strictly accurate.  It is
      legal design, so no problem.  However, depending on such inaccuracies, the
      nr_accesses can be larger than aggregation interval divided by sampling
      interval.  For example, with the default setting (5 ms sampling interval
      and 100 ms aggregation interval) we frequently show regions having
      nr_accesses larger than 20.  Also, if the execution of a DAMOS scheme
      takes a long time, next aggregation could happen before enough number of
      samples are collected.  This is not what usual users would intuitively
      expect.
      
      Since access check sampling is the smallest unit work of DAMON, using the
      number of passed sampling intervals as the DAMON-internal timer can easily
      avoid these problems.  That is, convert aggregation and ops update
      intervals to numbers of sampling intervals that need to be passed before
      those operations be executed, count the number of passed sampling
      intervals, and invoke the operations as soon as the specific amount of
      sampling intervals passed.  Make the change.
      
      Note that this could make a behavioral change to settings that using
      intervals that not aligned by the sampling interval.  For example, if the
      sampling interval is 5 ms and the aggregation interval is 12 ms, DAMON
      effectively uses 15 ms as its aggregation interval, because it checks
      whether the aggregation interval after sleeping the sampling interval. 
      This change will make DAMON to effectively use 10 ms as aggregation
      interval, since it uses 'aggregation interval / sampling interval *
      sampling interval' as the effective aggregation interval, and we don't use
      floating point types.  Usual users would have used aligned intervals, so
      this behavioral change is not expected to make any meaningful impact, so
      just make this change.
      
      Link: https://lkml.kernel.org/r/20230914021523.60649-1-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4472edf6
    • Zi Yan's avatar
      mips: use nth_page() in place of direct struct page manipulation · aa5fe31b
      Zi Yan authored
      __flush_dcache_pages() is called during hugetlb migration via
      migrate_pages() -> migrate_hugetlbs() -> unmap_and_move_huge_page() ->
      move_to_new_folio() -> flush_dcache_folio().  And with hugetlb and without
      sparsemem vmemmap, struct page is not guaranteed to be contiguous beyond a
      section.  Use nth_page() instead.
      
      Without the fix, a wrong address might be used for data cache page flush.
      No bug is reported. The fix comes from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-6-zi.yan@sent.com
      Fixes: 15fa3e8e ("mips: implement the new page table range API")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa5fe31b
    • Zi Yan's avatar
      fs: use nth_page() in place of direct struct page manipulation · 8db0ec79
      Zi Yan authored
      When dealing with hugetlb pages, struct page is not guaranteed to be
      contiguous on SPARSEMEM without VMEMMAP.  Use nth_page() to handle it
      properly.
      
      Without the fix, a wrong subpage might be checked for HWPoison, causing wrong
      number of bytes of a page copied to user space. No bug is reported. The fix
      comes from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-5-zi.yan@sent.com
      Fixes: 38c1ddbd ("hugetlbfs: improve read HWPOISON hugepage")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8db0ec79
    • Zi Yan's avatar
      mm/memory_hotplug: use pfn math in place of direct struct page manipulation · 1640a0ef
      Zi Yan authored
      When dealing with hugetlb pages, manipulating struct page pointers
      directly can get to wrong struct page, since struct page is not guaranteed
      to be contiguous on SPARSEMEM without VMEMMAP.  Use pfn calculation to
      handle it properly.
      
      Without the fix, a wrong number of page might be skipped. Since skip cannot be
      negative, scan_movable_page() will end early and might miss a movable page with
      -ENOENT. This might fail offline_pages(). No bug is reported. The fix comes
      from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-4-zi.yan@sent.com
      Fixes: eeb0efd0 ("mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1640a0ef
    • Zi Yan's avatar
      mm/hugetlb: use nth_page() in place of direct struct page manipulation · 426056ef
      Zi Yan authored
      When dealing with hugetlb pages, manipulating struct page pointers
      directly can get to wrong struct page, since struct page is not guaranteed
      to be contiguous on SPARSEMEM without VMEMMAP.  Use nth_page() to handle
      it properly.
      
      A wrong or non-existing page might be tried to be grabbed, either
      leading to a non freeable page or kernel memory access errors.  No bug
      is reported.  It comes from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-3-zi.yan@sent.com
      Fixes: 57a196a5 ("hugetlb: simplify hugetlb handling in follow_page_mask")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      426056ef
    • Zi Yan's avatar
      mm/cma: use nth_page() in place of direct struct page manipulation · 2e7cfe5c
      Zi Yan authored
      Patch series "Use nth_page() in place of direct struct page manipulation",
      v3.
      
      On SPARSEMEM without VMEMMAP, struct page is not guaranteed to be
      contiguous, since each memory section's memmap might be allocated
      independently.  hugetlb pages can go beyond a memory section size, thus
      direct struct page manipulation on hugetlb pages/subpages might give wrong
      struct page.  Kernel provides nth_page() to do the manipulation properly. 
      Use that whenever code can see hugetlb pages.
      
      
      This patch (of 5):
      
      When dealing with hugetlb pages, manipulating struct page pointers
      directly can get to wrong struct page, since struct page is not guaranteed
      to be contiguous on SPARSEMEM without VMEMMAP.  Use nth_page() to handle
      it properly.
      
      Without the fix, page_kasan_tag_reset() could reset wrong page tags,
      causing a wrong kasan result.  No related bug is reported.  The fix
      comes from code inspection.
      
      Link: https://lkml.kernel.org/r/20230913201248.452081-1-zi.yan@sent.com
      Link: https://lkml.kernel.org/r/20230913201248.452081-2-zi.yan@sent.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2e7cfe5c
    • Vlastimil Babka's avatar
      mm, vmscan: remove ISOLATE_UNMAPPED · 3dfbb555
      Vlastimil Babka authored
      This isolate_mode_t flag is effectively unused since 89f6c88a ("mm:
      __isolate_lru_page_prepare() in isolate_migratepages_block()") as
      sc->may_unmap is now checked directly (and only node_reclaim has a mode
      that sets it to 0).  The last remaining place is mm_vmscan_lru_isolate
      tracepoint for the isolate_mode parameter.  That one was mainly used to
      indicate the active/inactive mode, which the trace-vmscan-postprocess.pl
      script consumed, but that got silently broken.  After fixing the script by
      the previous patch, it does not need the isolate_mode anymore.  So just
      remove the parameter and with that the whole ISOLATE_UNMAPPED flag.
      
      Link: https://lkml.kernel.org/r/20230914131637.12204-4-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3dfbb555
    • Vlastimil Babka's avatar
      trace-vmscan-postprocess: sync with tracepoints updates · 83121580
      Vlastimil Babka authored
      The script has fallen behind tracepoint changes for a while, fix it up.
      
      Most changes are mechanical (renames, removal of tracepoint parameters
      that are not used by the script).  More notable change involves
      mm_vmscan_lru_isolate which is relying on the isolate_mode to determine if
      the inactive list is being scanned.  However the parameter currently only
      indicates ISOLATE_UNMAPPED.  We can use the lru parameter instead to
      determine which list is scanned, and stop checking isolate_mode.
      
      Link: https://lkml.kernel.org/r/20230914131637.12204-3-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      83121580
    • Matthew Wilcox (Oracle)'s avatar
      buffer: remove __getblk_gfp() · 93b13eca
      Matthew Wilcox (Oracle) authored
      Inline it into __bread_gfp().
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-9-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93b13eca
    • Matthew Wilcox (Oracle)'s avatar
      ext4: call bdev_getblk() from sb_getblk_gfp() · 8a83ac54
      Matthew Wilcox (Oracle) authored
      Most of the callers of sb_getblk_gfp() already assumed that they were
      passing the entire GFP flags to use.  Fix up the two callers that didn't,
      and remove the __GFP_NOFAIL from them since they both appear to correctly
      handle failure.
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a83ac54
    • Matthew Wilcox (Oracle)'s avatar
      buffer: convert sb_getblk() to call __getblk() · 4b9c8b19
      Matthew Wilcox (Oracle) authored
      Now that __getblk() is in the right place in the file, it is trivial to
      call it from sb_getblk().
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b9c8b19
    • Matthew Wilcox (Oracle)'s avatar
      buffer: convert getblk_unmovable() and __getblk() to use bdev_getblk() · c645e65c
      Matthew Wilcox (Oracle) authored
      Move these two functions up in the file for the benefit of the next patch,
      and pass in all of the GFP flags to use instead of the partial GFP flags
      used by __getblk_gfp().
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c645e65c
    • Matthew Wilcox (Oracle)'s avatar
      buffer: use bdev_getblk() to avoid memory reclaim in readahead path · 775d9b10
      Matthew Wilcox (Oracle) authored
      __getblk() adds __GFP_NOFAIL, which is unnecessary for readahead; we're
      quite comfortable with the possibility that we may not get a bh back. 
      Switch to bdev_getblk() which does not include __GFP_NOFAIL.
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      775d9b10
    • Matthew Wilcox (Oracle)'s avatar
      ext4: use bdev_getblk() to avoid memory reclaim in readahead path · e509ad4d
      Matthew Wilcox (Oracle) authored
      sb_getblk_gfp adds __GFP_NOFAIL, which is unnecessary for readahead; we're
      quite comfortable with the possibility that we may not get a bh back. 
      Switch to bdev_getblk() which does not include __GFP_NOFAIL.
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarHui Zhu <teawater@antgroup.com>
      Closes: https://lore.kernel.org/linux-fsdevel/20230811035705.3296-1-teawaterz@linux.alibaba.com/Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e509ad4d
    • Matthew Wilcox (Oracle)'s avatar
      buffer: hoist GFP flags from grow_dev_page() to __getblk_gfp() · 3ed65f04
      Matthew Wilcox (Oracle) authored
      grow_dev_page() is only called by grow_buffers().  grow_buffers() is only
      called by __getblk_slow() and __getblk_slow() is only called from
      __getblk_gfp(), so it is safe to move the GFP flags setting all the way
      up.  With that done, add a new bdev_getblk() entry point that leaves the
      GFP flags the way the caller specified them.
      
      [willy@infradead.org: fix grow_dev_page() error handling]
        Link: https://lkml.kernel.org/r/ZRREEIwqiy5DijKB@casper.infradead.org
      Link: https://lkml.kernel.org/r/20230914150011.843330-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ed65f04
    • Matthew Wilcox (Oracle)'s avatar
      buffer: pass GFP flags to folio_alloc_buffers() · 2a418157
      Matthew Wilcox (Oracle) authored
      Patch series "Add and use bdev_getblk()", v2.
      
      This patch series fixes a bug reported by Hui Zhu; see proposed
      patches v1 and v2:
      https://lore.kernel.org/linux-fsdevel/20230811035705.3296-1-teawaterz@linux.alibaba.com/
      https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
      
      I decided to go in a rather different direction for this fix, and fix a
      related problem at the same time.  I don't think there's any urgency to
      rush this into Linus' tree, nor have I marked it for stable.  Reasonable
      people may disagree.
      
      
      This patch (of 8):
      
      Instead of creating entirely new flags, inherit them from grow_dev_page().
      The other callers create the same flags that this function used to
      create.
      
      Link: https://lkml.kernel.org/r/20230914150011.843330-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20230914150011.843330-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hui Zhu <teawater@antgroup.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2a418157
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: document damos_before_apply tracepoint · 1b2b7a17
      SeongJae Park authored
      Document damos_before_apply tracepoint on the usage document.
      
      Link: https://lkml.kernel.org/r/20230913022050.2109-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b2b7a17
    • SeongJae Park's avatar
      mm/damon/core: add a tracepoint for damos apply target regions · c603c630
      SeongJae Park authored
      Patch series "mm/damon: add a tracepoint for damos apply target regions",
      v2.
      
      DAMON provides damon_aggregated tracepoint to let users record full
      monitoring results.  Sometimes, users need to record monitoring results of
      specific pattern.  DAMOS tried regions directory of DAMON sysfs interface
      allows it, but the interface is mainly designed for snapshots and
      therefore would be inefficient for such recording.  Implement yet another
      tracepoint for efficient support of the usecase.
      
      
      This patch (of 2):
      
      DAMON provides damon_aggregated tracepoint, which exposes details of each
      region and its access monitoring results.  It is useful for getting whole
      monitoring results, e.g., for recording purposes.
      
      For investigations of DAMOS, DAMON Sysfs interface provides DAMOS
      statistics and tried_regions directory.  But, those provides only
      statistics and snapshots.  If the scheme is frequently applied and if the
      user needs to know every detail of DAMOS behavior, the snapshot-based
      interface could be insufficient and expensive.
      
      As a last resort, userspace users need to record the all monitoring
      results via damon_aggregated tracepoint and simulate how DAMOS would
      worked.  It is unnecessarily complicated.  DAMON kernel API users,
      meanwhile, can do that easily via before_damos_apply() callback field of
      'struct damon_callback', though.
      
      Add a tracepoint that will be called just after before_damos_apply()
      callback for more convenient investigations of DAMOS.  The tracepoint
      exposes all details about each regions, similar to damon_aggregated
      tracepoint.
      
      Please note that DAMOS is currently not only for memory management but
      also for query-like efficient monitoring results retrievals (when 'stat'
      action is used).  Until now, only statistics or snapshots were supported. 
      Addition of this tracepoint allows efficient full recording of DAMOS-based
      filtered monitoring results.
      
      Link: https://lkml.kernel.org/r/20230913022050.2109-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20230913022050.2109-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>	[tracing]
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c603c630
    • Kefeng Wang's avatar
      mm: migrate: remove isolated variable in add_page_for_migration() · fa1df3f6
      Kefeng Wang authored
      Directly check the return of isolate_hugetlb() and folio_isolate_lru() to
      remove isolated variable, also setup err = -EBUSY in advance before
      isolation, and update err only when successfully queued for migration,
      which could help us to unify and simplify code a bit.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-9-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa1df3f6
    • Kefeng Wang's avatar
      mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration() · b426ed78
      Kefeng Wang authored
      There is some different between hugeTLB and THP behave when passed the
      address of a tail page, for THP, it will migrate the entire THP page, but
      for HugeTLB, it will return -EACCES, or -ENOENT before commit e66f17ff
      ("mm/hugetlb: take page table lock in follow_huge_pmd()"),
      
        -EACCES The page is mapped by multiple processes and can be moved
      	  only if MPOL_MF_MOVE_ALL is specified.
        -ENOENT The page is not present.
      
      But when check manual[1], both of the two errnos are not suitable, it is
      better to keep the same behave between hugetlb and THP when passed the
      address of a tail page, so let's just remove the PageHead() check for
      HugeTLB.
      
      [1] https://man7.org/linux/man-pages/man2/move_pages.2.html
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-8-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Suggested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b426ed78
    • Kefeng Wang's avatar
      mm: migrate: use a folio in add_page_for_migration() · d64cfccb
      Kefeng Wang authored
      Use a folio in add_page_for_migration() to save compound_head() calls.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d64cfccb
    • Kefeng Wang's avatar
      mm: migrate: use __folio_test_movable() · 7e2a5e5a
      Kefeng Wang authored
      Use __folio_test_movable(), no need to convert from folio to page again.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e2a5e5a
    • Kefeng Wang's avatar
      mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio() · 73eab3ca
      Kefeng Wang authored
      At present, numa balance only support base page and PMD-mapped THP, but we
      will expand to support to migrate large folio/pte-mapped THP in the
      future, it is better to make migrate_misplaced_page() to take a folio
      instead of a page, and rename it to migrate_misplaced_folio(), it is a
      preparation, also this remove several compound_head() calls.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73eab3ca
    • Kefeng Wang's avatar
      mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio() · 2ac9e99f
      Kefeng Wang authored
      Rename numamigrate_isolate_page() to numamigrate_isolate_folio(), then
      make it takes a folio and use folio API to save compound_head() calls.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ac9e99f
    • Kefeng Wang's avatar
      mm: migrate: remove THP mapcount check in numamigrate_isolate_page() · 728be28f
      Kefeng Wang authored
      The check of THP mapped by multiple processes was introduced by commit
      04fa5d6a ("mm: migrate: check page_count of THP before migrating") and
      refactor by commit 340ef390 ("mm: numa: cleanup flow of transhuge page
      migration"), which is out of date, since migrate_misplaced_page() is now
      using the standard migrate_pages() for small pages and THPs, the reference
      count checking is in folio_migrate_mapping(), so let's remove the special
      check for THP.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      728be28f
    • Kefeng Wang's avatar
      mm: migrate: remove PageTransHuge check in numamigrate_isolate_page() · a8ac4a76
      Kefeng Wang authored
      Patch series "mm: migrate: more folio conversion and unification", v3.
      
      Convert more migrate functions to use a folio, it is also a preparation
      for large folio migration support when balancing numa.
      
      
      This patch (of 8):
      
      The assert VM_BUG_ON_PAGE(order && !PageTransHuge(page), page) is not very
      useful,
      
         1) for a tail/base page, order = 0, for a head page, the order > 0 &&
            PageTransHuge() is true
         2) there is a PageCompound() check and only base page is handled in
            do_numa_page(), and do_huge_pmd_numa_page() only handle PMD-mapped
            THP
         3) even though the page is a tail page, isolate_lru_page() will post
            a warning, and fail to isolate the page
         4) if large folio/pte-mapped THP migration supported in the future,
            we could migrate the entire folio if numa fault on a tail page
      
      so just remove the check.
      
      Link: https://lkml.kernel.org/r/20230913095131.2426871-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20230913095131.2426871-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Suggested-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8ac4a76
    • David Hildenbrand's avatar
      mm/rmap: pass folio to hugepage_add_anon_rmap() · 09c55050
      David Hildenbrand authored
      Let's pass a folio; we are always mapping the entire thing.
      
      Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      09c55050
    • David Hildenbrand's avatar
      mm/rmap: simplify PageAnonExclusive sanity checks when adding anon rmap · 132b180f
      David Hildenbrand authored
      Let's sanity-check PageAnonExclusive vs.  mapcount in page_add_anon_rmap()
      and hugepage_add_anon_rmap() after setting PageAnonExclusive simply by
      re-reading the mapcounts.
      
      We can stop initializing the "first" variable in page_add_anon_rmap() and
      no longer need an atomic_inc_and_test() in hugepage_add_anon_rmap().
      
      While at it, switch to VM_WARN_ON_FOLIO().
      
      [david@redhat.com: update check for doubly-mapped page]
        Link: https://lkml.kernel.org/r/d8e5a093-2e22-c14b-7e64-6da280398d9f@redhat.com
      Link: https://lkml.kernel.org/r/20230913125113.313322-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      132b180f
    • David Hildenbrand's avatar
      mm/rmap: warn on new PTE-mapped folios in page_add_anon_rmap() · a1f34ee1
      David Hildenbrand authored
      If swapin code would ever decide to not use order-0 pages and supply a
      PTE-mapped large folio, we will have to change how we call
      __folio_set_anon() -- eventually with exclusive=false and an adjusted
      address.  For now, let's add a VM_WARN_ON_FOLIO() with a comment about the
      situation.
      
      Link: https://lkml.kernel.org/r/20230913125113.313322-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a1f34ee1
    • David Hildenbrand's avatar
      mm/rmap: move folio_test_anon() check out of __folio_set_anon() · c5c54003
      David Hildenbrand authored
      Let's handle it in the caller; no need for the "first" check based on the
      mapcount.
      
      We really only end up with !anon pages in page_add_anon_rmap() via
      do_swap_page(), where we hold the folio lock.  So races are not possible. 
      Add a VM_WARN_ON_FOLIO() to make sure that we really hold the folio lock.
      
      In the future, we might want to let do_swap_page() use
      folio_add_new_anon_rmap() on new pages instead: however, we might have to
      pass then whether the folio is exclusive or not.  So keep it in there for
      now.
      
      For hugetlb we never expect to have a non-anon page in
      hugepage_add_anon_rmap().  Remove that code, along with some other checks
      that are either not required or were checked in
      hugepage_add_new_anon_rmap() already.
      
      Link: https://lkml.kernel.org/r/20230913125113.313322-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5c54003
    • David Hildenbrand's avatar
      mm/rmap: move SetPageAnonExclusive out of __page_set_anon_rmap() · c66db8c0
      David Hildenbrand authored
      Let's handle it in the caller.  No need to pass the page.  While at it,
      rename the function to __folio_set_anon() and pass "bool exclusive"
      instead of "int exclusive".
      
      Link: https://lkml.kernel.org/r/20230913125113.313322-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c66db8c0
    • David Hildenbrand's avatar
      mm/rmap: drop stale comment in page_add_anon_rmap and hugepage_add_anon_rmap() · fd639087
      David Hildenbrand authored
      Patch series "Anon rmap cleanups".
      
      Some cleanups around rmap for anon pages.  I'm working on more cleanups
      also around file rmap -- also to handle the "compound" parameter
      internally only and to let hugetlb use page_add_file_rmap(), but these
      changes make sense separately.
      
      
      This patch (of 6):
      
      That comment was added in commit 5dbe0af4 ("mm: fix kernel BUG at
      mm/rmap.c:1017!") to document why we can see vma->vm_end getting adjusted
      concurrently due to a VMA split.
      
      However, the optimized locking code was changed again in bf181b9f ("mm
      anon rmap: replace same_anon_vma linked list with an interval tree.").
      
      ...  and later, the comment was changed in commit 0503ea8f ("mm/mmap:
      remove __vma_adjust()") to talk about "vma_merge" although the original
      issue was with VMA splitting.
      
      Let's just remove that comment.  Nowadays, it's outdated, imprecise and
      confusing.
      
      Link: https://lkml.kernel.org/r/20230913125113.313322-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20230913125113.313322-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd639087
    • Xin Hao's avatar
      mm: memcg: add THP swap out info for anonymous reclaim · 811244a5
      Xin Hao authored
      At present, we support per-memcg reclaim strategy, however we do not know
      the number of transparent huge pages being reclaimed, as we know the
      transparent huge pages need to be splited before reclaim them, and they
      will bring some performance bottleneck effect.  for example, when two
      memcg (A & B) are doing reclaim for anonymous pages at same time, and 'A'
      memcg is reclaiming a large number of transparent huge pages, we can
      better analyze that the performance bottleneck will be caused by 'A'
      memcg.  therefore, in order to better analyze such problems, there add THP
      swap out info for per-memcg.
      
      [akpm@linux-foundation.orgL fix swap_writepage_fs(), per Johannes]
        Link: https://lkml.kernel.org/r/20230913213343.GB48476@cmpxchg.org
      Link: https://lkml.kernel.org/r/20230913164938.16918-1-vernhao@tencent.comSigned-off-by: default avatarXin Hao <vernhao@tencent.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      811244a5
    • liujinlong's avatar
      mm: vmscan: modify an easily misunderstood function name · ed547ab6
      liujinlong authored
      When looking at the code in the memory part, I found that the purpose of
      the function prepare_scan_countis very different from the function name. 
      It is easy to misunderstand when reading.The function prepare_scan_count
      mainly completes the assignment of the scan_control structure.Therefore, I
      suggest that the function name can be changed to prepare_scan_control,
      which is easier to understand.
      
      Link: https://lkml.kernel.org/r/20230912085923.27238-1-liujinlong@kylinos.cnSigned-off-by: default avatarliujinlong <liujinlong@kylinos.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed547ab6
    • Qi Zheng's avatar
      mm: shrinker: convert shrinker_rwsem to mutex · 8a0e8bb1
      Qi Zheng authored
      Now there are no readers of shrinker_rwsem, so we can simply replace it
      with mutex lock.
      
      [akpm@linux-foundation.org: update the fix to alloc_shrinker_info()]
      Link: https://lkml.kernel.org/r/20230911094444.68966-46-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8a0e8bb1
    • Qi Zheng's avatar
      mm: shrinker: hold write lock to reparent shrinker nr_deferred · 604b8b65
      Qi Zheng authored
      For now, reparent_shrinker_deferred() is the only holder of read lock of
      shrinker_rwsem. And it already holds the global cgroup_mutex, so it will
      not be called in parallel.
      
      Therefore, in order to convert shrinker_rwsem to shrinker_mutex later,
      here we change to hold the write lock of shrinker_rwsem to reparent.
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-45-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      604b8b65
    • Qi Zheng's avatar
      mm: shrinker: make memcg slab shrink lockless · 50d09da8
      Qi Zheng authored
      Like global slab shrink, this commit also uses refcount+RCU method to make
      memcg slab shrink lockless.
      
      Use the following script to do slab shrink stress test:
      
      ```
      
      DIR="/root/shrinker/memcg/mnt"
      
      do_create()
      {
          mkdir -p /sys/fs/cgroup/memory/test
          echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          for i in `seq 0 $1`;
          do
              mkdir -p /sys/fs/cgroup/memory/test/$i;
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              mkdir -p $DIR/$i;
          done
      }
      
      do_mount()
      {
          for i in `seq $1 $2`;
          do
              mount -t tmpfs $i $DIR/$i;
          done
      }
      
      do_touch()
      {
          for i in `seq $1 $2`;
          do
              echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
              dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
          done
      }
      
      case "$1" in
        touch)
          do_touch $2 $3
          ;;
        test)
          do_create 4000
          do_mount 0 4000
          do_touch 0 3000
          ;;
        *)
          exit 1
          ;;
      esac
      ```
      
      Save the above script, then run test and touch commands. Then we can use
      the following perf command to view hotspots:
      
      perf top -U -F 999
      
      1) Before applying this patchset:
      
        33.15%  [kernel]          [k] down_read_trylock
        25.38%  [kernel]          [k] shrink_slab
        21.75%  [kernel]          [k] up_read
         4.45%  [kernel]          [k] _find_next_bit
         2.27%  [kernel]          [k] do_shrink_slab
         1.80%  [kernel]          [k] intel_idle_irq
         1.79%  [kernel]          [k] shrink_lruvec
         0.67%  [kernel]          [k] xas_descend
         0.41%  [kernel]          [k] mem_cgroup_iter
         0.40%  [kernel]          [k] shrink_node
         0.38%  [kernel]          [k] list_lru_count_one
      
      2) After applying this patchset:
      
        64.56%  [kernel]          [k] shrink_slab
        12.18%  [kernel]          [k] do_shrink_slab
         3.30%  [kernel]          [k] __rcu_read_unlock
         2.61%  [kernel]          [k] shrink_lruvec
         2.49%  [kernel]          [k] __rcu_read_lock
         1.93%  [kernel]          [k] intel_idle_irq
         0.89%  [kernel]          [k] shrink_node
         0.81%  [kernel]          [k] mem_cgroup_iter
         0.77%  [kernel]          [k] mem_cgroup_calculate_protection
         0.66%  [kernel]          [k] list_lru_count_one
      
      We can see that the first perf hotspot becomes shrink_slab, which is what
      we expect.
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-44-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50d09da8
    • Qi Zheng's avatar
      mm: shrinker: make global slab shrink lockless · ca1d36b8
      Qi Zheng authored
      The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
      which protects most operations such as slab shrink, registration and
      unregistration of shrinkers, etc. This can easily cause problems in the
      following cases.
      
      1) When the memory pressure is high and there are many filesystems
         mounted or unmounted at the same time, slab shrink will be affected
         (down_read_trylock() failed).
      
         Such as the real workload mentioned by Kirill Tkhai:
      
         ```
         One of the real workloads from my experience is start
         of an overcommitted node containing many starting
         containers after node crash (or many resuming containers
         after reboot for kernel update). In these cases memory
         pressure is huge, and the node goes round in long reclaim.
         ```
      
      2) If a shrinker is blocked (such as the case mentioned
         in [1]) and a writer comes in (such as mount a fs),
         then this writer will be blocked and cause all
         subsequent shrinker-related operations to be blocked.
      
      Even if there is no competitor when shrinking slab, there may still be a
      problem. The down_read_trylock() may become a perf hotspot with frequent
      calls to shrink_slab(). Because of the poor multicore scalability of
      atomic operations, this can lead to a significant drop in IPC
      (instructions per cycle).
      
      We used to implement the lockless slab shrink with SRCU [2], but then
      kernel test robot reported -88.8% regression in
      stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
      
      This commit uses the refcount+RCU method [5] proposed by Dave Chinner
      to re-implement the lockless global slab shrink. The memcg slab shrink is
      handled in the subsequent patch.
      
      For now, all shrinker instances are converted to dynamically allocated and
      will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to
      ensure that the shrinker instance is valid.
      
      And the shrinker instance will not be run again after unregistration. So
      the structure that records the pointer of shrinker instance can be safely
      freed without waiting for the RCU read-side critical section.
      
      In this way, while we implement the lockless slab shrink, we don't need to
      be blocked in unregister_shrinker().
      
      The following are the test results:
      
      stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
      
      1) Before applying this patchset:
      
      setting to a 60 second run per stressor
      dispatching hogs: 9 ramfs
      stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                                (secs)    (secs)    (secs)   (real time) (usr+sys time)
      ramfs            473062     60.00      8.00    279.13      7884.12        1647.59
      for a 60.01s run time:
         1440.34s available CPU time
            7.99s user time   (  0.55%)
          279.13s system time ( 19.38%)
          287.12s total time  ( 19.93%)
      load average: 7.12 2.99 1.15
      successful run completed in 60.01s (1 min, 0.01 secs)
      
      2) After applying this patchset:
      
      setting to a 60 second run per stressor
      dispatching hogs: 9 ramfs
      stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
                                (secs)    (secs)    (secs)   (real time) (usr+sys time)
      ramfs            477165     60.00      8.13    281.34      7952.55        1648.40
      for a 60.01s run time:
         1440.33s available CPU time
            8.12s user time   (  0.56%)
          281.34s system time ( 19.53%)
          289.46s total time  ( 20.10%)
      load average: 6.98 3.03 1.19
      successful run completed in 60.01s (1 min, 0.01 secs)
      
      We can see that the ops/s has hardly changed.
      
      [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
      [2]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
      [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
      [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
      [5]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
      
      Link: https://lkml.kernel.org/r/20230911094444.68966-43-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Anna Schumaker <anna@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Carlos Llamas <cmllamas@google.com>
      Cc: Chandan Babu R <chandan.babu@oracle.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Chuck Lever <cel@kernel.org>
      Cc: Coly Li <colyli@suse.de>
      Cc: Dai Ngo <Dai.Ngo@oracle.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Airlie <airlied@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
      Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Kirill Tkhai <tkhai@ya.ru>
      Cc: Marijn Suijten <marijn.suijten@somainline.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
      Cc: Olga Kornievskaia <kolga@netapp.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Tom Talpey <tom@talpey.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
      Cc: Yue Hu <huyue2@coolpad.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ca1d36b8