1. 06 Nov, 2021 40 commits
    • SeongJae Park's avatar
      mm/damon/dbgfs: support quotas of schemes · d7d0ec85
      SeongJae Park authored
      This makes the debugfs interface of DAMON support the scheme quotas by
      chaning the format of the input for the schemes file.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7d0ec85
    • SeongJae Park's avatar
      mm/damon/schemes: implement time quota · 1cd24303
      SeongJae Park authored
      The size quota feature of DAMOS is useful for IO resource-critical
      systems, but not so intuitive for CPU time-critical systems.  Systems
      using zram or zswap-like swap device would be examples.
      
      To provide another intuitive ways for such systems, this implements
      time-based quota for DAMON-based Operation Schemes.  If the quota is
      set, DAMOS tries to use only up to the user-defined quota of CPU time
      within a given time window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1cd24303
    • SeongJae Park's avatar
      mm/damon/schemes: skip already charged targets and regions · 50585192
      SeongJae Park authored
      If DAMOS has stopped applying action in the middle of a group of memory
      regions due to its size quota, it starts the work again from the
      beginning of the address space in the next charge window.  If there is a
      huge memory region at the beginning of the address space and it fulfills
      the scheme's target data access pattern always, the action will applied
      to only the region.
      
      This mitigates the case by skipping memory regions that charged in
      current charge window at the beginning of next charge window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50585192
    • SeongJae Park's avatar
      mm/damon/schemes: implement size quota for schemes application speed control · 2b8a248d
      SeongJae Park authored
      There could be arbitrarily large memory regions fulfilling the target
      data access pattern of a DAMON-based operation scheme.  In the case,
      applying the action of the scheme could incur too high overhead.  To
      provide an intuitive way for avoiding it, this implements a feature
      called size quota.  If the quota is set, DAMON tries to apply the action
      only up to the given amount of memory regions within a given time
      window.
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b8a248d
    • SeongJae Park's avatar
      mm/damon/paddr: support the pageout scheme · 57223ac2
      SeongJae Park authored
      Introduction
      ============
      
      This patchset 1) makes the engine for general data access
      pattern-oriented memory management (DAMOS) be more useful for production
      environments, and 2) implements a static kernel module for lightweight
      proactive reclamation using the engine.
      
      Proactive Reclamation
      ---------------------
      
      On general memory over-committed systems, proactively reclaiming cold
      pages helps saving memory and reducing latency spikes that incurred by
      the direct reclaim or the CPU consumption of kswapd, while incurring
      only minimal performance degradation[2].
      
      A Free Pages Reporting[8] based memory over-commit virtualization system
      would be one more specific use case.  In the system, the guest VMs
      reports their free memory to host, and the host reallocates the reported
      memory to other guests.  As a result, the system's memory utilization
      can be maximized.  However, the guests could be not so memory-frugal,
      because some kernel subsystems and user-space applications are designed
      to use as much memory as available.  Then, guests would report only
      small amount of free memory to host, results in poor memory utilization.
      Running the proactive reclamation in such guests could help mitigating
      this problem.
      
      Google has also implemented this idea and using it in their data center.
      They further proposed upstreaming it in LSFMM'19, and "the general
      consensus was that, while this sort of proactive reclaim would be useful
      for a number of users, the cost of this particular solution was too high
      to consider merging it upstream"[3].  The cost mainly comes from the
      coldness tracking.  Roughly speaking, the implementation periodically
      scans the 'Accessed' bit of each page.  For the reason, the overhead
      linearly increases as the size of the memory and the scanning frequency
      grows.  As a result, Google is known to dedicating one CPU for the work.
      That's a reasonable option to someone like Google, but it wouldn't be so
      to some others.
      
      DAMON and DAMOS: An engine for data access pattern-oriented memory management
      -----------------------------------------------------------------------------
      
      DAMON[4] is a framework for general data access monitoring.  Its
      adaptive monitoring overhead control feature minimizes its monitoring
      overhead.  It also let the upper-bound of the overhead be configurable
      by clients, regardless of the size of the monitoring target memory.
      While monitoring 70 GiB memory of a production system every 5
      milliseconds, it consumes less than 1% single CPU time.  For this, it
      could sacrify some of the quality of the monitoring results.
      Nevertheless, the lower-bound of the quality is configurable, and it
      uses a best-effort algorithm for better quality.  Our test results[5]
      show the quality is practical enough.  From the production system
      monitoring, we were able to find a 4 KiB region in the 70 GiB memory
      that shows highest access frequency.
      
      We normally don't monitor the data access pattern just for fun but to
      improve something like memory management.  Proactive reclamation is one
      such usage.  For such general cases, DAMON provides a feature called
      DAMon-based Operation Schemes (DAMOS)[6].  It makes DAMON an engine for
      general data access pattern oriented memory management.  Using this,
      clients can ask DAMON to find memory regions of specific data access
      pattern and apply some memory management action (e.g., page out, move to
      head of the LRU list, use huge page, ...).  We call the request
      'scheme'.
      
      Proactive Reclamation on top of DAMON/DAMOS
      -------------------------------------------
      
      Therefore, by using DAMON for the cold pages detection, the proactive
      reclamation's monitoring overhead issue can be solved.  Actually, we
      previously implemented a version of proactive reclamation using DAMOS
      and achieved noticeable improvements with our evaluation setup[5].
      Nevertheless, it more for a proof-of-concept, rather than production
      uses.  It supports only virtual address spaces of processes, and require
      additional tuning efforts for given workloads and the hardware.  For the
      tuning, we introduced a simple auto-tuning user space tool[8].  Google
      is also known to using a ML-based similar approach for their fleets[2].
      But, making it just works with intuitive knobs in the kernel would be
      helpful for general users.
      
      To this end, this patchset improves DAMOS to be ready for such
      production usages, and implements another version of the proactive
      reclamation, namely DAMON_RECLAIM, on top of it.
      
      DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks
      --------------------------------------------------------------------------
      
      First of all, the current version of DAMOS supports only virtual address
      spaces.  This patchset makes it supports the physical address space for
      the page out action.
      
      Next major problem of the current version of DAMOS is the lack of the
      aggressiveness control, which can results in arbitrary overhead.  For
      example, if huge memory regions having the data access pattern of
      interest are found, applying the requested action to all of the regions
      could incur significant overhead.  It can be controlled by tuning the
      target data access pattern with manual or automated approaches[2,7].
      But, some people would prefer the kernel to just work with only
      intuitive tuning or default values.
      
      For such cases, this patchset implements a safeguard, namely time/size
      quota.  Using this, the clients can specify up to how much time can be
      used for applying the action, and/or up to how much memory regions the
      action can be applied within a user-specified time duration.  A followup
      question is, to which memory regions should the action applied within
      the limits? We implement a simple regions prioritization mechanism for
      each action and make DAMOS to apply the action to high priority regions
      first.  It also allows clients tune the prioritization mechanism to use
      different weights for size, access frequency, and age of memory regions.
      This means we could use not only LRU but also LFU or some fancy
      algorithms like CAR[9] with lightweight overhead.
      
      Though DAMON is lightweight, someone would want to remove even the cold
      pages monitoring overhead when it is unnecessary.  Currently, it should
      manually turned on and off by clients, but some clients would simply
      want to turn it on and off based on some metrics like free memory ratio
      or memory fragmentation.  For such cases, this patchset implements a
      watermarks-based automatic activation feature.  It allows the clients
      configure the metric of their interest, and three watermarks of the
      metric.  If the metric is higher than the high watermark or lower than
      the low watermark, the scheme is deactivated.  If the metric is lower
      than the mid watermark but higher than the low watermark, the scheme is
      activated.
      
      DAMON-based Reclaim
      -------------------
      
      Using the improved version of DAMOS, this patchset implements a static
      kernel module called 'damon_reclaim'.  It finds memory regions that
      didn't accessed for specific time duration and page out.  Consuming too
      much CPU for the paging out operations, or doing pageout too frequently
      can be critical for systems configuring their swap devices with
      software-defined in-memory block devices like zram/zswap or total number
      of writes limited devices like SSDs, respectively.  To avoid the
      problems, the time/size quotas can be configured.  Under the quotas, it
      pages out memory regions that didn't accessed longer first.  Also, to
      remove the monitoring overhead under peaceful situation, and to fall
      back to the LRU-list based page granularity reclamation when it doesn't
      make progress, the three watermarks based activation mechanism is used,
      with the free memory ratio as the watermark metric.
      
      For convenient configurations, it provides several module parameters.
      Using these, sysadmins can enable/disable it, and tune its parameters
      including the coldness identification time threshold, the time/size
      quotas and the three watermarks.
      
      Evaluation
      ==========
      
      In short, DAMON_RECLAIM with 50ms/s time quota and regions
      prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves
      38.58% memory saving with only 1.94% runtime overhead.  For this,
      DAMON_RECLAIM consumes only 4.97% of single CPU time.
      
      Setup
      -----
      
      We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements
      make effect.  For this, we measure DAMON_RECLAIM's CPU consumption,
      entire system memory footprint, total number of major page faults, and
      runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark
      suites on my QEMU/KVM based virtual machine.  The virtual machine runs
      on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel
      built on latest -mm tree[1] plus this patchset.  It also utilizes a 4
      GiB ZRAM swap device.  We repeats the measurement 5 times and use
      averages.
      
      [1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55
      
      Detailed Results
      ----------------
      
      The results are summarized in the below table.
      
      With coldness identification threshold of 5 seconds, DAMON_RECLAIM
      without the time quota-based speed limit achieves 47.21% memory saving,
      but incur 4.59% runtime slowdown to the workloads on average.  For this,
      DAMON_RECLAIM consumes about 11.28% single CPU time.
      
      Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions
      prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%,
      respectively.  Time quota of 200ms/s (20%) makes no real change compared
      to the quota unapplied version, because the quota unapplied version
      consumes only 11.28% CPU time.  DAMON_RECLAIM's CPU utilization also
      similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time.  That
      is, the overhead is proportional to the speed limit.  Nevertheless, it
      also reduces the memory saving because it becomes less aggressive.  In
      detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving,
      respectively.
      
      Applying the regions prioritization (page out regions that not accessed
      longer first within the time quota) further reduces the performance
      degradation.  Runtime slowdowns and total number of major page faults
      increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s),
      2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% ->
      2.08%/8,781.75% (10ms/s).  The runtime under 10ms/s time quota has
      increased with prioritization, but apparently that's under the margin of
      error.
      
          time quota   prioritization  memory_saving  cpu_util  slowdown  pgmajfaults overhead
          N            N               47.21%         11.28%    4.59%     194,802%
          200ms/s      N               48.76%         11.24%    4.89%     218,690%
          50ms/s       N               37.83%         5.51%     2.65%     111,886%
          10ms/s       N               7.85%          2.01%     1.5%      34,793.40%
          200ms/s      Y               50.08%         10.38%    4.39%     166,136%
          50ms/s       Y               38.58%         4.97%     1.94%     59,053%
          10ms/s       Y               3.63%          1.73%     2.08%     8,781.75%
      
      Baseline and Complete Git Trees
      ===============================
      
      The patches are based on the latest -mm tree
      (v5.15-rc5-mmots-2021-10-13-19-55).  You can also clone the complete git tree
      from:
      
          $ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1
      
      The web is also available:
      https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1
      
      Sequence Of Patches
      ===================
      
      The first patch makes DAMOS support the physical address space for the
      page out action.  Following five patches (patches 2-6) implement the
      time/size quotas.  Next four patches (patches 7-10) implement the memory
      regions prioritization within the limit.  Then, three following patches
      (patches 11-13) implement the watermarks-based schemes activation.
      
      Finally, the last two patches (patches 14-15) implement and document the
      DAMON-based reclamation using the advanced DAMOS.
      
      [1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html
      [2] https://research.google/pubs/pub48551/
      [3] https://lwn.net/Articles/787611/
      [4] https://damonitor.github.io
      [5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
      [6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/
      [7] https://github.com/awslabs/damoos
      [8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
      [9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement
      
      This patch (of 15):
      
      This makes the DAMON primitives for physical address space support the
      pageout action for DAMON-based Operation Schemes.  With this commit,
      hence, users can easily implement system-level data access-aware
      reclamations using DAMOS.
      
      [sj@kernel.org: fix missing-prototype build warning]
        Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57223ac2
    • Rongwei Wang's avatar
      mm/damon/dbgfs: remove unnecessary variables · 9210622a
      Rongwei Wang authored
      In some functions, it's unnecessary to declare 'err' and 'ret' variables
      at the same time.  This patch mainly to simplify the issue of such
      declarations by reusing one variable.
      
      Link: https://lkml.kernel.org/r/20211014073014.35754-1-sj@kernel.orgSigned-off-by: default avatarRongwei Wang <rongwei.wang@linux.alibaba.com>
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9210622a
    • Rikard Falkeborn's avatar
      mm/damon/vaddr: constify static mm_walk_ops · 199b50f4
      Rikard Falkeborn authored
      The only usage of these structs is to pass their addresses to
      walk_page_range(), which takes a pointer to const mm_walk_ops as
      argument.  Make them const to allow the compiler to put them in
      read-only memory.
      
      Link: https://lkml.kernel.org/r/20211014075042.17174-2-rikard.falkeborn@gmail.comSigned-off-by: default avatarRikard Falkeborn <rikard.falkeborn@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      199b50f4
    • SeongJae Park's avatar
      Docs/DAMON: document physical memory monitoring support · c6380721
      SeongJae Park authored
      This updates the DAMON documents for the physical memory address space
      monitoring support.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c6380721
    • SeongJae Park's avatar
      mm/damon/dbgfs: support physical memory monitoring · c026291a
      SeongJae Park authored
      This makes the 'damon-dbgfs' to support the physical memory monitoring,
      in addition to the virtual memory monitoring.
      
      Users can do the physical memory monitoring by writing a special
      keyword, 'paddr' to the 'target_ids' debugfs file.  Then, DAMON will
      check the special keyword and configure the monitoring context to run
      with the primitives for the physical address space.
      
      Unlike the virtual memory monitoring, the monitoring target region will
      not be automatically set.  Therefore, users should also set the
      monitoring target address region using the 'init_regions' debugfs file.
      
      Also, note that the physical memory monitoring will not automatically
      terminated.  The user should explicitly turn off the monitoring by
      writing 'off' to the 'monitor_on' debugfs file.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c026291a
    • SeongJae Park's avatar
      mm/damon: implement primitives for physical address space monitoring · a28397be
      SeongJae Park authored
      This implements the monitoring primitives for the physical memory
      address space.  Internally, it uses the PTE Accessed bit, similar to
      that of the virtual address spaces monitoring primitives.  It supports
      only user memory pages, as idle pages tracking does.  If the monitoring
      target physical memory address range contains non-user memory pages,
      access check of the pages will do nothing but simply treat the pages as
      not accessed.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a28397be
    • SeongJae Park's avatar
      mm/damon/vaddr: separate commonly usable functions · 46c3a0ac
      SeongJae Park authored
      This moves functions in the default virtual address spaces monitoring
      primitives that commonly usable from other address spaces like physical
      address space into a header file.  Those will be reused by the physical
      address space monitoring primitives which will be implemented by the
      following commit.
      
      [sj@kernel.org: include 'highmem.h' to fix a build failure]
        Link: https://lkml.kernel.org/r/20211014110848.5204-1-sj@kernel.org
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c3a0ac
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon: document 'init_regions' feature · c2fe4987
      SeongJae Park authored
      This adds description of the 'init_regions' feature in the DAMON usage
      document.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c2fe4987
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: add a unit test case for 'init_regions' · 1c2e11bf
      SeongJae Park authored
      This adds another test case for the new feature, 'init_regions'.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarBrendan Higgins <brendanhiggins@google.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c2e11bf
    • SeongJae Park's avatar
      mm/damon/dbgfs: allow users to set initial monitoring target regions · 90bebce9
      SeongJae Park authored
      Patch series "DAMON: Support Physical Memory Address Space Monitoring:.
      
      DAMON currently supports only virtual address spaces monitoring.  It can
      be easily extended for various use cases and address spaces by
      configuring its monitoring primitives layer to use appropriate
      primitives implementations, though.  This patchset implements monitoring
      primitives for the physical address space monitoring using the
      structure.
      
      The first 3 patches allow the user space users manually set the
      monitoring regions.  The 1st patch implements the feature in the
      'damon-dbgfs'.  Then, patches for adding a unit tests (the 2nd patch)
      and updating the documentation (the 3rd patch) follow.
      
      Following 4 patches implement the physical address space monitoring
      primitives.  The 4th patch makes some primitive functions for the
      virtual address spaces primitives reusable.  The 5th patch implements
      the physical address space monitoring primitives.  The 6th patch links
      the primitives to the 'damon-dbgfs'.  Finally, 7th patch documents this
      new features.
      
      This patch (of 7):
      
      Some 'damon-dbgfs' users would want to monitor only a part of the entire
      virtual memory address space.  The program interface users in the kernel
      space could use '->before_start()' callback or set the regions inside
      the context struct as they want, but 'damon-dbgfs' users cannot.
      
      For that reason, this introduces a new debugfs file called
      'init_region'.  'damon-dbgfs' users can specify which initial monitoring
      target address regions they want by writing special input to the file.
      The input should describe each region in each line in the below form:
      
          <pid> <start address> <end address>
      
      Note that the regions will be updated to cover entire memory mapped
      regions after a 'regions update interval' is passed.  If you want the
      regions to not be updated after the initial setting, you could set the
      interval as a very long time, say, a few decades.
      
      Link: https://lkml.kernel.org/r/20211012205711.29216-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211012205711.29216-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90bebce9
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes · 68536f8e
      SeongJae Park authored
      This adds the description of DAMON-based operation schemes in the DAMON
      documents.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68536f8e
    • SeongJae Park's avatar
      selftests/damon: add 'schemes' debugfs tests · 8d5d4c63
      SeongJae Park authored
      This adds simple selftets for 'schemes' debugfs file of DAMON.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d5d4c63
    • SeongJae Park's avatar
      mm/damon/schemes: implement statistics feature · 2f0b548c
      SeongJae Park authored
      To tune the DAMON-based operation schemes, knowing how many and how
      large regions are affected by each of the schemes will be helful.  Those
      stats could be used for not only the tuning, but also monitoring of the
      working set size and the number of regions, if the scheme does not
      change the program behavior too much.
      
      For the reason, this implements the statistics for the schemes.  The
      total number and size of the regions that each scheme is applied are
      exported to users via '->stat_count' and '->stat_sz' of 'struct damos'.
      Admins can also check the number by reading 'schemes' debugfs file.  The
      last two integers now represents the stats.  To allow collecting the
      stats without changing the program behavior, this also adds new scheme
      action, 'DAMOS_STAT'.  Note that 'DAMOS_STAT' is not only making no
      memory operation actions, but also does not reset the age of regions.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f0b548c
    • SeongJae Park's avatar
      mm/damon/dbgfs: support DAMON-based Operation Schemes · af122dd8
      SeongJae Park authored
      This makes 'damon-dbgfs' to support the data access monitoring oriented
      memory management schemes.  Users can read and update the schemes using
      ``<debugfs>/damon/schemes`` file.  The format is::
      
          <min/max size> <min/max access frequency> <min/max age> <action>
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af122dd8
    • SeongJae Park's avatar
      mm/damon/vaddr: support DAMON-based Operation Schemes · 6dea8add
      SeongJae Park authored
      This makes DAMON's default primitives for virtual address spaces to
      support DAMON-based Operation Schemes (DAMOS) by implementing actions
      application functions and registering it to the monitoring context.  The
      implementation simply links 'madvise()' for related DAMOS actions.  That
      is, 'madvise(MADV_WILLNEED)' is called for 'WILLNEED' DAMOS action and
      similar for other actions ('COLD', 'PAGEOUT', 'HUGEPAGE', 'NOHUGEPAGE').
      
      So, the kernel space DAMON users can now use the DAMON-based
      optimizations with only small amount of code.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6dea8add
    • SeongJae Park's avatar
      mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) · 1f366e42
      SeongJae Park authored
      In many cases, users might use DAMON for simple data access aware memory
      management optimizations such as applying an operation scheme to a
      memory region of a specific size having a specific access frequency for
      a specific time.  For example, "page out a memory region larger than 100
      MiB but having a low access frequency more than 10 minutes", or "Use THP
      for a memory region larger than 2 MiB having a high access frequency for
      more than 2 seconds".
      
      Most simple form of the solution would be doing offline data access
      pattern profiling using DAMON and modifying the application source code
      or system configuration based on the profiling results.  Or, developing
      a daemon constructed with two modules (one for access monitoring and the
      other for applying memory management actions via mlock(), madvise(),
      sysctl, etc) is imaginable.
      
      To avoid users spending their time for implementation of such simple
      data access monitoring-based operation schemes, this makes DAMON to
      handle such schemes directly.  With this change, users can simply
      specify their desired schemes to DAMON.  Then, DAMON will automatically
      apply the schemes to the user-specified target processes.
      
      Each of the schemes is composed with conditions for filtering of the
      target memory regions and desired memory management action for the
      target.  Specifically, the format is::
      
          <min/max size> <min/max access frequency> <min/max age> <action>
      
      The filtering conditions are size of memory region, number of accesses
      to the region monitored by DAMON, and the age of the region.  The age of
      region is incremented periodically but reset when its addresses or
      access frequency has significantly changed or the action of a scheme was
      applied.  For the action, current implementation supports a few of
      madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``,
      and ``NOHUGEPAGE``.
      
      Because DAMON supports various address spaces and application of the
      actions to a monitoring target region is dependent to the type of the
      target address space, the application code should be implemented by each
      primitives and registered to the framework.  Note that this only
      implements the framework part.  Following commit will implement the
      action applications for virtual address spaces primitives.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Marco Elver <elver@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f366e42
    • SeongJae Park's avatar
      mm/damon/core: account age of target regions · fda504fa
      SeongJae Park authored
      Patch series "Implement Data Access Monitoring-based Memory Operation Schemes".
      
      Introduction
      ============
      
      DAMON[1] can be used as a primitive for data access aware memory
      management optimizations.  For that, users who want such optimizations
      should run DAMON, read the monitoring results, analyze it, plan a new
      memory management scheme, and apply the new scheme by themselves.  Such
      efforts will be inevitable for some complicated optimizations.
      
      However, in many other cases, the users would simply want the system to
      apply a memory management action to a memory region of a specific size
      having a specific access frequency for a specific time.  For example,
      "page out a memory region larger than 100 MiB keeping only rare accesses
      more than 2 minutes", or "Do not use THP for a memory region larger than
      2 MiB rarely accessed for more than 1 seconds".
      
      To make the works easier and non-redundant, this patchset implements a
      new feature of DAMON, which is called Data Access Monitoring-based
      Operation Schemes (DAMOS).  Using the feature, users can describe the
      normal schemes in a simple way and ask DAMON to execute those on its
      own.
      
      [1] https://damonitor.github.io
      
      Evaluations
      ===========
      
      DAMOS is accurate and useful for memory management optimizations.  An
      experimental DAMON-based operation scheme for THP, 'ethp', removes
      76.15% of THP memory overheads while preserving 51.25% of THP speedup.
      Another experimental DAMON-based 'proactive reclamation' implementation,
      'prcl', reduces 93.38% of residential sets and 23.63% of system memory
      footprint while incurring only 1.22% runtime overhead in the best case
      (parsec3/freqmine).
      
      NOTE that the experimental THP optimization and proactive reclamation
      are not for production but only for proof of concepts.
      
      Please refer to the showcase web site's evaluation document[1] for
      detailed evaluation setup and results.
      
      [1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html
      
      Long-term Support Trees
      -----------------------
      
      For people who want to test DAMON but using LTS kernels, there are
      another couple of trees based on two latest LTS kernels respectively and
      containing the 'damon/master' backports.
      
      - For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y
      - For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y
      
      Sequence Of Patches
      ===================
      
      The 1st patch accounts age of each region.  The 2nd patch implements the
      core of the DAMON-based operation schemes feature.  The 3rd patch makes
      the default monitoring primitives for virtual address spaces to support
      the schemes.  From this point, the kernel space users can use DAMOS.
      The 4th patch exports the feature to the user space via the debugfs
      interface.  The 5th patch implements schemes statistics feature for
      easier tuning of the schemes and runtime access pattern analysis, and
      the 6th patch adds selftests for these changes.  Finally, the 7th patch
      documents this new feature.
      
      This patch (of 7):
      
      DAMON can be used for data access pattern aware memory management
      optimizations.  For that, users should run DAMON, read the monitoring
      results, analyze it, plan a new memory management scheme, and apply the
      new scheme by themselves.  It would not be too hard, but still require
      some level of effort.  For complicated cases, this effort is inevitable.
      
      That said, in many cases, users would simply want to apply an actions to
      a memory region of a specific size having a specific access frequency
      for a specific time.  For example, "page out a memory region larger than
      100 MiB but having a low access frequency more than 10 minutes", or "Use
      THP for a memory region larger than 2 MiB having a high access frequency
      for more than 2 seconds".
      
      For such optimizations, users will need to first account the age of each
      region themselves.  To reduce such efforts, this implements a simple age
      account of each region in DAMON.  For each aggregation step, DAMON
      compares the access frequency with that from last aggregation and reset
      the age of the region if the change is significant.  Else, the age is
      incremented.  Also, in case of the merge of regions, the region
      size-weighted average of the ages is set as the age of merged new
      region.
      
      Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Amit Shah <amit@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Leonard Foerster <foersleo@amazon.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Markus Boehme <markubo@amazon.de>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fda504fa
    • Colin Ian King's avatar
      mm/damon/core: nullify pointer ctx->kdamond with a NULL · 7ec1992b
      Colin Ian King authored
      Currently a plain integer is being used to nullify the pointer
      ctx->kdamond.  Use NULL instead.  Cleans up sparse warning:
      
        mm/damon/core.c:317:40: warning: Using plain integer as NULL pointer
      
      Link: https://lkml.kernel.org/r/20210925215908.181226-1-colin.king@canonical.comSigned-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ec1992b
    • Changbin Du's avatar
      mm/damon: needn't hold kdamond_lock to print pid of kdamond · 42e4cef5
      Changbin Du authored
      Just get the pid by 'current->pid'.  Meanwhile, to be symmetrical make
      the 'starts' and 'finishes' logs both use debug level.
      
      Link: https://lkml.kernel.org/r/20210927232432.17750-1-changbin.du@gmail.comSigned-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42e4cef5
    • Changbin Du's avatar
      mm/damon: remove unnecessary do_exit() from kdamond · 5f7fe2b9
      Changbin Du authored
      Just return from the kthread function.
      
      Link: https://lkml.kernel.org/r/20210927232421.17694-1-changbin.du@gmail.comSigned-off-by: default avatarChangbin Du <changbin.du@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f7fe2b9
    • SeongJae Park's avatar
      mm/damon/core: print kdamond start log in debug mode only · 704571f9
      SeongJae Park authored
      Logging of kdamond startup is using 'pr_info()' unnecessarily.  This
      makes it to use 'pr_debug()' instead.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      704571f9
    • SeongJae Park's avatar
      include/linux/damon.h: fix kernel-doc comments for 'damon_callback' · d2f272b3
      SeongJae Park authored
      A few Kernel-doc comments in 'damon.h' are broken.  This fixes them.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2f272b3
    • SeongJae Park's avatar
      docs/vm/damon: remove broken reference · 876d0aac
      SeongJae Park authored
      Building DAMON documents warns for a reference to nonexisting doc, as
      below:
      
          $ time make htmldocs
          [...]
          Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans'
      
      This fixes the warning by removing the wrong reference.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      876d0aac
    • SeongJae Park's avatar
      MAINTAINERS: update SeongJae's email address · f9803a99
      SeongJae Park authored
      This updates SeongJae's email address in MAINTAINERS file to his
      preferred one.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9803a99
    • SeongJae Park's avatar
      Documentation/vm: move user guides to admin-guide/mm/ · ad782c48
      SeongJae Park authored
      Most memory management user guide documents are in 'admin-guide/mm/',
      but two of those are in 'vm/'.  This moves the two docs into
      'admin-guide/mm' for easier documents finding.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad782c48
    • Geert Uytterhoeven's avatar
      mm/damon: grammar s/works/work/ · f24b0626
      Geert Uytterhoeven authored
      Correct a singular versus plural grammar mistake in the help text for
      the DAMON_VADDR config symbol.
      
      Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org
      Fixes: 3f49584b ("mm/damon: implement primitives for the virtual memory address spaces")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarSeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f24b0626
    • Marco Elver's avatar
      kfence: default to dynamic branch instead of static keys mode · 4f612ed3
      Marco Elver authored
      We have observed that on very large machines with newer CPUs, the static
      key/branch switching delay is on the order of milliseconds.  This is due
      to the required broadcast IPIs, which simply does not scale well to
      hundreds of CPUs (cores).  If done too frequently, this can adversely
      affect tail latencies of various workloads.
      
      One workaround is to increase the sample interval to several seconds,
      while decreasing sampled allocation coverage, but the problem still
      exists and could still increase tail latencies.
      
      As already noted in the Kconfig help text, there are trade-offs: at
      lower sample intervals the dynamic branch results in better performance;
      however, at very large sample intervals, the static keys mode can result
      in better performance -- careful benchmarking is recommended.
      
      Our initial benchmarking showed that with large enough sample intervals
      and workloads stressing the allocator, the static keys mode was slightly
      better.  Evaluating and observing the possible system-wide side-effects
      of the static-key-switching induced broadcast IPIs, however, was a blind
      spot (in particular on large machines with 100s of cores).
      
      Therefore, a major downside of the static keys mode is, unfortunately,
      that it is hard to predict performance on new system architectures and
      topologies, but also making conclusions about performance of new
      workloads based on a limited set of benchmarks.
      
      Most distributions will simply select the defaults, while targeting a
      large variety of different workloads and system architectures.  As such,
      the better default is CONFIG_KFENCE_STATIC_KEYS=n, and re-enabling it is
      only recommended after careful evaluation.
      
      For reference, on x86-64 the condition in kfence_alloc() generates
      exactly
      2 instructions in the kmem_cache_alloc() fast-path:
      
       | ...
       | cmpl   $0x0,0x1a8021c(%rip)  # ffffffff82d560d0 <kfence_allocation_gate>
       | je     ffffffff812d6003      <kmem_cache_alloc+0x243>
       | ...
      
      which, given kfence_allocation_gate is infrequently modified, should be
      well predicted by most CPUs.
      
      Link: https://lkml.kernel.org/r/20211019102524.2807208-2-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f612ed3
    • Marco Elver's avatar
      kfence: always use static branches to guard kfence_alloc() · 07e8481d
      Marco Elver authored
      Regardless of KFENCE mode (CONFIG_KFENCE_STATIC_KEYS: either using
      static keys to gate allocations, or using a simple dynamic branch),
      always use a static branch to avoid the dynamic branch in kfence_alloc()
      if KFENCE was disabled at boot.
      
      For CONFIG_KFENCE_STATIC_KEYS=n, this now avoids the dynamic branch if
      KFENCE was disabled at boot.
      
      To simplify, also unifies the location where kfence_allocation_gate is
      read-checked to just be inline in kfence_alloc().
      
      Link: https://lkml.kernel.org/r/20211019102524.2807208-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07e8481d
    • Marco Elver's avatar
      kfence: shorten critical sections of alloc/free · 49332956
      Marco Elver authored
      Initializing memory and setting/checking the canary bytes is relatively
      expensive, and doing so in the meta->lock critical sections extends the
      duration with preemption and interrupts disabled unnecessarily.
      
      Any reads to meta->addr and meta->size in kfence_guarded_alloc() and
      kfence_guarded_free() don't require locking meta->lock as long as the
      object is removed from the freelist: only kfence_guarded_alloc() sets
      meta->addr and meta->size after removing it from the freelist, which
      requires a preceding kfence_guarded_free() returning it to the list or
      the initial state.
      
      Therefore move reads to meta->addr and meta->size, including expensive
      memory initialization using them, out of meta->lock critical sections.
      
      Link: https://lkml.kernel.org/r/20210930153706.2105471-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49332956
    • Marco Elver's avatar
      kfence: test: use kunit_skip() to skip tests · f51733e2
      Marco Elver authored
      Use the new kunit_skip() to skip tests if requirements were not met.  It
      makes it easier to see in KUnit's summary if there were skipped tests.
      
      Link: https://lkml.kernel.org/r/20210922182541.1372400-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDavid Gow <davidgow@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f51733e2
    • Marco Elver's avatar
      kfence: add note to documentation about skipping covered allocations · 5cc906b4
      Marco Elver authored
      Add a note briefly mentioning the new policy about "skipping currently
      covered allocations if pool close to full." Since this has a notable
      impact on KFENCE's bug-detection ability on systems with large uptimes,
      it is worth pointing out the feature.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-5-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cc906b4
    • Marco Elver's avatar
      kfence: limit currently covered allocations when pool nearly full · 08f6b106
      Marco Elver authored
      One of KFENCE's main design principles is that with increasing uptime,
      allocation coverage increases sufficiently to detect previously
      undetected bugs.
      
      We have observed that frequent long-lived allocations of the same source
      (e.g.  pagecache) tend to permanently fill up the KFENCE pool with
      increasing system uptime, thus breaking the above requirement.  The
      workaround thus far had been increasing the sample interval and/or
      increasing the KFENCE pool size, but is no reliable solution.
      
      To ensure diverse coverage of allocations, limit currently covered
      allocations of the same source once pool utilization reaches 75%
      (configurable via `kfence.skip_covered_thresh`) or above.  The effect is
      retaining reasonable allocation coverage when the pool is close to full.
      
      A side-effect is that this also limits frequent long-lived allocations
      of the same source filling up the pool permanently.
      
      Uniqueness of an allocation for coverage purposes is based on its
      (partial) allocation stack trace (the source).  A Counting Bloom filter
      is used to check if an allocation is covered; if the allocation is
      currently covered, the allocation is skipped by KFENCE.
      
      Testing was done using:
      
      	(a) a synthetic workload that performs frequent long-lived
      	    allocations (default config values; sample_interval=1;
      	    num_objects=63), and
      
      	(b) normal desktop workloads on an otherwise idle machine where
      	    the problem was first reported after a few days of uptime
      	    (default config values).
      
      In both test cases the sampled allocation rate no longer drops to zero
      at any point.  In the case of (b) we observe (after 2 days uptime) 15%
      unique allocations in the pool, 77% pool utilization, with 20% "skipped
      allocations (covered)".
      
      [elver@google.com: simplify and just use hash_32(), use more random stack_hash_seed]
        Link: https://lkml.kernel.org/r/YU3MRGaCaJiYht5g@elver.google.com
      [elver@google.com: fix 32 bit]
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-4-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08f6b106
    • Marco Elver's avatar
      kfence: move saving stack trace of allocations into __kfence_alloc() · a9ab52bb
      Marco Elver authored
      Move the saving of the stack trace of allocations into __kfence_alloc(),
      so that the stack entries array can be used outside of
      kfence_guarded_alloc() and we avoid potentially unwinding the stack
      multiple times.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-3-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9ab52bb
    • Marco Elver's avatar
      kfence: count unexpectedly skipped allocations · 9a19aeb5
      Marco Elver authored
      Maintain a counter to count allocations that are skipped due to being
      incompatible (oversized, incompatible gfp flags) or no capacity.
      
      This is to compute the fraction of allocations that could not be
      serviced by KFENCE, which we expect to be rare.
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-2-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Acked-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a19aeb5
    • Marco Elver's avatar
      stacktrace: move filter_irq_stacks() to kernel/stacktrace.c · f39f21b3
      Marco Elver authored
      filter_irq_stacks() has little to do with the stackdepot implementation,
      except that it is usually used by users (such as KASAN) of stackdepot to
      reduce the stack trace.
      
      However, filter_irq_stacks() itself is not useful without a stack trace
      as obtained by stack_trace_save() and friends.
      
      Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new
      users of filter_irq_stacks() do not have to start depending on
      STACKDEPOT only for filter_irq_stacks().
      
      Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Aleksandr Nogikh <nogikh@google.com>
      Cc: Taras Madan <tarasmadan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f39f21b3
    • Mianhan Liu's avatar
      include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h · a1554c00
      Mianhan Liu authored
      nr_free_buffer_pages could be exposed through mm.h instead of swap.h.
      The advantage of this change is that it can reduce the obsolete
      includes.  For example, net/ipv4/tcp.c wouldn't need swap.h any more
      since it has already included mm.h.  Similarly, after checking all the
      other files, it comes that tcp.c, udp.c meter.c ,...  follow the same
      rule, so these files can have swap.h removed too.
      
      Moreover, after preprocessing all the files that use
      nr_free_buffer_pages, it turns out that those files have already
      included mm.h.Thus, we can move nr_free_buffer_pages from swap.h to mm.h
      safely.  This change will not affect the compilation of other files.
      
      Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cnSigned-off-by: default avatarMianhan Liu <liumh1@shanghaitech.edu.cn>
      Cc: Jakub Kicinski <kuba@kernel.org>
      CC: Ulf Hansson <ulf.hansson@linaro.org>
      Cc: "David S . Miller" <davem@davemloft.net>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Pravin B Shelar <pshelar@ovn.org>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1554c00