1. 16 Oct, 2023 13 commits
    • Huang Ying's avatar
      acpi, hmat: refactor hmat_register_target_initiators() · d0376aac
      Huang Ying authored
      Previously, in hmat_register_target_initiators(), the performance
      attributes are calculated and the corresponding sysfs links and files are
      created too.  Which is called during memory onlining.
      
      But now, to calculate the abstract distance of a memory target before
      memory onlining, we need to calculate the performance attributes for a
      memory target without creating sysfs links and files.
      
      To do that, hmat_register_target_initiators() is refactored to make it
      possible to calculate performance attributes separately.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0376aac
    • Huang Ying's avatar
      memory tiering: add abstract distance calculation algorithms management · 07a8bdd4
      Huang Ying authored
      Patch series "memory tiering: calculate abstract distance based on ACPI
      HMAT", v4.
      
      We have the explicit memory tiers framework to manage systems with
      multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
      Where, same kind of memory devices will be grouped into memory types,
      then put into memory tiers.  To describe the performance of a memory type,
      abstract distance is defined.  Which is in direct proportion to the memory
      latency and inversely proportional to the memory bandwidth.  To keep the
      code as simple as possible, fixed abstract distance is used in dax/kmem to
      describe slow memory such as Optane DCPMM.
      
      To support more memory types, in this series, we added the abstract
      distance calculation algorithm management mechanism, provided a algorithm
      implementation based on ACPI HMAT, and used the general abstract distance
      calculation interface in dax/kmem driver.  So, dax/kmem can support HBM
      (high bandwidth memory) in addition to the original Optane DCPMM.
      
      
      This patch (of 4):
      
      The abstract distance may be calculated by various drivers, such as ACPI
      HMAT, CXL CDAT, etc.  While it may be used by various code which hot-add
      memory node, such as dax/kmem etc.  To decouple the algorithm users and
      the providers, the abstract distance calculation algorithms management
      mechanism is implemented in this patch.  It provides interface for the
      providers to register the implementation, and interface for the users.
      
      Multiple algorithm implementations can cooperate via calculating abstract
      distance for different memory nodes.  The preference of algorithm
      implementations can be specified via priority (notifier_block.priority).
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a8bdd4
    • Sidhartha Kumar's avatar
      mm/hugetlb: replace page_ref_freeze() with folio_ref_freeze() in hugetlb_folio_init_vmemmap() · a48bf7b4
      Sidhartha Kumar authored
      No functional difference, folio_ref_freeze() is currently a wrapper for
      page_ref_freeze().
      
      Link: https://lkml.kernel.org/r/20230926174433.81241-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: Muchun Song <songmuchun@bytedance.com> 
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a48bf7b4
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
    • Stefan Roesch's avatar
      mm/ksm: test case for prctl fork/exec workflow · 0374af1d
      Stefan Roesch authored
      This adds a new test case to the ksm functional tests to make sure that
      the KSM setting is inherited by the child process when doing a fork/exec.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Carl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0374af1d
    • Stefan Roesch's avatar
      mm/ksm: support fork/exec for prctl · 3c6f33b7
      Stefan Roesch authored
      Patch series "mm/ksm: add fork-exec support for prctl", v4.
      
      A process can enable KSM with the prctl system call.  When the process is
      forked the KSM flag is inherited by the child process.  However if the
      process is executing an exec system call directly after the fork, the KSM
      setting is cleared.  This patch series addresses this problem.
      
      1) Change the mask in coredump.h for execing a new process
      2) Add a new test case in ksm_functional_tests
      
      
      This patch (of 2):
      
      Today we have two ways to enable KSM:
      
      1) madvise system call
         This allows to enable KSM for a memory region for a long time.
      
      2) prctl system call
         This is a recent addition to enable KSM for the complete process.
         In addition when a process is forked, the KSM setting is inherited.
      
      This change only affects the second case.
      
      One of the use cases for (2) was to support the ability to enable
      KSM for cgroups. This allows systemd to enable KSM for the seed
      process. By enabling it in the seed process all child processes inherit
      the setting.
      
      This works correctly when the process is forked. However it doesn't
      support fork/exec workflow.
      
      From the previous cover letter:
      
      ....
      Use case 3:
      With the madvise call sharing opportunities are only enabled for the
      current process: it is a workload-local decision. A considerable number
      of sharing opportunities may exist across multiple workloads or jobs
      (if they are part of the same security domain). Only a higler level
      entity like a job scheduler or container can know for certain if its
      running one or more instances of a job. That job scheduler however
      doesn't have the necessary internal workload knowledge to make targeted
      madvise calls.
      ....
      
      In addition it can also be a bit surprising that fork keeps the KSM
      setting and fork/exec does not.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230922211141.320789-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Fixes: d7597f59 ("mm: add new api to enable ksm per process")
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarCarl Klemm <carl@uvos.xyz>
      Tested-by: default avatarCarl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c6f33b7
    • Huan Yang's avatar
      mm/damon/core: remove unnecessary si_meminfo invoke. · 987ffa5a
      Huan Yang authored
      si_meminfo() will read and assign more info not just free/ram pages.  For
      just DAMOS_WMARK_FREE_MEM_RATE use, only get free and ram pages is ok to
      save cpu.
      
      Link: https://lkml.kernel.org/r/20230920015727.4482-1-link@vivo.comSigned-off-by: default avatarHuan Yang <link@vivo.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      987ffa5a
    • Kefeng Wang's avatar
      sched/numa, mm: make numa migrate functions to take a folio · 8c9ae56d
      Kefeng Wang authored
      The cpupid (or access time) is stored in the head page for THP, so it is
      safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
      to take a folio.  This is in preparation for large folio numa balancing.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c9ae56d
    • Kefeng Wang's avatar
      mm: mempolicy: make mpol_misplaced() to take a folio · 75c70128
      Kefeng Wang authored
      In preparation for large folio numa balancing, make mpol_misplaced() to
      take a folio, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75c70128
    • Kefeng Wang's avatar
      mm: memory: make numa_migrate_prep() to take a folio · cda6d936
      Kefeng Wang authored
      In preparation for large folio numa balancing, make numa_migrate_prep() to
      take a folio, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cda6d936
    • Kefeng Wang's avatar
      mm: memory: use a folio in do_numa_page() · 6695cf68
      Kefeng Wang authored
      Numa balancing only try to migrate non-compound page in do_numa_page(),
      use a folio in it to save several compound_head calls, note we use
      folio_estimated_sharers(), it is enough to check the folio sharers since
      only normal page is handled, if large folio numa balancing is supported, a
      precise folio sharers check would be used, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6695cf68
    • Kefeng Wang's avatar
      mm: huge_memory: use a folio in do_huge_pmd_numa_page() · 667ffc31
      Kefeng Wang authored
      Use a folio in do_huge_pmd_numa_page(), reduce three page_folio() calls to
      one, no functional change intended.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      667ffc31
    • Kefeng Wang's avatar
      mm: memory: add vm_normal_folio_pmd() · 65610453
      Kefeng Wang authored
      Patch series "mm: convert numa balancing functions to use a folio", v2.
      
      do_numa_pages() only handles non-compound pages, and only PMD-mapped THPs
      are handled in do_huge_pmd_numa_page().  But a large, PTE-mapped folio
      will be supported so let's convert more numa balancing functions to
      use/take a folio in preparation for that, no functional change intended
      for now.
      
      
      This patch (of 6):
      
      The new vm_normal_folio_pmd() wrapper is similar to vm_normal_folio(),
      which allow them to completely replace the struct page variables with
      struct folio variables.
      
      Link: https://lkml.kernel.org/r/20230921074417.24004-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20230921074417.24004-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65610453
  2. 06 Oct, 2023 13 commits
  3. 04 Oct, 2023 14 commits
    • Yin Fengwei's avatar
      mm: mlock: update mlock_pte_range to handle large folio · dc68badc
      Yin Fengwei authored
      Current kernel only lock base size folio during mlock syscall.
      Add large folio support with following rules:
        - Only mlock large folio when it's in VM_LOCKED VMA range
          and fully mapped to page table.
      
          fully mapped folio is required as if folio is not fully
          mapped to a VM_LOCKED VMA, if system is in memory pressure,
          page reclaim is allowed to pick up this folio, split it
          and reclaim the pages which are not in VM_LOCKED VMA.
      
        - munlock will apply to the large folio which is in VMA range
          or cross the VMA boundary.
      
          This is required to handle the case that the large folio is
          mlocked, later the VMA is split in the middle of large folio.
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-4-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dc68badc
    • Yin Fengwei's avatar
      mm: handle large folio when large folio in VM_LOCKED VMA range · 1acbc3f9
      Yin Fengwei authored
      If large folio is in the range of VM_LOCKED VMA, it should be mlocked to
      avoid being picked by page reclaim.  Which may split the large folio and
      then mlock each pages again.
      
      Mlock this kind of large folio to prevent them being picked by page
      reclaim.
      
      For the large folio which cross the boundary of VM_LOCKED VMA or not fully
      mapped to VM_LOCKED VMA, we'd better not to mlock it.  So if the system is
      under memory pressure, this kind of large folio will be split and the
      pages ouf of VM_LOCKED VMA can be reclaimed.
      
      Ideally, for large folio, we should mlock it when the large folio is fully
      mapped to VMA and munlock it if any page are unmampped from VMA.  But it's
      not easy to detect whether the large folio is fully mapped to VMA in some
      cases (like add/remove rmap).  So we update mlock_vma_folio() and
      munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags.
      Let caller to decide whether they should call these two functions.
      
      For add rmap, only mlock normal 4K folio and postpone large folio handling
      to page reclaim phase.  It is possible to reuse page table iterator to
      detect whether folio is fully mapped or not during page reclaim phase. 
      For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly
      because rmap makes folio not fully mapped to VMA.
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-3-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1acbc3f9
    • Yin Fengwei's avatar
      mm: add functions folio_in_range() and folio_within_vma() · 28e56657
      Yin Fengwei authored
      Patch series "support large folio for mlock", v3.
      
      Yu mentioned at [1] about the mlock() can't be applied to large folio.
      
      I leant the related code and here is my understanding:
      
      - For RLIMIT_MEMLOCK related, there is no problem.  Because the
        RLIMIT_MEMLOCK statistics is not related underneath page.  That means
        underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
        statistics collection which is always correct.
      
      - For keeping the page in RAM, there is no problem either.  At least,
        during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit set in
        vm_flags, the folio will be kept whatever the folio is mlocked or not.
      
      So the function of mlock for large folio works.  But it's not optimized
      because the page reclaim needs scan these large folio and may split them.
      
      This series identified the large folio for mlock to four types:
        - The large folio is in VM_LOCKED range and fully mapped to the
          range
      
        - The large folio is in the VM_LOCKED range but not fully mapped to
          the range
      
        - The large folio cross VM_LOCKED VMA boundary
      
        - The large folio cross last level page table boundary
      
      For the first type, we mlock large folio so page reclaim will skip it.
      
      For the second/third type, we don't mlock large folio.  As the pages not
      mapped to VM_LOACKED range are mapped to none VM_LOCKED range, if system
      is in memory pressure situation, the large folio can be picked by page
      reclaim and split.  Then the pages not mapped to VM_LOCKED range can be
      reclaimed.
      
      For the fourth type, we don't mlock large folio because locking one page
      table lock can't prevent the part in another last level page table being
      unmapped.  Thanks to Ryan for pointing this out.
      
      
      To check whether the folio is fully mapped to the range, PTEs needs be
      checked to see whether the page of folio is associated.  Which needs take
      page table lock and is heavy operation.  So far, the only place needs this
      check is madvise and page reclaim.  These functions already have their own
      PTE iterator.
      
      patch1 introduce API to check whether large folio is in VMA range.
      patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
             large folio mlock/munlock.
      patch3 make mlock/munlock syscall support large folio.
      
      Yu also mentioned a race which can make folio unevictable after munlock
      during RFC v2 discussion [3]:
      We decided that race issue didn't block this series based on:
        - That race issue was not introduced by this series
      
        - We had a looks-ok fix for that race issue. Need to wait
          for mlock_count fixing patch as Yosry Ahmed suggested [4]
      
      [1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/
      [2] https://lore.kernel.org/linux-mm/20230809061105.3369958-1-fengwei.yin@intel.com/
      [3] https://lore.kernel.org/linux-mm/CAOUHufZ6=9P_=CAOQyw0xw-3q707q-1FVV09dBNDC-hpcpj2Pg@mail.gmail.com/
      
      
      This patch (of 3):
      
      folio_in_range() will be used to check whether the folio is mapped to
      specific VMA and whether the mapping address of folio is in the range.
      
      Also a helper function folio_within_vma() to check whether folio
      is in the range of vma based on folio_in_range().
      
      Link: https://lkml.kernel.org/r/20230918073318.1181104-1-fengwei.yin@intel.com
      Link: https://lkml.kernel.org/r/20230918073318.1181104-2-fengwei.yin@intel.comSigned-off-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28e56657
    • Jinjie Ruan's avatar
      mm/damon/core-test: fix memory leak in damon_new_ctx() · a0ce7925
      Jinjie Ruan authored
      When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y and
      CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      The damon_ctx which is allocated by kzalloc() in damon_new_ctx() in
      damon_test_ops_registration() and damon_test_set_attrs() are not freed. 
      So use damon_destroy_ctx() to free it.  After applying this patch, the
      following memory leak is never detected
      
          unreferenced object 0xffff2b49c6968800 (size 512):
            comm "kunit_try_catch", pid 350, jiffies 4294895294 (age 557.028s)
            hex dump (first 32 bytes):
              88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00  ................
              00 87 93 03 00 00 00 00 0a 00 00 00 00 00 00 00  ................
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
              [<00000000b5f89cef>] kmalloc_trace+0x40/0x164
              [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
              [<00000000daf6227b>] damon_test_ops_registration+0x34/0x328
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
          unreferenced object 0xffff2b49c1a9cc00 (size 512):
            comm "kunit_try_catch", pid 356, jiffies 4294895306 (age 557.000s)
            hex dump (first 32 bytes):
              88 13 00 00 00 00 00 00 a0 86 01 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00  ................
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<0000000073acab3b>] __kmem_cache_alloc_node+0x174/0x290
              [<00000000b5f89cef>] kmalloc_trace+0x40/0x164
              [<00000000eb19e83f>] damon_new_ctx+0x28/0xb4
              [<00000000058495c4>] damon_test_set_attrs+0x30/0x1a8
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
      
      Link: https://lkml.kernel.org/r/20230918120951.2230468-3-ruanjinjie@huawei.com
      Fixes: d1836a3b ("mm/damon/core-test: initialise context before test in damon_test_set_attrs()")
      Fixes: 4f540f5a ("mm/damon/core-test: add a kunit test case for ops registration")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendan.higgins@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0ce7925
    • Jinjie Ruan's avatar
      mm/damon/core-test: fix memory leak in damon_new_region() · f950fa6e
      Jinjie Ruan authored
      Patch series "mm/damon/core-test: Fix memory leaks in core-test", v3.
      
      There are a few memory leaks in core-test which are detected by kmemleak. 
      This patchset fixes the issues.
      
      
      This patch (of 2):
      
      When CONFIG_DAMON_KUNIT_TEST=y and making CONFIG_DEBUG_KMEMLEAK=y
      and CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y, the below memory leak is detected.
      
      The damon_region which is allocated by kmem_cache_alloc() in
      damon_new_region() in damon_test_regions() and
      damon_test_update_monitoring_result() are not freed.
      
      So for damon_test_regions(), replace damon_del_region() call with
      damon_destroy_region() so that it calls both damon_del_region() and
      damon_free_region(), the latter will free the damon_region. For
      damon_test_update_monitoring_result(), call damon_free_region() to
      free it. After applying this patch, the following memory leak is never
      detected.
      
          unreferenced object 0xffff2b49c3edc000 (size 56):
            comm "kunit_try_catch", pid 338, jiffies 4294895280 (age 557.084s)
            hex dump (first 32 bytes):
              01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 00 00 00 00 49 2b ff ff  ............I+..
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
              [<000000008603f022>] damon_new_region+0x28/0x54
              [<00000000a3b8c64e>] damon_test_regions+0x38/0x270
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
          unreferenced object 0xffff2b49c5b20000 (size 56):
            comm "kunit_try_catch", pid 354, jiffies 4294895304 (age 556.988s)
            hex dump (first 32 bytes):
              03 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00  ................
              00 00 00 00 00 00 00 00 96 00 00 00 49 2b ff ff  ............I+..
            backtrace:
              [<0000000088e71769>] slab_post_alloc_hook+0xb8/0x368
              [<00000000b528f67c>] kmem_cache_alloc+0x168/0x284
              [<000000008603f022>] damon_new_region+0x28/0x54
              [<00000000ca019f80>] damon_test_update_monitoring_result+0x18/0x34
              [<00000000559c4801>] kunit_try_run_case+0x50/0xac
              [<000000003932ed49>] kunit_generic_run_threadfn_adapter+0x20/0x2c
              [<000000003c3e9211>] kthread+0x124/0x130
              [<0000000028f85bdd>] ret_from_fork+0x10/0x20
      
      Link: https://lkml.kernel.org/r/20230918120951.2230468-1-ruanjinjie@huawei.com
      Link: https://lkml.kernel.org/r/20230918120951.2230468-2-ruanjinjie@huawei.com
      Fixes: 17ccae8b ("mm/damon: add kunit tests")
      Fixes: f4c978b6 ("mm/damon/core-test: add a test for damon_update_monitoring_results()")
      Signed-off-by: default avatarJinjie Ruan <ruanjinjie@huawei.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendan.higgins@linux.dev>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f950fa6e
    • Jianguo Bao's avatar
      mm/writeback: update filemap_dirty_folio() comment · ab428b4c
      Jianguo Bao authored
      Change to use new address space operation dirty_folio().
      
      Link: https://lkml.kernel.org/r/20230917-trycontrib1-v1-1-db22630b8839@gmail.com
      Fixes: 6f31a5a2 ("fs: Add aops->dirty_folio")
      Signed-off-by: default avatarJianguo Bau <roidinev@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab428b4c
    • SeongJae Park's avatar
      Docs/ABI/damon: update for DAMOS apply intervals · d57d36b5
      SeongJae Park authored
      Update DAMON ABI document for the newly added DAMON sysfs file for DAMOS
      apply intervals (apply_interval_us file).
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d57d36b5
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: update for DAMOS apply intervals · 033343d5
      SeongJae Park authored
      Update DAMON usage document's DAMON sysfs interface section for the newly
      added DAMOS apply intervals support (apply_interval_us file).
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      033343d5
    • SeongJae Park's avatar
      selftests/damon/sysfs: test DAMOS apply intervals · 65ded14e
      SeongJae Park authored
      Update DAMON selftests to test existence of the file for reading/writing
      DAMOS apply interval under each scheme directory.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65ded14e
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: support DAMOS apply interval · a2a9f68e
      SeongJae Park authored
      Update DAMON sysfs interface to support DAMOS apply intervals by adding a
      new file, 'apply_interval_us' in each scheme directory.  Users can set and
      get the interval for each scheme in microseconds by writing to and reading
      from the file.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2a9f68e
    • SeongJae Park's avatar
      Docs/mm/damon/design: document DAMOS apply intervals · 3f8723f1
      SeongJae Park authored
      Update DAMON design doc to explain about DAMOS apply intervals.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f8723f1
    • SeongJae Park's avatar
      mm/damon/core: implement scheme-specific apply interval · 42f994b7
      SeongJae Park authored
      DAMON-based operation schemes are applied for every aggregation interval. 
      That was mainly because schemes were using nr_accesses, which be complete
      to be used for every aggregation interval.  However, the schemes are now
      using nr_accesses_bp, which is updated for each sampling interval in a way
      that reasonable to be used.  Therefore, there is no reason to apply
      schemes for each aggregation interval.
      
      The unnecessary alignment with aggregation interval was also making some
      use cases of DAMOS tricky.  Quotas setting under long aggregation interval
      is one such example.  Suppose the aggregation interval is ten seconds, and
      there is a scheme having CPU quota 100ms per 1s.  The scheme will actually
      uses 100ms per ten seconds, since it cannobe be applied before next
      aggregation interval.  The feature is working as intended, but the results
      might not that intuitive for some users.  This could be fixed by updating
      the quota to 1s per 10s.  But, in the case, the CPU usage of DAMOS could
      look like spikes, and would actually make a bad effect to other
      CPU-sensitive workloads.
      
      Implement a dedicated timing interval for each DAMON-based operation
      scheme, namely apply_interval.  The interval will be sampling interval
      aligned, and each scheme will be applied for its apply_interval.  The
      interval is set to 0 by default, and it means the scheme should use the
      aggregation interval instead.  This avoids old users getting any
      behavioral difference.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      42f994b7
    • SeongJae Park's avatar
      mm/damon/core: use nr_accesses_bp as a source of damos_before_apply tracepoint · a72217ad
      SeongJae Park authored
      damos_before_apply tracepoint is exposing access rate of DAMON regions
      using nr_accesses field of regions, which was actually used by DAMOS in
      the past.  However, it has changed to use nr_accesses_bp instead.  Update
      the tracepoint to expose the value that DAMOS is really using.
      
      Note that it doesn't expose the value as is in the basis point, but after
      converting it to the natural number by dividing it by 10,000.  Therefore
      this change doesn't make user-visible behavioral differences.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a72217ad
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: use nr_accesses_bp as the source of tried_regions/<N>/nr_accesses · e7639bb4
      SeongJae Park authored
      DAMON sysfs interface exposes access rate of each region via DAMOS tried
      regions directory.  For this, the nr_accesses field of the region is used.
      DAMOS was actually using nr_accesses in the past, but it uses
      nr_accesses_bp now.  Use the value that it is really using as the source.
      
      Note that this doesn't expose nr_accesses_bp as is (in basis point), but
      after converting it to the natural number by dividing the value by 10,000.
      Hence there is no behavioral change from users' perspective.
      
      Link: https://lkml.kernel.org/r/20230916020945.47296-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e7639bb4