1. 09 Sep, 2024 9 commits
  2. 04 Sep, 2024 31 commits
    • Kefeng Wang's avatar
      mm: memory_hotplug: unify Huge/LRU/non-LRU movable folio isolation · 6f1833b8
      Kefeng Wang authored
      Use the isolate_folio_to_list() to unify hugetlb/LRU/non-LRU folio
      isolation, which cleanup code a bit and save a few calls to
      compound_head().
      
      [wangkefeng.wang@huawei.com: various fixes]
        Link: https://lkml.kernel.org/r/20240829150500.2599549-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f1833b8
    • Kefeng Wang's avatar
      mm: migrate: add isolate_folio_to_list() · f1264e95
      Kefeng Wang authored
      Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU
      movable and LRU folios to a list, which will be reused by
      do_migrate_range() from memory hotplug soon, also drop the
      mf_isolate_folio() since we could directly use new helper in the
      soft_offline_in_use_page().
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1264e95
    • Kefeng Wang's avatar
      mm: memory_hotplug: check hwpoisoned page firstly in do_migrate_range() · e8a796fa
      Kefeng Wang authored
      Commit b15c8726 ("hwpoison, memory_hotplug: allow hwpoisoned pages to
      be offlined") don't handle the hugetlb pages, the endless loop still occur
      if offline a hwpoison hugetlb page, luckly, after the commit e591ef7d
      ("mm, hwpoison,hugetlb,memory_hotplug: hotremove memory section with
      hwpoisoned hugepage"), the HPageMigratable of hugetlb page will be
      cleared, and the hwpoison hugetlb page will be skipped in
      scan_movable_pages(), so the endless loop issue is fixed.
      
      However if the HPageMigratable() check passed(without reference and lock),
      the hugetlb page may be hwpoisoned, it won't cause issue since the
      hwpoisoned page will be handled correctly in the next movable pages scan
      loop, and it will be isolated in do_migrate_range() but fails to migrate. 
      In order to avoid the unnecessary isolation and unify all hwpoisoned page
      handling, let's unconditionally check hwpoison firstly, and if it is a
      hwpoisoned hugetlb page, try to unmap it as the catch all safety net like
      normal page does.
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8a796fa
    • Kefeng Wang's avatar
      mm: memory-failure: add unmap_poisoned_folio() · 16038c4f
      Kefeng Wang authored
      Add unmap_poisoned_folio() helper which will be reused by
      do_migrate_range() from memory hotplug soon.
      
      [akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin]
        Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16038c4f
    • Kefeng Wang's avatar
      mm: memory_hotplug: remove head variable in do_migrate_range() · b62b51d2
      Kefeng Wang authored
      Patch series "mm: memory_hotplug: improve do_migrate_range()", v3.
      
      Unify hwpoisoned page handling and isolation of HugeTLB/LRU/non-LRU
      movable page, also convert to use folios in do_migrate_range().
      
      
      This patch (of 5):
      
      Directly use a folio for HugeTLB and THP when calculate the next pfn, then
      remove unused head variable.
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b62b51d2
    • SeongJae Park's avatar
      mm/damon/tests: add .kunitconfig file for DAMON kunit tests · f66ac836
      SeongJae Park authored
      '--kunitconfig' option of 'kunit.py run' supports '.kunitconfig' file name
      convention.  Add the file for DAMON kunit tests for more convenient kunit
      run.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f66ac836
    • SeongJae Park's avatar
      mm/damon: move kunit tests to tests/ subdirectory with _kunit suffix · 9bfbaa5e
      SeongJae Park authored
      There was a discussion about better places for kunit test code[1] and test
      file name suffix[2].  Folowwing the conclusion, move kunit tests for DAMON
      to mm/damon/tests/ subdirectory and rename those.
      
      [1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com
      [2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9bfbaa5e
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: skip dbgfs_set_init_regions() test if PADDR is not registered · 61879eed
      SeongJae Park authored
      The test depends on registration of DAMON_OPS_PADDR.  It would be
      registered only when CONFIG_DAMON_PADDR is set.  DAMON core kunit tests do
      fake ops registration for such case.  However, the functions for such fake
      ops registration is not available to DAMON debugfs interface.  Just skip
      the test in the case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-8-sj@kernel.org
      Fixes: 999b9467 ("mm/damon/dbgfs-test: fix is_target_id() change")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      61879eed
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: skip dbgfs_set_targets() test if PADDR is not registered · 8e34bac5
      SeongJae Park authored
      The test depends on registration of DAMON_OPS_PADDR.  It would be
      registered only when CONFIG_DAMON_PADDR is set.  DAMON core kunit tests do
      fake ops registration for such case.  However, the functions for such fake
      ops registration is not available to DAMON debugfs interface.  Just skip
      the test in the case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-7-sj@kernel.org
      Fixes: 999b9467 ("mm/damon/dbgfs-test: fix is_target_id() change")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e34bac5
    • SeongJae Park's avatar
      mm/damon/core-test: fix damon_test_ops_registration() for DAMON_VADDR unset case · e43772dc
      SeongJae Park authored
      DAMON core kunit test can be executed without CONFIG_DAMON_VADDR.  In the
      case, vaddr DAMON ops is not registered.  Meanwhile, ops registration
      kunit test assumes the vaddr ops is registered.  Check and handle the case
      by registrering fake vaddr ops inside the test code.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-6-sj@kernel.org
      Fixes: 4f540f5a ("mm/damon/core-test: add a kunit test case for ops registration")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e43772dc
    • SeongJae Park's avatar
      mm/damon/core-test: test only vaddr case on ops registration test · 9fcce7e7
      SeongJae Park authored
      DAMON ops registration kunit test tests both vaddr and paddr use cases in
      parts of the whole test cases.  Basically testing only one ops use case is
      enough.  Do the test with only vaddr use case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9fcce7e7
    • SeongJae Park's avatar
      selftests/damon: add execute permissions to test scripts · 8c211412
      SeongJae Park authored
      Some test scripts are missing executable permissions.  It causes warnings
      that make the test output unnecessarily verbose.  Add executable
      permissions.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c211412
    • SeongJae Park's avatar
      selftests/damon: cleanup __pycache__/ with 'make clean' · 582c04b0
      SeongJae Park authored
      Python-based tests creates __pycache__/ directory.  Remove it with 'make
      clean' by defining it as EXTRA_CLEAN.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-3-sj@kernel.org
      Fixes: b5906f5f ("selftests/damon: add a test for update_schemes_tried_regions sysfs command")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      582c04b0
    • SeongJae Park's avatar
      selftests/damon: add access_memory_even to .gitignore · 9cb75552
      SeongJae Park authored
      Patch series "misc fixups for DAMON {self,kunit} tests".
      
      This patchset is for minor fixups of DAMON selftests and kunit tests. 
      First three patches make DAMON selftests more cleanly maintained (patches
      1 and 2) without unnecessary warnings (patch 3).  Following six patches
      remove unnecessary test case (patch 4), handle configs combinations that
      can make tests fail (patches 5-7), reorganize the test files following the
      new guideline (patch 8), and add reference kunitconfig for DAMON kunit
      tests (patch 9).
      
      
      This patch (of 9):
      
      DAMON selftests build access_memory_even, but its not on the .gitignore
      list.  Add it to make 'git status' output cleaner.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240827030336.7930-2-sj@kernel.org
      Fixes: c94df805 ("selftests/damon: implement a program for even-numbered memory regions access")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cb75552
    • Yujie Liu's avatar
      sched/numa: Fix the vma scan starving issue · f22cde43
      Yujie Liu authored
      Problem statement:
      Since commit fc137c0d ("sched/numa: enhance vma scanning logic"), the
      Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
      the vma scan might create less Numa page fault information.  The
      insufficient information makes it harder for the Numa balancer to make
      decision.  Later, commit b7a5b537 ("sched/numa: Complete scanning of
      partial VMAs regardless of PID activity") and commit 84db47ca
      ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
      bring back part of the performance.
      
      Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
      long duration of remote Numa node read was observed by PMU events: A few
      cores having ~500MB/s remote memory access for ~20 seconds.  It causes
      high core-to-core variance and performance penalty.  After the
      investigation, it is found that many vmas are skipped due to the active
      PID check.  According to the trace events, in most cases,
      vma_is_accessed() returns false because the history access info stored in
      pids_active array has been cleared.
      
      Proposal:
      The main idea is to adjust vma_is_accessed() to let it return true easier.
      Thus compare the diff between mm->numa_scan_seq and
      vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
      scan the vma.
      
      This patch especially helps the cases where there are small number of
      threads, like the process-based SPECcpu.  Without this patch, if the
      SPECcpu process access the vma at the beginning, then sleeps for a long
      time, the pid_active array will be cleared.  A a result, if this process
      is woken up again, it never has a chance to set prot_none anymore. 
      Because only the first 2 times of access is granted for vma scan:
      (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
      worse, no other threads within the task can help set the prot_none.  This
      causes information lost.
      
      Raghavendra helped test current patch and got the positive result
      on the AMD platform:
      
      autonumabench NUMA01
                                  base                  patched
      Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
      Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
      
      Duration User      380345.36   368252.04
      Duration System      1358.89     1156.23
      Duration Elapsed     2277.45     2213.25
      
      autonumabench NUMA02
      
      Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
      Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
      
      Duration User        1513.23     1575.48
      Duration System         8.33        8.13
      Duration Elapsed       28.59       29.71
      
      kernbench
      
      Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
      Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
      Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
      
      Duration User       68816.41    67615.74
      Duration System     21873.94    22848.08
      Duration Elapsed      506.66      504.55
      
      Intel 256 CPUs/2 Sockets:
      autonuma benchmark also shows improvements:
      
                                                     v6.10-rc5              v6.10-rc5
                                                                               +patch
      Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
      Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
      Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
      Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
      Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
      Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
      Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
      Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
      
                         v6.10-rc5   v6.10-rc5
                                        +patch
      Duration User     1064010.16  1075416.23
      Duration System      3307.64     3104.66
      Duration Elapsed     4537.54     4604.73
      
      The SPECcpu remote node access issue disappears with the patch applied.
      
      Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
      Fixes: fc137c0d ("sched/numa: enhance vma scanning logic")
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Co-developed-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarYujie Liu <yujie.liu@intel.com>
      Reported-by: default avatarXiaoping Zhou <xiaoping.zhou@intel.com>
      Reviewed-and-tested-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@amd.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f22cde43
    • Yanfei Xu's avatar
      memory tier: fix deadlock warning while onlining pages · 073c78ed
      Yanfei Xu authored
      commit 823430c8 ("memory tier: consolidate the initialization of
      memory tiers") introduces a locking change that use guard(mutex) to
      instead of mutex_lock/unlock() for memory_tier_lock.  It unexpectedly
      expanded the locked region to include the hotplug_memory_notifier(), as a
      result, it triggers an locking dependency detected of ABBA deadlock. 
      Exclude hotplug_memory_notifier() from the locked region to fixing it.
      
      The deadlock scenario is that when a memory online event occurs, the
      execution of memory notifier will access the read lock of the
      memory_chain.rwsem, then the reigistration of the memory notifier in
      memory_tier_init() acquires the write lock of the memory_chain.rwsem while
      holding memory_tier_lock.  Then the memory online event continues to
      invoke the memory hotplug callback registered by memory_tier_init(). 
      Since this callback tries to acquire the memory_tier_lock, a deadlock
      occurs.
      
      In fact, this deadlock can't happen because memory_tier_init() always
      executes before memory online events happen due to the subsys_initcall()
      has an higher priority than module_init().
      
      [  133.491106] WARNING: possible circular locking dependency detected
      [  133.493656] 6.11.0-rc2+ #146 Tainted: G           O     N
      [  133.504290] ------------------------------------------------------
      [  133.515194] (udev-worker)/1133 is trying to acquire lock:
      [  133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
      [  133.536449]
      [  133.536449] but task is already holding lock:
      [  133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
      [  133.556781]
      [  133.556781] which lock already depends on the new lock.
      [  133.556781]
      [  133.569957]
      [  133.569957] the existing dependency chain (in reverse order) is:
      [  133.577618]
      [  133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
      [  133.584997]        down_write+0x97/0x210
      [  133.588647]        blocking_notifier_chain_register+0x71/0xd0
      [  133.592537]        register_memory_notifier+0x26/0x30
      [  133.596314]        memory_tier_init+0x187/0x300
      [  133.599864]        do_one_initcall+0x117/0x5d0
      [  133.603399]        kernel_init_freeable+0xab0/0xeb0
      [  133.606986]        kernel_init+0x28/0x2f0
      [  133.610312]        ret_from_fork+0x59/0x90
      [  133.613652]        ret_from_fork_asm+0x1a/0x30
      [  133.617012]
      [  133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
      [  133.623390]        __lock_acquire+0x2efd/0x5c60
      [  133.626730]        lock_acquire+0x1ce/0x580
      [  133.629757]        __mutex_lock+0x15c/0x1490
      [  133.632731]        mutex_lock_nested+0x1f/0x30
      [  133.635717]        memtier_hotplug_callback+0x383/0x4b0
      [  133.638748]        notifier_call_chain+0xbf/0x370
      [  133.641647]        blocking_notifier_call_chain+0x76/0xb0
      [  133.644636]        memory_notify+0x2e/0x40
      [  133.647427]        online_pages+0x597/0x720
      [  133.650246]        memory_subsys_online+0x4f6/0x7f0
      [  133.653107]        device_online+0x141/0x1d0
      [  133.655831]        online_memory_block+0x4d/0x60
      [  133.658616]        walk_memory_blocks+0xc0/0x120
      [  133.661419]        add_memory_resource+0x51d/0x6c0
      [  133.664202]        add_memory_driver_managed+0xf5/0x180
      [  133.667060]        dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
      [  133.669949]        dax_bus_probe+0x147/0x230
      [  133.672687]        really_probe+0x27f/0xac0
      [  133.675463]        __driver_probe_device+0x1f3/0x460
      [  133.678493]        driver_probe_device+0x56/0x1b0
      [  133.681366]        __driver_attach+0x277/0x570
      [  133.684149]        bus_for_each_dev+0x145/0x1e0
      [  133.686937]        driver_attach+0x49/0x60
      [  133.689673]        bus_add_driver+0x2f3/0x6b0
      [  133.692421]        driver_register+0x170/0x4b0
      [  133.695118]        __dax_driver_register+0x141/0x1b0
      [  133.697910]        dax_kmem_init+0x54/0xff0 [kmem]
      [  133.700794]        do_one_initcall+0x117/0x5d0
      [  133.703455]        do_init_module+0x277/0x750
      [  133.706054]        load_module+0x5d1d/0x74f0
      [  133.708602]        init_module_from_file+0x12c/0x1a0
      [  133.711234]        idempotent_init_module+0x3f1/0x690
      [  133.713937]        __x64_sys_finit_module+0x10e/0x1a0
      [  133.716492]        x64_sys_call+0x184d/0x20d0
      [  133.719053]        do_syscall_64+0x6d/0x140
      [  133.721537]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [  133.724239]
      [  133.724239] other info that might help us debug this:
      [  133.724239]
      [  133.730832]  Possible unsafe locking scenario:
      [  133.730832]
      [  133.735298]        CPU0                    CPU1
      [  133.737759]        ----                    ----
      [  133.740165]   rlock((memory_chain).rwsem);
      [  133.742623]                                lock(memory_tier_lock);
      [  133.745357]                                lock((memory_chain).rwsem);
      [  133.748141]   lock(memory_tier_lock);
      [  133.750489]
      [  133.750489]  *** DEADLOCK ***
      [  133.750489]
      [  133.756742] 6 locks held by (udev-worker)/1133:
      [  133.759179]  #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
      [  133.762299]  #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
      [  133.765565]  #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
      [  133.768978]  #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
      [  133.772312]  #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
      [  133.775544]  #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
      [  133.779113]
      [  133.779113] stack backtrace:
      [  133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G           O     N 6.11.0-rc2+ #146
      [  133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
      [  133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [  133.793291] Call Trace:
      [  133.795826]  <TASK>
      [  133.798284]  dump_stack_lvl+0xea/0x150
      [  133.801025]  dump_stack+0x19/0x20
      [  133.803609]  print_circular_bug+0x477/0x740
      [  133.806341]  check_noncircular+0x2f4/0x3e0
      [  133.809056]  ? __pfx_check_noncircular+0x10/0x10
      [  133.811866]  ? __pfx_lockdep_lock+0x10/0x10
      [  133.814670]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  133.817610]  __lock_acquire+0x2efd/0x5c60
      [  133.820339]  ? __pfx___lock_acquire+0x10/0x10
      [  133.823128]  ? __dax_driver_register+0x141/0x1b0
      [  133.825926]  ? do_one_initcall+0x117/0x5d0
      [  133.828648]  lock_acquire+0x1ce/0x580
      [  133.831349]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.834293]  ? __pfx_lock_acquire+0x10/0x10
      [  133.837134]  __mutex_lock+0x15c/0x1490
      [  133.839829]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.842753]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.845602]  ? __this_cpu_preempt_check+0x21/0x30
      [  133.848438]  ? __pfx___mutex_lock+0x10/0x10
      [  133.851200]  ? __pfx_lock_acquire+0x10/0x10
      [  133.853935]  ? global_dirty_limits+0xc0/0x160
      [  133.856699]  ? __sanitizer_cov_trace_switch+0x58/0xa0
      [  133.859564]  mutex_lock_nested+0x1f/0x30
      [  133.862251]  ? mutex_lock_nested+0x1f/0x30
      [  133.864964]  memtier_hotplug_callback+0x383/0x4b0
      [  133.867752]  notifier_call_chain+0xbf/0x370
      [  133.870550]  ? writeback_set_ratelimit+0xe8/0x160
      [  133.873372]  blocking_notifier_call_chain+0x76/0xb0
      [  133.876311]  memory_notify+0x2e/0x40
      [  133.879013]  online_pages+0x597/0x720
      [  133.881686]  ? irqentry_exit+0x3e/0xa0
      [  133.884397]  ? __pfx_online_pages+0x10/0x10
      [  133.887244]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  133.890299]  ? mhp_init_memmap_on_memory+0x7a/0x1c0
      [  133.893203]  memory_subsys_online+0x4f6/0x7f0
      [  133.896099]  ? __pfx_memory_subsys_online+0x10/0x10
      [  133.899039]  ? xa_load+0x16d/0x2e0
      [  133.901667]  ? __pfx_xa_load+0x10/0x10
      [  133.904366]  ? __pfx_memory_subsys_online+0x10/0x10
      [  133.907218]  device_online+0x141/0x1d0
      [  133.909845]  online_memory_block+0x4d/0x60
      [  133.912494]  walk_memory_blocks+0xc0/0x120
      [  133.915104]  ? __pfx_online_memory_block+0x10/0x10
      [  133.917776]  add_memory_resource+0x51d/0x6c0
      [  133.920404]  ? __pfx_add_memory_resource+0x10/0x10
      [  133.923104]  ? _raw_write_unlock+0x31/0x60
      [  133.925781]  ? register_memory_resource+0x119/0x180
      [  133.928450]  add_memory_driver_managed+0xf5/0x180
      [  133.931036]  dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
      [  133.933665]  ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
      [  133.936332]  ? __pfx___up_read+0x10/0x10
      [  133.938878]  dax_bus_probe+0x147/0x230
      [  133.941332]  ? __pfx_dax_bus_probe+0x10/0x10
      [  133.943954]  really_probe+0x27f/0xac0
      [  133.946387]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
      [  133.949106]  __driver_probe_device+0x1f3/0x460
      [  133.951704]  ? parse_option_str+0x149/0x190
      [  133.954241]  driver_probe_device+0x56/0x1b0
      [  133.956749]  __driver_attach+0x277/0x570
      [  133.959228]  ? __pfx___driver_attach+0x10/0x10
      [  133.961776]  bus_for_each_dev+0x145/0x1e0
      [  133.964367]  ? __pfx_bus_for_each_dev+0x10/0x10
      [  133.967019]  ? __kasan_check_read+0x15/0x20
      [  133.969543]  ? _raw_spin_unlock+0x31/0x60
      [  133.972132]  driver_attach+0x49/0x60
      [  133.974536]  bus_add_driver+0x2f3/0x6b0
      [  133.977044]  driver_register+0x170/0x4b0
      [  133.979480]  __dax_driver_register+0x141/0x1b0
      [  133.982126]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  133.984724]  dax_kmem_init+0x54/0xff0 [kmem]
      [  133.987284]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  133.989965]  do_one_initcall+0x117/0x5d0
      [  133.992506]  ? __pfx_do_one_initcall+0x10/0x10
      [  133.995185]  ? __kasan_kmalloc+0x88/0xa0
      [  133.997748]  ? kasan_poison+0x3e/0x60
      [  134.000288]  ? kasan_unpoison+0x2c/0x60
      [  134.002762]  ? kasan_poison+0x3e/0x60
      [  134.005202]  ? __asan_register_globals+0x62/0x80
      [  134.007753]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  134.010439]  do_init_module+0x277/0x750
      [  134.012953]  load_module+0x5d1d/0x74f0
      [  134.015406]  ? __pfx_load_module+0x10/0x10
      [  134.017887]  ? __pfx_ima_post_read_file+0x10/0x10
      [  134.020470]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  134.023127]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.025767]  ? security_kernel_post_read_file+0xa2/0xd0
      [  134.028429]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.031162]  ? kernel_read_file+0x503/0x820
      [  134.033645]  ? __pfx_kernel_read_file+0x10/0x10
      [  134.036232]  ? __pfx___lock_acquire+0x10/0x10
      [  134.038766]  init_module_from_file+0x12c/0x1a0
      [  134.041291]  ? init_module_from_file+0x12c/0x1a0
      [  134.043936]  ? __pfx_init_module_from_file+0x10/0x10
      [  134.046516]  ? __this_cpu_preempt_check+0x21/0x30
      [  134.049091]  ? __kasan_check_read+0x15/0x20
      [  134.051551]  ? do_raw_spin_unlock+0x60/0x210
      [  134.054077]  idempotent_init_module+0x3f1/0x690
      [  134.056643]  ? __pfx_idempotent_init_module+0x10/0x10
      [  134.059318]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.061995]  ? __fget_light+0x17d/0x210
      [  134.064428]  __x64_sys_finit_module+0x10e/0x1a0
      [  134.066976]  x64_sys_call+0x184d/0x20d0
      [  134.069405]  do_syscall_64+0x6d/0x140
      [  134.071926]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      [yanfei.xu@intel.com: add mutex_lock/unlock() pair back]
        Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com
      Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com
      Fixes: 823430c8 ("memory tier: consolidate the initialization of memory tiers")
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Ho-Ren (Jack) Chuang <horen.chuang@linux.dev>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      073c78ed
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: refactor vm_area_alloc_pages() function · 7de8728f
      Uladzislau Rezki (Sony) authored
      The aim is to simplify and making the vm_area_alloc_pages()
      function less confusing as it became more clogged nowadays:
      
      - eliminate a "bulk_gfp" variable and do not overwrite a gfp
        flag for bulk allocator;
      - drop __GFP_NOFAIL flag for high-order-page requests on upper
        layer. It becomes less spread between levels when it comes to
        __GFP_NOFAIL allocations;
      - add a comment about a fallback path if high-order attempt is
        unsuccessful because for such cases __GFP_NOFAIL is dropped;
      - fix a typo in a commit message.
      
      Link: https://lkml.kernel.org/r/20240827190916.34242-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7de8728f
    • Lorenzo Stoakes's avatar
      mm: rework vm_ops->close() handling on VMA merge · 01c373e9
      Lorenzo Stoakes authored
      In commit 714965ca ("mm/mmap: start distinguishing if vma can be
      removed in mergeability test") we relaxed the VMA merge rules for VMAs
      possessing a vm_ops->close() hook, permitting this operation in instances
      where we wouldn't delete the VMA as part of the merge operation.
      
      This was later corrected in commit fc0c8f90 ("mm, mmap: fix
      vma_merge() case 7 with vma_ops->close") to account for a subtle case that
      the previous commit had not taken into account.
      
      In both instances, we first rely on is_mergeable_vma() to determine
      whether we might be dealing with a VMA that might be removed, taking
      advantage of the fact that a 'previous' VMA will never be deleted, only
      VMAs that follow it.
      
      The second patch corrects the instance where a merge of the previous VMA
      into a subsequent one did not correctly check whether the subsequent VMA
      had a vm_ops->close() handler.
      
      Both changes prevent merge cases that are actually permissible (for
      instance a merge of a VMA into a following VMA with a vm_ops->close(), but
      with no previous VMA, which would result in the next VMA being extended,
      not deleted).
      
      In addition, both changes fail to consider the case where a VMA that would
      otherwise be merged with the previous and next VMA might have
      vm_ops->close(), on the assumption that for this to be the case, all three
      would have to have the same vma->vm_file to be mergeable and thus the same
      vm_ops.
      
      And in addition both changes operate at 50,000 feet, trying to guess
      whether a VMA will be deleted.
      
      As we have majorly refactored the VMA merge operation and de-duplicated
      code to the point where we know precisely where deletions will occur, this
      patch removes the aforementioned checks altogether and instead explicitly
      checks whether a VMA will be deleted.
      
      In cases where a reduced merge is still possible (where we merge both
      previous and next VMA but the next VMA has a vm_ops->close hook, meaning
      we could just merge the previous and current VMA), we do so, otherwise the
      merge is not permitted.
      
      We take advantage of our userland testing to assert that this functions
      correctly - replacing the previous limited vm_ops->close() tests with
      tests for every single case where we delete a VMA.
      
      We also update all testing for both new and modified VMAs to set
      vma->vm_ops->close() in every single instance where this would not prevent
      the merge, to assert that we never do so.
      
      Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      01c373e9
    • Lorenzo Stoakes's avatar
      mm: refactor vma_merge() into modify-only vma_merge_existing_range() · cc8cb369
      Lorenzo Stoakes authored
      The existing vma_merge() function is no longer required to handle what
      were previously referred to as cases 1-3 (i.e.  the merging of a new VMA),
      as this is now handled by vma_merge_new_vma().
      
      Additionally, simplify the convoluted control flow of the original,
      maintaining identical logic only expressed more clearly and doing away
      with a complicated set of cases, rather logically examining each possible
      outcome - merging of both the previous and subsequent VMA, merging of the
      previous VMA and merging of the subsequent VMA alone.
      
      We now utilise the previously implemented commit_merge() function to share
      logic with vma_expand() de-duplicating code and providing less surface
      area for bugs and confusion.  In order to do so, we adjust this function
      to accept parameters specific to merging existing ranges.
      
      Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc8cb369
    • Lorenzo Stoakes's avatar
      mm: introduce commit_merge(), abstracting final commit of merge · 65e0aa64
      Lorenzo Stoakes authored
      Pull the part of vma_expand() which actually commits the merge operation,
      that is inserts it into the maple tree and sets the VMA's vma->vm_start
      and vma->vm_end parameters, into its own function.
      
      We implement only the parts needed for vma_expand() which now as a result
      of previous work is also the means by which new VMA ranges are merged.
      
      The next commit in the series will implement merging of existing ranges
      which will extend commit_merge() to accommodate this case and result in
      all merges using this common code.
      
      Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65e0aa64
    • Lorenzo Stoakes's avatar
      mm: make vma_prepare() and friends static and internal to vma.c · 25d3925f
      Lorenzo Stoakes authored
      Now we have abstracted merge behaviour for new VMA ranges, we are able to
      render vma_prepare(), init_vma_prep(), vma_complete(),
      can_vma_merge_before() and can_vma_merge_after() static and internal to
      vma.c.
      
      These are internal implementation details of kernel VMA manipulation and
      merging mechanisms and thus should not be exposed.  This also renders the
      functions userland testable.
      
      Link: https://lkml.kernel.org/r/7f7f1c34ce10405a6aab2714c505af3cf41b7851.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      25d3925f
    • Lorenzo Stoakes's avatar
      mm: avoid using vma_merge() for new VMAs · cacded5e
      Lorenzo Stoakes authored
      Abstract vma_merge_new_vma() to use vma_merge_struct and rename the
      resultant function vma_merge_new_range() to be clear what the purpose of
      this function is - a new VMA is desired in the specified range, and we
      wish to see if it is possible to 'merge' surrounding VMAs into this range
      rather than having to allocate a new VMA.
      
      Note that this function uses vma_extend() exclusively, so adopts its
      requirement that the iterator point at or before the gap.  We add an
      assert to this effect.
      
      This is as opposed to vma_merge_existing_range(), which will be introduced
      in a subsequent commit, and provide the same functionality for cases in
      which we are modifying an existing VMA.
      
      In mmap_region() and do_brk_flags() we open code scenarios where we prefer
      to use vma_expand() rather than invoke a full vma_merge() operation.
      
      Abstract this logic and eliminate all of the open-coding, and also use the
      same logic for all cases where we add new VMAs to, rather than ultimately
      use vma_merge(), rather use vma_expand().
      
      Doing so removes duplication and simplifies VMA merging in all such cases,
      laying the ground for us to eliminate the merging of new VMAs in
      vma_merge() altogether.
      
      Also add the ability for the vmg to track state, and able to report
      errors, allowing for us to differentiate a failed merge from an inability
      to allocate memory in callers.
      
      This makes it far easier to understand what is happening in these cases
      avoiding confusion, bugs and allowing for future optimisation.
      
      Also introduce vma_iter_next_rewind() to allow for retrieval of the next,
      and (optionally) the prev VMA, rewinding to the start of the previous gap.
      
      Introduce are_anon_vmas_compatible() to abstract individual VMA anon_vma
      comparison for the case of merging on both sides where the anon_vma of the
      VMA being merged maybe compatible with prev and next, but prev and next's
      anon_vma's may not be compatible with each other.
      
      Finally also introduce can_vma_merge_left() / can_vma_merge_right() to
      check adjacent VMA compatibility and that they are indeed adjacent.
      
      Link: https://lkml.kernel.org/r/49d37c0769b6b9dc03b27fe4d059173832556392.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Tested-by: default avatarMark Brown <broonie@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cacded5e
    • Lorenzo Stoakes's avatar
      mm: abstract vma_expand() to use vma_merge_struct · fc21959f
      Lorenzo Stoakes authored
      The purpose of the vmg is to thread merge state through functions and
      avoid egregious parameter lists.  We expand this to vma_expand(), which is
      used for a number of merge cases.
      
      Accordingly, adjust its callers, mmap_region() and relocate_vma_down(), to
      use a vmg.
      
      An added purpose of this change is the ability in a future commit to
      perform all new VMA range merging using vma_expand().
      
      Link: https://lkml.kernel.org/r/4bc8c9dbc9ca52452ef8e587b28fe555854ceb38.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fc21959f
    • Lorenzo Stoakes's avatar
      mm: remove duplicated open-coded VMA policy check · 3e01310d
      Lorenzo Stoakes authored
      Both can_vma_merge_before() and can_vma_merge_after() are invoked after
      checking for compatible VMA NUMA policy, we can simply move this to
      is_mergeable_vma() and abstract this altogether.
      
      In mmap_region() we set vmg->policy to NULL, so the policy comparisons
      checked in can_vma_merge_before() and can_vma_merge_after() are exactly
      equivalent to !vma_policy(vmg.next) and !vma_policy(vmg.prev).
      
      Equally, in do_brk_flags(), vmg->policy is NULL, so the
      can_vma_merge_after() is checking !vma_policy(vma), as we set vmg.prev to
      vma.
      
      In vma_merge(), we compare prev and next policies with vmg->policy before
      checking can_vma_merge_after() and can_vma_merge_before() respectively,
      which this patch causes to be checked in precisely the same way.
      
      This therefore maintains precisely the same logic as before, only now
      abstracted into is_mergeable_vma().
      
      Link: https://lkml.kernel.org/r/0dbff286d9c4988333bc6f4ff3734cb95dd5410a.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e01310d
    • Lorenzo Stoakes's avatar
      mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify() · 2f1c6611
      Lorenzo Stoakes authored
      Rather than passing around huge numbers of parameters to numerous helper
      functions, abstract them into a single struct that we thread through the
      operation, the vma_merge_struct ('vmg').
      
      Adjust vma_merge() and vma_modify() to accept this parameter, as well as
      predicate functions can_vma_merge_before(), can_vma_merge_after(), and the
      vma_modify_...() helper functions.
      
      Also introduce VMG_STATE() and VMG_VMA_STATE() helper macros to allow for
      easy vmg declaration.
      
      We additionally remove the requirement that vma_merge() is passed a VMA
      object representing the candidate new VMA.  Previously it used this to
      obtain the mm_struct, file and anon_vma properties of the proposed range
      (a rather confusing state of affairs), which are now provided by the vmg
      directly.
      
      We also remove the pgoff calculation previously performed vma_modify(),
      and instead calculate this in VMG_VMA_STATE() via the vma_pgoff_offset()
      helper.
      
      Link: https://lkml.kernel.org/r/a955aad09d81329f6fbeb636b2dd10cde7b73dab.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f1c6611
    • Lorenzo Stoakes's avatar
      tools: add VMA merge tests · 955db396
      Lorenzo Stoakes authored
      Add a variety of VMA merge unit tests to assert that the behaviour of VMA
      merge is correct at an abstract level and VMAs are merged or not merged as
      expected.
      
      These are intentionally added _before_ we start refactoring vma_merge() in
      order that we can continually assert correctness throughout the rest of
      the series.
      
      In order to reduce churn going forward, we backport the vma_merge_struct
      data type to the test code which we introduce and use in a future commit,
      and add wrappers around the merge new and existing VMA cases.
      
      Link: https://lkml.kernel.org/r/1c7a0b43cfad2c511a6b1b52f3507696478ff51a.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      955db396
    • Lorenzo Stoakes's avatar
      tools: improve vma test Makefile · 4e52a60a
      Lorenzo Stoakes authored
      Patch series "mm: remove vma_merge()", v3.
      
      The infamous vma_merge() function has been the cause of a great deal of
      pain, bugs and confusion for a very long time.
      
      It is subtle, contains many corner cases, tries to do far too much and is
      as a result very fragile.
      
      The fact that the function requires there to be a numbering system to
      cover each possible eventuality with references to each in the many
      branches of its implementation as to which case you are looking at speaks
      to all this.
      
      Some of this complexity is inherent - unfortunately there is no getting
      away from the need to figure out precisely how to execute the merge,
      whether we need to remove VMAs, whether it is safe to do so, what
      constitutes a mergeable VMA and so on.
      
      However, a lot of the complexity is not inherent but instead a product of
      the function's 'organic' development.
      
      Liam has gone to great lengths to improve the situation as a part of his
      maple tree implementation, greatly improving the readability of the code,
      and Vlastimil and myself have additionally gone to lengths to try to
      improve things further.
      
      However, with the availability of userland VMA testing, it now becomes
      possible to perform a rather more significant refactoring while
      maintaining confidence in its correct operation.
      
      An attempt was previously made by Vlastimil [0] to eliminate vma_merge(),
      however it was rather - brutal - and an astute reader might refer to the
      date of that patch for insight as to its intent.
      
      This series instead divides merge operations into two natural kinds -
      merges which occur when a NEW vma is being added to the address space, and
      merges which occur when a vma is being MODIFIED.
      
      Happily, the vma_expand() function introduced by Liam, which has the
      capacity for also deleting a subsequent VMA, covers each of the NEW vma
      cases.
      
      By abstracting the actual final commit of changes to a VMA to its own
      function, commit_merge() and writing a wrapper around vma_expand() for new
      VMA cases vma_merge_new_range(), we can avoid having to use vma_merge()
      for these instances altogether.
      
      By doing so we are also able to then de-duplicate all existing merge logic
      in mmap_region() and do_brk_flags() and have everything invoke this new
      function, so we universally take the same approach to merging new VMAs.
      
      Having done so, we can then completely rework vma_merge() into
      vma_merge_existing_range() and use this for the instances where a merge is
      proposed for a region of an existing VMA.
      
      This eliminates vma_merge() and its numbered cases and instead divides
      things into logical cases - merge both, merge left, merge right (the
      latter 2 being either partial or full merges).
      
      The code is heavily annotated with ASCII diagrams and greatly simplified
      in comparison to the existing vma_merge() function.
      
      Having made this change, we take the opportunity to address an issue with
      merging VMAs possessing a vm_ops->close() hook - commit 714965ca
      ("mm/mmap: start distinguishing if vma can be removed in mergeability
      test") and commit fc0c8f90 ("mm, mmap: fix vma_merge() case 7 with
      vma_ops->close") make efforts to relax how we handle these, making
      assumptions about which VMAs might end up deleted (and thus, if possessing
      a vm_ops->close() hook, cannot be).
      
      This refactor means we do not need to guess, so instead explicitly only
      disallow merge in instances where a VMA with a vm_ops->close() hook would
      be deleted (and try a smaller merge in cases where this is possible).
      
      In addition to these changes, we introduce a new vma_merge_struct
      abstraction to allow VMA merge state to be threaded through the operation
      neatly.
      
      There is heavy unit testing provided for all merge functionality, added
      prior to the refactoring, allowing for before/after testing.
      
      The vm_ops->close() change also introduces exhaustive testing to
      demonstrate that this functions as expected, and in addition to this the
      reproduction code from commit fc0c8f90 ("mm, mmap: fix vma_merge()
      case 7 with vma_ops->close") was tested and confirmed passing.
      
      [0]:https://lore.kernel.org/linux-mm/20240401192623.18575-2-vbabka@suse.cz/
      
      
      This patch (of 10):
      
      Have vma.o depend on its source dependencies explicitly, as previously
      these were simply being ignored as existing object files were up to date.
      
      This now correctly re-triggers the build if mm/ source is changed as well
      as local source code.
      
      Also set clean as a phony rule.
      
      Link: https://lkml.kernel.org/r/cover.1725040657.git.lorenzo.stoakes@oracle.com
      Link: https://lkml.kernel.org/r/e3ea58f08364ae5432c9a074de0195a7c7e0b04a.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4e52a60a
    • Liam R. Howlett's avatar
      mm/vma.h: optimise vma_munmap_struct · 723e1e8b
      Liam R. Howlett authored
      The vma_munmap_struct has a hole of 4 bytes and pushes the struct to three
      cachelines.  Relocating the three booleans upwards allows for the struct
      to only use two cachelines (as reported by pahole on amd64).
      
      Before:
      struct vma_munmap_struct {
              struct vma_iterator *      vmi;                  /*     0     8 */
              struct vm_area_struct *    vma;                  /*     8     8 */
              struct vm_area_struct *    prev;                 /*    16     8 */
              struct vm_area_struct *    next;                 /*    24     8 */
              struct list_head *         uf;                   /*    32     8 */
              long unsigned int          start;                /*    40     8 */
              long unsigned int          end;                  /*    48     8 */
              long unsigned int          unmap_start;          /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              long unsigned int          unmap_end;            /*    64     8 */
              int                        vma_count;            /*    72     4 */
      
              /* XXX 4 bytes hole, try to pack */
      
              long unsigned int          nr_pages;             /*    80     8 */
              long unsigned int          locked_vm;            /*    88     8 */
              long unsigned int          nr_accounted;         /*    96     8 */
              long unsigned int          exec_vm;              /*   104     8 */
              long unsigned int          stack_vm;             /*   112     8 */
              long unsigned int          data_vm;              /*   120     8 */
              /* --- cacheline 2 boundary (128 bytes) --- */
              bool                       unlock;               /*   128     1 */
              bool                       clear_ptes;           /*   129     1 */
              bool                       closed_vm_ops;        /*   130     1 */
      
              /* size: 136, cachelines: 3, members: 19 */
              /* sum members: 127, holes: 1, sum holes: 4 */
              /* padding: 5 */
              /* last cacheline: 8 bytes */
      };
      
      After:
      struct vma_munmap_struct {
              struct vma_iterator *      vmi;                  /*     0     8 */
              struct vm_area_struct *    vma;                  /*     8     8 */
              struct vm_area_struct *    prev;                 /*    16     8 */
              struct vm_area_struct *    next;                 /*    24     8 */
              struct list_head *         uf;                   /*    32     8 */
              long unsigned int          start;                /*    40     8 */
              long unsigned int          end;                  /*    48     8 */
              long unsigned int          unmap_start;          /*    56     8 */
              /* --- cacheline 1 boundary (64 bytes) --- */
              long unsigned int          unmap_end;            /*    64     8 */
              int                        vma_count;            /*    72     4 */
              bool                       unlock;               /*    76     1 */
              bool                       clear_ptes;           /*    77     1 */
              bool                       closed_vm_ops;        /*    78     1 */
      
              /* XXX 1 byte hole, try to pack */
      
              long unsigned int          nr_pages;             /*    80     8 */
              long unsigned int          locked_vm;            /*    88     8 */
              long unsigned int          nr_accounted;         /*    96     8 */
              long unsigned int          exec_vm;              /*   104     8 */
              long unsigned int          stack_vm;             /*   112     8 */
              long unsigned int          data_vm;              /*   120     8 */
      
              /* size: 128, cachelines: 2, members: 19 */
              /* sum members: 127, holes: 1, sum holes: 1 */
      };
      
      Link: https://lkml.kernel.org/r/20240830040101.822209-22-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      723e1e8b
    • Liam R. Howlett's avatar
      mm/vma: drop incorrect comment from vms_gather_munmap_vmas() · 20831cd6
      Liam R. Howlett authored
      The comment has been outdated since 6b73cff2 ("mm: change munmap
      splitting order and move_vma()").  The move_vma() was altered to fix the
      fragile state of the accounting since then.
      
      Link: https://lkml.kernel.org/r/20240830040101.822209-21-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20831cd6
    • Liam R. Howlett's avatar
      mm: move may_expand_vm() check in mmap_region() · 224c1c70
      Liam R. Howlett authored
      The may_expand_vm() check requires the count of the pages within the
      munmap range.  Since this is needed for accounting and obtained later, the
      reodering of ma_expand_vm() to later in the call stack, after the vma
      munmap struct (vms) is initialised and the gather stage is potentially
      run, will allow for a single loop over the vmas.  The gather sage does not
      commit any work and so everything can be undone in the case of a failure.
      
      The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
      call, so use it instead of looping over the vmas twice.
      
      Link: https://lkml.kernel.org/r/20240830040101.822209-20-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      224c1c70
    • Liam R. Howlett's avatar
      ipc/shm, mm: drop do_vma_munmap() · 63fc66f5
      Liam R. Howlett authored
      The do_vma_munmap() wrapper existed for callers that didn't have a vma
      iterator and needed to check the vma mseal status prior to calling the
      underlying munmap().  All callers now use a vma iterator and since the
      mseal check has been moved to do_vmi_align_munmap() and the vmas are
      aligned, this function can just be called instead.
      
      do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
      it is exported via the mm.h header.
      
      Link: https://lkml.kernel.org/r/20240830040101.822209-19-Liam.Howlett@oracle.comSigned-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63fc66f5