1. 09 Sep, 2024 20 commits
  2. 04 Sep, 2024 20 commits
    • Kefeng Wang's avatar
      mm: memory_hotplug: unify Huge/LRU/non-LRU movable folio isolation · 6f1833b8
      Kefeng Wang authored
      Use the isolate_folio_to_list() to unify hugetlb/LRU/non-LRU folio
      isolation, which cleanup code a bit and save a few calls to
      compound_head().
      
      [wangkefeng.wang@huawei.com: various fixes]
        Link: https://lkml.kernel.org/r/20240829150500.2599549-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-6-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6f1833b8
    • Kefeng Wang's avatar
      mm: migrate: add isolate_folio_to_list() · f1264e95
      Kefeng Wang authored
      Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU
      movable and LRU folios to a list, which will be reused by
      do_migrate_range() from memory hotplug soon, also drop the
      mf_isolate_folio() since we could directly use new helper in the
      soft_offline_in_use_page().
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-5-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Tested-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f1264e95
    • Kefeng Wang's avatar
      mm: memory_hotplug: check hwpoisoned page firstly in do_migrate_range() · e8a796fa
      Kefeng Wang authored
      Commit b15c8726 ("hwpoison, memory_hotplug: allow hwpoisoned pages to
      be offlined") don't handle the hugetlb pages, the endless loop still occur
      if offline a hwpoison hugetlb page, luckly, after the commit e591ef7d
      ("mm, hwpoison,hugetlb,memory_hotplug: hotremove memory section with
      hwpoisoned hugepage"), the HPageMigratable of hugetlb page will be
      cleared, and the hwpoison hugetlb page will be skipped in
      scan_movable_pages(), so the endless loop issue is fixed.
      
      However if the HPageMigratable() check passed(without reference and lock),
      the hugetlb page may be hwpoisoned, it won't cause issue since the
      hwpoisoned page will be handled correctly in the next movable pages scan
      loop, and it will be isolated in do_migrate_range() but fails to migrate. 
      In order to avoid the unnecessary isolation and unify all hwpoisoned page
      handling, let's unconditionally check hwpoison firstly, and if it is a
      hwpoisoned hugetlb page, try to unmap it as the catch all safety net like
      normal page does.
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-4-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8a796fa
    • Kefeng Wang's avatar
      mm: memory-failure: add unmap_poisoned_folio() · 16038c4f
      Kefeng Wang authored
      Add unmap_poisoned_folio() helper which will be reused by
      do_migrate_range() from memory hotplug soon.
      
      [akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin]
        Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16038c4f
    • Kefeng Wang's avatar
      mm: memory_hotplug: remove head variable in do_migrate_range() · b62b51d2
      Kefeng Wang authored
      Patch series "mm: memory_hotplug: improve do_migrate_range()", v3.
      
      Unify hwpoisoned page handling and isolation of HugeTLB/LRU/non-LRU
      movable page, also convert to use folios in do_migrate_range().
      
      
      This patch (of 5):
      
      Directly use a folio for HugeTLB and THP when calculate the next pfn, then
      remove unused head variable.
      
      Link: https://lkml.kernel.org/r/20240827114728.3212578-1-wangkefeng.wang@huawei.com
      Link: https://lkml.kernel.org/r/20240827114728.3212578-2-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@linaro.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b62b51d2
    • SeongJae Park's avatar
      mm/damon/tests: add .kunitconfig file for DAMON kunit tests · f66ac836
      SeongJae Park authored
      '--kunitconfig' option of 'kunit.py run' supports '.kunitconfig' file name
      convention.  Add the file for DAMON kunit tests for more convenient kunit
      run.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f66ac836
    • SeongJae Park's avatar
      mm/damon: move kunit tests to tests/ subdirectory with _kunit suffix · 9bfbaa5e
      SeongJae Park authored
      There was a discussion about better places for kunit test code[1] and test
      file name suffix[2].  Folowwing the conclusion, move kunit tests for DAMON
      to mm/damon/tests/ subdirectory and rename those.
      
      [1] https://lore.kernel.org/CABVgOS=pUdWb6NDHszuwb1HYws4a1-b1UmN=i8U_ED7HbDT0mg@mail.gmail.com
      [2] https://lore.kernel.org/CABVgOSmKwPq7JEpHfS6sbOwsR0B-DBDk_JP-ZD9s9ZizvpUjbQ@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9bfbaa5e
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: skip dbgfs_set_init_regions() test if PADDR is not registered · 61879eed
      SeongJae Park authored
      The test depends on registration of DAMON_OPS_PADDR.  It would be
      registered only when CONFIG_DAMON_PADDR is set.  DAMON core kunit tests do
      fake ops registration for such case.  However, the functions for such fake
      ops registration is not available to DAMON debugfs interface.  Just skip
      the test in the case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-8-sj@kernel.org
      Fixes: 999b9467 ("mm/damon/dbgfs-test: fix is_target_id() change")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      61879eed
    • SeongJae Park's avatar
      mm/damon/dbgfs-test: skip dbgfs_set_targets() test if PADDR is not registered · 8e34bac5
      SeongJae Park authored
      The test depends on registration of DAMON_OPS_PADDR.  It would be
      registered only when CONFIG_DAMON_PADDR is set.  DAMON core kunit tests do
      fake ops registration for such case.  However, the functions for such fake
      ops registration is not available to DAMON debugfs interface.  Just skip
      the test in the case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-7-sj@kernel.org
      Fixes: 999b9467 ("mm/damon/dbgfs-test: fix is_target_id() change")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e34bac5
    • SeongJae Park's avatar
      mm/damon/core-test: fix damon_test_ops_registration() for DAMON_VADDR unset case · e43772dc
      SeongJae Park authored
      DAMON core kunit test can be executed without CONFIG_DAMON_VADDR.  In the
      case, vaddr DAMON ops is not registered.  Meanwhile, ops registration
      kunit test assumes the vaddr ops is registered.  Check and handle the case
      by registrering fake vaddr ops inside the test code.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-6-sj@kernel.org
      Fixes: 4f540f5a ("mm/damon/core-test: add a kunit test case for ops registration")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e43772dc
    • SeongJae Park's avatar
      mm/damon/core-test: test only vaddr case on ops registration test · 9fcce7e7
      SeongJae Park authored
      DAMON ops registration kunit test tests both vaddr and paddr use cases in
      parts of the whole test cases.  Basically testing only one ops use case is
      enough.  Do the test with only vaddr use case.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9fcce7e7
    • SeongJae Park's avatar
      selftests/damon: add execute permissions to test scripts · 8c211412
      SeongJae Park authored
      Some test scripts are missing executable permissions.  It causes warnings
      that make the test output unnecessarily verbose.  Add executable
      permissions.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c211412
    • SeongJae Park's avatar
      selftests/damon: cleanup __pycache__/ with 'make clean' · 582c04b0
      SeongJae Park authored
      Python-based tests creates __pycache__/ directory.  Remove it with 'make
      clean' by defining it as EXTRA_CLEAN.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-3-sj@kernel.org
      Fixes: b5906f5f ("selftests/damon: add a test for update_schemes_tried_regions sysfs command")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      582c04b0
    • SeongJae Park's avatar
      selftests/damon: add access_memory_even to .gitignore · 9cb75552
      SeongJae Park authored
      Patch series "misc fixups for DAMON {self,kunit} tests".
      
      This patchset is for minor fixups of DAMON selftests and kunit tests. 
      First three patches make DAMON selftests more cleanly maintained (patches
      1 and 2) without unnecessary warnings (patch 3).  Following six patches
      remove unnecessary test case (patch 4), handle configs combinations that
      can make tests fail (patches 5-7), reorganize the test files following the
      new guideline (patch 8), and add reference kunitconfig for DAMON kunit
      tests (patch 9).
      
      
      This patch (of 9):
      
      DAMON selftests build access_memory_even, but its not on the .gitignore
      list.  Add it to make 'git status' output cleaner.
      
      Link: https://lkml.kernel.org/r/20240827030336.7930-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240827030336.7930-2-sj@kernel.org
      Fixes: c94df805 ("selftests/damon: implement a program for even-numbered memory regions access")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: David Gow <davidgow@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cb75552
    • Yujie Liu's avatar
      sched/numa: Fix the vma scan starving issue · f22cde43
      Yujie Liu authored
      Problem statement:
      Since commit fc137c0d ("sched/numa: enhance vma scanning logic"), the
      Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
      the vma scan might create less Numa page fault information.  The
      insufficient information makes it harder for the Numa balancer to make
      decision.  Later, commit b7a5b537 ("sched/numa: Complete scanning of
      partial VMAs regardless of PID activity") and commit 84db47ca
      ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
      bring back part of the performance.
      
      Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
      long duration of remote Numa node read was observed by PMU events: A few
      cores having ~500MB/s remote memory access for ~20 seconds.  It causes
      high core-to-core variance and performance penalty.  After the
      investigation, it is found that many vmas are skipped due to the active
      PID check.  According to the trace events, in most cases,
      vma_is_accessed() returns false because the history access info stored in
      pids_active array has been cleared.
      
      Proposal:
      The main idea is to adjust vma_is_accessed() to let it return true easier.
      Thus compare the diff between mm->numa_scan_seq and
      vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
      scan the vma.
      
      This patch especially helps the cases where there are small number of
      threads, like the process-based SPECcpu.  Without this patch, if the
      SPECcpu process access the vma at the beginning, then sleeps for a long
      time, the pid_active array will be cleared.  A a result, if this process
      is woken up again, it never has a chance to set prot_none anymore. 
      Because only the first 2 times of access is granted for vma scan:
      (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
      worse, no other threads within the task can help set the prot_none.  This
      causes information lost.
      
      Raghavendra helped test current patch and got the positive result
      on the AMD platform:
      
      autonumabench NUMA01
                                  base                  patched
      Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
      Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
      
      Duration User      380345.36   368252.04
      Duration System      1358.89     1156.23
      Duration Elapsed     2277.45     2213.25
      
      autonumabench NUMA02
      
      Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
      Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
      
      Duration User        1513.23     1575.48
      Duration System         8.33        8.13
      Duration Elapsed       28.59       29.71
      
      kernbench
      
      Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
      Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
      Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
      
      Duration User       68816.41    67615.74
      Duration System     21873.94    22848.08
      Duration Elapsed      506.66      504.55
      
      Intel 256 CPUs/2 Sockets:
      autonuma benchmark also shows improvements:
      
                                                     v6.10-rc5              v6.10-rc5
                                                                               +patch
      Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
      Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
      Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
      Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
      Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
      Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
      Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
      Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
      
                         v6.10-rc5   v6.10-rc5
                                        +patch
      Duration User     1064010.16  1075416.23
      Duration System      3307.64     3104.66
      Duration Elapsed     4537.54     4604.73
      
      The SPECcpu remote node access issue disappears with the patch applied.
      
      Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
      Fixes: fc137c0d ("sched/numa: enhance vma scanning logic")
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Co-developed-by: default avatarChen Yu <yu.c.chen@intel.com>
      Signed-off-by: default avatarYujie Liu <yujie.liu@intel.com>
      Reported-by: default avatarXiaoping Zhou <xiaoping.zhou@intel.com>
      Reviewed-and-tested-by: default avatarRaghavendra K T <raghavendra.kt@amd.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@amd.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f22cde43
    • Yanfei Xu's avatar
      memory tier: fix deadlock warning while onlining pages · 073c78ed
      Yanfei Xu authored
      commit 823430c8 ("memory tier: consolidate the initialization of
      memory tiers") introduces a locking change that use guard(mutex) to
      instead of mutex_lock/unlock() for memory_tier_lock.  It unexpectedly
      expanded the locked region to include the hotplug_memory_notifier(), as a
      result, it triggers an locking dependency detected of ABBA deadlock. 
      Exclude hotplug_memory_notifier() from the locked region to fixing it.
      
      The deadlock scenario is that when a memory online event occurs, the
      execution of memory notifier will access the read lock of the
      memory_chain.rwsem, then the reigistration of the memory notifier in
      memory_tier_init() acquires the write lock of the memory_chain.rwsem while
      holding memory_tier_lock.  Then the memory online event continues to
      invoke the memory hotplug callback registered by memory_tier_init(). 
      Since this callback tries to acquire the memory_tier_lock, a deadlock
      occurs.
      
      In fact, this deadlock can't happen because memory_tier_init() always
      executes before memory online events happen due to the subsys_initcall()
      has an higher priority than module_init().
      
      [  133.491106] WARNING: possible circular locking dependency detected
      [  133.493656] 6.11.0-rc2+ #146 Tainted: G           O     N
      [  133.504290] ------------------------------------------------------
      [  133.515194] (udev-worker)/1133 is trying to acquire lock:
      [  133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
      [  133.536449]
      [  133.536449] but task is already holding lock:
      [  133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
      [  133.556781]
      [  133.556781] which lock already depends on the new lock.
      [  133.556781]
      [  133.569957]
      [  133.569957] the existing dependency chain (in reverse order) is:
      [  133.577618]
      [  133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
      [  133.584997]        down_write+0x97/0x210
      [  133.588647]        blocking_notifier_chain_register+0x71/0xd0
      [  133.592537]        register_memory_notifier+0x26/0x30
      [  133.596314]        memory_tier_init+0x187/0x300
      [  133.599864]        do_one_initcall+0x117/0x5d0
      [  133.603399]        kernel_init_freeable+0xab0/0xeb0
      [  133.606986]        kernel_init+0x28/0x2f0
      [  133.610312]        ret_from_fork+0x59/0x90
      [  133.613652]        ret_from_fork_asm+0x1a/0x30
      [  133.617012]
      [  133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
      [  133.623390]        __lock_acquire+0x2efd/0x5c60
      [  133.626730]        lock_acquire+0x1ce/0x580
      [  133.629757]        __mutex_lock+0x15c/0x1490
      [  133.632731]        mutex_lock_nested+0x1f/0x30
      [  133.635717]        memtier_hotplug_callback+0x383/0x4b0
      [  133.638748]        notifier_call_chain+0xbf/0x370
      [  133.641647]        blocking_notifier_call_chain+0x76/0xb0
      [  133.644636]        memory_notify+0x2e/0x40
      [  133.647427]        online_pages+0x597/0x720
      [  133.650246]        memory_subsys_online+0x4f6/0x7f0
      [  133.653107]        device_online+0x141/0x1d0
      [  133.655831]        online_memory_block+0x4d/0x60
      [  133.658616]        walk_memory_blocks+0xc0/0x120
      [  133.661419]        add_memory_resource+0x51d/0x6c0
      [  133.664202]        add_memory_driver_managed+0xf5/0x180
      [  133.667060]        dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
      [  133.669949]        dax_bus_probe+0x147/0x230
      [  133.672687]        really_probe+0x27f/0xac0
      [  133.675463]        __driver_probe_device+0x1f3/0x460
      [  133.678493]        driver_probe_device+0x56/0x1b0
      [  133.681366]        __driver_attach+0x277/0x570
      [  133.684149]        bus_for_each_dev+0x145/0x1e0
      [  133.686937]        driver_attach+0x49/0x60
      [  133.689673]        bus_add_driver+0x2f3/0x6b0
      [  133.692421]        driver_register+0x170/0x4b0
      [  133.695118]        __dax_driver_register+0x141/0x1b0
      [  133.697910]        dax_kmem_init+0x54/0xff0 [kmem]
      [  133.700794]        do_one_initcall+0x117/0x5d0
      [  133.703455]        do_init_module+0x277/0x750
      [  133.706054]        load_module+0x5d1d/0x74f0
      [  133.708602]        init_module_from_file+0x12c/0x1a0
      [  133.711234]        idempotent_init_module+0x3f1/0x690
      [  133.713937]        __x64_sys_finit_module+0x10e/0x1a0
      [  133.716492]        x64_sys_call+0x184d/0x20d0
      [  133.719053]        do_syscall_64+0x6d/0x140
      [  133.721537]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
      [  133.724239]
      [  133.724239] other info that might help us debug this:
      [  133.724239]
      [  133.730832]  Possible unsafe locking scenario:
      [  133.730832]
      [  133.735298]        CPU0                    CPU1
      [  133.737759]        ----                    ----
      [  133.740165]   rlock((memory_chain).rwsem);
      [  133.742623]                                lock(memory_tier_lock);
      [  133.745357]                                lock((memory_chain).rwsem);
      [  133.748141]   lock(memory_tier_lock);
      [  133.750489]
      [  133.750489]  *** DEADLOCK ***
      [  133.750489]
      [  133.756742] 6 locks held by (udev-worker)/1133:
      [  133.759179]  #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
      [  133.762299]  #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
      [  133.765565]  #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
      [  133.768978]  #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
      [  133.772312]  #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
      [  133.775544]  #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
      [  133.779113]
      [  133.779113] stack backtrace:
      [  133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G           O     N 6.11.0-rc2+ #146
      [  133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
      [  133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [  133.793291] Call Trace:
      [  133.795826]  <TASK>
      [  133.798284]  dump_stack_lvl+0xea/0x150
      [  133.801025]  dump_stack+0x19/0x20
      [  133.803609]  print_circular_bug+0x477/0x740
      [  133.806341]  check_noncircular+0x2f4/0x3e0
      [  133.809056]  ? __pfx_check_noncircular+0x10/0x10
      [  133.811866]  ? __pfx_lockdep_lock+0x10/0x10
      [  133.814670]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  133.817610]  __lock_acquire+0x2efd/0x5c60
      [  133.820339]  ? __pfx___lock_acquire+0x10/0x10
      [  133.823128]  ? __dax_driver_register+0x141/0x1b0
      [  133.825926]  ? do_one_initcall+0x117/0x5d0
      [  133.828648]  lock_acquire+0x1ce/0x580
      [  133.831349]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.834293]  ? __pfx_lock_acquire+0x10/0x10
      [  133.837134]  __mutex_lock+0x15c/0x1490
      [  133.839829]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.842753]  ? memtier_hotplug_callback+0x383/0x4b0
      [  133.845602]  ? __this_cpu_preempt_check+0x21/0x30
      [  133.848438]  ? __pfx___mutex_lock+0x10/0x10
      [  133.851200]  ? __pfx_lock_acquire+0x10/0x10
      [  133.853935]  ? global_dirty_limits+0xc0/0x160
      [  133.856699]  ? __sanitizer_cov_trace_switch+0x58/0xa0
      [  133.859564]  mutex_lock_nested+0x1f/0x30
      [  133.862251]  ? mutex_lock_nested+0x1f/0x30
      [  133.864964]  memtier_hotplug_callback+0x383/0x4b0
      [  133.867752]  notifier_call_chain+0xbf/0x370
      [  133.870550]  ? writeback_set_ratelimit+0xe8/0x160
      [  133.873372]  blocking_notifier_call_chain+0x76/0xb0
      [  133.876311]  memory_notify+0x2e/0x40
      [  133.879013]  online_pages+0x597/0x720
      [  133.881686]  ? irqentry_exit+0x3e/0xa0
      [  133.884397]  ? __pfx_online_pages+0x10/0x10
      [  133.887244]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  133.890299]  ? mhp_init_memmap_on_memory+0x7a/0x1c0
      [  133.893203]  memory_subsys_online+0x4f6/0x7f0
      [  133.896099]  ? __pfx_memory_subsys_online+0x10/0x10
      [  133.899039]  ? xa_load+0x16d/0x2e0
      [  133.901667]  ? __pfx_xa_load+0x10/0x10
      [  133.904366]  ? __pfx_memory_subsys_online+0x10/0x10
      [  133.907218]  device_online+0x141/0x1d0
      [  133.909845]  online_memory_block+0x4d/0x60
      [  133.912494]  walk_memory_blocks+0xc0/0x120
      [  133.915104]  ? __pfx_online_memory_block+0x10/0x10
      [  133.917776]  add_memory_resource+0x51d/0x6c0
      [  133.920404]  ? __pfx_add_memory_resource+0x10/0x10
      [  133.923104]  ? _raw_write_unlock+0x31/0x60
      [  133.925781]  ? register_memory_resource+0x119/0x180
      [  133.928450]  add_memory_driver_managed+0xf5/0x180
      [  133.931036]  dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
      [  133.933665]  ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
      [  133.936332]  ? __pfx___up_read+0x10/0x10
      [  133.938878]  dax_bus_probe+0x147/0x230
      [  133.941332]  ? __pfx_dax_bus_probe+0x10/0x10
      [  133.943954]  really_probe+0x27f/0xac0
      [  133.946387]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
      [  133.949106]  __driver_probe_device+0x1f3/0x460
      [  133.951704]  ? parse_option_str+0x149/0x190
      [  133.954241]  driver_probe_device+0x56/0x1b0
      [  133.956749]  __driver_attach+0x277/0x570
      [  133.959228]  ? __pfx___driver_attach+0x10/0x10
      [  133.961776]  bus_for_each_dev+0x145/0x1e0
      [  133.964367]  ? __pfx_bus_for_each_dev+0x10/0x10
      [  133.967019]  ? __kasan_check_read+0x15/0x20
      [  133.969543]  ? _raw_spin_unlock+0x31/0x60
      [  133.972132]  driver_attach+0x49/0x60
      [  133.974536]  bus_add_driver+0x2f3/0x6b0
      [  133.977044]  driver_register+0x170/0x4b0
      [  133.979480]  __dax_driver_register+0x141/0x1b0
      [  133.982126]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  133.984724]  dax_kmem_init+0x54/0xff0 [kmem]
      [  133.987284]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  133.989965]  do_one_initcall+0x117/0x5d0
      [  133.992506]  ? __pfx_do_one_initcall+0x10/0x10
      [  133.995185]  ? __kasan_kmalloc+0x88/0xa0
      [  133.997748]  ? kasan_poison+0x3e/0x60
      [  134.000288]  ? kasan_unpoison+0x2c/0x60
      [  134.002762]  ? kasan_poison+0x3e/0x60
      [  134.005202]  ? __asan_register_globals+0x62/0x80
      [  134.007753]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
      [  134.010439]  do_init_module+0x277/0x750
      [  134.012953]  load_module+0x5d1d/0x74f0
      [  134.015406]  ? __pfx_load_module+0x10/0x10
      [  134.017887]  ? __pfx_ima_post_read_file+0x10/0x10
      [  134.020470]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
      [  134.023127]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.025767]  ? security_kernel_post_read_file+0xa2/0xd0
      [  134.028429]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.031162]  ? kernel_read_file+0x503/0x820
      [  134.033645]  ? __pfx_kernel_read_file+0x10/0x10
      [  134.036232]  ? __pfx___lock_acquire+0x10/0x10
      [  134.038766]  init_module_from_file+0x12c/0x1a0
      [  134.041291]  ? init_module_from_file+0x12c/0x1a0
      [  134.043936]  ? __pfx_init_module_from_file+0x10/0x10
      [  134.046516]  ? __this_cpu_preempt_check+0x21/0x30
      [  134.049091]  ? __kasan_check_read+0x15/0x20
      [  134.051551]  ? do_raw_spin_unlock+0x60/0x210
      [  134.054077]  idempotent_init_module+0x3f1/0x690
      [  134.056643]  ? __pfx_idempotent_init_module+0x10/0x10
      [  134.059318]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
      [  134.061995]  ? __fget_light+0x17d/0x210
      [  134.064428]  __x64_sys_finit_module+0x10e/0x1a0
      [  134.066976]  x64_sys_call+0x184d/0x20d0
      [  134.069405]  do_syscall_64+0x6d/0x140
      [  134.071926]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      [yanfei.xu@intel.com: add mutex_lock/unlock() pair back]
        Link: https://lkml.kernel.org/r/20240830102447.1445296-1-yanfei.xu@intel.com
      Link: https://lkml.kernel.org/r/20240827113614.1343049-1-yanfei.xu@intel.com
      Fixes: 823430c8 ("memory tier: consolidate the initialization of memory tiers")
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Ho-Ren (Jack) Chuang <horen.chuang@linux.dev>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      073c78ed
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: refactor vm_area_alloc_pages() function · 7de8728f
      Uladzislau Rezki (Sony) authored
      The aim is to simplify and making the vm_area_alloc_pages()
      function less confusing as it became more clogged nowadays:
      
      - eliminate a "bulk_gfp" variable and do not overwrite a gfp
        flag for bulk allocator;
      - drop __GFP_NOFAIL flag for high-order-page requests on upper
        layer. It becomes less spread between levels when it comes to
        __GFP_NOFAIL allocations;
      - add a comment about a fallback path if high-order attempt is
        unsuccessful because for such cases __GFP_NOFAIL is dropped;
      - fix a typo in a commit message.
      
      Link: https://lkml.kernel.org/r/20240827190916.34242-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7de8728f
    • Lorenzo Stoakes's avatar
      mm: rework vm_ops->close() handling on VMA merge · 01c373e9
      Lorenzo Stoakes authored
      In commit 714965ca ("mm/mmap: start distinguishing if vma can be
      removed in mergeability test") we relaxed the VMA merge rules for VMAs
      possessing a vm_ops->close() hook, permitting this operation in instances
      where we wouldn't delete the VMA as part of the merge operation.
      
      This was later corrected in commit fc0c8f90 ("mm, mmap: fix
      vma_merge() case 7 with vma_ops->close") to account for a subtle case that
      the previous commit had not taken into account.
      
      In both instances, we first rely on is_mergeable_vma() to determine
      whether we might be dealing with a VMA that might be removed, taking
      advantage of the fact that a 'previous' VMA will never be deleted, only
      VMAs that follow it.
      
      The second patch corrects the instance where a merge of the previous VMA
      into a subsequent one did not correctly check whether the subsequent VMA
      had a vm_ops->close() handler.
      
      Both changes prevent merge cases that are actually permissible (for
      instance a merge of a VMA into a following VMA with a vm_ops->close(), but
      with no previous VMA, which would result in the next VMA being extended,
      not deleted).
      
      In addition, both changes fail to consider the case where a VMA that would
      otherwise be merged with the previous and next VMA might have
      vm_ops->close(), on the assumption that for this to be the case, all three
      would have to have the same vma->vm_file to be mergeable and thus the same
      vm_ops.
      
      And in addition both changes operate at 50,000 feet, trying to guess
      whether a VMA will be deleted.
      
      As we have majorly refactored the VMA merge operation and de-duplicated
      code to the point where we know precisely where deletions will occur, this
      patch removes the aforementioned checks altogether and instead explicitly
      checks whether a VMA will be deleted.
      
      In cases where a reduced merge is still possible (where we merge both
      previous and next VMA but the next VMA has a vm_ops->close hook, meaning
      we could just merge the previous and current VMA), we do so, otherwise the
      merge is not permitted.
      
      We take advantage of our userland testing to assert that this functions
      correctly - replacing the previous limited vm_ops->close() tests with
      tests for every single case where we delete a VMA.
      
      We also update all testing for both new and modified VMAs to set
      vma->vm_ops->close() in every single instance where this would not prevent
      the merge, to assert that we never do so.
      
      Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      01c373e9
    • Lorenzo Stoakes's avatar
      mm: refactor vma_merge() into modify-only vma_merge_existing_range() · cc8cb369
      Lorenzo Stoakes authored
      The existing vma_merge() function is no longer required to handle what
      were previously referred to as cases 1-3 (i.e.  the merging of a new VMA),
      as this is now handled by vma_merge_new_vma().
      
      Additionally, simplify the convoluted control flow of the original,
      maintaining identical logic only expressed more clearly and doing away
      with a complicated set of cases, rather logically examining each possible
      outcome - merging of both the previous and subsequent VMA, merging of the
      previous VMA and merging of the subsequent VMA alone.
      
      We now utilise the previously implemented commit_merge() function to share
      logic with vma_expand() de-duplicating code and providing less surface
      area for bugs and confusion.  In order to do so, we adjust this function
      to accept parameters specific to merging existing ranges.
      
      Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc8cb369
    • Lorenzo Stoakes's avatar
      mm: introduce commit_merge(), abstracting final commit of merge · 65e0aa64
      Lorenzo Stoakes authored
      Pull the part of vma_expand() which actually commits the merge operation,
      that is inserts it into the maple tree and sets the VMA's vma->vm_start
      and vma->vm_end parameters, into its own function.
      
      We implement only the parts needed for vma_expand() which now as a result
      of previous work is also the means by which new VMA ranges are merged.
      
      The next commit in the series will implement merging of existing ranges
      which will extend commit_merge() to accommodate this case and result in
      all merges using this common code.
      
      Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.comSigned-off-by: default avatarLorenzo Stoakes <lorenzo.stoakes@oracle.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bert Karwatzki <spasswolf@web.de>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Jiri Olsa <olsajiri@gmail.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65e0aa64