1. 24 Feb, 2024 32 commits
    • Baoquan He's avatar
      crash: split crash dumping code out from kexec_core.c · 02aff848
      Baoquan He authored
      Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes
      need be built in to avoid compiling error when building kexec code even
      though the crash dumping functionality is not enabled. E.g
      --------------------
      CONFIG_CRASH_CORE=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC=y
      CONFIG_KEXEC_FILE=y
      ---------------------
      
      After splitting out crashkernel reservation code and vmcoreinfo exporting
      code, there's only crash related code left in kernel/crash_core.c. Now
      move crash related codes from kexec_core.c to crash_core.c and only build it
      in when CONFIG_CRASH_DUMP=y.
      
      And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope,
      or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP
      ifdef in generic kernel files.
      
      With these changes, crash_core codes are abstracted from kexec codes and
      can be disabled at all if only kexec reboot feature is wanted.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02aff848
    • Baoquan He's avatar
      crash: remove dependency of FA_DUMP on CRASH_DUMP · 2c44b67e
      Baoquan He authored
      In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's
      old memory content. Its elf header is constructed in 1st kernel and passed
      to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code
      of 1st kernel's old memory accessing in different architectures.
      
      Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump
      needs access global variable 'elfcorehdr_addr' to judge if it's in
      kdump kernel within function is_kdump_kernel(). In the current
      kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function
      setup_elfcorehdr() used to parse kernel parameter to fetch the passed
      value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP
      really doesn't have to depends on CRASH_DUMP.
      
      To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion,
      rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when
      CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend
      on CRASH_DUMP.
      
      [bhe@redhat.com: power/fadump: make FA_DUMP select CRASH_DUMP]
        Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240124051254.67105-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2c44b67e
    • Baoquan He's avatar
      crash: split vmcoreinfo exporting code out from crash_core.c · 443cbaf9
      Baoquan He authored
      Now move the relevant codes into separate files:
      kernel/crash_reserve.c, include/linux/crash_reserve.h.
      
      And add config item CRASH_RESERVE to control its enabling.
      
      And also update the old ifdeffery of CONFIG_CRASH_CORE, including of
      <linux/crash_core.h> and config item dependency on CRASH_CORE
      accordingly.
      
      And also do renaming as follows:
       - arch/xxx/kernel/{crash_core.c => vmcore_info.c}
      because they are only related to vmcoreinfo exporting on x86, arm64,
      riscv.
      
      And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to
      decide if build in crash_core.c.
      
      [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c]
        Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      443cbaf9
    • Baoquan He's avatar
      kexec: split crashkernel reservation code out from crash_core.c · 85fcde40
      Baoquan He authored
      Patch series "Split crash out from kexec and clean up related config
      items", v3.
      
      Motivation:
      =============
      Previously, LKP reported a building error. When investigating, it can't
      be resolved reasonablly with the present messy kdump config items.
      
       https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/
      
      The kdump (crash dumping) related config items could causes confusions:
      
      Firstly,
      
      CRASH_CORE enables codes including
       - crashkernel reservation;
       - elfcorehdr updating;
       - vmcoreinfo exporting;
       - crash hotplug handling;
      
      Now fadump of powerpc, kcore dynamic debugging and kdump all selects
      CRASH_CORE, while fadump
       - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing
         global variable 'elfcorehdr_addr';
       - kcore only needs vmcoreinfo exporting;
       - kdump needs all of the current kernel/crash_core.c.
      
      So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this
      mislead people that we enable crash dumping, actual it's not.
      
      Secondly,
      
      It's not reasonable to allow KEXEC_CORE select CRASH_CORE.
      
      Because KEXEC_CORE enables codes which allocate control pages, copy
      kexec/kdump segments, and prepare for switching. These codes are
      shared by both kexec reboot and kdump. We could want kexec reboot,
      but disable kdump. In that case, CRASH_CORE should not be selected.
      
       --------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_KEXEC=y
       CONFIG_KEXEC_FILE=y
       ---------------------
      
      Thirdly,
      
      It's not reasonable to allow CRASH_DUMP select KEXEC_CORE.
      
      That could make KEXEC_CORE, CRASH_DUMP are enabled independently from
      KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE
      code built in doesn't make any sense because no kernel loading or
      switching will happen to utilize the KEXEC_CORE code.
       ---------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_CRASH_DUMP=y
       ---------------------
      
      In this case, what is worse, on arch sh and arm, KEXEC relies on MMU,
      while CRASH_DUMP can still be enabled when !MMU, then compiling error is
      seen as the lkp test robot reported in above link.
      
       ------arch/sh/Kconfig------
       config ARCH_SUPPORTS_KEXEC
               def_bool MMU
      
       config ARCH_SUPPORTS_CRASH_DUMP
               def_bool BROKEN_ON_SMP
       ---------------------------
      
      Changes:
      ===========
      1, split out crash_reserve.c from crash_core.c;
      2, split out vmcore_infoc. from crash_core.c;
      3, move crash related codes in kexec_core.c into crash_core.c;
      4, remove dependency of FA_DUMP on CRASH_DUMP;
      5, clean up kdump related config items;
      6, wrap up crash codes in crash related ifdefs on all 8 arch-es
         which support crash dumping, except of ppc;
      
      Achievement:
      ===========
      With above changes, I can rearrange the config item logic as below (the right
      item depends on or is selected by the left item):
      
          PROC_KCORE -----------> VMCORE_INFO
      
                     |----------> VMCORE_INFO
          FA_DUMP----|
                     |----------> CRASH_RESERVE
      
                                                          ---->VMCORE_INFO
                                                         /
                                                         |---->CRASH_RESERVE
          KEXEC      --|                                /|
                       |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
          KEXEC_FILE --|                               \ |
                                                         \---->CRASH_HOTPLUG
      
      
          KEXEC      --|
                       |--> KEXEC_CORE (for kexec reboot only)
          KEXEC_FILE --|
      
      Test
      ========
      On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips,
      riscv, loongarch, I did below three cases of config item setting and
      building all passed. Take configs on x86_64 as exampmle here:
      
      (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump
      items are unset automatically:
      # Kexec and crash features
      # CONFIG_KEXEC is not set
      # CONFIG_KEXEC_FILE is not set
      # end of Kexec and crash features
      
      (2) set CONFIG_KEXEC_FILE and 'make olddefconfig':
      ---------------
      # Kexec and crash features
      CONFIG_CRASH_RESERVE=y
      CONFIG_VMCORE_INFO=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      CONFIG_CRASH_DUMP=y
      CONFIG_CRASH_HOTPLUG=y
      CONFIG_CRASH_MAX_MEMORY_RANGES=8192
      # end of Kexec and crash features
      ---------------
      
      (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig':
      ------------------------
      # Kexec and crash features
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      # end of Kexec and crash features
      ------------------------
      
      Note:
      For ppc, it needs investigation to make clear how to split out crash
      code in arch folder. Hope Hari and Pingfan can help have a look, see if
      it's doable. Now, I make it either have both kexec and crash enabled, or
      disable both of them altogether.
      
      
      This patch (of 14):
      
      Both kdump and fa_dump of ppc rely on crashkernel reservation.  Move the
      relevant codes into separate files: crash_reserve.c,
      include/linux/crash_reserve.h.
      
      And also add config item CRASH_RESERVE to control its enabling of the
      codes.  And update config items which has relationship with crashkernel
      reservation.
      
      And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE
      when those scopes are only crashkernel reservation related.
      
      And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on
      arm64, x86 and risc-v because those architectures' crash_core.h is only
      related to crashkernel reservation.
      
      [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin]
      Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85fcde40
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: refactor vmalloc_dump_obj() function · 8be4d46e
      Uladzislau Rezki (Sony) authored
      This patch tends to simplify the function in question, by removing an
      extra stack "objp" variable, returning back to an early exit approach if
      spin_trylock() fails or VA was not found.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be4d46e
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: improve description of vmap node layer · 15e02a39
      Uladzislau Rezki (Sony) authored
      This patch adds extra explanation of recently added vmap node layer based
      on community feedback.  No functional change.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15e02a39
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a shrinker to drain vmap pools · 7679ba6b
      Uladzislau Rezki (Sony) authored
      The added shrinker is used to return back current cached VAs into a global
      vmap space, when a system enters into a low memory mode.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-12-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7679ba6b
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: set nr_nodes based on CPUs in a system · 8f33a2ff
      Uladzislau Rezki (Sony) authored
      A number of nodes which are used in the alloc/free paths is set based on
      num_possible_cpus() in a system.  Please note a high limit threshold
      though is fixed and corresponds to 128 nodes.
      
      For 32-bit or single core systems an access to a global vmap heap is not
      balanced.  Such small systems do not suffer from lock contentions due to
      low number of CPUs.  In such case the nr_nodes is equal to 1.
      
      Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo
      ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-11-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f33a2ff
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vmallocinfo · 8e1d743f
      Uladzislau Rezki (Sony) authored
      Allocated areas are spread among nodes, it implies that the scanning has
      to be performed individually of each node in order to dump all existing
      VAs.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-10-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e1d743f
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vread_iter · 53becf32
      Uladzislau Rezki (Sony) authored
      Extend the vread_iter() to be able to perform a sequential reading of VAs
      which are spread among multiple nodes.  So a data read over the /dev/kmem
      correctly reflects a vmalloc memory layout.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53becf32
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a scan area of VA only once · 96aa8437
      Uladzislau Rezki (Sony) authored
      Invoke a kmemleak_scan_area() function only for newly allocated objects to
      add a scan area within that object.  There is no reason to add a same scan
      area(pointer to beginning or inside the object) several times.  If a VA is
      obtained from the cache its scan area has already been associated.
      
      Link: https://lkml.kernel.org/r/20240202190628.47806-1-urezki@gmail.com
      Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96aa8437
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: offload free_vmap_area_lock lock · 72210662
      Uladzislau Rezki (Sony) authored
      Concurrent access to a global vmap space is a bottle-neck.  We can
      simulate a high contention by running a vmalloc test suite.
      
      To address it, introduce an effective vmap node logic.  Each node behaves
      as independent entity.  When a node is accessed it serves a request
      directly(if possible) from its pool.
      
      This model has a size based pool for requests, i.e.  pools are serialized
      and populated based on object size and real demand.  A maximum object size
      that pool can handle is set to 256 pages.
      
      This technique reduces a pressure on the global vmap lock.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-8-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72210662
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global purge_vmap_area_root rb-tree · 282631cb
      Uladzislau Rezki (Sony) authored
      Similar to busy VA, lazily-freed area is stored to a node it belongs to. 
      Such approach does not require any global locking primitive, instead an
      access becomes scalable what mitigates a contention.
      
      This patch removes a global purge-lock, global purge-tree and global purge
      list.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-7-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      282631cb
    • Baoquan He's avatar
      mm/vmalloc: remove vmap_area_list · 55c49fee
      Baoquan He authored
      Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get
      the base address of vmalloc area.  Now, vmap_area_list is empty, so export
      VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
      
      [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()]
        Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c49fee
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global vmap_area_root rb-tree · d0936029
      Uladzislau Rezki (Sony) authored
      Store allocated objects in a separate nodes.  A va->va_start address is
      converted into a correct node where it should be placed and resided.  An
      addr_to_node() function is used to do a proper address conversion to
      determine a node that contains a VA.
      
      Such approach balances VAs across nodes as a result an access becomes
      scalable.  Number of nodes in a system depends on number of CPUs.
      
      Please note:
      
      1. As of now allocated VAs are bound to a node-0. It means the
         patch does not give any difference comparing with a current
         behavior;
      
      2. The global vmap_area_lock, vmap_area_root are removed as there
         is no need in it anymore. The vmap_area_list is still kept and
         is _empty_. It is exported for a kexec only;
      
      3. The vmallocinfo and vread() have to be reworked to be able to
         handle multiple nodes.
      
      [urezki@gmail.com: mark vmap_init_free_space() with __init tag]
        Link: https://lkml.kernel.org/r/20240111132628.299644-1-urezki@gmail.com
      [urezki@gmail.com: fix a wrong value passed to __find_vmap_area()]
        Link: https://lkml.kernel.org/r/20240111121104.180993-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-5-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0936029
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: move vmap_init_free_space() down in vmalloc.c · 7fa8cee0
      Uladzislau Rezki (Sony) authored
      A vmap_init_free_space() is a function that setups a vmap space and is
      considered as part of initialization phase.  Since a main entry which is
      vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow
      the pattern.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fa8cee0
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: rename adjust_va_to_fit_type() function · 5b75b8e1
      Uladzislau Rezki (Sony) authored
      This patch renames the adjust_va_to_fit_type() function to va_clip() which
      is shorter and more expressive.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-3-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b75b8e1
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add va_alloc() helper · 38f6b9af
      Uladzislau Rezki (Sony) authored
      Patch series "Mitigate a vmap lock contention", v3.
      
      1. Motivation
      
      - Offload global vmap locks making it scaled to number of CPUS;
      
      - If possible and there is an agreement, we can remove the "Per cpu kva
        allocator" to make the vmap code to be more simple;
      
      - There were complaints from XFS folk that a vmalloc might be contented
        on their workloads.
      
      2. Design(high level overview)
      
      We introduce an effective vmap node logic.  A node behaves as independent
      entity to serve an allocation request directly(if possible) from its pool.
      That way it bypasses a global vmap space that is protected by its own
      lock.
      
      An access to pools are serialized by CPUs.  Number of nodes are equal to
      number of CPUs in a system.  Please note the high threshold is bound to
      128 nodes.
      
      Pools are size segregated and populated based on system demand.  The
      maximum alloc request that can be stored into a segregated storage is 256
      pages.  The lazily drain path decays a pool by 25% as a first step and as
      second populates it by fresh freed VAs for reuse instead of returning them
      into a global space.
      
      When a VA is obtained(alloc path), it is stored in separate nodes.  A
      va->va_start address is converted into a correct node where it should be
      placed and resided.  Doing so we balance VAs across the nodes as a result
      an access becomes scalable.  The addr_to_node() function does a proper
      address conversion to a correct node.
      
      A vmap space is divided on segments with fixed size, it is 16 pages.  That
      way any address can be associated with a segment number.  Number of
      segments are equal to num_possible_cpus() but not grater then 128.  The
      numeration starts from 0.  See below how it is converted:
      
      static inline unsigned int
      addr_to_node_id(unsigned long addr)
      {
      	return (addr / zone_size) % nr_nodes;
      }
      
      On a free path, a VA can be easily found by converting its "va_start"
      address to a certain node it resides.  It is moved from "busy" data to
      "lazy" data structure.  Later on, as noted earlier, the lazy kworker
      decays each node pool and populates it by fresh incoming VAs.  Please
      note, a VA is returned to a node that did an alloc request.
      
      3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
      
      sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      
      This patch (of 11):
      
      Currently __alloc_vmap_area() function contains an open codded logic that
      finds and adjusts a VA based on allocation request.
      
      Introduce a va_alloc() helper that adjusts found VA only.  There is no a
      functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f6b9af
    • Oscar Salvador's avatar
      mm,page_owner: update Documentation regarding page_owner_stacks · ba6fe537
      Oscar Salvador authored
      Update page_owner documentation including the new page_owner_stacks
      feature to show how it can be used.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-8-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba6fe537
    • Oscar Salvador's avatar
      mm,page_owner: filter out stacks by a threshold · 05bb6f4e
      Oscar Salvador authored
      We want to be able to filter out the stacks based on a threshold we can
      can tune.  By writing to 'count_threshold' file, we can adjust the
      threshold value.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-7-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05bb6f4e
    • Oscar Salvador's avatar
      mm,page_owner: display all stacks and their count · 765973a0
      Oscar Salvador authored
      This patch adds a new directory called 'page_owner_stacks' under
      /sys/kernel/debug/, with a file called 'show_stacks' in it.  Reading from
      that file will show all stacks that were added by page_owner followed by
      their counting, giving us a clear overview of stack <-> count
      relationship.
      
      E.g:
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_write+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 4578
      
      The seq stack_{start,next} functions will iterate through the list
      stack_list in order to print all stacks.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-6-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      765973a0
    • Oscar Salvador's avatar
      mm,page_owner: implement the tracking of the stacks count · 217b2119
      Oscar Salvador authored
      Implement {inc,dec}_stack_record_count() which increments or decrements on
      respective allocation and free operations, via __reset_page_owner() (free
      operation) and __set_page_owner() (alloc operation).
      
      Newly allocated stack_record structs will be added to the list stack_list
      via add_stack_record_to_list().  Modifications on the list are protected
      via a spinlock with irqs disabled, since this code can also be reached
      from IRQ context.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-5-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      217b2119
    • Oscar Salvador's avatar
      mm,page_owner: maintain own list of stack_records structs · 4bedfb31
      Oscar Salvador authored
      page_owner needs to increment a stack_record refcount when a new
      allocation occurs, and decrement it on a free operation.  In order to do
      that, we need to have a way to get a stack_record from a handle. 
      Implement __stack_depot_get_stack_record() which just does that, and make
      it public so page_owner can use it.
      
      Also, traversing all stackdepot buckets comes with its own complexity,
      plus we would have to implement a way to mark only those stack_records
      that were originated from page_owner, as those are the ones we are
      interested in.  For that reason, page_owner maintains its own list of
      stack_records, because traversing that list is faster than traversing all
      buckets while keeping at the same time a low complexity.
      
      For now, add to stack_list only the stack_records of dummy_handle and
      failure_handle, and set their refcount of 1.
      
      Further patches will add code to increment or decrement stack_records
      count on allocation and free operation.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-4-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bedfb31
    • Oscar Salvador's avatar
      lib/stackdepot: move stack_record struct definition into the header · 8151c7a3
      Oscar Salvador authored
      In order to move the heavy lifting into page_owner code, this one needs to
      have access to the stack_record structure, which right now sits in
      lib/stackdepot.c.  Move it to the stackdepot.h header so page_owner can
      access stack_record's struct fields.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-3-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8151c7a3
    • Oscar Salvador's avatar
      lib/stackdepot: fix first entry having a 0-handle · 3ee34eab
      Oscar Salvador authored
      Patch series "page_owner: print stacks and their outstanding allocations",
      v10.
      
      page_owner is a great debug functionality tool that lets us know about all
      pages that have been allocated/freed and their specific stacktrace.  This
      comes very handy when debugging memory leaks, since with some scripting we
      can see the outstanding allocations, which might point to a memory leak.
      
      In my experience, that is one of the most useful cases, but it can get
      really tedious to screen through all pages and try to reconstruct the
      stack <-> allocated/freed relationship, becoming most of the time a
      daunting and slow process when we have tons of allocation/free operations.
       
      
      This patchset aims to ease that by adding a new functionality into
      page_owner.  This functionality creates a new directory called
      'page_owner_stacks' under 'sys/kernel//debug' with a read-only file called
      'show_stacks', which prints out all the stacks followed by their
      outstanding number of allocations (being that the times the stacktrace has
      allocated but not freed yet).  This gives us a clear and a quick overview
      of stacks <-> allocated/free.
      
      We take advantage of the new refcount_f field that stack_record struct
      gained, and increment/decrement the stack refcount on every
      __set_page_owner() (alloc operation) and __reset_page_owner (free
      operation) call.
      
      Unfortunately, we cannot use the new stackdepot api STACK_DEPOT_FLAG_GET
      because it does not fulfill page_owner needs, meaning we would have to
      special case things, at which point makes more sense for page_owner to do
      its own {dec,inc}rementing of the stacks.  E.g: Using
      STACK_DEPOT_FLAG_PUT, once the refcount reaches 0, such stack gets
      evicted, so page_owner would lose information.
      
      This patchset also creates a new file called 'set_threshold' within
      'page_owner_stacks' directory, and by writing a value to it, the stacks
      which refcount is below such value will be filtered out.
      
      A PoC can be found below:
      
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks.txt
       # head -40 page_owner_full_stacks.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        page_cache_ra_unbounded+0x96/0x180
        filemap_get_pages+0xfd/0x590
        filemap_read+0xcc/0x330
        blkdev_read_iter+0xb8/0x150
        vfs_read+0x285/0x320
        ksys_read+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 521
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_write+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 4609
      ...
      ...
      
       # echo 5000 > /sys/kernel/debug/page_owner_stacks/set_threshold 
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks_5000.txt
       # head -40 page_owner_full_stacks_5000.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_pwrite64+0x75/0x90
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 6781
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        pcpu_populate_chunk+0xec/0x350
        pcpu_balance_workfn+0x2d1/0x4a0
        process_scheduled_works+0x84/0x380
        worker_thread+0x12a/0x2a0
        kthread+0xe3/0x110
        ret_from_fork+0x30/0x50
        ret_from_fork_asm+0x1b/0x30
       stack_count: 8641
      
      
      This patch (of 7):
      
      The very first entry of stack_record gets a handle of 0, but this is wrong
      because stackdepot treats a 0-handle as a non-valid one.  E.g: See the
      check in stack_depot_fetch()
      
      Fix this by adding and offset of 1.
      
      This bug has been lurking since the very beginning of stackdepot, but no
      one really cared as it seems.  Because of that I am not adding a Fixes
      tag.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240215215907.20121-2-osalvador@suse.deCo-developed-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ee34eab
    • Andrew Morton's avatar
    • Aneesh Kumar K.V (IBM)'s avatar
      mm/debug_vm_pgtable: fix BUG_ON with pud advanced test · 720da1e5
      Aneesh Kumar K.V (IBM) authored
      Architectures like powerpc add debug checks to ensure we find only devmap
      PUD pte entries.  These debug checks are only done with CONFIG_DEBUG_VM. 
      This patch marks the ptes used for PUD advanced test devmap pte entries so
      that we don't hit on debug checks on architecture like ppc64 as below.
      
      WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
      ....
      NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
      LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
      Call Trace:
      [c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
      [c000000004a2f980] [000d34c100000000] 0xd34c100000000
      [c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
      [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
      [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Also
      
       kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
       ....
      
       NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
       LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       Call Trace:
       [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
       [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
       [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Link: https://lkml.kernel.org/r/20240129060022.68044-1-aneesh.kumar@kernel.org
      Fixes: 27af67f3 ("powerpc/book3s64/mm: enable transparent pud hugepage")
      Signed-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      720da1e5
    • Nhat Pham's avatar
      mm: cachestat: fix folio read-after-free in cache walk · 3a75cb05
      Nhat Pham authored
      In cachestat, we access the folio from the page cache's xarray to compute
      its page offset, and check for its dirty and writeback flags.  However, we
      do not hold a reference to the folio before performing these actions,
      which means the folio can concurrently be released and reused as another
      folio/page/slab.
      
      Get around this altogether by just using xarray's existing machinery for
      the folio page offsets and dirty/writeback states.
      
      This changes behavior for tmpfs files to now always report zeroes in their
      dirty and writeback counters.  This is okay as tmpfs doesn't follow
      conventional writeback cache behavior: its pages get "cleaned" during
      swapout, after which they're no longer resident etc.
      
      Link: https://lkml.kernel.org/r/20240220153409.GA216065@cmpxchg.org
      Fixes: cf264e13 ("cachestat: implement cachestat syscall")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Cc: <stable@vger.kernel.org>	[6.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a75cb05
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add memory mapping entry with reviewers · 00130266
      Lorenzo Stoakes authored
      Recently there have been a number of patches which have affected various
      aspects of the memory mapping logic as implemented in mm/mmap.c where it
      would have been useful for regular contributors to have been notified.
      
      Add an entry for this part of mm in particular with regular contributors
      tagged as reviewers.
      
      Link: https://lkml.kernel.org/r/20240220064410.4639-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      00130266
    • Byungchul Park's avatar
      mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index · 2774f256
      Byungchul Park authored
      With numa balancing on, when a numa system is running where a numa node
      doesn't have its local memory so it has no managed zones, the following
      oops has been observed.  It's because wakeup_kswapd() is called with a
      wrong zone index, -1.  Fixed it by checking the index before calling
      wakeup_kswapd().
      
      > BUG: unable to handle page fault for address: 00000000000033f3
      > #PF: supervisor read access in kernel mode
      > #PF: error_code(0x0000) - not-present page
      > PGD 0 P4D 0
      > Oops: 0000 [#1] PREEMPT SMP NOPTI
      > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
      > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
      > Code: (omitted)
      > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
      > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
      > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
      > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
      > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > PKRU: 55555554
      > Call Trace:
      >  <TASK>
      > ? __die
      > ? page_fault_oops
      > ? __pte_offset_map_lock
      > ? exc_page_fault
      > ? asm_exc_page_fault
      > ? wakeup_kswapd
      > migrate_misplaced_page
      > __handle_mm_fault
      > handle_mm_fault
      > do_user_addr_fault
      > exc_page_fault
      > asm_exc_page_fault
      > RIP: 0033:0x55b897ba0808
      > Code: (omitted)
      > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
      > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
      > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
      > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
      > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
      >  </TASK>
      
      Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.comSigned-off-by: default avatarByungchul Park <byungchul@sk.com>
      Reported-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Fixes: c574bbe9 ("NUMA balancing: optimize page placement for memory tiering system")
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2774f256
    • Marco Elver's avatar
      kasan: revert eviction of stack traces in generic mode · 711d3491
      Marco Elver authored
      This partially reverts commits cc478e0b, 63b85ac5, 08d7c94d,
      a414d428, and 773688a6 to make use of variable-sized stack depot
      records, since eviction of stack entries from stack depot forces fixed-
      sized stack records.  Care was taken to retain the code cleanups by the
      above commits.
      
      Eviction was added to generic KASAN as a response to alleviating the
      additional memory usage from fixed-sized stack records, but this still
      uses more memory than previously.
      
      With the re-introduction of variable-sized records for stack depot, we can
      just switch back to non-evictable stack records again, and return back to
      the previous performance and memory usage baseline.
      
      Before (observed after a KASAN kernel boot):
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      After:
      
        pools: 319
        refcounted_allocations: 0
        refcounted_frees: 0
        refcounted_in_use: 0
        freelist_size: 0
        persistent_count: 29397
        persistent_bytes: 5183536
      
      As can be seen from the counters, with a generic KASAN config, refcounted
      allocations and evictions are no longer used.  Due to using variable-sized
      records, I observe a reduction of 278 stack depot pools (saving 4448 KiB)
      with my test setup.
      
      Link: https://lkml.kernel.org/r/20240129100708.39460-2-elver@google.com
      Fixes: cc478e0b ("kasan: avoid resetting aux_lock")
      Fixes: 63b85ac5 ("kasan: stop leaking stack trace handles")
      Fixes: 08d7c94d ("kasan: memset free track in qlink_free")
      Fixes: a414d428 ("kasan: handle concurrent kasan_record_aux_stack calls")
      Fixes: 773688a6 ("kasan: use stack_depot_put for Generic mode")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      711d3491
    • Marco Elver's avatar
      stackdepot: use variable size records for non-evictable entries · 31639fd6
      Marco Elver authored
      With the introduction of stack depot evictions, each stack record is now
      fixed size, so that future reuse after an eviction can safely store
      differently sized stack traces.  In all cases that do not make use of
      evictions, this wastes lots of space.
      
      Fix it by re-introducing variable size stack records (up to the max
      allowed size) for entries that will never be evicted.  We know if an entry
      will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided,
      since a later stack_depot_put() attempt is undefined behavior.
      
      With my current kernel config that enables KASAN and also SLUB owner
      tracking, I observe (after a kernel boot) a whopping reduction of 296
      stack depot pools, which translates into 4736 KiB saved.  The savings here
      are from SLUB owner tracking only, because KASAN generic mode still uses
      refcounting.
      
      Before:
      
        pools: 893
        allocations: 29841
        frees: 6524
        in_use: 23317
        freelist_size: 3454
      
      After:
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      [elver@google.com: fix -Wstringop-overflow warning]
        Link: https://lore.kernel.org/all/20240201135747.18eca98e@canb.auug.org.au/
        Link: https://lkml.kernel.org/r/20240201090434.1762340-1-elver@google.com
        Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20240129100708.39460-1-elver@google.com
      Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Fixes: 108be8de ("lib/stackdepot: allow users to evict stack traces")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31639fd6
  2. 22 Feb, 2024 8 commits