1. 24 Feb, 2024 40 commits
    • SeongJae Park's avatar
      mm/damon/core: split out quota goal related fields to a struct · 106e26fc
      SeongJae Park authored
      'struct damos_quota' is not small now.  Split out fields for quota goal to
      a separate struct for easier reading.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      106e26fc
    • SeongJae Park's avatar
      mm/damon: move comments and fields for damos-quota-prioritization to the end · 4d791a0a
      SeongJae Park authored
      The comments and definition of 'struct damos_quota' lists a few fields for
      effective quota generation first, fields for regions prioritization under
      the quota, and then remaining fields for effective quota generation. 
      Readers' should unnecesssarily switch their context in the middle.  List
      all the fields for the effective quota first, and then fields for the
      prioritization for making it easier to read.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d791a0a
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: document effective_bytes file · a6068d6d
      SeongJae Park authored
      Update DAMON usage document for the effective quota file of the DAMON
      sysfs interface.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6068d6d
    • SeongJae Park's avatar
      Docs/ABI/damon: document effective_bytes sysfs file · 68c4905b
      SeongJae Park authored
      Update the DAMON ABI doc for the effective_bytes sysfs file and the
      kdamond state file input command for updating the content of the file.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68c4905b
    • SeongJae Park's avatar
      mm/damon/sysfs: implement a kdamond command for updating schemes' effective quotas · c71f8a71
      SeongJae Park authored
      Implement yet another kdamond 'state' file input command, namely
      'update_schemes_effective_quotas'.  If it is written, the
      'effective_bytes' files of the kdamond will be updated to provide the
      current effective size quota of each scheme in bytes.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c71f8a71
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: implement quota effective_bytes file · 68131315
      SeongJae Park authored
      DAMON sysfs interface allows users to set two types of quotas, namely time
      quota and size quota.  DAMOS converts time quota to a size quota and use
      smaller one among the resulting two size quotas.  The resulting effective
      size quota can be helpful for debugging and analysis, but not exposed to
      the user.  The recently added feedback-driven quota auto-tuning is making
      it even more mysterious.
      
      Implement a DAMON sysfs interface read-only empty file, namely
      'effective_bytes', under the quota goal DAMON sysfs directory.  It will be
      extended to expose the effective quota to the end user.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68131315
    • SeongJae Park's avatar
      mm/damon/core: set damos_quota->esz as public field and document · 78f2f603
      SeongJae Park authored
      Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself".
      
      The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning
      patchset[1] which has merged since commit 9294a037 ("mm/damon/core:
      implement goal-oriented feedback-driven quota auto-tuning") made the
      mechanism and the policy separated.  That is, users can set a part of
      DAMOS control policies without a deep understanding of the mechanism but
      just their demands such as SLA.
      
      However, users are still required to do some additional work of manually
      collecting their target metric and feeding it to DAMOS.  In the case of
      end-users who use DAMON sysfs interface, the context switches between
      user-space and kernel-space could also make it inefficient.  The overhead
      is supposed to be only trivial in common cases, though.  Meanwhile, in
      simple use cases, the target metric could be common system metrics that
      the kernel can efficiently self-retrieve, such as memory pressure stall
      time (PSI).
      
      Extend DAMOS quota auto-tuning to support multiple types of metrics
      including the DAMOS self-retrievable ones, and add support for memory
      pressure stall time metric.  Different types of metrics can be supported
      in future.  The auto-tuning capability is currently supported for only
      users of DAMOS kernel API and DAMON sysfs interface.  Extend the support
      to DAMON_RECLAIM.
      
      Patches Sequence
      ================
      
      First five patches are for helping debugging and fine-tuning existing
      quota control features.  The first one (patch 1) exposes the effective
      quota that is made with given user inputs to DAMOS kernel API users and
      kernel-doc documents.  Following four patches implement (patches 1, 2 and
      3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the
      value.
      
      Following six patches cleanup and simplify the existing DAMOS quota
      auto-tuning code by improving layout of comments and data structures
      (patches 6 and 7), supporting common use cases, namely multiple goals
      (patches 8, 9 and 10), and simplifying the interface (patch 11).
      
      Then six patches for the main purpose of this patchset follow.  The first
      three changes extend the core logic for various target metrics (patch 12),
      implement memory pressure stall time-based target metric support (patch
      13), and update DAMON sysfs interface to support the new target metric
      (patch 14).  Then, documentation updates for the features on design (patch
      15), ABI (patch 16), and usage (patch 17) follow.
      
      Last three patches add auto-tuning support on DAMON_RECLAIM.  The patches
      implement DAMON_RECLAIM parameters for user-feedback driven quota
      auto-tuning (patch 18), memory pressure stall time-driven quota
      self-tuning (patch 19), and finally update the DAMON_RECLAIM usage
      document for the new parameters (patch 20).
      
      [1] https://lore.kernel.org/all/20231130023652.50284-1-sj@kernel.org/
      
      
      This patch (of 20):
      
      DAMOS allow users to specify the quota as they want in multiple ways
      including time quota, size quota, and feedback-based auto-tuning.  DAMOS
      makes one effective quota out of the inputs and use it at the end. 
      Knowing the current effective quota helps understanding DAMOS' internal
      mechanism and fine-tuning quotas.  DAMON kernel API users can get the
      information from ->esz field of damos_quota struct, but the field is
      marked as private purpose, and not kernel-doc documented.  Make it public
      and document.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240219194431.159606-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78f2f603
    • Lance Yang's avatar
      mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check · 879c6000
      Lance Yang authored
      khugepaged scans the entire address space in the background for each
      given mm, looking for opportunities to merge sequences of basic pages
      into huge pages.  However, when an mm is inserted to the mm_slots list,
      and the MMF_DISABLE_THP flag is set later, this scanning process
      becomes unnecessary for that mm and can be skipped to avoid redundant
      operations, especially in scenarios with a large address space.
      
      On an Intel Core i5 CPU, the time taken by khugepaged to scan the
      address space of the process, which has been set with the
      MMF_DISABLE_THP flag after being added to the mm_slots list, is as
      follows (shorter is better):
      
      VMA Count |   Old   |   New   |  Change
      ---------------------------------------
          50    |   23us  |    9us  |  -60.9%
         100    |   32us  |    9us  |  -71.9%
         200    |   44us  |    9us  |  -79.5%
         400    |   75us  |    9us  |  -88.0%
         800    |   98us  |    9us  |  -90.8%
      
      Once the count of VMAs for the process exceeds page_to_scan, khugepaged
      needs to wait for scan_sleep_millisecs ms before scanning the next
      process.  IMO, unnecessary scans could actually be skipped with a very
      inexpensive mm->flags check in this case.
      
      This commit introduces a check before each scanning process to test the
      MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning
      process is bypassed, thereby improving the efficiency of khugepaged.
      
      This optimization is not a correctness issue but rather an enhancement
      to save expensive checks on each VMA when userspace cannot prctl itself
      before spawning into the new process.
      
      On some servers within our company, we deploy a daemon responsible for
      monitoring and updating local applications.  Some applications prefer
      not to use THP, so the daemon calls prctl to disable THP before
      fork/exec.  Conversely, for other applications, the daemon calls prctl
      to enable THP before fork/exec.
      
      Ideally, the daemon should invoke prctl after the fork, but its current
      implementation follows the described approach.  In the Go standard
      library, there is no direct encapsulation of the fork system call;
      instead, fork and execve are combined into one through
      syscall.ForkExec.
      
      Link: https://lkml.kernel.org/r/20240129054551.57728-1-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      879c6000
    • Mike Rapoport (IBM)'s avatar
      MAINTAINERS: update mm and memcg entries · b659a7c2
      Mike Rapoport (IBM) authored
      Add F: lines for memory management and memory cgroup include files.
      
      Link: https://lkml.kernel.org/r/20240208055727.142387-1-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b659a7c2
    • Baoquan He's avatar
      arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.c · 199da871
      Baoquan He authored
      Nathan reported below building error:
      
      =====
      $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7
      $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all
      ..
      arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo':
      machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str'
      ====
      
      On architecutres, like arm, s390, ppc, sh, function
      arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can
      only be compiled in when CONFIG_KEXEC_CORE=y.
      
      That's not right because arch_crash_save_vmcoreinfo() is used to export
      arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its
      compiling in. However, CONFIG_VMVCORE_INFO could be independent of
      CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO.
      Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is
      not set, it will report linking error.
      
      So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to
      a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in
      arch_crash_save_vmcoreinfo().
      
      [akpm@linux-foundation.org: remove stray newlines at eof]
      Link: https://lkml.kernel.org/r/20240129135033.157195-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      199da871
    • Baoquan He's avatar
      loongarch, crash: wrap crash dumping code into crash related ifdefs · ea034d0b
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on loongarch with some adjustments.
      
      Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-15-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea034d0b
    • Baoquan He's avatar
      arm, crash: wrap crash dumping code into crash related ifdefs · 5057dff3
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on arm with some adjustments.
      
      Here use CONFIG_CRASH_RESERVE ifdef to replace CONFIG_KEXEC ifdef.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-14-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5057dff3
    • Baoquan He's avatar
      riscv, crash: wrap crash dumping code into crash related ifdefs · 0978a63f
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on risc-v with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and
      use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-13-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0978a63f
    • Baoquan He's avatar
      mips, crash: wrap crash dumping code into crash related ifdefs · d739f190
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on mips with some adjustments.
      
      Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-12-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d739f190
    • Baoquan He's avatar
      sh, crash: wrap crash dumping code into crash related ifdefs · e3892635
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on SuperH with some adjustments.
      
      Wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and use
      IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the
      crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-11-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3892635
    • Baoquan He's avatar
      s390, crash: wrap crash dumping code into crash related ifdefs · 865e2acd
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on s390 with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-10-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      865e2acd
    • Baoquan He's avatar
      ppc, crash: enforce KEXEC and KEXEC_FILE to select CRASH_DUMP · 086d67ef
      Baoquan He authored
      In PowerPC, the crash dumping and kexec reboot share code in
      arch_kexec_locate_mem_hole(), in which struct crash_mem is used.
      
      Here enfoce enforce KEXEC and KEXEC_FILE to select CRASH_DUMP for now.
      
      [bhe@redhat.com: fix allnoconfig on ppc]
        Link: https://lkml.kernel.org/r/ZbJwMyCpz4HDySoo@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240124051254.67105-9-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      086d67ef
    • Baoquan He's avatar
      arm64, crash: wrap crash dumping code into crash related ifdefs · 40254101
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on arm64 with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.
      
      [bhe@redhat.com: fix building error in generic codes]
        Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40254101
    • Baoquan He's avatar
      x86, crash: wrap crash dumping code into crash related ifdefs · a4eeb217
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on x86 with some adjustments.
      
      Here, also change some ifdefs or IS_ENABLED() check to more appropriate
      ones, e,g
       - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP
       - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE))
      
      [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope]
        Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u
      Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4eeb217
    • Baoquan He's avatar
      crash: clean up kdump related config items · 75bc255a
      Baoquan He authored
      By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning
      up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from
      kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to
      depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and
      VMCORE_INFO.
      
      KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because
      KEXEC_CORE enables codes which allocate control pages, copy
      kexec/kdump segments, and prepare for switching. These codes are shared
      by both kexec reboot and crash dumping.
      
      Doing this makes codes and the corresponding config items more
      logical (the right item depends on or is selected by the left item).
      
      PROC_KCORE -----------> VMCORE_INFO
      
                 |----------> VMCORE_INFO
      FA_DUMP----|
                 |----------> CRASH_RESERVE
      
                                                      ---->VMCORE_INFO
                                                     /
                                                     |---->CRASH_RESERVE
      KEXEC      --|                                /|
                   |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
      KEXEC_FILE --|                               \ |
                                                     \---->CRASH_HOTPLUG
      
      KEXEC      --|
                   |--> KEXEC_CORE--> kexec reboot
      KEXEC_FILE --|
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-6-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75bc255a
    • Baoquan He's avatar
      crash: split crash dumping code out from kexec_core.c · 02aff848
      Baoquan He authored
      Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes
      need be built in to avoid compiling error when building kexec code even
      though the crash dumping functionality is not enabled. E.g
      --------------------
      CONFIG_CRASH_CORE=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC=y
      CONFIG_KEXEC_FILE=y
      ---------------------
      
      After splitting out crashkernel reservation code and vmcoreinfo exporting
      code, there's only crash related code left in kernel/crash_core.c. Now
      move crash related codes from kexec_core.c to crash_core.c and only build it
      in when CONFIG_CRASH_DUMP=y.
      
      And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope,
      or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP
      ifdef in generic kernel files.
      
      With these changes, crash_core codes are abstracted from kexec codes and
      can be disabled at all if only kexec reboot feature is wanted.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02aff848
    • Baoquan He's avatar
      crash: remove dependency of FA_DUMP on CRASH_DUMP · 2c44b67e
      Baoquan He authored
      In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's
      old memory content. Its elf header is constructed in 1st kernel and passed
      to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code
      of 1st kernel's old memory accessing in different architectures.
      
      Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump
      needs access global variable 'elfcorehdr_addr' to judge if it's in
      kdump kernel within function is_kdump_kernel(). In the current
      kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function
      setup_elfcorehdr() used to parse kernel parameter to fetch the passed
      value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP
      really doesn't have to depends on CRASH_DUMP.
      
      To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion,
      rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when
      CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend
      on CRASH_DUMP.
      
      [bhe@redhat.com: power/fadump: make FA_DUMP select CRASH_DUMP]
        Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240124051254.67105-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2c44b67e
    • Baoquan He's avatar
      crash: split vmcoreinfo exporting code out from crash_core.c · 443cbaf9
      Baoquan He authored
      Now move the relevant codes into separate files:
      kernel/crash_reserve.c, include/linux/crash_reserve.h.
      
      And add config item CRASH_RESERVE to control its enabling.
      
      And also update the old ifdeffery of CONFIG_CRASH_CORE, including of
      <linux/crash_core.h> and config item dependency on CRASH_CORE
      accordingly.
      
      And also do renaming as follows:
       - arch/xxx/kernel/{crash_core.c => vmcore_info.c}
      because they are only related to vmcoreinfo exporting on x86, arm64,
      riscv.
      
      And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to
      decide if build in crash_core.c.
      
      [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c]
        Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      443cbaf9
    • Baoquan He's avatar
      kexec: split crashkernel reservation code out from crash_core.c · 85fcde40
      Baoquan He authored
      Patch series "Split crash out from kexec and clean up related config
      items", v3.
      
      Motivation:
      =============
      Previously, LKP reported a building error. When investigating, it can't
      be resolved reasonablly with the present messy kdump config items.
      
       https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/
      
      The kdump (crash dumping) related config items could causes confusions:
      
      Firstly,
      
      CRASH_CORE enables codes including
       - crashkernel reservation;
       - elfcorehdr updating;
       - vmcoreinfo exporting;
       - crash hotplug handling;
      
      Now fadump of powerpc, kcore dynamic debugging and kdump all selects
      CRASH_CORE, while fadump
       - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing
         global variable 'elfcorehdr_addr';
       - kcore only needs vmcoreinfo exporting;
       - kdump needs all of the current kernel/crash_core.c.
      
      So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this
      mislead people that we enable crash dumping, actual it's not.
      
      Secondly,
      
      It's not reasonable to allow KEXEC_CORE select CRASH_CORE.
      
      Because KEXEC_CORE enables codes which allocate control pages, copy
      kexec/kdump segments, and prepare for switching. These codes are
      shared by both kexec reboot and kdump. We could want kexec reboot,
      but disable kdump. In that case, CRASH_CORE should not be selected.
      
       --------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_KEXEC=y
       CONFIG_KEXEC_FILE=y
       ---------------------
      
      Thirdly,
      
      It's not reasonable to allow CRASH_DUMP select KEXEC_CORE.
      
      That could make KEXEC_CORE, CRASH_DUMP are enabled independently from
      KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE
      code built in doesn't make any sense because no kernel loading or
      switching will happen to utilize the KEXEC_CORE code.
       ---------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_CRASH_DUMP=y
       ---------------------
      
      In this case, what is worse, on arch sh and arm, KEXEC relies on MMU,
      while CRASH_DUMP can still be enabled when !MMU, then compiling error is
      seen as the lkp test robot reported in above link.
      
       ------arch/sh/Kconfig------
       config ARCH_SUPPORTS_KEXEC
               def_bool MMU
      
       config ARCH_SUPPORTS_CRASH_DUMP
               def_bool BROKEN_ON_SMP
       ---------------------------
      
      Changes:
      ===========
      1, split out crash_reserve.c from crash_core.c;
      2, split out vmcore_infoc. from crash_core.c;
      3, move crash related codes in kexec_core.c into crash_core.c;
      4, remove dependency of FA_DUMP on CRASH_DUMP;
      5, clean up kdump related config items;
      6, wrap up crash codes in crash related ifdefs on all 8 arch-es
         which support crash dumping, except of ppc;
      
      Achievement:
      ===========
      With above changes, I can rearrange the config item logic as below (the right
      item depends on or is selected by the left item):
      
          PROC_KCORE -----------> VMCORE_INFO
      
                     |----------> VMCORE_INFO
          FA_DUMP----|
                     |----------> CRASH_RESERVE
      
                                                          ---->VMCORE_INFO
                                                         /
                                                         |---->CRASH_RESERVE
          KEXEC      --|                                /|
                       |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
          KEXEC_FILE --|                               \ |
                                                         \---->CRASH_HOTPLUG
      
      
          KEXEC      --|
                       |--> KEXEC_CORE (for kexec reboot only)
          KEXEC_FILE --|
      
      Test
      ========
      On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips,
      riscv, loongarch, I did below three cases of config item setting and
      building all passed. Take configs on x86_64 as exampmle here:
      
      (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump
      items are unset automatically:
      # Kexec and crash features
      # CONFIG_KEXEC is not set
      # CONFIG_KEXEC_FILE is not set
      # end of Kexec and crash features
      
      (2) set CONFIG_KEXEC_FILE and 'make olddefconfig':
      ---------------
      # Kexec and crash features
      CONFIG_CRASH_RESERVE=y
      CONFIG_VMCORE_INFO=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      CONFIG_CRASH_DUMP=y
      CONFIG_CRASH_HOTPLUG=y
      CONFIG_CRASH_MAX_MEMORY_RANGES=8192
      # end of Kexec and crash features
      ---------------
      
      (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig':
      ------------------------
      # Kexec and crash features
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      # end of Kexec and crash features
      ------------------------
      
      Note:
      For ppc, it needs investigation to make clear how to split out crash
      code in arch folder. Hope Hari and Pingfan can help have a look, see if
      it's doable. Now, I make it either have both kexec and crash enabled, or
      disable both of them altogether.
      
      
      This patch (of 14):
      
      Both kdump and fa_dump of ppc rely on crashkernel reservation.  Move the
      relevant codes into separate files: crash_reserve.c,
      include/linux/crash_reserve.h.
      
      And also add config item CRASH_RESERVE to control its enabling of the
      codes.  And update config items which has relationship with crashkernel
      reservation.
      
      And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE
      when those scopes are only crashkernel reservation related.
      
      And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on
      arm64, x86 and risc-v because those architectures' crash_core.h is only
      related to crashkernel reservation.
      
      [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin]
      Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85fcde40
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: refactor vmalloc_dump_obj() function · 8be4d46e
      Uladzislau Rezki (Sony) authored
      This patch tends to simplify the function in question, by removing an
      extra stack "objp" variable, returning back to an early exit approach if
      spin_trylock() fails or VA was not found.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be4d46e
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: improve description of vmap node layer · 15e02a39
      Uladzislau Rezki (Sony) authored
      This patch adds extra explanation of recently added vmap node layer based
      on community feedback.  No functional change.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15e02a39
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a shrinker to drain vmap pools · 7679ba6b
      Uladzislau Rezki (Sony) authored
      The added shrinker is used to return back current cached VAs into a global
      vmap space, when a system enters into a low memory mode.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-12-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7679ba6b
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: set nr_nodes based on CPUs in a system · 8f33a2ff
      Uladzislau Rezki (Sony) authored
      A number of nodes which are used in the alloc/free paths is set based on
      num_possible_cpus() in a system.  Please note a high limit threshold
      though is fixed and corresponds to 128 nodes.
      
      For 32-bit or single core systems an access to a global vmap heap is not
      balanced.  Such small systems do not suffer from lock contentions due to
      low number of CPUs.  In such case the nr_nodes is equal to 1.
      
      Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo
      ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-11-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f33a2ff
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vmallocinfo · 8e1d743f
      Uladzislau Rezki (Sony) authored
      Allocated areas are spread among nodes, it implies that the scanning has
      to be performed individually of each node in order to dump all existing
      VAs.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-10-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e1d743f
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vread_iter · 53becf32
      Uladzislau Rezki (Sony) authored
      Extend the vread_iter() to be able to perform a sequential reading of VAs
      which are spread among multiple nodes.  So a data read over the /dev/kmem
      correctly reflects a vmalloc memory layout.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53becf32
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a scan area of VA only once · 96aa8437
      Uladzislau Rezki (Sony) authored
      Invoke a kmemleak_scan_area() function only for newly allocated objects to
      add a scan area within that object.  There is no reason to add a same scan
      area(pointer to beginning or inside the object) several times.  If a VA is
      obtained from the cache its scan area has already been associated.
      
      Link: https://lkml.kernel.org/r/20240202190628.47806-1-urezki@gmail.com
      Fixes: 7db166b4aa0d ("mm: vmalloc: offload free_vmap_area_lock lock")
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      96aa8437
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: offload free_vmap_area_lock lock · 72210662
      Uladzislau Rezki (Sony) authored
      Concurrent access to a global vmap space is a bottle-neck.  We can
      simulate a high contention by running a vmalloc test suite.
      
      To address it, introduce an effective vmap node logic.  Each node behaves
      as independent entity.  When a node is accessed it serves a request
      directly(if possible) from its pool.
      
      This model has a size based pool for requests, i.e.  pools are serialized
      and populated based on object size and real demand.  A maximum object size
      that pool can handle is set to 256 pages.
      
      This technique reduces a pressure on the global vmap lock.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-8-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      72210662
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global purge_vmap_area_root rb-tree · 282631cb
      Uladzislau Rezki (Sony) authored
      Similar to busy VA, lazily-freed area is stored to a node it belongs to. 
      Such approach does not require any global locking primitive, instead an
      access becomes scalable what mitigates a contention.
      
      This patch removes a global purge-lock, global purge-tree and global purge
      list.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-7-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      282631cb
    • Baoquan He's avatar
      mm/vmalloc: remove vmap_area_list · 55c49fee
      Baoquan He authored
      Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile get
      the base address of vmalloc area.  Now, vmap_area_list is empty, so export
      VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
      
      [urezki@gmail.com: fix a warning in the crash_save_vmcoreinfo_init()]
        Link: https://lkml.kernel.org/r/20240111192329.449189-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-6-urezki@gmail.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      55c49fee
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: remove global vmap_area_root rb-tree · d0936029
      Uladzislau Rezki (Sony) authored
      Store allocated objects in a separate nodes.  A va->va_start address is
      converted into a correct node where it should be placed and resided.  An
      addr_to_node() function is used to do a proper address conversion to
      determine a node that contains a VA.
      
      Such approach balances VAs across nodes as a result an access becomes
      scalable.  Number of nodes in a system depends on number of CPUs.
      
      Please note:
      
      1. As of now allocated VAs are bound to a node-0. It means the
         patch does not give any difference comparing with a current
         behavior;
      
      2. The global vmap_area_lock, vmap_area_root are removed as there
         is no need in it anymore. The vmap_area_list is still kept and
         is _empty_. It is exported for a kexec only;
      
      3. The vmallocinfo and vread() have to be reworked to be able to
         handle multiple nodes.
      
      [urezki@gmail.com: mark vmap_init_free_space() with __init tag]
        Link: https://lkml.kernel.org/r/20240111132628.299644-1-urezki@gmail.com
      [urezki@gmail.com: fix a wrong value passed to __find_vmap_area()]
        Link: https://lkml.kernel.org/r/20240111121104.180993-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-5-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0936029
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: move vmap_init_free_space() down in vmalloc.c · 7fa8cee0
      Uladzislau Rezki (Sony) authored
      A vmap_init_free_space() is a function that setups a vmap space and is
      considered as part of initialization phase.  Since a main entry which is
      vmalloc_init(), has been moved down in vmalloc.c it makes sense to follow
      the pattern.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-4-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fa8cee0
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: rename adjust_va_to_fit_type() function · 5b75b8e1
      Uladzislau Rezki (Sony) authored
      This patch renames the adjust_va_to_fit_type() function to va_clip() which
      is shorter and more expressive.
      
      There is no a functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-3-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5b75b8e1
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add va_alloc() helper · 38f6b9af
      Uladzislau Rezki (Sony) authored
      Patch series "Mitigate a vmap lock contention", v3.
      
      1. Motivation
      
      - Offload global vmap locks making it scaled to number of CPUS;
      
      - If possible and there is an agreement, we can remove the "Per cpu kva
        allocator" to make the vmap code to be more simple;
      
      - There were complaints from XFS folk that a vmalloc might be contented
        on their workloads.
      
      2. Design(high level overview)
      
      We introduce an effective vmap node logic.  A node behaves as independent
      entity to serve an allocation request directly(if possible) from its pool.
      That way it bypasses a global vmap space that is protected by its own
      lock.
      
      An access to pools are serialized by CPUs.  Number of nodes are equal to
      number of CPUs in a system.  Please note the high threshold is bound to
      128 nodes.
      
      Pools are size segregated and populated based on system demand.  The
      maximum alloc request that can be stored into a segregated storage is 256
      pages.  The lazily drain path decays a pool by 25% as a first step and as
      second populates it by fresh freed VAs for reuse instead of returning them
      into a global space.
      
      When a VA is obtained(alloc path), it is stored in separate nodes.  A
      va->va_start address is converted into a correct node where it should be
      placed and resided.  Doing so we balance VAs across the nodes as a result
      an access becomes scalable.  The addr_to_node() function does a proper
      address conversion to a correct node.
      
      A vmap space is divided on segments with fixed size, it is 16 pages.  That
      way any address can be associated with a segment number.  Number of
      segments are equal to num_possible_cpus() but not grater then 128.  The
      numeration starts from 0.  See below how it is converted:
      
      static inline unsigned int
      addr_to_node_id(unsigned long addr)
      {
      	return (addr / zone_size) % nr_nodes;
      }
      
      On a free path, a VA can be easily found by converting its "va_start"
      address to a certain node it resides.  It is moved from "busy" data to
      "lazy" data structure.  Later on, as noted earlier, the lazy kworker
      decays each node pool and populates it by fresh incoming VAs.  Please
      note, a VA is returned to a node that did an alloc request.
      
      3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor
      
      sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      
      This patch (of 11):
      
      Currently __alloc_vmap_area() function contains an open codded logic that
      finds and adjusts a VA based on allocation request.
      
      Introduce a va_alloc() helper that adjusts found VA only.  There is no a
      functional change as a result of this patch.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-1-urezki@gmail.com
      Link: https://lkml.kernel.org/r/20240102184633.748113-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38f6b9af
    • Oscar Salvador's avatar
      mm,page_owner: update Documentation regarding page_owner_stacks · ba6fe537
      Oscar Salvador authored
      Update page_owner documentation including the new page_owner_stacks
      feature to show how it can be used.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-8-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba6fe537
    • Oscar Salvador's avatar
      mm,page_owner: filter out stacks by a threshold · 05bb6f4e
      Oscar Salvador authored
      We want to be able to filter out the stacks based on a threshold we can
      can tune.  By writing to 'count_threshold' file, we can adjust the
      threshold value.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-7-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      05bb6f4e