1. 24 Feb, 2024 40 commits
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: document quota goal metric file · 57e88e86
      SeongJae Park authored
      Update DAMON usage document for the quota goal target_metric file.
      
      [sj@kernel.org: fix a typo on the auto-tuning design reference link]
        Link: https://lkml.kernel.org/r/20240221170852.55529-3-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240219194431.159606-18-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57e88e86
    • SeongJae Park's avatar
      Docs/ABI/damon: document quota goal metric file · adc3908b
      SeongJae Park authored
      Update DAMON ABI document for the quota goal target_metric file.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-17-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adc3908b
    • SeongJae Park's avatar
      Docs/mm/damon/design: document quota goal self-tuning · 3c17174f
      SeongJae Park authored
      Update DAMON design doc to explain the quota goal self-tuning, which can
      be used by setting the goal's metric to metrics that kernel can
      self-retrieve.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-16-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c17174f
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: support PSI-based quota auto-tune · 4daacfe8
      SeongJae Park authored
      Extend DAMON sysfs interface to support the PSI-based quota auto-tuning by
      adding a new file, 'target_metric' under the quota goal directory.  Old
      users don't get any behavioral changes since the default value of the
      metric is 'user input'.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-15-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4daacfe8
    • SeongJae Park's avatar
      mm/damon/core: implement PSI metric DAMOS quota goal · 2dbb60f7
      SeongJae Park authored
      Extend DAMOS quota goal metric with system wide memory pressure stall
      time.  Specifically, the system level 'some' PSI for memory is used.  The
      target value can be set in microseconds.  DAMOS measures the increased
      amount of the PSI metric in last quota_reset_interval and use the ratio of
      it versus the user-specified target PSI value as the score for the
      auto-tuning feedback loop.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-14-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2dbb60f7
    • SeongJae Park's avatar
      mm/damon/core: support multiple metrics for quota goal · bcce9bc1
      SeongJae Park authored
      DAMOS quota auto-tuning asks users to assess the current tuned quota and
      provide the feedback in a manual and repeated way.  It allows users
      generate the feedback from a source that the kernel cannot access, and
      writing a script or a function for doing the manual and repeated feeding
      is not a big deal.  However, additional works are additional works, and it
      could be more efficient if DAMOS could do the fetch itself, especially in
      case of DAMON sysfs interface use case, since it can avoid the context
      switches between the user-space and the kernel-space, though the overhead
      would be only trivial in most cases.  Also in many cases, feedbacks could
      be made from kernel-accessible sources, such as PSI, CPU usage, etc.  Make
      the quota goal to support multiple types of metrics including such ones.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-13-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bcce9bc1
    • SeongJae Park's avatar
      mm/damon/core: let goal specified with only target and current values · 06ba5b30
      SeongJae Park authored
      DAMOS quota auto-tuning feature let users to set the goal by providing a
      function for getting the current score of the tuned quota.  It allows
      flexible goal setup, but only simple user-set quota is currently being
      used.  As a result, the only user of the DAMOS quota auto-tuning is using
      a silly void pointer casting based score value passing function.  Simplify
      the interface and the user code by letting user directly set the target
      and the current value.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-12-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06ba5b30
    • SeongJae Park's avatar
      mm/damon/core: remove ->goal field of damos_quota · 89d347a5
      SeongJae Park authored
      DAMOS quota auto-tuning feature supports static signle goal and dynamic
      multiple goals via DAMON kernel API, specifically via ->goal and ->goals
      fields of damos_quota struct, respectively.  All in-tree DAMOS kernel API
      users are using only the dynamic multiple goals now.  Remove the unsued
      static single goal interface.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-11-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      89d347a5
    • SeongJae Park's avatar
      mm/damon/sysfs: use only quota->goals · 9e736fdf
      SeongJae Park authored
      DAMON sysfs interface implements multiple quota auto-tuning goals on its
      level since the DAMOS core logic was supporting only single goal.  Now the
      core logic supports multiple goals on its level.  Update DAMON sysfs
      interface to reuse the core logic and drop unnecessary duplicated multiple
      goals implementation.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-10-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9e736fdf
    • SeongJae Park's avatar
      mm/damon/core: add multiple goals per damos_quota and helpers for those · 91f21216
      SeongJae Park authored
      The feedback-driven DAMOS quota auto-tuning feature allows only single
      goal to the DAMON kernel API users.  The API users could implement
      multiple goals for the end-users on their level, and that's what DAMON
      sysfs interface is doing.  More DAMON kernel API users such as
      DAMON_RECLAIM would need to do similar work.  To reduce unnecessary future
      duplciated efforts, support multiple goals from DAMOS core layer.  To make
      the support in minimum non-destructive change, keep the old single goal
      setup interface, and add multiple goals setup.  The single goal will
      treated as one of the multiple goals, so old API users are not required to
      make any change.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91f21216
    • SeongJae Park's avatar
      mm/damon/core: split out quota goal related fields to a struct · 106e26fc
      SeongJae Park authored
      'struct damos_quota' is not small now.  Split out fields for quota goal to
      a separate struct for easier reading.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      106e26fc
    • SeongJae Park's avatar
      mm/damon: move comments and fields for damos-quota-prioritization to the end · 4d791a0a
      SeongJae Park authored
      The comments and definition of 'struct damos_quota' lists a few fields for
      effective quota generation first, fields for regions prioritization under
      the quota, and then remaining fields for effective quota generation. 
      Readers' should unnecesssarily switch their context in the middle.  List
      all the fields for the effective quota first, and then fields for the
      prioritization for making it easier to read.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d791a0a
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: document effective_bytes file · a6068d6d
      SeongJae Park authored
      Update DAMON usage document for the effective quota file of the DAMON
      sysfs interface.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6068d6d
    • SeongJae Park's avatar
      Docs/ABI/damon: document effective_bytes sysfs file · 68c4905b
      SeongJae Park authored
      Update the DAMON ABI doc for the effective_bytes sysfs file and the
      kdamond state file input command for updating the content of the file.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68c4905b
    • SeongJae Park's avatar
      mm/damon/sysfs: implement a kdamond command for updating schemes' effective quotas · c71f8a71
      SeongJae Park authored
      Implement yet another kdamond 'state' file input command, namely
      'update_schemes_effective_quotas'.  If it is written, the
      'effective_bytes' files of the kdamond will be updated to provide the
      current effective size quota of each scheme in bytes.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c71f8a71
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: implement quota effective_bytes file · 68131315
      SeongJae Park authored
      DAMON sysfs interface allows users to set two types of quotas, namely time
      quota and size quota.  DAMOS converts time quota to a size quota and use
      smaller one among the resulting two size quotas.  The resulting effective
      size quota can be helpful for debugging and analysis, but not exposed to
      the user.  The recently added feedback-driven quota auto-tuning is making
      it even more mysterious.
      
      Implement a DAMON sysfs interface read-only empty file, namely
      'effective_bytes', under the quota goal DAMON sysfs directory.  It will be
      extended to expose the effective quota to the end user.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68131315
    • SeongJae Park's avatar
      mm/damon/core: set damos_quota->esz as public field and document · 78f2f603
      SeongJae Park authored
      Patch series "mm/damon: let DAMOS feeds and tame/auto-tune itself".
      
      The Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning
      patchset[1] which has merged since commit 9294a037 ("mm/damon/core:
      implement goal-oriented feedback-driven quota auto-tuning") made the
      mechanism and the policy separated.  That is, users can set a part of
      DAMOS control policies without a deep understanding of the mechanism but
      just their demands such as SLA.
      
      However, users are still required to do some additional work of manually
      collecting their target metric and feeding it to DAMOS.  In the case of
      end-users who use DAMON sysfs interface, the context switches between
      user-space and kernel-space could also make it inefficient.  The overhead
      is supposed to be only trivial in common cases, though.  Meanwhile, in
      simple use cases, the target metric could be common system metrics that
      the kernel can efficiently self-retrieve, such as memory pressure stall
      time (PSI).
      
      Extend DAMOS quota auto-tuning to support multiple types of metrics
      including the DAMOS self-retrievable ones, and add support for memory
      pressure stall time metric.  Different types of metrics can be supported
      in future.  The auto-tuning capability is currently supported for only
      users of DAMOS kernel API and DAMON sysfs interface.  Extend the support
      to DAMON_RECLAIM.
      
      Patches Sequence
      ================
      
      First five patches are for helping debugging and fine-tuning existing
      quota control features.  The first one (patch 1) exposes the effective
      quota that is made with given user inputs to DAMOS kernel API users and
      kernel-doc documents.  Following four patches implement (patches 1, 2 and
      3) and document (patches 4 and 5) a new DAMON sysfs file that exposes the
      value.
      
      Following six patches cleanup and simplify the existing DAMOS quota
      auto-tuning code by improving layout of comments and data structures
      (patches 6 and 7), supporting common use cases, namely multiple goals
      (patches 8, 9 and 10), and simplifying the interface (patch 11).
      
      Then six patches for the main purpose of this patchset follow.  The first
      three changes extend the core logic for various target metrics (patch 12),
      implement memory pressure stall time-based target metric support (patch
      13), and update DAMON sysfs interface to support the new target metric
      (patch 14).  Then, documentation updates for the features on design (patch
      15), ABI (patch 16), and usage (patch 17) follow.
      
      Last three patches add auto-tuning support on DAMON_RECLAIM.  The patches
      implement DAMON_RECLAIM parameters for user-feedback driven quota
      auto-tuning (patch 18), memory pressure stall time-driven quota
      self-tuning (patch 19), and finally update the DAMON_RECLAIM usage
      document for the new parameters (patch 20).
      
      [1] https://lore.kernel.org/all/20231130023652.50284-1-sj@kernel.org/
      
      
      This patch (of 20):
      
      DAMOS allow users to specify the quota as they want in multiple ways
      including time quota, size quota, and feedback-based auto-tuning.  DAMOS
      makes one effective quota out of the inputs and use it at the end. 
      Knowing the current effective quota helps understanding DAMOS' internal
      mechanism and fine-tuning quotas.  DAMON kernel API users can get the
      information from ->esz field of damos_quota struct, but the field is
      marked as private purpose, and not kernel-doc documented.  Make it public
      and document.
      
      Link: https://lkml.kernel.org/r/20240219194431.159606-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240219194431.159606-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      78f2f603
    • Lance Yang's avatar
      mm/khugepaged: bypassing unnecessary scans with MMF_DISABLE_THP check · 879c6000
      Lance Yang authored
      khugepaged scans the entire address space in the background for each
      given mm, looking for opportunities to merge sequences of basic pages
      into huge pages.  However, when an mm is inserted to the mm_slots list,
      and the MMF_DISABLE_THP flag is set later, this scanning process
      becomes unnecessary for that mm and can be skipped to avoid redundant
      operations, especially in scenarios with a large address space.
      
      On an Intel Core i5 CPU, the time taken by khugepaged to scan the
      address space of the process, which has been set with the
      MMF_DISABLE_THP flag after being added to the mm_slots list, is as
      follows (shorter is better):
      
      VMA Count |   Old   |   New   |  Change
      ---------------------------------------
          50    |   23us  |    9us  |  -60.9%
         100    |   32us  |    9us  |  -71.9%
         200    |   44us  |    9us  |  -79.5%
         400    |   75us  |    9us  |  -88.0%
         800    |   98us  |    9us  |  -90.8%
      
      Once the count of VMAs for the process exceeds page_to_scan, khugepaged
      needs to wait for scan_sleep_millisecs ms before scanning the next
      process.  IMO, unnecessary scans could actually be skipped with a very
      inexpensive mm->flags check in this case.
      
      This commit introduces a check before each scanning process to test the
      MMF_DISABLE_THP flag for the given mm; if the flag is set, the scanning
      process is bypassed, thereby improving the efficiency of khugepaged.
      
      This optimization is not a correctness issue but rather an enhancement
      to save expensive checks on each VMA when userspace cannot prctl itself
      before spawning into the new process.
      
      On some servers within our company, we deploy a daemon responsible for
      monitoring and updating local applications.  Some applications prefer
      not to use THP, so the daemon calls prctl to disable THP before
      fork/exec.  Conversely, for other applications, the daemon calls prctl
      to enable THP before fork/exec.
      
      Ideally, the daemon should invoke prctl after the fork, but its current
      implementation follows the described approach.  In the Go standard
      library, there is no direct encapsulation of the fork system call;
      instead, fork and execve are combined into one through
      syscall.ForkExec.
      
      Link: https://lkml.kernel.org/r/20240129054551.57728-1-ioworker0@gmail.comSigned-off-by: default avatarLance Yang <ioworker0@gmail.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      879c6000
    • Mike Rapoport (IBM)'s avatar
      MAINTAINERS: update mm and memcg entries · b659a7c2
      Mike Rapoport (IBM) authored
      Add F: lines for memory management and memory cgroup include files.
      
      Link: https://lkml.kernel.org/r/20240208055727.142387-1-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b659a7c2
    • Baoquan He's avatar
      arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.c · 199da871
      Baoquan He authored
      Nathan reported below building error:
      
      =====
      $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7
      $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all
      ..
      arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo':
      machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str'
      ====
      
      On architecutres, like arm, s390, ppc, sh, function
      arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can
      only be compiled in when CONFIG_KEXEC_CORE=y.
      
      That's not right because arch_crash_save_vmcoreinfo() is used to export
      arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its
      compiling in. However, CONFIG_VMVCORE_INFO could be independent of
      CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO.
      Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is
      not set, it will report linking error.
      
      So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to
      a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in
      arch_crash_save_vmcoreinfo().
      
      [akpm@linux-foundation.org: remove stray newlines at eof]
      Link: https://lkml.kernel.org/r/20240129135033.157195-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      199da871
    • Baoquan He's avatar
      loongarch, crash: wrap crash dumping code into crash related ifdefs · ea034d0b
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on loongarch with some adjustments.
      
      Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-15-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea034d0b
    • Baoquan He's avatar
      arm, crash: wrap crash dumping code into crash related ifdefs · 5057dff3
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on arm with some adjustments.
      
      Here use CONFIG_CRASH_RESERVE ifdef to replace CONFIG_KEXEC ifdef.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-14-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5057dff3
    • Baoquan He's avatar
      riscv, crash: wrap crash dumping code into crash related ifdefs · 0978a63f
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on risc-v with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and
      use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-13-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0978a63f
    • Baoquan He's avatar
      mips, crash: wrap crash dumping code into crash related ifdefs · d739f190
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on mips with some adjustments.
      
      Here use IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling
      in the crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-12-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d739f190
    • Baoquan He's avatar
      sh, crash: wrap crash dumping code into crash related ifdefs · e3892635
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on SuperH with some adjustments.
      
      Wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery, and use
      IS_ENABLED(CONFIG_CRASH_RESERVE) check to decide if compiling in the
      crashkernel reservation code.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-11-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3892635
    • Baoquan He's avatar
      s390, crash: wrap crash dumping code into crash related ifdefs · 865e2acd
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on s390 with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-10-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      865e2acd
    • Baoquan He's avatar
      ppc, crash: enforce KEXEC and KEXEC_FILE to select CRASH_DUMP · 086d67ef
      Baoquan He authored
      In PowerPC, the crash dumping and kexec reboot share code in
      arch_kexec_locate_mem_hole(), in which struct crash_mem is used.
      
      Here enfoce enforce KEXEC and KEXEC_FILE to select CRASH_DUMP for now.
      
      [bhe@redhat.com: fix allnoconfig on ppc]
        Link: https://lkml.kernel.org/r/ZbJwMyCpz4HDySoo@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240124051254.67105-9-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      086d67ef
    • Baoquan He's avatar
      arm64, crash: wrap crash dumping code into crash related ifdefs · 40254101
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on arm64 with some adjustments.
      
      Here wrap up crash dumping codes with CONFIG_CRASH_DUMP ifdeffery.
      
      [bhe@redhat.com: fix building error in generic codes]
        Link: https://lkml.kernel.org/r/20240129135033.157195-2-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-8-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      40254101
    • Baoquan He's avatar
      x86, crash: wrap crash dumping code into crash related ifdefs · a4eeb217
      Baoquan He authored
      Now crash codes under kernel/ folder has been split out from kexec
      code, crash dumping can be separated from kexec reboot in config
      items on x86 with some adjustments.
      
      Here, also change some ifdefs or IS_ENABLED() check to more appropriate
      ones, e,g
       - #ifdef CONFIG_KEXEC_CORE -> #ifdef CONFIG_CRASH_DUMP
       - (!IS_ENABLED(CONFIG_KEXEC_CORE)) - > (!IS_ENABLED(CONFIG_CRASH_RESERVE))
      
      [bhe@redhat.com: don't nest CONFIG_CRASH_DUMP ifdef inside CONFIG_KEXEC_CODE ifdef scope]
        Link: https://lore.kernel.org/all/SN6PR02MB4157931105FA68D72E3D3DB8D47B2@SN6PR02MB4157.namprd02.prod.outlook.com/T/#u
      Link: https://lkml.kernel.org/r/20240124051254.67105-7-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a4eeb217
    • Baoquan He's avatar
      crash: clean up kdump related config items · 75bc255a
      Baoquan He authored
      By splitting CRASH_RESERVE and VMCORE_INFO out from CRASH_CORE, cleaning
      up the dependency of FA_DMUMP on CRASH_DUMP, and moving crash codes from
      kexec_core.c to crash_core.c, now we can rearrange CRASH_DUMP to
      depend on KEXEC_CORE, and make CRASH_DUMP select CRASH_RESERVE and
      VMCORE_INFO.
      
      KEXEC_CORE won't select CRASH_RESERVE and VMCORE_INFO any more because
      KEXEC_CORE enables codes which allocate control pages, copy
      kexec/kdump segments, and prepare for switching. These codes are shared
      by both kexec reboot and crash dumping.
      
      Doing this makes codes and the corresponding config items more
      logical (the right item depends on or is selected by the left item).
      
      PROC_KCORE -----------> VMCORE_INFO
      
                 |----------> VMCORE_INFO
      FA_DUMP----|
                 |----------> CRASH_RESERVE
      
                                                      ---->VMCORE_INFO
                                                     /
                                                     |---->CRASH_RESERVE
      KEXEC      --|                                /|
                   |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
      KEXEC_FILE --|                               \ |
                                                     \---->CRASH_HOTPLUG
      
      KEXEC      --|
                   |--> KEXEC_CORE--> kexec reboot
      KEXEC_FILE --|
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-6-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75bc255a
    • Baoquan He's avatar
      crash: split crash dumping code out from kexec_core.c · 02aff848
      Baoquan He authored
      Currently, KEXEC_CORE select CRASH_CORE automatically because crash codes
      need be built in to avoid compiling error when building kexec code even
      though the crash dumping functionality is not enabled. E.g
      --------------------
      CONFIG_CRASH_CORE=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC=y
      CONFIG_KEXEC_FILE=y
      ---------------------
      
      After splitting out crashkernel reservation code and vmcoreinfo exporting
      code, there's only crash related code left in kernel/crash_core.c. Now
      move crash related codes from kexec_core.c to crash_core.c and only build it
      in when CONFIG_CRASH_DUMP=y.
      
      And also wrap up crash codes inside CONFIG_CRASH_DUMP ifdeffery scope,
      or replace inappropriate CONFIG_KEXEC_CORE ifdef with CONFIG_CRASH_DUMP
      ifdef in generic kernel files.
      
      With these changes, crash_core codes are abstracted from kexec codes and
      can be disabled at all if only kexec reboot feature is wanted.
      
      Link: https://lkml.kernel.org/r/20240124051254.67105-5-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      02aff848
    • Baoquan He's avatar
      crash: remove dependency of FA_DUMP on CRASH_DUMP · 2c44b67e
      Baoquan He authored
      In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's
      old memory content. Its elf header is constructed in 1st kernel and passed
      to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code
      of 1st kernel's old memory accessing in different architectures.
      
      Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump
      needs access global variable 'elfcorehdr_addr' to judge if it's in
      kdump kernel within function is_kdump_kernel(). In the current
      kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function
      setup_elfcorehdr() used to parse kernel parameter to fetch the passed
      value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP
      really doesn't have to depends on CRASH_DUMP.
      
      To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion,
      rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when
      CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend
      on CRASH_DUMP.
      
      [bhe@redhat.com: power/fadump: make FA_DUMP select CRASH_DUMP]
        Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv
      Link: https://lkml.kernel.org/r/20240124051254.67105-4-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2c44b67e
    • Baoquan He's avatar
      crash: split vmcoreinfo exporting code out from crash_core.c · 443cbaf9
      Baoquan He authored
      Now move the relevant codes into separate files:
      kernel/crash_reserve.c, include/linux/crash_reserve.h.
      
      And add config item CRASH_RESERVE to control its enabling.
      
      And also update the old ifdeffery of CONFIG_CRASH_CORE, including of
      <linux/crash_core.h> and config item dependency on CRASH_CORE
      accordingly.
      
      And also do renaming as follows:
       - arch/xxx/kernel/{crash_core.c => vmcore_info.c}
      because they are only related to vmcoreinfo exporting on x86, arm64,
      riscv.
      
      And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to
      decide if build in crash_core.c.
      
      [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c]
        Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      443cbaf9
    • Baoquan He's avatar
      kexec: split crashkernel reservation code out from crash_core.c · 85fcde40
      Baoquan He authored
      Patch series "Split crash out from kexec and clean up related config
      items", v3.
      
      Motivation:
      =============
      Previously, LKP reported a building error. When investigating, it can't
      be resolved reasonablly with the present messy kdump config items.
      
       https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/
      
      The kdump (crash dumping) related config items could causes confusions:
      
      Firstly,
      
      CRASH_CORE enables codes including
       - crashkernel reservation;
       - elfcorehdr updating;
       - vmcoreinfo exporting;
       - crash hotplug handling;
      
      Now fadump of powerpc, kcore dynamic debugging and kdump all selects
      CRASH_CORE, while fadump
       - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing
         global variable 'elfcorehdr_addr';
       - kcore only needs vmcoreinfo exporting;
       - kdump needs all of the current kernel/crash_core.c.
      
      So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this
      mislead people that we enable crash dumping, actual it's not.
      
      Secondly,
      
      It's not reasonable to allow KEXEC_CORE select CRASH_CORE.
      
      Because KEXEC_CORE enables codes which allocate control pages, copy
      kexec/kdump segments, and prepare for switching. These codes are
      shared by both kexec reboot and kdump. We could want kexec reboot,
      but disable kdump. In that case, CRASH_CORE should not be selected.
      
       --------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_KEXEC=y
       CONFIG_KEXEC_FILE=y
       ---------------------
      
      Thirdly,
      
      It's not reasonable to allow CRASH_DUMP select KEXEC_CORE.
      
      That could make KEXEC_CORE, CRASH_DUMP are enabled independently from
      KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE
      code built in doesn't make any sense because no kernel loading or
      switching will happen to utilize the KEXEC_CORE code.
       ---------------------
       CONFIG_CRASH_CORE=y
       CONFIG_KEXEC_CORE=y
       CONFIG_CRASH_DUMP=y
       ---------------------
      
      In this case, what is worse, on arch sh and arm, KEXEC relies on MMU,
      while CRASH_DUMP can still be enabled when !MMU, then compiling error is
      seen as the lkp test robot reported in above link.
      
       ------arch/sh/Kconfig------
       config ARCH_SUPPORTS_KEXEC
               def_bool MMU
      
       config ARCH_SUPPORTS_CRASH_DUMP
               def_bool BROKEN_ON_SMP
       ---------------------------
      
      Changes:
      ===========
      1, split out crash_reserve.c from crash_core.c;
      2, split out vmcore_infoc. from crash_core.c;
      3, move crash related codes in kexec_core.c into crash_core.c;
      4, remove dependency of FA_DUMP on CRASH_DUMP;
      5, clean up kdump related config items;
      6, wrap up crash codes in crash related ifdefs on all 8 arch-es
         which support crash dumping, except of ppc;
      
      Achievement:
      ===========
      With above changes, I can rearrange the config item logic as below (the right
      item depends on or is selected by the left item):
      
          PROC_KCORE -----------> VMCORE_INFO
      
                     |----------> VMCORE_INFO
          FA_DUMP----|
                     |----------> CRASH_RESERVE
      
                                                          ---->VMCORE_INFO
                                                         /
                                                         |---->CRASH_RESERVE
          KEXEC      --|                                /|
                       |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE
          KEXEC_FILE --|                               \ |
                                                         \---->CRASH_HOTPLUG
      
      
          KEXEC      --|
                       |--> KEXEC_CORE (for kexec reboot only)
          KEXEC_FILE --|
      
      Test
      ========
      On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips,
      riscv, loongarch, I did below three cases of config item setting and
      building all passed. Take configs on x86_64 as exampmle here:
      
      (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump
      items are unset automatically:
      # Kexec and crash features
      # CONFIG_KEXEC is not set
      # CONFIG_KEXEC_FILE is not set
      # end of Kexec and crash features
      
      (2) set CONFIG_KEXEC_FILE and 'make olddefconfig':
      ---------------
      # Kexec and crash features
      CONFIG_CRASH_RESERVE=y
      CONFIG_VMCORE_INFO=y
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      CONFIG_CRASH_DUMP=y
      CONFIG_CRASH_HOTPLUG=y
      CONFIG_CRASH_MAX_MEMORY_RANGES=8192
      # end of Kexec and crash features
      ---------------
      
      (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig':
      ------------------------
      # Kexec and crash features
      CONFIG_KEXEC_CORE=y
      CONFIG_KEXEC_FILE=y
      # end of Kexec and crash features
      ------------------------
      
      Note:
      For ppc, it needs investigation to make clear how to split out crash
      code in arch folder. Hope Hari and Pingfan can help have a look, see if
      it's doable. Now, I make it either have both kexec and crash enabled, or
      disable both of them altogether.
      
      
      This patch (of 14):
      
      Both kdump and fa_dump of ppc rely on crashkernel reservation.  Move the
      relevant codes into separate files: crash_reserve.c,
      include/linux/crash_reserve.h.
      
      And also add config item CRASH_RESERVE to control its enabling of the
      codes.  And update config items which has relationship with crashkernel
      reservation.
      
      And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE
      when those scopes are only crashkernel reservation related.
      
      And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on
      arm64, x86 and risc-v because those architectures' crash_core.h is only
      related to crashkernel reservation.
      
      [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin]
      Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.comSigned-off-by: default avatarBaoquan He <bhe@redhat.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pingfan Liu <piliu@redhat.com>
      Cc: Klara Modin <klarasmodin@gmail.com>
      Cc: Michael Kelley <mhklinux@outlook.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Yang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85fcde40
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: refactor vmalloc_dump_obj() function · 8be4d46e
      Uladzislau Rezki (Sony) authored
      This patch tends to simplify the function in question, by removing an
      extra stack "objp" variable, returning back to an early exit approach if
      spin_trylock() fails or VA was not found.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8be4d46e
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: improve description of vmap node layer · 15e02a39
      Uladzislau Rezki (Sony) authored
      This patch adds extra explanation of recently added vmap node layer based
      on community feedback.  No functional change.
      
      Link: https://lkml.kernel.org/r/20240124180920.50725-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15e02a39
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: add a shrinker to drain vmap pools · 7679ba6b
      Uladzislau Rezki (Sony) authored
      The added shrinker is used to return back current cached VAs into a global
      vmap space, when a system enters into a low memory mode.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-12-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7679ba6b
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: set nr_nodes based on CPUs in a system · 8f33a2ff
      Uladzislau Rezki (Sony) authored
      A number of nodes which are used in the alloc/free paths is set based on
      num_possible_cpus() in a system.  Please note a high limit threshold
      though is fixed and corresponds to 128 nodes.
      
      For 32-bit or single core systems an access to a global vmap heap is not
      balanced.  Such small systems do not suffer from lock contentions due to
      low number of CPUs.  In such case the nr_nodes is equal to 1.
      
      Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo
      ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      
      <default perf>
       94.41%     0.89%  [kernel]        [k] _raw_spin_lock
       93.35%    93.07%  [kernel]        [k] native_queued_spin_lock_slowpath
       76.13%     0.28%  [kernel]        [k] __vmalloc_node_range
       72.96%     0.81%  [kernel]        [k] alloc_vmap_area
       56.94%     0.00%  [kernel]        [k] __get_vm_area_node
       41.95%     0.00%  [kernel]        [k] vmalloc
       37.15%     0.01%  [test_vmalloc]  [k] full_fit_alloc_test
       35.17%     0.00%  [kernel]        [k] ret_from_fork_asm
       35.17%     0.00%  [kernel]        [k] ret_from_fork
       35.17%     0.00%  [kernel]        [k] kthread
       35.08%     0.00%  [test_vmalloc]  [k] test_func
       34.45%     0.00%  [test_vmalloc]  [k] fix_size_alloc_test
       28.09%     0.01%  [test_vmalloc]  [k] long_busy_list_alloc_test
       23.53%     0.25%  [kernel]        [k] vfree.part.0
       21.72%     0.00%  [kernel]        [k] remove_vm_area
       20.08%     0.21%  [kernel]        [k] find_unlink_vmap_area
        2.34%     0.61%  [kernel]        [k] free_vmap_area_noflush
      <default perf>
         vs
      <patch-series perf>
       82.32%     0.22%  [test_vmalloc]  [k] long_busy_list_alloc_test
       63.36%     0.02%  [kernel]        [k] vmalloc
       63.34%     2.64%  [kernel]        [k] __vmalloc_node_range
       30.42%     4.46%  [kernel]        [k] vfree.part.0
       28.98%     2.51%  [kernel]        [k] __alloc_pages_bulk
       27.28%     0.19%  [kernel]        [k] __get_vm_area_node
       26.13%     1.50%  [kernel]        [k] alloc_vmap_area
       21.72%    21.67%  [kernel]        [k] clear_page_rep
       19.51%     2.43%  [kernel]        [k] _raw_spin_lock
       16.61%    16.51%  [kernel]        [k] native_queued_spin_lock_slowpath
       13.40%     2.07%  [kernel]        [k] free_unref_page
       10.62%     0.01%  [kernel]        [k] remove_vm_area
        9.02%     8.73%  [kernel]        [k] insert_vmap_area
        8.94%     0.00%  [kernel]        [k] ret_from_fork_asm
        8.94%     0.00%  [kernel]        [k] ret_from_fork
        8.94%     0.00%  [kernel]        [k] kthread
        8.29%     0.00%  [test_vmalloc]  [k] test_func
        7.81%     0.05%  [test_vmalloc]  [k] full_fit_alloc_test
        5.30%     4.73%  [kernel]        [k] purge_vmap_node
        4.47%     2.65%  [kernel]        [k] free_vmap_area_noflush
      <patch-series perf>
      
      confirms that a native_queued_spin_lock_slowpath goes down to
      16.51% percent from 93.07%.
      
      The throughput is ~12x higher:
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    10m51.271s
      user    0m0.013s
      sys     0m0.187s
      urezki@pc638:~$
      
      urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
      Run the test with following parameters: run_test_mask=7 nr_threads=64
      Done.
      Check the kernel ring buffer to see the summary.
      
      real    0m51.301s
      user    0m0.015s
      sys     0m0.040s
      urezki@pc638:~$
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-11-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f33a2ff
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vmallocinfo · 8e1d743f
      Uladzislau Rezki (Sony) authored
      Allocated areas are spread among nodes, it implies that the scanning has
      to be performed individually of each node in order to dump all existing
      VAs.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-10-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e1d743f
    • Uladzislau Rezki (Sony)'s avatar
      mm: vmalloc: support multiple nodes in vread_iter · 53becf32
      Uladzislau Rezki (Sony) authored
      Extend the vread_iter() to be able to perform a sequential reading of VAs
      which are spread among multiple nodes.  So a data read over the /dev/kmem
      correctly reflects a vmalloc memory layout.
      
      Link: https://lkml.kernel.org/r/20240102184633.748113-9-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kazuhito Hagio <k-hagio-ab@nec.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53becf32