1. 18 Oct, 2023 40 commits
    • Audra Mitchell's avatar
      tools/mm: fix the default case for page_owner_sort · c6d5e490
      Audra Mitchell authored
      With the additional commands and timestamps added to the tool, the default
      case (-t) has been broken.  Now that the allocation timestamps are saved
      outside of the txt field, allow us to properly sort the data by number of
      times the record has been seen.  Furthermore prevent the misuse of the
      commandline arguments so only one compare option can be used.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-5-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6d5e490
    • Audra Mitchell's avatar
      tools/mm: filter out timestamps for correct collation · 63a15062
      Audra Mitchell authored
      With the introduction of allocation timestamps being included in
      page_owner output, each record becomes unique due to the timestamp
      nanosecond granularity.  Remove the check in add_list that tries to
      collate each record during processing as the memcmp() is just additional
      overhead at this point.
      
      Also keep the allocation timestamps, but allow collation to occur without
      consideration of the allocation timestamp except in the case were
      allocation timestamps are requested by the user (the -a option).
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-4-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63a15062
    • Audra Mitchell's avatar
      tools/mm: remove references to free_ts from page_owner_sort · 0179c628
      Audra Mitchell authored
      With the removal of free timestamps from page_owner output, we no longer
      need to handle this case or the "unreleased" case.  Remove all references
      to both cases.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-3-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0179c628
    • Audra Mitchell's avatar
      mm/page_owner: remove free_ts from page_owner output · b459f090
      Audra Mitchell authored
      Patch series "Fix page_owner's use of free timestamps".
      
      While page ower output is used to investigate memory utilization,
      typically the allocation pathway, the introduction of timestamps to the
      page owner records caused each record to become unique due to the
      granularity of the nanosecond timestamp (for example):
      
        Page allocated via order 0 ... ts 5206196026 ns, free_ts 5187156703 ns
        Page allocated via order 0 ... ts 5206198540 ns, free_ts 5187162702 ns
      
      Furthermore, the page_owner output only dumps the currently allocated
      records, so having the free timestamps is nonsensical for the typical use
      case.
      
      In addition, the introduction of timestamps was not properly handled in
      the page_owner_sort tool causing most use cases to be broken.  This series
      is meant to remove the free timestamps from the page_owner output and fix
      the page_owner_sort tool so proper collation can occur.
      
      
      This patch (of 5):
      
      When printing page_owner data via the sysfs interface, no free pages will
      ever be dumped due to the series of checks in read_page_owner():
      
          /*
           * Although we do have the info about past allocation of free
           * pages, it's not relevant for current memory usage.
           */
           if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags))
      
      The free_ts values are still used when dump_page_owner() is called, so
      keeping the field for other use cases but removing them for the typical
      page_owner case.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-1-audra@redhat.com
      Link: https://lkml.kernel.org/r/20231013190350.579407-2-audra@redhat.com
      Fixes: 866b4852 ("mm/page_owner: record the timestamp of all pages during free")
      Signed-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b459f090
    • Lorenzo Stoakes's avatar
      mm: abstract VMA merge and extend into vma_merge_extend() helper · 93bf5d4a
      Lorenzo Stoakes authored
      mremap uses vma_merge() in the case where a VMA needs to be extended. This
      can be significantly simplified and abstracted.
      
      This makes it far easier to understand what the actual function is doing,
      avoids future mistakes in use of the confusing vma_merge() function and
      importantly allows us to make future changes to how vma_merge() is
      implemented by knowing explicitly which merge cases each invocation uses.
      
      Note that in the mremap() extend case, we perform this merge only when
      old_len == vma->vm_end - addr. The extension_start, i.e. the start of the
      extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end.
      
      With this refactoring, vma_merge() is no longer required anywhere except
      mm/mmap.c, so mark it static.
      
      Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93bf5d4a
    • Lorenzo Stoakes's avatar
      mm: abstract merge for new VMAs into vma_merge_new_vma() · 4b5f2d20
      Lorenzo Stoakes authored
      Only in mmap_region() and copy_vma() do we attempt to merge VMAs which
      occupy entirely new regions of virtual memory.
      
      We can abstract this logic and make the intent of this invocations of it
      completely explicit, rather than invoking vma_merge() with an inscrutable
       wall of parameters.
      
      This also paves the way for a simplification of the core vma_merge()
      implementation, as we seek to make it entirely an implementation detail.
      
      The VMA merge call in mmap_region() occurs only for file-backed mappings,
      where each of the parameters previously specified as NULL are defaulted to
      NULL in vma_init() (called by vm_area_alloc()).
      
      This matches the previous behaviour of specifying NULL for a number of
      fields, however note that prior to this call we pass the VMA to the file
      system driver via call_mmap(), which may in theory adjust fields that we
      pass in to vma_merge_new_vma().
      
      Therefore we actually resolve an oversight here by allowing for the fact
      that the driver may have done this.
      
      Link: https://lkml.kernel.org/r/3dc71d17e307756a54781d4a4ce7315cf8b18bea.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b5f2d20
    • Lorenzo Stoakes's avatar
      mm: make vma_merge() and split_vma() internal · adb20b0c
      Lorenzo Stoakes authored
      Now the common pattern of - attempting a merge via vma_merge() and should
      this fail splitting VMAs via split_vma() - has been abstracted, the former
      can be placed into mm/internal.h and the latter made static.
      
      In addition, the split_vma() nommu variant also need not be exported.
      
      Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adb20b0c
    • Lorenzo Stoakes's avatar
      mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al. · 94d7d923
      Lorenzo Stoakes authored
      mprotect() and other functions which change VMA parameters over a range
      each employ a pattern of:-
      
      1. Attempt to merge the range with adjacent VMAs.
      2. If this fails, and the range spans a subset of the VMA, split it
         accordingly.
      
      This is open-coded and duplicated in each case. Also in each case most of
      the parameters passed to vma_merge() remain the same.
      
      Create a new function, vma_modify(), which abstracts this operation,
      accepting only those parameters which can be changed.
      
      To avoid the mess of invoking each function call with unnecessary
      parameters, create inline wrapper functions for each of the modify
      operations, parameterised only by what is required to perform the action.
      
      We can also significantly simplify the logic - by returning the VMA if we
      split (or merged VMA if we do not) we no longer need specific handling for
      merge/split cases in any of the call sites.
      
      Note that the userfaultfd_release() case works even though it does not
      split VMAs - since start is set to vma->vm_start and end is set to
      vma->vm_end, the split logic does not trigger.
      
      In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
      - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
      instance, this invocation will remain unchanged.
      
      We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
      vma_merge() correctly ensures that flags remain the same, something that is
      already checked in is_mergeable_vma() and elsewhere, and in any case is not
      specific to mprotect().
      
      Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94d7d923
    • Lorenzo Stoakes's avatar
      mm: move vma_policy() and anon_vma_name() decls to mm_types.h · 3657fdc2
      Lorenzo Stoakes authored
      Patch series "Abstract vma_merge() and split_vma()", v4.
      
      The vma_merge() interface is very confusing and its implementation has led
      to numerous bugs as a result of that confusion.
      
      In addition there is duplication both in invocation of vma_merge(), but
      also in the common mprotect()-style pattern of attempting a merge, then if
      this fails, splitting the portion of a VMA about to have its attributes
      changed.
      
      This pattern has been copy/pasted around the kernel in each instance where
      such an operation has been required, each very slightly modified from the
      last to make it even harder to decipher what is going on.
      
      Simplify the whole thing by dividing the actual uses of vma_merge() and
      split_vma() into specific and abstracted functions and de-duplicate the
      vma_merge()/split_vma() pattern altogether.
      
      Doing so also opens the door to changing how vma_merge() is implemented -
      by knowing precisely what cases a caller is invoking rather than having a
      central interface where anything might happen we can untangle the brittle
      and confusing vma_merge() implementation into something more workable.
      
      For mprotect()-like cases we introduce vma_modify() which performs the
      vma_merge()/split_vma() pattern, returning a pointer to either the merged
      or split VMA or an ERR_PTR(err) if the splits fail.
      
      We provide a number of inline helper functions to make things even clearer:-
      
      * vma_modify_flags()      - Prepare to modify the VMA's flags.
      * vma_modify_flags_name() - Prepare to modify the VMA's flags/anon_vma_name
      * vma_modify_policy()     - Prepare to modify the VMA's mempolicy.
      * vma_modify_flags_uffd() - Prepare to modify the VMA's flags/uffd context.
      
      For cases where a new VMA is attempted to be merged with adjacent VMAs we
      add:-
      
      * vma_merge_new_vma() - Prepare to merge a new VMA.
      * vma_merge_extend()  - Prepare to extend the end of a new VMA.
      
      
      This patch (of 5):
      
      The vma_policy() define is a helper specifically for a VMA field so it
      makes sense to host it in the memory management types header.
      
      The anon_vma_name(), anon_vma_name_alloc() and anon_vma_name_free()
      functions are a little out of place in mm_inline.h as they define external
      functions, and so it makes sense to locate them in mm_types.h.
      
      The purpose of these relocations is to make it possible to abstract static
      inline wrappers which invoke both of these helpers.
      
      Link: https://lkml.kernel.org/r/cover.1697043508.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/24bfc6c9e382fffbcb0ea8d424392c27d56cc8ca.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3657fdc2
    • Matthew Wilcox (Oracle)'s avatar
      sched: remove wait bookmarks · 37acade0
      Matthew Wilcox (Oracle) authored
      There are no users of wait bookmarks left, so simplify the wait
      code by removing them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37acade0
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove use of wait bookmarks · b0b598ee
      Matthew Wilcox (Oracle) authored
      The original problem of the overly long list of waiters on a locked page
      was solved properly by commit 9a1ea439 ("mm:
      put_and_wait_on_page_locked() while page is migrated").  In the meantime,
      using bookmarks for the writeback bit can cause livelocks, so we need to
      stop using them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0b598ee
    • Lorenzo Stoakes's avatar
      mm/mprotect: allow unfaulted VMAs to be unaccounted on mprotect() · 9b914329
      Lorenzo Stoakes authored
      When mprotect() is used to make unwritable VMAs writable, they have the
      VM_ACCOUNT flag applied and memory accounted accordingly.
      
      If the VMA has had no pages faulted in and is then made unwritable once
      again, it will remain accounted for, despite not being capable of
      extending memory usage.
      
      Consider:-
      
      ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0);
      mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
      mprotect(ptr + page_size, page_size, PROT_READ);
      
      The first mprotect() splits the range into 3 VMAs and the second fails to
      merge the three as the middle VMA has VM_ACCOUNT set and the others do
      not, rendering them unmergeable.
      
      This is unnecessary, since no pages have actually been allocated and the
      middle VMA is not capable of utilising more memory, thereby introducing
      unnecessary VMA fragmentation (and accounting for more memory than is
      necessary).
      
      Since we cannot efficiently determine which pages map to an anonymous VMA,
      we have to be very conservative - determining whether any pages at all
      have been faulted in, by checking whether vma->anon_vma is NULL.
      
      We can see that the lack of anon_vma implies that no anonymous pages are
      present as evidenced by vma_needs_copy() utilising this on fork to
      determine whether page tables need to be copied.
      
      The only place where anon_vma is set NULL explicitly is on fork with
      VM_WIPEONFORK set, however since this flag is intended to cause the child
      process to not CoW on a given memory range, it is right to interpret this
      as indicating the VMA has no faulted-in anonymous memory mapped.
      
      If the VMA was forked without VM_WIPEONFORK set, then anon_vma_fork() will
      have ensured that a new anon_vma is assigned (and correctly related to its
      parent anon_vma) should any pages be CoW-mapped.
      
      The overall operation is safe against races as we hold a write lock against
      mm->mmap_lock.
      
      If we could efficiently look up the VMA's faulted-in pages then we would
      unaccount all those pages not yet faulted in.  However as the original
      comment alludes this simply isn't currently possible, so we are
      conservative and account all pages or none at all.
      
      Link: https://lkml.kernel.org/r/ad5540371a16623a069f03f4db1739f33cde1fab.1696921767.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b914329
    • Lucy Mielke's avatar
      mm: add printf attribute to shrinker_debugfs_name_alloc · f04eba13
      Lucy Mielke authored
      This fixes a compiler warning when compiling an allyesconfig with W=1:
      
      mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf'
      format attribute [-Werror=suggest-attribute=format]
      
      [akpm@linux-foundation.org: fix shrinker_alloc() as welll per Qi Zheng]
        Link: https://lkml.kernel.org/r/822387b7-4895-4e64-5806-0f56b5d6c447@bytedance.com
      Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe
      Fixes: c42d50ae ("mm: shrinker: add infrastructure for dynamically allocating shrinker")
      Signed-off-by: default avatarLucy Mielke <lucymielke@icloud.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f04eba13
    • Zach O'Keefe's avatar
      mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" · 7a81751f
      Zach O'Keefe authored
      The 6.0 commits:
      
      commit 9fec5168 ("mm: thp: kill transparent_hugepage_active()")
      commit 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      
      merged "can we have THPs in this VMA?" logic that was previously done
      separately by fault-path, khugepaged, and smaps "THPeligible" checks.
      
      During the process, the semantics of the fault path check changed in two
      ways:
      
      1) A VM_NO_KHUGEPAGED check was introduced (also added to smaps path).
      2) We no longer checked if non-anonymous memory had a vm_ops->huge_fault
         handler that could satisfy the fault.  Previously, this check had been
         done in create_huge_pud() and create_huge_pmd() routines, but after
         the changes, we never reach those routines.
      
      During the review of the above commits, it was determined that in-tree
      users weren't affected by the change; most notably, since the only
      relevant user (in terms of THP) of VM_MIXEDMAP or ->huge_fault is DAX,
      which is explicitly approved early in approval logic.  However, this was a
      bad assumption to make as it assumes the only reason to support
      ->huge_fault was for DAX (which is not true in general).
      
      Remove the VM_NO_KHUGEPAGED check when not in collapse path and give any
      ->huge_fault handler a chance to handle the fault.  Note that we don't
      validate the file mode or mapping alignment, which is consistent with the
      behavior before the aforementioned commits.
      
      Link: https://lkml.kernel.org/r/20230925200110.1979606-1-zokeefe@google.com
      Fixes: 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      Reported-by: default avatarSaurabh Singh Sengar <ssengar@microsoft.com>
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a81751f
    • Nhat Pham's avatar
      selftests: add a selftest to verify hugetlb usage in memcg · c0dddb7a
      Nhat Pham authored
      This patch add a new kselftest to demonstrate and verify the new hugetlb
      memcg accounting behavior.
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-5-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c0dddb7a
    • Nhat Pham's avatar
      hugetlb: memcg: account hugetlb-backed memory in memory controller · 8cba9576
      Nhat Pham authored
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etc.  fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch rectifies this issue by charging the memcg when the hugetlb
      folio is utilized, and uncharging when the folio is freed (analogous to
      the hugetlb controller).  Note that we do not charge when the folio is
      allocated to the hugetlb pool, because at this point it is not owned by
      any memcg.
      
      Some caveats to consider:
        * This feature is only available on cgroup v2.
        * There is no hugetlb pool management involved in the memory
          controller. As stated above, hugetlb folios are only charged towards
          the memory controller when it is used. Host overcommit management
          has to consider it when configuring hard limits.
        * Failure to charge towards the memcg results in SIGBUS. This could
          happen even if the hugetlb pool still has pages (but the cgroup
          limit is hit and reclaim attempt fails).
        * When this feature is enabled, hugetlb pages contribute to memory
          reclaim protection. low, min limits tuning must take into account
          hugetlb memory.
        * Hugetlb pages utilized while this option is not selected will not
          be tracked by the memory controller (even if cgroup v2 is remounted
          later on).
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cba9576
    • Nhat Pham's avatar
      memcontrol: only transfer the memcg data for migration · 85ce2c51
      Nhat Pham authored
      For most migration use cases, only transfer the memcg data from the old
      folio to the new folio, and clear the old folio's memcg data.  No charging
      and uncharging will be done.
      
      This shaves off some work on the migration path, and avoids the temporary
      double charging of a folio during its migration.
      
      The only exception is replace_page_cache_folio(), which will use the old
      mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio).  In that
      context, the isolation of the old page isn't quite as thorough as with
      migration, so we cannot use our new implementation directly.
      
      This patch is the result of the following discussion on the new hugetlb
      memcg accounting behavior:
      
      https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-3-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      85ce2c51
    • Nhat Pham's avatar
      memcontrol: add helpers for hugetlb memcg accounting · 4b569387
      Nhat Pham authored
      Patch series "hugetlb memcg accounting", v4.
      
      Currently, hugetlb memory usage is not acounted for in the memory
      controller, which could lead to memory overprotection for cgroups with
      hugetlb-backed memory.  This has been observed in our production system.
      
      For instance, here is one of our usecases: suppose there are two 32G
      containers.  The machine is booted with hugetlb_cma=6G, and each container
      may or may not use up to 3 gigantic page, depending on the workload within
      it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
      limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
      difficult to configure memory.max to keep overall consumption, including
      anon, cache, slab etcetera fair.
      
      What we have had to resort to is to constantly poll hugetlb usage and
      readjust memory.max.  Similar procedure is done to other memory limits
      (memory.low for e.g).  However, this is rather cumbersome and buggy. 
      Furthermore, when there is a delay in memory limits correction, (for e.g
      when hugetlb usage changes within consecutive runs of the userspace
      agent), the system could be in an over/underprotected state.
      
      This patch series rectifies this issue by charging the memcg when the
      hugetlb folio is allocated, and uncharging when the folio is freed.  In
      addition, a new selftest is added to demonstrate and verify this new
      behavior.
      
      
      This patch (of 4):
      
      This patch exposes charge committing and cancelling as parts of the memory
      controller interface.  These functionalities are useful when the
      try_charge() and commit_charge() stages have to be separated by other
      actions in between (which can fail).  One such example is the new hugetlb
      accounting behavior in the following patch.
      
      The patch also adds a helper function to obtain a reference to the
      current task's memcg.
      
      Link: https://lkml.kernel.org/r/20231006184629.155543-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20231006184629.155543-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Frank van der Linden <fvdl@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun heo <tj@kernel.org>
      Cc: Yosry Ahmed <yosryahmed@google.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b569387
    • Frank van der Linden's avatar
      mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDER · 59838b25
      Frank van der Linden authored
      Originally, hugetlb_cgroup was the only hugetlb user of tail page
      structure fields.  So, the code defined and checked against
      HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use.
      
      However, by now, tail page #2 is used to store hugetlb hwpoison and
      subpool information as well.  In other words, without that tail page
      hugetlb doesn't work.
      
      Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and
      checks against it.  Instead, just check for the minimum viable page order
      at hstate creation time.
      
      Link: https://lkml.kernel.org/r/20231004153248.3842997-1-fvdl@google.comSigned-off-by: default avatarFrank van der Linden <fvdl@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      59838b25
    • Matthew Wilcox (Oracle)'s avatar
      mm: use folio_xor_flags_has_waiters() in folio_end_writeback() · 2580d554
      Matthew Wilcox (Oracle) authored
      Match how folio_unlock() works by combining the test for PG_waiters with
      the clearing of PG_writeback.  This should have a small performance win,
      and removes the last user of folio_wake().
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-18-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2580d554
    • Matthew Wilcox (Oracle)'s avatar
      mm: make __end_folio_writeback() return void · 7d0795d0
      Matthew Wilcox (Oracle) authored
      Rather than check the result of test-and-clear, just check that we have
      the writeback bit set at the start.  This wouldn't catch every case, but
      it's good enough (and enables the next patch).
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-17-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7d0795d0
    • Matthew Wilcox (Oracle)'s avatar
      mm: add folio_xor_flags_has_waiters() · 0410cd84
      Matthew Wilcox (Oracle) authored
      Optimise folio_end_read() by setting the uptodate bit at the same time we
      clear the unlock bit.  This saves at least one memory barrier and one
      write-after-write hazard.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-16-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0410cd84
    • Matthew Wilcox (Oracle)'s avatar
      mm: delete checks for xor_unlock_is_negative_byte() · f12fb73b
      Matthew Wilcox (Oracle) authored
      Architectures which don't define their own use the one in
      asm-generic/bitops/lock.h.  Get rid of all the ifdefs around "maybe we
      don't have it".
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-15-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f12fb73b
    • Matthew Wilcox (Oracle)'s avatar
      s390: implement arch_xor_unlock_is_negative_byte · 12010aa8
      Matthew Wilcox (Oracle) authored
      Inspired by the s390 arch_test_and_clear_bit(), this will surely be more
      efficient than the generic one defined in filemap.c.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-14-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12010aa8
    • Matthew Wilcox (Oracle)'s avatar
      riscv: implement xor_unlock_is_negative_byte · 2a667285
      Matthew Wilcox (Oracle) authored
      Inspired by the riscv clear_bit_unlock(), this will surely be
      more efficient than the generic one defined in filemap.c.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-13-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2a667285
    • Matthew Wilcox (Oracle)'s avatar
      powerpc: implement arch_xor_unlock_is_negative_byte on 32-bit · 51a752c2
      Matthew Wilcox (Oracle) authored
      Simply remove the ifdef.  The assembly is identical to that in the
      non-optimised case of test_and_clear_bits() on PPC32, and it's not clear
      to me how the PPC32 optimisation works, nor whether it would work for
      arch_xor_unlock_is_negative_byte().  If that optimisation would work,
      someone can implement it later, but this is more efficient than the
      implementation in filemap.c.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-12-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51a752c2
    • Matthew Wilcox (Oracle)'s avatar
      mips: implement xor_unlock_is_negative_byte · 8da36b26
      Matthew Wilcox (Oracle) authored
      Inspired by the mips test_and_change_bit(), this will surely be more
      efficient than the generic one defined in filemap.c
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-11-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8da36b26
    • Matthew Wilcox (Oracle)'s avatar
      m68k: implement xor_unlock_is_negative_byte · ea845e31
      Matthew Wilcox (Oracle) authored
      Using EOR to clear the guaranteed-to-be-set lock bit will test the
      negative flag just like the x86 implementation.  This should be more
      efficient than the generic implementation in filemap.c.  It would be
      better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
      
      Coldfire doesn't have a byte-sized EOR, so we test bit 7 after the EOR,
      which is a second memory access, but it's slightly better than the current
      C code.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-10-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea845e31
    • Matthew Wilcox (Oracle)'s avatar
      alpha: implement xor_unlock_is_negative_byte · e28ff5dc
      Matthew Wilcox (Oracle) authored
      Inspired by the alpha clear_bit() and arch_atomic_add_return(), this will
      surely be more efficient than the generic one defined in filemap.c.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-9-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e28ff5dc
    • Matthew Wilcox (Oracle)'s avatar
      bitops: add xor_unlock_is_negative_byte() · 247dbcdb
      Matthew Wilcox (Oracle) authored
      Replace clear_bit_and_unlock_is_negative_byte() with
      xor_unlock_is_negative_byte().  We have a few places that like to lock a
      folio, set a flag and unlock it again.  Allow for the possibility of
      combining the latter two operations for efficiency.  We are guaranteed
      that the caller holds the lock, so it is safe to unlock it with the xor. 
      The caller must guarantee that nobody else will set the flag without
      holding the lock; it is not safe to do this with the PG_dirty flag, for
      example.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      247dbcdb
    • Matthew Wilcox (Oracle)'s avatar
      iomap: use folio_end_read() · 7a4847e5
      Matthew Wilcox (Oracle) authored
      Combine the setting of the uptodate flag with the clearing of the locked
      flag.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-7-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a4847e5
    • Matthew Wilcox (Oracle)'s avatar
      buffer: use folio_end_read() · 6ba924d3
      Matthew Wilcox (Oracle) authored
      There are two places that we can use this new helper.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-6-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6ba924d3
    • Matthew Wilcox (Oracle)'s avatar
      ext4: use folio_end_read() · f8174a11
      Matthew Wilcox (Oracle) authored
      folio_end_read() is the perfect fit for ext4.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-5-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8174a11
    • Matthew Wilcox (Oracle)'s avatar
      mm: add folio_end_read() · 0b237047
      Matthew Wilcox (Oracle) authored
      Provide a function for filesystems to call when they have finished reading
      an entire folio.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-4-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b237047
    • Matthew Wilcox (Oracle)'s avatar
      iomap: protect read_bytes_pending with the state_lock · f45b494e
      Matthew Wilcox (Oracle) authored
      Perform one atomic operation (acquiring the spinlock) instead of two
      (spinlock & atomic_sub) per read completion.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-3-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f45b494e
    • Matthew Wilcox (Oracle)'s avatar
      iomap: hold state_lock over call to ifs_set_range_uptodate() · 279d5fc3
      Matthew Wilcox (Oracle) authored
      Patch series "Add folio_end_read", v2.
      
      The core of this patchset is the new folio_end_read() call which
      filesystems can use when finishing a page cache read instead of separate
      calls to mark the folio uptodate and unlock it.  As an illustration of its
      use, I converted ext4, iomap & mpage; more can be converted.
      
      I think that's useful by itself, but the interesting optimisation is that
      we can implement that with a single XOR instruction that sets the uptodate
      bit, clears the lock bit, tests the waiter bit and provides a write memory
      barrier.  That removes one memory barrier and one atomic instruction from
      each page read, which seems worth doing.  That's in patch 15.
      
      The last two patches could be a separate series, but basically we can do
      the same thing with the writeback flag that we do with the unlock flag;
      clear it and test the waiters bit at the same time.
      
      
      This patch (of 17):
      
      This is really preparation for the next patch, but it lets us call
      folio_mark_uptodate() in just one place instead of two.
      
      Link: https://lkml.kernel.org/r/20231004165317.1061855-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20231004165317.1061855-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Richard Henderson <richard.henderson@linaro.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      279d5fc3
    • Breno Leitao's avatar
      selftests/mm: add a new test for madv and hugetlb · 116d5730
      Breno Leitao authored
      Create a selftest that exercises the race between page faults and
      madvise(MADV_DONTNEED) in the same huge page. Do it by running two
      threads that touches the huge page and madvise(MADV_DONTNEED) at the same
      time.
      
      In case of a SIGBUS coming at pagefault, the test should fail, since we
      hit the bug.
      
      The test doesn't have a signal handler, and if it fails, it fails like
      the following
      
        ----------------------------------
        running ./hugetlb_fault_after_madv
        ----------------------------------
        ./run_vmtests.sh: line 186: 595563 Bus error    (core dumped) "$@"
        [FAIL]
      
      This selftest goes together with the fix of the bug[1] itself.
      
      [1] https://lore.kernel.org/all/20231001005659.2185316-1-riel@surriel.com/#r
      
      Link: https://lkml.kernel.org/r/20231005163922.87568-3-leitao@debian.orgSigned-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Tested-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      116d5730
    • Breno Leitao's avatar
      selftests/mm: export get_free_hugepages() · c8b90731
      Breno Leitao authored
      Patch series "New selftest for mm", v2.
      
      This is a simple test case that reproduces an mm problem[1], where a page
      fault races with madvise(), and it is not trivial to reproduce and debug.
      
      This test-case aims to avoid such race problems from happening again,
      impacting workloads that leverages external allocators, such as tcmalloc,
      jemalloc, etc.
      
      [1] https://lore.kernel.org/all/20231001005659.2185316-1-riel@surriel.com/#r
      
      
      This patch (of 2):
      
      get_free_hugepages() is helpful for other hugepage tests.  Export it to
      the common file (vm_util.c) to be reused.
      
      Link: https://lkml.kernel.org/r/20231005163922.87568-1-leitao@debian.org
      Link: https://lkml.kernel.org/r/20231005163922.87568-2-leitao@debian.orgSigned-off-by: default avatarBreno Leitao <leitao@debian.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c8b90731
    • Mark-PK Tsai's avatar
      zsmalloc: use copy_page for full page copy · afb2d666
      Mark-PK Tsai authored
      Some architectures have implemented optimized copy_page for full page
      copying, such as arm.
      
      On my arm platform, use the copy_page helper for single page copying is
      about 10 percent faster than memcpy.
      
      Link: https://lkml.kernel.org/r/20231006060245.7411-1-mark-pk.tsai@mediatek.comSigned-off-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: YJ Chiang <yj.chiang@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      afb2d666
    • Pankaj Raghav's avatar
      filemap: call filemap_get_folios_tag() from filemap_get_folios() · bafd7e9d
      Pankaj Raghav authored
      filemap_get_folios() is filemap_get_folios_tag() with XA_PRESENT as the
      tag that is being matched.  Return filemap_get_folios_tag() with
      XA_PRESENT as the tag instead of duplicating the code in
      filemap_get_folios().
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20231006110120.136809-1-kernel@pankajraghav.comSigned-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bafd7e9d