1. 25 Oct, 2023 19 commits
  2. 18 Oct, 2023 21 commits
    • Lorenzo Stoakes's avatar
      mm: perform the mapping_map_writable() check after call_mmap() · 15897894
      Lorenzo Stoakes authored
      In order for a F_SEAL_WRITE sealed memfd mapping to have an opportunity to
      clear VM_MAYWRITE, we must be able to invoke the appropriate
      vm_ops->mmap() handler to do so.  We would otherwise fail the
      mapping_map_writable() check before we had the opportunity to avoid it.
      
      This patch moves this check after the call_mmap() invocation.  Only memfd
      actively denies write access causing a potential failure here (in
      memfd_add_seals()), so there should be no impact on non-memfd cases.
      
      This patch makes the userland-visible change that MAP_SHARED, PROT_READ
      mappings of an F_SEAL_WRITE sealed memfd mapping will now succeed.
      
      There is a delicate situation with cleanup paths assuming that a writable
      mapping must have occurred in circumstances where it may now not have.  In
      order to ensure we do not accidentally mark a writable file unwritable by
      mistake, we explicitly track whether we have a writable mapping and unmap
      only if we do.
      
      [lstoakes@gmail.com: do not set writable_file_mapping in inappropriate case]
        Link: https://lkml.kernel.org/r/c9eb4cc6-7db4-4c2b-838d-43a0b319a4f0@lucifer.local
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=217238
      Link: https://lkml.kernel.org/r/55e413d20678a1bb4c7cce889062bbb07b0df892.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15897894
    • Lorenzo Stoakes's avatar
      mm: update memfd seal write check to include F_SEAL_WRITE · 28464bbb
      Lorenzo Stoakes authored
      The seal_check_future_write() function is called by shmem_mmap() or
      hugetlbfs_file_mmap() to disallow any future writable mappings of an memfd
      sealed this way.
      
      The F_SEAL_WRITE flag is not checked here, as that is handled via the
      mapping->i_mmap_writable mechanism and so any attempt at a mapping would
      fail before this could be run.
      
      However we intend to change this, meaning this check can be performed for
      F_SEAL_WRITE mappings also.
      
      The logic here is equally applicable to both flags, so update this
      function to accommodate both and rename it accordingly.
      
      Link: https://lkml.kernel.org/r/913628168ce6cce77df7d13a63970bae06a526e0.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      28464bbb
    • Lorenzo Stoakes's avatar
      mm: drop the assumption that VM_SHARED always implies writable · e8e17ee9
      Lorenzo Stoakes authored
      Patch series "permit write-sealed memfd read-only shared mappings", v4.
      
      The man page for fcntl() describing memfd file seals states the following
      about F_SEAL_WRITE:-
      
          Furthermore, trying to create new shared, writable memory-mappings via
          mmap(2) will also fail with EPERM.
      
      With emphasis on 'writable'.  In turns out in fact that currently the
      kernel simply disallows all new shared memory mappings for a memfd with
      F_SEAL_WRITE applied, rendering this documentation inaccurate.
      
      This matters because users are therefore unable to obtain a shared mapping
      to a memfd after write sealing altogether, which limits their usefulness. 
      This was reported in the discussion thread [1] originating from a bug
      report [2].
      
      This is a product of both using the struct address_space->i_mmap_writable
      atomic counter to determine whether writing may be permitted, and the
      kernel adjusting this counter when any VM_SHARED mapping is performed and
      more generally implicitly assuming VM_SHARED implies writable.
      
      It seems sensible that we should only update this mapping if VM_MAYWRITE
      is specified, i.e.  whether it is possible that this mapping could at any
      point be written to.
      
      If we do so then all we need to do to permit write seals to function as
      documented is to clear VM_MAYWRITE when mapping read-only.  It turns out
      this functionality already exists for F_SEAL_FUTURE_WRITE - we can
      therefore simply adapt this logic to do the same for F_SEAL_WRITE.
      
      We then hit a chicken and egg situation in mmap_region() where the check
      for VM_MAYWRITE occurs before we are able to clear this flag.  To work
      around this, perform this check after we invoke call_mmap(), with careful
      consideration of error paths.
      
      Thanks to Andy Lutomirski for the suggestion!
      
      [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/
      [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238
      
      
      This patch (of 3):
      
      There is a general assumption that VMAs with the VM_SHARED flag set are
      writable.  If the VM_MAYWRITE flag is not set, then this is simply not the
      case.
      
      Update those checks which affect the struct address_space->i_mmap_writable
      field to explicitly test for this by introducing
      [vma_]is_shared_maywrite() helper functions.
      
      This remains entirely conservative, as the lack of VM_MAYWRITE guarantees
      that the VMA cannot be written to.
      
      Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e8e17ee9
    • SeongJae Park's avatar
      Docs/admin-guide/mm/damon/usage: update for tried regions update time interval · bc17ea26
      SeongJae Park authored
      The documentation says DAMOS tried regions update feature of DAMON sysfs
      interface is doing the update for one aggregation interval after the
      request is made.  Since the introduction of the per-scheme apply interval,
      that behavior makes no much sense.  Hence the implementation has changed
      to update the regions for each scheme for only its apply interval. 
      Further update the document to reflect the real behavior.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc17ea26
    • SeongJae Park's avatar
      mm/damon/sysfs: avoid empty scheme tried regions for large apply interval · 76126332
      SeongJae Park authored
      DAMON_SYSFS assumes all schemes will be applied for at least one DAMON
      monitoring results snapshot within one aggregation interval, or makes no
      sense to wait for it while DAMON is deactivated by the watermarks.  That
      for deactivated status still makes sense, but the aggregation interval
      based assumption is invalid now because each scheme can has its own apply
      interval.  For schemes having larger than the aggregation or watermarks
      check interval, DAMOS tried regions update request can be finished without
      the update.  Avoid the case by explicitly checking the status of the
      schemes tried regions update and watermarks based DAMON deactivation.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76126332
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: do not update tried regions more than one DAMON snapshot · 4d4e41b6
      SeongJae Park authored
      Patch series "mm/damon/sysfs-schemes: Do DAMOS tried regions update for
      only one apply interval".
      
      DAMOS tried regions update feature of DAMON sysfs interface is doing the
      update for one aggregation interval after the request is made.  Since the
      per-scheme apply interval is supported, that behavior makes no much sense.
      That is, the tried regions directory will have regions from multiple
      DAMON monitoring results snapshots, or no region for apply intervals that
      much shorter than, or longer than the aggregation interval, respectively. 
      Update the behavior to update the regions for each scheme for only its
      apply interval, and update the document.
      
      Since DAMOS apply interval is the aggregation by default, this change
      makes no visible behavioral difference to old users who don't explicitly
      set the apply intervals.
      
      Patches Sequence
      ----------------
      
      The first two patches makes schemes of apply intervals that much shorter
      or longer than the aggregation interval to keep the maximum and minimum
      times for continuing the update.  After the two patches, the update aligns
      with the each scheme's apply interval.
      
      Finally, the third patch updates the document to reflect the behavior.
      
      
      This patch (of 3):
      
      DAMON_SYSFS exposes every DAMON-found region that eligible for applying
      the scheme action for one aggregation interval.  However, each DAMON-based
      operation scheme has its own apply interval.  Hence, for a scheme that
      having its apply interval much smaller than the aggregation interval,
      DAMON_SYSFS will expose the scheme regions that applied to more than one
      DAMON monitoring results snapshots.  Since the purpose of DAMON tried
      regions is exposing single snapshot, this makes no much sense.  Track
      progress of each scheme's tried regions update and avoid the case.
      
      Link: https://lkml.kernel.org/r/20231012192256.33556-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20231012192256.33556-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d4e41b6
    • Audra Mitchell's avatar
      tools/mm: update the usage output to be more organized · d8ea435f
      Audra Mitchell authored
      Organize the usage options alphabetically and improve the description of
      some options.  Also separate the more complicated cull options from the
      single use compare options.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-6-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d8ea435f
    • Audra Mitchell's avatar
      tools/mm: fix the default case for page_owner_sort · c6d5e490
      Audra Mitchell authored
      With the additional commands and timestamps added to the tool, the default
      case (-t) has been broken.  Now that the allocation timestamps are saved
      outside of the txt field, allow us to properly sort the data by number of
      times the record has been seen.  Furthermore prevent the misuse of the
      commandline arguments so only one compare option can be used.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-5-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6d5e490
    • Audra Mitchell's avatar
      tools/mm: filter out timestamps for correct collation · 63a15062
      Audra Mitchell authored
      With the introduction of allocation timestamps being included in
      page_owner output, each record becomes unique due to the timestamp
      nanosecond granularity.  Remove the check in add_list that tries to
      collate each record during processing as the memcmp() is just additional
      overhead at this point.
      
      Also keep the allocation timestamps, but allow collation to occur without
      consideration of the allocation timestamp except in the case were
      allocation timestamps are requested by the user (the -a option).
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-4-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63a15062
    • Audra Mitchell's avatar
      tools/mm: remove references to free_ts from page_owner_sort · 0179c628
      Audra Mitchell authored
      With the removal of free timestamps from page_owner output, we no longer
      need to handle this case or the "unreleased" case.  Remove all references
      to both cases.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-3-audra@redhat.comSigned-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0179c628
    • Audra Mitchell's avatar
      mm/page_owner: remove free_ts from page_owner output · b459f090
      Audra Mitchell authored
      Patch series "Fix page_owner's use of free timestamps".
      
      While page ower output is used to investigate memory utilization,
      typically the allocation pathway, the introduction of timestamps to the
      page owner records caused each record to become unique due to the
      granularity of the nanosecond timestamp (for example):
      
        Page allocated via order 0 ... ts 5206196026 ns, free_ts 5187156703 ns
        Page allocated via order 0 ... ts 5206198540 ns, free_ts 5187162702 ns
      
      Furthermore, the page_owner output only dumps the currently allocated
      records, so having the free timestamps is nonsensical for the typical use
      case.
      
      In addition, the introduction of timestamps was not properly handled in
      the page_owner_sort tool causing most use cases to be broken.  This series
      is meant to remove the free timestamps from the page_owner output and fix
      the page_owner_sort tool so proper collation can occur.
      
      
      This patch (of 5):
      
      When printing page_owner data via the sysfs interface, no free pages will
      ever be dumped due to the series of checks in read_page_owner():
      
          /*
           * Although we do have the info about past allocation of free
           * pages, it's not relevant for current memory usage.
           */
           if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags))
      
      The free_ts values are still used when dump_page_owner() is called, so
      keeping the field for other use cases but removing them for the typical
      page_owner case.
      
      Link: https://lkml.kernel.org/r/20231013190350.579407-1-audra@redhat.com
      Link: https://lkml.kernel.org/r/20231013190350.579407-2-audra@redhat.com
      Fixes: 866b4852 ("mm/page_owner: record the timestamp of all pages during free")
      Signed-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Georgi Djakov <djakov@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b459f090
    • Lorenzo Stoakes's avatar
      mm: abstract VMA merge and extend into vma_merge_extend() helper · 93bf5d4a
      Lorenzo Stoakes authored
      mremap uses vma_merge() in the case where a VMA needs to be extended. This
      can be significantly simplified and abstracted.
      
      This makes it far easier to understand what the actual function is doing,
      avoids future mistakes in use of the confusing vma_merge() function and
      importantly allows us to make future changes to how vma_merge() is
      implemented by knowing explicitly which merge cases each invocation uses.
      
      Note that in the mremap() extend case, we perform this merge only when
      old_len == vma->vm_end - addr. The extension_start, i.e. the start of the
      extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end.
      
      With this refactoring, vma_merge() is no longer required anywhere except
      mm/mmap.c, so mark it static.
      
      Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      93bf5d4a
    • Lorenzo Stoakes's avatar
      mm: abstract merge for new VMAs into vma_merge_new_vma() · 4b5f2d20
      Lorenzo Stoakes authored
      Only in mmap_region() and copy_vma() do we attempt to merge VMAs which
      occupy entirely new regions of virtual memory.
      
      We can abstract this logic and make the intent of this invocations of it
      completely explicit, rather than invoking vma_merge() with an inscrutable
       wall of parameters.
      
      This also paves the way for a simplification of the core vma_merge()
      implementation, as we seek to make it entirely an implementation detail.
      
      The VMA merge call in mmap_region() occurs only for file-backed mappings,
      where each of the parameters previously specified as NULL are defaulted to
      NULL in vma_init() (called by vm_area_alloc()).
      
      This matches the previous behaviour of specifying NULL for a number of
      fields, however note that prior to this call we pass the VMA to the file
      system driver via call_mmap(), which may in theory adjust fields that we
      pass in to vma_merge_new_vma().
      
      Therefore we actually resolve an oversight here by allowing for the fact
      that the driver may have done this.
      
      Link: https://lkml.kernel.org/r/3dc71d17e307756a54781d4a4ce7315cf8b18bea.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4b5f2d20
    • Lorenzo Stoakes's avatar
      mm: make vma_merge() and split_vma() internal · adb20b0c
      Lorenzo Stoakes authored
      Now the common pattern of - attempting a merge via vma_merge() and should
      this fail splitting VMAs via split_vma() - has been abstracted, the former
      can be placed into mm/internal.h and the latter made static.
      
      In addition, the split_vma() nommu variant also need not be exported.
      
      Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      adb20b0c
    • Lorenzo Stoakes's avatar
      mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al. · 94d7d923
      Lorenzo Stoakes authored
      mprotect() and other functions which change VMA parameters over a range
      each employ a pattern of:-
      
      1. Attempt to merge the range with adjacent VMAs.
      2. If this fails, and the range spans a subset of the VMA, split it
         accordingly.
      
      This is open-coded and duplicated in each case. Also in each case most of
      the parameters passed to vma_merge() remain the same.
      
      Create a new function, vma_modify(), which abstracts this operation,
      accepting only those parameters which can be changed.
      
      To avoid the mess of invoking each function call with unnecessary
      parameters, create inline wrapper functions for each of the modify
      operations, parameterised only by what is required to perform the action.
      
      We can also significantly simplify the logic - by returning the VMA if we
      split (or merged VMA if we do not) we no longer need specific handling for
      merge/split cases in any of the call sites.
      
      Note that the userfaultfd_release() case works even though it does not
      split VMAs - since start is set to vma->vm_start and end is set to
      vma->vm_end, the split logic does not trigger.
      
      In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start
      - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this
      instance, this invocation will remain unchanged.
      
      We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that
      vma_merge() correctly ensures that flags remain the same, something that is
      already checked in is_mergeable_vma() and elsewhere, and in any case is not
      specific to mprotect().
      
      Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94d7d923
    • Lorenzo Stoakes's avatar
      mm: move vma_policy() and anon_vma_name() decls to mm_types.h · 3657fdc2
      Lorenzo Stoakes authored
      Patch series "Abstract vma_merge() and split_vma()", v4.
      
      The vma_merge() interface is very confusing and its implementation has led
      to numerous bugs as a result of that confusion.
      
      In addition there is duplication both in invocation of vma_merge(), but
      also in the common mprotect()-style pattern of attempting a merge, then if
      this fails, splitting the portion of a VMA about to have its attributes
      changed.
      
      This pattern has been copy/pasted around the kernel in each instance where
      such an operation has been required, each very slightly modified from the
      last to make it even harder to decipher what is going on.
      
      Simplify the whole thing by dividing the actual uses of vma_merge() and
      split_vma() into specific and abstracted functions and de-duplicate the
      vma_merge()/split_vma() pattern altogether.
      
      Doing so also opens the door to changing how vma_merge() is implemented -
      by knowing precisely what cases a caller is invoking rather than having a
      central interface where anything might happen we can untangle the brittle
      and confusing vma_merge() implementation into something more workable.
      
      For mprotect()-like cases we introduce vma_modify() which performs the
      vma_merge()/split_vma() pattern, returning a pointer to either the merged
      or split VMA or an ERR_PTR(err) if the splits fail.
      
      We provide a number of inline helper functions to make things even clearer:-
      
      * vma_modify_flags()      - Prepare to modify the VMA's flags.
      * vma_modify_flags_name() - Prepare to modify the VMA's flags/anon_vma_name
      * vma_modify_policy()     - Prepare to modify the VMA's mempolicy.
      * vma_modify_flags_uffd() - Prepare to modify the VMA's flags/uffd context.
      
      For cases where a new VMA is attempted to be merged with adjacent VMAs we
      add:-
      
      * vma_merge_new_vma() - Prepare to merge a new VMA.
      * vma_merge_extend()  - Prepare to extend the end of a new VMA.
      
      
      This patch (of 5):
      
      The vma_policy() define is a helper specifically for a VMA field so it
      makes sense to host it in the memory management types header.
      
      The anon_vma_name(), anon_vma_name_alloc() and anon_vma_name_free()
      functions are a little out of place in mm_inline.h as they define external
      functions, and so it makes sense to locate them in mm_types.h.
      
      The purpose of these relocations is to make it possible to abstract static
      inline wrappers which invoke both of these helpers.
      
      Link: https://lkml.kernel.org/r/cover.1697043508.git.lstoakes@gmail.com
      Link: https://lkml.kernel.org/r/24bfc6c9e382fffbcb0ea8d424392c27d56cc8ca.1697043508.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3657fdc2
    • Matthew Wilcox (Oracle)'s avatar
      sched: remove wait bookmarks · 37acade0
      Matthew Wilcox (Oracle) authored
      There are no users of wait bookmarks left, so simplify the wait
      code by removing them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37acade0
    • Matthew Wilcox (Oracle)'s avatar
      filemap: remove use of wait bookmarks · b0b598ee
      Matthew Wilcox (Oracle) authored
      The original problem of the overly long list of waiters on a locked page
      was solved properly by commit 9a1ea439 ("mm:
      put_and_wait_on_page_locked() while page is migrated").  In the meantime,
      using bookmarks for the writeback bit can cause livelocks, so we need to
      stop using them.
      
      Link: https://lkml.kernel.org/r/20231010035829.544242-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Bin Lai <sclaibin@gmail.com>
      Cc: Benjamin Segall <bsegall@google.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Valentin Schneider <vschneid@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0b598ee
    • Lorenzo Stoakes's avatar
      mm/mprotect: allow unfaulted VMAs to be unaccounted on mprotect() · 9b914329
      Lorenzo Stoakes authored
      When mprotect() is used to make unwritable VMAs writable, they have the
      VM_ACCOUNT flag applied and memory accounted accordingly.
      
      If the VMA has had no pages faulted in and is then made unwritable once
      again, it will remain accounted for, despite not being capable of
      extending memory usage.
      
      Consider:-
      
      ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0);
      mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
      mprotect(ptr + page_size, page_size, PROT_READ);
      
      The first mprotect() splits the range into 3 VMAs and the second fails to
      merge the three as the middle VMA has VM_ACCOUNT set and the others do
      not, rendering them unmergeable.
      
      This is unnecessary, since no pages have actually been allocated and the
      middle VMA is not capable of utilising more memory, thereby introducing
      unnecessary VMA fragmentation (and accounting for more memory than is
      necessary).
      
      Since we cannot efficiently determine which pages map to an anonymous VMA,
      we have to be very conservative - determining whether any pages at all
      have been faulted in, by checking whether vma->anon_vma is NULL.
      
      We can see that the lack of anon_vma implies that no anonymous pages are
      present as evidenced by vma_needs_copy() utilising this on fork to
      determine whether page tables need to be copied.
      
      The only place where anon_vma is set NULL explicitly is on fork with
      VM_WIPEONFORK set, however since this flag is intended to cause the child
      process to not CoW on a given memory range, it is right to interpret this
      as indicating the VMA has no faulted-in anonymous memory mapped.
      
      If the VMA was forked without VM_WIPEONFORK set, then anon_vma_fork() will
      have ensured that a new anon_vma is assigned (and correctly related to its
      parent anon_vma) should any pages be CoW-mapped.
      
      The overall operation is safe against races as we hold a write lock against
      mm->mmap_lock.
      
      If we could efficiently look up the VMA's faulted-in pages then we would
      unaccount all those pages not yet faulted in.  However as the original
      comment alludes this simply isn't currently possible, so we are
      conservative and account all pages or none at all.
      
      Link: https://lkml.kernel.org/r/ad5540371a16623a069f03f4db1739f33cde1fab.1696921767.git.lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9b914329
    • Lucy Mielke's avatar
      mm: add printf attribute to shrinker_debugfs_name_alloc · f04eba13
      Lucy Mielke authored
      This fixes a compiler warning when compiling an allyesconfig with W=1:
      
      mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf'
      format attribute [-Werror=suggest-attribute=format]
      
      [akpm@linux-foundation.org: fix shrinker_alloc() as welll per Qi Zheng]
        Link: https://lkml.kernel.org/r/822387b7-4895-4e64-5806-0f56b5d6c447@bytedance.com
      Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe
      Fixes: c42d50ae ("mm: shrinker: add infrastructure for dynamically allocating shrinker")
      Signed-off-by: default avatarLucy Mielke <lucymielke@icloud.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f04eba13
    • Zach O'Keefe's avatar
      mm/thp: fix "mm: thp: kill __transhuge_page_enabled()" · 7a81751f
      Zach O'Keefe authored
      The 6.0 commits:
      
      commit 9fec5168 ("mm: thp: kill transparent_hugepage_active()")
      commit 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      
      merged "can we have THPs in this VMA?" logic that was previously done
      separately by fault-path, khugepaged, and smaps "THPeligible" checks.
      
      During the process, the semantics of the fault path check changed in two
      ways:
      
      1) A VM_NO_KHUGEPAGED check was introduced (also added to smaps path).
      2) We no longer checked if non-anonymous memory had a vm_ops->huge_fault
         handler that could satisfy the fault.  Previously, this check had been
         done in create_huge_pud() and create_huge_pmd() routines, but after
         the changes, we never reach those routines.
      
      During the review of the above commits, it was determined that in-tree
      users weren't affected by the change; most notably, since the only
      relevant user (in terms of THP) of VM_MIXEDMAP or ->huge_fault is DAX,
      which is explicitly approved early in approval logic.  However, this was a
      bad assumption to make as it assumes the only reason to support
      ->huge_fault was for DAX (which is not true in general).
      
      Remove the VM_NO_KHUGEPAGED check when not in collapse path and give any
      ->huge_fault handler a chance to handle the fault.  Note that we don't
      validate the file mode or mapping alignment, which is consistent with the
      behavior before the aforementioned commits.
      
      Link: https://lkml.kernel.org/r/20230925200110.1979606-1-zokeefe@google.com
      Fixes: 7da4e2cb ("mm: thp: kill __transhuge_page_enabled()")
      Reported-by: default avatarSaurabh Singh Sengar <ssengar@microsoft.com>
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a81751f