1. 18 Oct, 2023 27 commits
    • Hugh Dickins's avatar
      shmem: shrink shmem_inode_info: dir_offsets in a union · ee615d45
      Hugh Dickins authored
      Patch series "shmem,tmpfs: general maintenance".
      
      Mostly just cosmetic mods in mm/shmem.c, but the last two enforcing the
      "size=" limit better.  8/8 goes into percpu counter territory, and could
      stand alone.
      
      
      This patch (of 8):
      
      Shave 32 bytes off (the 64-bit) shmem_inode_info.  There was a 4-byte
      pahole after stop_eviction, better filled by fsflags.  And the 24-byte
      dir_offsets can only be used by directories, whereas shrinklist and
      swaplist only by shmem_mapping() inodes (regular files or long symlinks):
      so put those into a union.  No change in mm/shmem.c is required for this.
      
      Link: https://lkml.kernel.org/r/c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com
      Link: https://lkml.kernel.org/r/86ebb4b-c571-b9e8-27f5-cb82ec50357e@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Carlos Maiolino <cem@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ee615d45
    • Lorenzo Stoakes's avatar
      mm/filemap: clarify filemap_fault() comments for not uptodate case · 6facf36e
      Lorenzo Stoakes authored
      The existing comments in filemap_fault() suggest that, after either a
      minor fault has occurred and filemap_get_folio() found a folio in the page
      cache, or a major fault arose and __filemap_get_folio(FGP_CREATE...) did
      the job (having relied on do_sync_mmap_readahead() or filemap_read_folio()
      to read in the folio), the only possible reason it could not be uptodate
      is because of an error.
      
      This is not so, as if, for instance, the fault occurred within a VMA which
      had the VM_RAND_READ flag set (via madvise() with the MADV_RANDOM flag
      specified), this would cause even synchronous readahead to fail to read in
      the folio.
      
      I confirmed this by dropping page caches and faulting in memory
      madvise()'d this way, observing that this code path was reached on each
      occasion.
      
      Clarify the comments to include this case, and additionally update the
      comment recently added around the invalidate lock logic to make it clear
      the comment explicitly refers to the minor fault case.
      
      In addition, while we're here, refer to folios rather than pages.
      
      [lstoakes@gmail.com: correct identation as per Christopher's feedback]
        Link: https://lkml.kernel.org/r/2c7014c0-6343-4e76-8697-3f84f54350bd@lucifer.local
      Link: https://lkml.kernel.org/r/20230930231029.88196-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6facf36e
    • Liam R. Howlett's avatar
      radix tree test suite: fix allocation calculation in kmem_cache_alloc_bulk() · 7771dcf0
      Liam R. Howlett authored
      The bulk allocation is iterating through an array and storing enough
      memory for the entire bulk allocation instead of a single array entry. 
      Only allocate an array element of the size set in the kmem_cache.
      
      Link: https://lkml.kernel.org/r/20230929201359.2857583-1-Liam.Howlett@oracle.com
      Fixes: cc86e0c2 ("radix tree test suite: add support for slab bulk APIs")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reported-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7771dcf0
    • Muhammad Usama Anjum's avatar
      selftests: mm: add pagemap ioctl tests · 46fd75d4
      Muhammad Usama Anjum authored
      Add pagemap ioctl tests. Add several different types of tests to judge
      the correction of the interface.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-7-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      46fd75d4
    • Muhammad Usama Anjum's avatar
      mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL · 18825b8a
      Muhammad Usama Anjum authored
      Add some explanation and method to use write-protection and written-to
      on memory range.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-6-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      18825b8a
    • Muhammad Usama Anjum's avatar
      tools headers UAPI: update linux/fs.h with the kernel sources · b58aa0f4
      Muhammad Usama Anjum authored
      New IOCTL and macros has been added in the kernel sources. Update the
      tools header file as well.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-5-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b58aa0f4
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag · 12f6b01a
      Muhammad Usama Anjum authored
      Adding fast code paths to handle specifically only get and/or clear
      operation of PAGE_IS_WRITTEN, increases its performance by 0-35%.  The
      results of some test cases are given below:
      
      Test-case-1
      t1 = (Get + WP) time
      t2 = WP time
                             t1            t2
      Without this patch:    140-170mcs    90-115mcs
      With this patch:       110mcs        80mcs
      Worst case diff:       35% faster    30% faster
      
      Test-case-2
      t3 = atomic Get and WP
                            t3
      Without this patch:   120-140mcs
      With this patch:      100-110mcs
      Worst case diff:      21% faster
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-4-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12f6b01a
    • Muhammad Usama Anjum's avatar
      fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs · 52526ca7
      Muhammad Usama Anjum authored
      The PAGEMAP_SCAN IOCTL on the pagemap file can be used to get or optionally
      clear the info about page table entries. The following operations are
      supported in this IOCTL:
      - Scan the address range and get the memory ranges matching the provided
        criteria. This is performed when the output buffer is specified.
      - Write-protect the pages. The PM_SCAN_WP_MATCHING is used to write-protect
        the pages of interest. The PM_SCAN_CHECK_WPASYNC aborts the operation if
        non-Async Write Protected pages are found. The ``PM_SCAN_WP_MATCHING``
        can be used with or without PM_SCAN_CHECK_WPASYNC.
      - Both of those operations can be combined into one atomic operation where
        we can get and write protect the pages as well.
      
      Following flags about pages are currently supported:
      - PAGE_IS_WPALLOWED - Page has async-write-protection enabled
      - PAGE_IS_WRITTEN - Page has been written to from the time it was write protected
      - PAGE_IS_FILE - Page is file backed
      - PAGE_IS_PRESENT - Page is present in the memory
      - PAGE_IS_SWAPPED - Page is in swapped
      - PAGE_IS_PFNZERO - Page has zero PFN
      - PAGE_IS_HUGE - Page is THP or Hugetlb backed
      
      This IOCTL can be extended to get information about more PTE bits. The
      entire address range passed by user [start, end) is scanned until either
      the user provided buffer is full or max_pages have been found.
      
      [akpm@linux-foundation.org: update it for "mm: hugetlb: add huge page size param to set_huge_pte_at()"]
      [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n warning]
      [arnd@arndb.de: hide unused pagemap_scan_backout_range() function]
        Link: https://lkml.kernel.org/r/20230927060257.2975412-1-arnd@kernel.org
      [sfr@canb.auug.org.au: fix "fs/proc/task_mmu: hide unused pagemap_scan_backout_range() function"]
        Link: https://lkml.kernel.org/r/20230928092223.0625c6bf@canb.auug.org.au
      Link: https://lkml.kernel.org/r/20230821141518.870589-3-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Reviewed-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52526ca7
    • Peter Xu's avatar
      userfaultfd: UFFD_FEATURE_WP_ASYNC · d61ea1cb
      Peter Xu authored
      Patch series "Implement IOCTL to get and optionally clear info about
      PTEs", v33.
      
      *Motivation*
      The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
      GetWriteWatch() and ResetWriteWatch() syscalls [1].  The GetWriteWatch()
      retrieves the addresses of the pages that are written to in a region of
      virtual memory.
      
      This syscall is used in Windows applications and games etc.  This syscall
      is being emulated in pretty slow manner in userspace.  Our purpose is to
      enhance the kernel such that we translate it efficiently in a better way. 
      Currently some out of tree hack patches are being used to efficiently
      emulate it in some kernels.  We intend to replace those with these
      patches.  So the whole gaming on Linux can effectively get benefit from
      this.  It means there would be tons of users of this code.
      
      CRIU use case [2] was mentioned by Andrei and Danylo:
      > Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
      > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
      > shadow memory [4]. Being able to migrate such binaries allows to highly
      > reduce the amount of work needed to identify and fix post-migration
      > crashes, which happen constantly.
      
      Andrei defines the following uses of this code:
      * it is more granular and allows us to track changed pages more
        effectively. The current interface can clear dirty bits for the entire
        process only. In addition, reading info about pages is a separate
        operation. It means we must freeze the process to read information
        about all its pages, reset dirty bits, only then we can start dumping
        pages. The information about pages becomes more and more outdated,
        while we are processing pages. The new interface solves both these
        downsides. First, it allows us to read pte bits and clear the
        soft-dirty bit atomically. It means that CRIU will not need to freeze
        processes to pre-dump their memory. Second, it clears soft-dirty bits
        for a specified region of memory. It means CRIU will have actual info
        about pages to the moment of dumping them.
      * The new interface has to be much faster because basic page filtering
        is happening in the kernel. With the old interface, we have to read
        pagemap for each page.
      
      *Implementation Evolution (Short Summary)*
      From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
      feature can be used under the hood with some additions like:
      * reset soft-dirty flag for only a specific region of memory instead of
      clearing the flag for the entire process
      * get and clear soft-dirty flag for a specific region atomically
      
      So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
      flag. But using soft-dirty flag, sometimes we get extra pages which weren't
      even written. They had become soft-dirty because of VMA merging and
      VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
      able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
      reported that mprotect etc messes up the soft-dirty flag while ignoring
      VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
      discussed if we can revert these patches. But we could not reach to any
      conclusion. So at this point, I made couple of tries to solve this whole
      VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
      * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
      regression. We left it behind.
      * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
      got the reply don't increase the size of the VMA by 8 bytes.
      
      At this point, we left soft-dirty considering it is too much delicate and
      userfaultfd [9] seemed like the only way forward. From there onward, we
      have been basing soft-dirty emulation on userfaultfd wp feature where
      kernel resolves the faults itself when WP_ASYNC feature is used. It was
      straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
      those pages dirty or written-to which are really written in reality. (PS
      There is another WP_UNPOPULATED userfautfd feature is required which is
      needed to avoid pre-faulting memory before write-protecting [9].)
      
      All the different masks were added on the request of CRIU devs to create
      interface more generic and better.
      
      [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch
      [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
      [3] https://github.com/google/sanitizers
      [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
      [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
      [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
      [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
      [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
      [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
      [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
      
      
      This patch (of 6):
      
      Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows
      userfaultfd wr-protect faults to be resolved by the kernel directly.
      
      It can be used like a high accuracy version of soft-dirty, without vma
      modifications during tracking, and also with ranged support by default
      rather than for a whole mm when reset the protections due to existence of
      ioctl(UFFDIO_WRITEPROTECT).
      
      Several goals of such a dirty tracking interface:
      
      1. All types of memory should be supported and tracable. This is nature
         for soft-dirty but should mention when the context is userfaultfd,
         because it used to only support anon/shmem/hugetlb. The problem is for
         a dirty tracking purpose these three types may not be enough, and it's
         legal to track anything e.g. any page cache writes from mmap.
      
      2. Protections can be applied to partial of a memory range, without vma
         split/merge fuss.  The hope is that the tracking itself should not
         affect any vma layout change.  It also helps when reset happens because
         the reset will not need mmap write lock which can block the tracee.
      
      3. Accuracy needs to be maintained.  This means we need pte markers to work
         on any type of VMA.
      
      One could question that, the whole concept of async dirty tracking is not
      really close to fundamentally what userfaultfd used to be: it's not "a
      fault to be serviced by userspace" anymore. However, using userfaultfd-wp
      here as a framework is convenient for us in at least:
      
      1. VM_UFFD_WP vma flag, which has a very good name to suite something like
         this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new
         feature bit to identify from a sync version of uffd-wp registration.
      
      2. PTE markers logic can be leveraged across the whole kernel to maintain
         the uffd-wp bit as long as an arch supports, this also applies to this
         case where uffd-wp bit will be a hint to dirty information and it will
         not go lost easily (e.g. when some page cache ptes got zapped).
      
      3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or
         resetting a range of memory, while there's no counterpart in the old
         soft-dirty world, hence if this is wanted in a new design we'll need a
         new interface otherwise.
      
      We can somehow understand that commonality because uffd-wp was
      fundamentally a similar idea of write-protecting pages just like
      soft-dirty.
      
      This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so
      far WP_ASYNC seems to not usable if without WP_UNPOPULATE.  This also
      gives us chance to modify impl of WP_ASYNC just in case it could be not
      depending on WP_UNPOPULATED anymore in the future kernels.  It's also fine
      to imply that because both features will rely on PTE_MARKER_UFFD_WP config
      option, so they'll show up together (or both missing) in an UFFDIO_API
      probe.
      
      vma_can_userfault() now allows any VMA if the userfaultfd registration is
      only about async uffd-wp.  So we can track dirty for all kinds of memory
      including generic file systems (like XFS, EXT4 or BTRFS).
      
      One trick worth mention in do_wp_page() is that we need to manually update
      vmf->orig_pte here because it can be used later with a pte_same() check -
      this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags.
      
      The major defect of this approach of dirty tracking is we need to populate
      the pgtables when tracking starts.  Soft-dirty doesn't do it like that. 
      It's unwanted in the case where the range of memory to track is huge and
      unpopulated (e.g., tracking updates on a 10G file with mmap() on top,
      without having any page cache installed yet).  One way to improve this is
      to allow pte markers exist for larger than PTE level for PMD+.  That will
      not change the interface if to implemented, so we can leave that for
      later.
      
      Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Miroslaw <emmir@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Paul Gofman <pgofman@codeweavers.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yun Zhou <yun.zhou@windriver.com>
      Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d61ea1cb
    • Yosry Ahmed's avatar
      mm: memcg: normalize the value passed into memcg_rstat_updated() · 7bd5bc3c
      Yosry Ahmed authored
      memcg_rstat_updated() uses the value of the state update to keep track of
      the magnitude of pending updates, so that we only do a stats flush when
      it's worth the work.  Most values passed into memcg_rstat_updated() are in
      pages, however, a few of them are actually in bytes or KBs.
      
      To put this into perspective, a 512 byte slab allocation today would look
      the same as allocating 512 pages.  This may result in premature flushes,
      which means unnecessary work and latency.
      
      Normalize all the state values passed into memcg_rstat_updated() to pages.
      Round up non-zero sub-page to 1 page, because memcg_rstat_updated()
      ignores 0 page updates.
      
      Link: https://lkml.kernel.org/r/20230922175741.635002-3-yosryahmed@google.com
      Fixes: 5b3be698 ("memcg: better bounds on the memcg stats updates")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7bd5bc3c
    • Yosry Ahmed's avatar
      mm: memcg: refactor page state unit helpers · ff841a06
      Yosry Ahmed authored
      Patch series "mm: memcg: fix tracking of pending stats updates values", v2.
      
      While working on adjacent code [1], I realized that the values passed into
      memcg_rstat_updated() to keep track of the magnitude of pending updates is
      consistent.  It is mostly in pages, but sometimes it can be in bytes or
      KBs.  Fix that.
      
      Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch 2
      to check and normalize the units of state updates.
      
      [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
      
      
      This patch (of 2):
      
      memcg_page_state_unit() is currently used to identify the unit of a memcg
      state item so that all stats in memory.stat are in bytes.  However, it
      lies about the units of WORKINGSET_* stats.  These stats actually
      represent pages, but we present them to userspace as a scalar number of
      events.  In retrospect, maybe those stats should have been memcg "events"
      rather than memcg "state".
      
      In preparation for using memcg_page_state_unit() for other purposes that
      need to know the truthful units of different stat items, break it down
      into two helpers:
      - memcg_page_state_unit() retuns the actual unit of the item.
      - memcg_page_state_output_unit() returns the unit used for output.
      
      Use the latter instead of the former in memcg_page_state_output() and
      lruvec_page_state_output().  While we are at it, let's show cgroup v1 some
      love and add memcg_page_state_local_output() for consistency.
      
      No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230922175741.635002-1-yosryahmed@google.com
      Link: https://lkml.kernel.org/r/20230922175741.635002-2-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ff841a06
    • Kees Cook's avatar
      mm/memcg: annotate struct mem_cgroup_threshold_ary with __counted_by · b7c67206
      Kees Cook authored
      Prepare for the coming implementation by GCC and Clang of the __counted_by
      attribute.  Flexible array members annotated with __counted_by can have
      their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
      (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
      functions).
      
      As found with Coccinelle[1], add __counted_by for struct
      mem_cgroup_threshold_ary.
      
      [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
      
      Link: https://lkml.kernel.org/r/20230922175327.work.985-kees@kernel.orgSigned-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Reviewed-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7c67206
    • Mike Kravetz's avatar
      hugetlb: check for hugetlb folio before vmemmap_restore · 30a89adf
      Mike Kravetz authored
      In commit d8f5f7e4 ("hugetlb: set hugetlb page flag before
      optimizing vmemmap") checks were added to print a warning if
      hugetlb_vmemmap_restore was called on a non-hugetlb page.
      
      This was mostly due to ordering issues in the hugetlb page set up and tear
      down sequencees.  One place missed was the routine
      dissolve_free_huge_page.
      
      Naoya Horiguchi noted: "I saw that VM_WARN_ON_ONCE() in
      hugetlb_vmemmap_restore is triggered when memory_failure() is called on a
      free hugetlb page with vmemmap optimization disabled (the warning is not
      triggered if vmemmap optimization is enabled).  I think that we need check
      folio_test_hugetlb() before dissolve_free_huge_page() calls
      hugetlb_vmemmap_restore_folio()."
      
      Perform the check as suggested by Naoya.
      
      Link: https://lkml.kernel.org/r/20231017032140.GA3680@monkey
      Fixes: d8f5f7e4 ("hugetlb: set hugetlb page flag before optimizing vmemmap")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Suggested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Tested-by: default avatarNaoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30a89adf
    • Andrew Morton's avatar
    • Tiezhu Yang's avatar
      selftests/clone3: Fix broken test under !CONFIG_TIME_NS · fc7f04dc
      Tiezhu Yang authored
      When execute the following command to test clone3 under !CONFIG_TIME_NS:
      
        # make headers && cd tools/testing/selftests/clone3 && make && ./clone3
      
      we can see the following error info:
      
        # [7538] Trying clone3() with flags 0x80 (size 0)
        # Invalid argument - Failed to create new process
        # [7538] clone3() with flags says: -22 expected 0
        not ok 18 [7538] Result (-22) is different than expected (0)
        ...
        # Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
      
      This is because if CONFIG_TIME_NS is not set, but the flag
      CLONE_NEWTIME (0x80) is used to clone a time namespace, it
      will return -EINVAL in copy_time_ns().
      
      If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
      will be not exist, and then we should skip clone3() test with
      CLONE_NEWTIME.
      
      With this patch under !CONFIG_TIME_NS:
      
        # make headers && cd tools/testing/selftests/clone3 && make && ./clone3
        ...
        # Time namespaces are not supported
        ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
        ...
        # Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
      
      Link: https://lkml.kernel.org/r/1689066814-13295-1-git-send-email-yangtiezhu@loongson.cn
      Fixes: 515bddf0 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fc7f04dc
    • Liam R. Howlett's avatar
      maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() · 099d7439
      Liam R. Howlett authored
      Users complained about OOM errors during fork without triggering
      compaction.  This can be fixed by modifying the flags used in
      mas_expected_entries() so that the compaction will be triggered in low
      memory situations.  Since mas_expected_entries() is only used during fork,
      the extra argument does not need to be passed through.
      
      Additionally, the two test_maple_tree test cases and one benchmark test
      were altered to use the correct locking type so that allocations would not
      trigger sleeping and thus fail.  Testing was completed with lockdep atomic
      sleep detection.
      
      The additional locking change requires rwsem support additions to the
      tools/ directory through the use of pthreads pthread_rwlock_t.  With this
      change test_maple_tree works in userspace, as a module, and in-kernel.
      
      Users may notice that the system gave up early on attempting to start new
      processes instead of attempting to reclaim memory.
      
      Link: https://lkml.kernel.org/r/20230915093243epcms1p46fa00bbac1ab7b7dca94acb66c44c456@epcms1p4
      Link: https://lkml.kernel.org/r/20231012155233.2272446-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: default avatarPeng Zhang <zhangpeng.00@bytedance.com>
      Cc: <jason.sim@samsung.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      099d7439
    • Samasth Norway Ananda's avatar
      selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier · e2de156b
      Samasth Norway Ananda authored
      Definition for MREMAP_DONTUNMAP is not present in glibc older than 2.32
      thus throwing an undeclared error when running make on mm.  Including
      linux/mman.h solves the build error for people having older glibc.
      
      Link: https://lkml.kernel.org/r/20231012155257.891776-1-samasth.norway.ananda@oracle.com
      Fixes: 0183d777 ("selftests: mm: remove duplicate unneeded defines")
      Signed-off-by: default avatarSamasth Norway Ananda <samasth.norway.ananda@oracle.com>
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Closes: https://lore.kernel.org/linux-mm/CA+G9fYvV-71XqpCr_jhdDfEtN701fBdG3q+=bafaZiGwUXy_aA@mail.gmail.com/Tested-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e2de156b
    • Oleksij Rempel's avatar
      mailmap: correct email aliasing for Oleksij Rempel · d2313c77
      Oleksij Rempel authored
      Ensure the current work email addresses for Oleksij Rempel are preserved
      and not overridden by private address.  Alias the alternate work email to
      the primary work email address.
      
      Link: https://lkml.kernel.org/r/20231011112519.1427077-1-o.rempel@pengutronix.deSigned-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2313c77
    • Bartosz Golaszewski's avatar
      mailmap: map Bartosz's old address to the current one · 002e39e9
      Bartosz Golaszewski authored
      I no longer work for BayLibre but many DT bindings have my BL address in
      the maintainers entries.  Map it to the email address I use for kernel
      development.
      
      Link: https://lkml.kernel.org/r/20231011150104.73863-1-brgl@bgdev.plSigned-off-by: default avatarBartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Suggested-by: default avatarConor Dooley <conor@kernel.org>
      Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      002e39e9
    • SeongJae Park's avatar
      mm/damon/sysfs: check DAMOS regions update progress from before_terminate() · 76b7069b
      SeongJae Park authored
      DAMON_SYSFS can receive DAMOS tried regions update request while kdamond
      is already out of the main loop and before_terminate callback
      (damon_sysfs_before_terminate() in this case) is not yet called.  And
      damon_sysfs_handle_cmd() can further be finished before the callback is
      invoked.  Then, damon_sysfs_before_terminate() unlocks damon_sysfs_lock,
      which is not locked by anyone.  This happens because the callback function
      assumes damon_sysfs_cmd_request_callback() should be called before it. 
      Check if the assumption was true before doing the unlock, to avoid this
      problem.
      
      Link: https://lkml.kernel.org/r/20231007200432.3110-1-sj@kernel.org
      Fixes: f1d13cac ("mm/damon/sysfs: implement DAMOS tried regions update command")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.2.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76b7069b
    • Ondrej Jirman's avatar
      MAINTAINERS: Ondrej has moved · c5155d4e
      Ondrej Jirman authored
      Update my email-address in MAINTAINERS to <megi@xff.cz>.  Also add
      .mailmap entries to map my old, now blocked, email address.
      
      Link: https://lkml.kernel.org/r/20231008105812.1084226-1-megi@xff.czSigned-off-by: default avatarOndrej Jirman <megi@xff.cz>
      Cc: Bjorn Andersson <quic_bjorande@quicinc.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org> # qcom
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Qais Yousef <qyousef@layalina.io>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c5155d4e
    • Arnd Bergmann's avatar
      kasan: disable kasan_non_canonical_hook() for HW tags · 17c17567
      Arnd Bergmann authored
      On arm64, building with CONFIG_KASAN_HW_TAGS now causes a compile-time
      error:
      
      mm/kasan/report.c: In function 'kasan_non_canonical_hook':
      mm/kasan/report.c:637:20: error: 'KASAN_SHADOW_OFFSET' undeclared (first use in this function)
        637 |         if (addr < KASAN_SHADOW_OFFSET)
            |                    ^~~~~~~~~~~~~~~~~~~
      mm/kasan/report.c:637:20: note: each undeclared identifier is reported only once for each function it appears in
      mm/kasan/report.c:640:77: error: expected expression before ';' token
        640 |         orig_addr = (addr - KASAN_SHADOW_OFFSET) << KASAN_SHADOW_SCALE_SHIFT;
      
      This was caused by removing the dependency on CONFIG_KASAN_INLINE that
      used to prevent this from happening. Use the more specific dependency
      on KASAN_SW_TAGS || KASAN_GENERIC to only ignore the function for hwasan
      mode.
      
      Link: https://lkml.kernel.org/r/20231016200925.984439-1-arnd@kernel.org
      Fixes: 12ec6a919b0f ("kasan: print the original fault addr when access invalid shadow")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Haibo Li <haibo.li@mediatek.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17c17567
    • Haibo Li's avatar
      kasan: print the original fault addr when access invalid shadow · babddbfb
      Haibo Li authored
      when the checked address is illegal,the corresponding shadow address from
      kasan_mem_to_shadow may have no mapping in mmu table.  Access such shadow
      address causes kernel oops.  Here is a sample about oops on arm64(VA
      39bit) with KASAN_SW_TAGS and KASAN_OUTLINE on:
      
      [ffffffb80aaaaaaa] pgd=000000005d3ce003, p4d=000000005d3ce003,
          pud=000000005d3ce003, pmd=0000000000000000
      Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
      Modules linked in:
      CPU: 3 PID: 100 Comm: sh Not tainted 6.6.0-rc1-dirty #43
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : __hwasan_load8_noabort+0x5c/0x90
      lr : do_ib_ob+0xf4/0x110
      ffffffb80aaaaaaa is the shadow address for efffff80aaaaaaaa.
      The problem is reading invalid shadow in kasan_check_range.
      
      The generic kasan also has similar oops.
      
      It only reports the shadow address which causes oops but not
      the original address.
      
      Commit 2f004eea("x86/kasan: Print original address on #GP")
      introduce to kasan_non_canonical_hook but limit it to KASAN_INLINE.
      
      This patch extends it to KASAN_OUTLINE mode.
      
      Link: https://lkml.kernel.org/r/20231009073748.159228-1-haibo.li@mediatek.com
      Fixes: 2f004eea("x86/kasan: Print original address on #GP")
      Signed-off-by: default avatarHaibo Li <haibo.li@mediatek.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Haibo Li <haibo.li@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      babddbfb
    • Rik van Riel's avatar
      hugetlbfs: close race between MADV_DONTNEED and page fault · 2820b0f0
      Rik van Riel authored
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by pulling the locking from __unmap_hugepage_final_range
      into helper functions called from zap_page_range_single.  This ensures
      page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
      have actually been freed.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2820b0f0
    • Rik van Riel's avatar
      hugetlbfs: extend hugetlb_vma_lock to private VMAs · bf491692
      Rik van Riel authored
      Extend the locking scheme used to protect shared hugetlb mappings from
      truncate vs page fault races, in order to protect private hugetlb mappings
      (with resv_map) against MADV_DONTNEED.
      
      Add a read-write semaphore to the resv_map data structure, and use that
      from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
      race between MADV_DONTNEED and page faults.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf491692
    • Rik van Riel's avatar
      hugetlbfs: clear resv_map pointer if mmap fails · 92fe9dcb
      Rik van Riel authored
      Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
      
      Malloc libraries, like jemalloc and tcalloc, take decisions on when to
      call madvise independently from the code in the main application.
      
      This sometimes results in the application page faulting on an address,
      right after the malloc library has shot down the backing memory with
      MADV_DONTNEED.
      
      Usually this is harmless, because we always have some 4kB pages sitting
      around to satisfy a page fault.  However, with hugetlbfs systems often
      allocate only the exact number of huge pages that the application wants.
      
      Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
      any lock taken on the page fault path, which can open up the following
      race condition:
      
             CPU 1                            CPU 2
      
             MADV_DONTNEED
             unmap page
             shoot down TLB entry
                                             page fault
                                             fail to allocate a huge page
                                             killed with SIGBUS
             free page
      
      Fix that race by extending the hugetlb_vma_lock locking scheme to also
      cover private hugetlb mappings (with resv_map), and pulling the locking
      from __unmap_hugepage_final_range into helper functions called from
      zap_page_range_single.  This ensures page faults stay locked out of the
      MADV_DONTNEED VMA until the huge pages have actually been freed.
      
      
      This patch (of 3):
      
      Hugetlbfs leaves a dangling pointer in the VMA if mmap fails.  This has
      not been a problem so far, but other code in this patch series tries to
      follow that pointer.
      
      Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
      Fixes: 04ada095 ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92fe9dcb
    • Johannes Weiner's avatar
      mm: zswap: fix pool refcount bug around shrink_worker() · 969d63e1
      Johannes Weiner authored
      When a zswap store fails due to the limit, it acquires a pool reference
      and queues the shrinker.  When the shrinker runs, it drops the reference. 
      However, there can be multiple store attempts before the shrinker wakes up
      and runs once.  This results in reference leaks and eventual saturation
      warnings for the pool refcount.
      
      Fix this by dropping the reference again when the shrinker is already
      queued.  This ensures one reference per shrinker run.
      
      Link: https://lkml.kernel.org/r/20231006160024.170748-1-hannes@cmpxchg.org
      Fixes: 45190f01 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.6+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      969d63e1
  2. 16 Oct, 2023 13 commits
    • Stefan Roesch's avatar
      mm/ksm: document pages_skipped sysfs knob · b0540208
      Stefan Roesch authored
      This adds documentation for the new metric pages_skipped.
      
      Link: https://lkml.kernel.org/r/20230926040939.516161-5-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b0540208
    • Stefan Roesch's avatar
      mm/ksm: document smart scan mode · 75d7dd41
      Stefan Roesch authored
      This adds documentation for the smart scan mode of KSM.
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: document that smart_scan defaults to on]
      Link: https://lkml.kernel.org/r/20230926040939.516161-4-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      75d7dd41
    • Stefan Roesch's avatar
      mm/ksm: add pages_skipped metric · e5a68991
      Stefan Roesch authored
      This change adds the "pages skipped" metric.  To be able to evaluate how
      successful smart page scanning is, the pages skipped metric can be
      compared to the pages scanned metric.
      
      The pages skipped metric is a cumulative counter.  The counter is stored
      under /sys/kernel/mm/ksm/pages_skipped.
      
      Link: https://lkml.kernel.org/r/20230926040939.516161-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5a68991
    • Stefan Roesch's avatar
      mm/ksm: add "smart" page scanning mode · 5e924ff5
      Stefan Roesch authored
      Patch series "Smart scanning mode for KSM", v3.
      
      This patch series adds "smart scanning" for KSM.
      
      What is smart scanning?
      =======================
      KSM evaluates all the candidate pages for each scan. It does not use historic
      information from previous scans. This has the effect that candidate pages that
      couldn't be used for KSM de-duplication continue to be evaluated for each scan.
      
      The idea of "smart scanning" is to keep historic information. With the historic
      information we can temporarily skip the candidate page for one or several scans.
      
      Details:
      ========
      "Smart scanning" is to keep two small counters to store if the page has been
      used for KSM. One counter stores how often we already tried to use the page for
      KSM and the other counter stores how often we skip a page.
      
      How often we skip the candidate page depends how often a page failed KSM
      de-duplication. The code skips a maximum of 8 times. During testing this has
      shown to be a good compromise for different workloads.
      
      New sysfs knob:
      ===============
      Smart scanning is not enabled by default. With /sys/kernel/mm/ksm/smart_scan
      smart scanning can be enabled.
      
      Monitoring:
      ===========
      To monitor how effective smart scanning is a new sysfs knob has been introduced.
      /sys/kernel/mm/pages_skipped report how many pages have been skipped by smart
      scanning.
      
      Results:
      ========
      - Various workloads have shown a 20% - 25% reduction in page scans
        For the instagram workload for instance, the number of pages scanned has been
        reduced from over 20M pages per scan to less than 15M pages.
      - Less pages scans also resulted in an overall higher de-duplication rate as
        some shorter lived pages could be de-duplicated additionally
      - Less pages scanned allows to reduce the pages_to_scan parameter
        and this resulted in  a 25% reduction in terms of CPU.
      - The improvements have been observed for workloads that enable KSM with
        madvise as well as prctl
      
      
      This patch (of 4):
      
      This change adds a "smart" page scanning mode for KSM.  So far all the
      candidate pages are continuously scanned to find candidates for
      de-duplication.  There are a considerably number of pages that cannot be
      de-duplicated.  This is costly in terms of CPU.  By using smart scanning
      considerable CPU savings can be achieved.
      
      This change takes the history of scanning pages into account and skips the
      page scanning of certain pages for a while if de-deduplication for this
      page has not been successful in the past.
      
      To do this it introduces two new fields in the ksm_rmap_item structure:
      age and remaining_skips.  age, is the KSM age and remaining_skips
      determines how often scanning of this page is skipped.  The age field is
      incremented each time the page is scanned and the page cannot be de-
      duplicated.  age updated is capped at U8_MAX.
      
      How often a page is skipped is dependent how often de-duplication has been
      tried so far and the number of skips is currently limited to 8.  This
      value has shown to be effective with different workloads.
      
      The feature is currently disable by default and can be enabled with the
      new smart_scan knob.
      
      The feature has shown to be very effective: upt to 25% of the page scans
      can be eliminated; the pages_to_scan rate can be reduced by 40 - 50% and a
      similar de-duplication rate can be maintained.
      
      [akpm@linux-foundation.org: make ksm_smart_scan default true, for testing]
      Link: https://lkml.kernel.org/r/20230926040939.516161-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230926040939.516161-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e924ff5
    • Huang Ying's avatar
      dax, kmem: calculate abstract distance with general interface · 6bc2cfdf
      Huang Ying authored
      Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
      used for slow memory type in kmem driver.  This limits the usage of kmem
      driver, for example, it cannot be used for HBM (high bandwidth memory).
      
      So, we use the general abstract distance calculation mechanism in kmem
      drivers to get more accurate abstract distance on systems with proper
      support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as fallback
      only.
      
      Now, multiple memory types may be managed by kmem.  These memory types are
      put into the "kmem_memory_types" list and protected by
      kmem_memory_type_lock.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-5-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6bc2cfdf
    • Huang Ying's avatar
      acpi, hmat: calculate abstract distance with HMAT · 3718c02d
      Huang Ying authored
      A memory tiering abstract distance calculation algorithm based on ACPI
      HMAT is implemented.  The basic idea is as follows.
      
      The performance attributes of system default DRAM nodes are recorded as
      the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.  Then,
      the ratio of the abstract distance of a memory node (target) to
      MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
      attributes of the node to that of the default DRAM nodes.
      
      The functions to record the read/write latency/bandwidth of the default
      DRAM nodes and calculate abstract distance according to read/write
      latency/bandwidth ratio will be used by CXL CDAT (Coherent Device
      Attribute Table) and other memory device drivers.  So, they are put in
      memory-tiers.c.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-4-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3718c02d
    • Huang Ying's avatar
      acpi, hmat: refactor hmat_register_target_initiators() · d0376aac
      Huang Ying authored
      Previously, in hmat_register_target_initiators(), the performance
      attributes are calculated and the corresponding sysfs links and files are
      created too.  Which is called during memory onlining.
      
      But now, to calculate the abstract distance of a memory target before
      memory onlining, we need to calculate the performance attributes for a
      memory target without creating sysfs links and files.
      
      To do that, hmat_register_target_initiators() is refactored to make it
      possible to calculate performance attributes separately.
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-3-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarAlistair Popple <apopple@nvidia.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0376aac
    • Huang Ying's avatar
      memory tiering: add abstract distance calculation algorithms management · 07a8bdd4
      Huang Ying authored
      Patch series "memory tiering: calculate abstract distance based on ACPI
      HMAT", v4.
      
      We have the explicit memory tiers framework to manage systems with
      multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices.
      Where, same kind of memory devices will be grouped into memory types,
      then put into memory tiers.  To describe the performance of a memory type,
      abstract distance is defined.  Which is in direct proportion to the memory
      latency and inversely proportional to the memory bandwidth.  To keep the
      code as simple as possible, fixed abstract distance is used in dax/kmem to
      describe slow memory such as Optane DCPMM.
      
      To support more memory types, in this series, we added the abstract
      distance calculation algorithm management mechanism, provided a algorithm
      implementation based on ACPI HMAT, and used the general abstract distance
      calculation interface in dax/kmem driver.  So, dax/kmem can support HBM
      (high bandwidth memory) in addition to the original Optane DCPMM.
      
      
      This patch (of 4):
      
      The abstract distance may be calculated by various drivers, such as ACPI
      HMAT, CXL CDAT, etc.  While it may be used by various code which hot-add
      memory node, such as dax/kmem etc.  To decouple the algorithm users and
      the providers, the abstract distance calculation algorithms management
      mechanism is implemented in this patch.  It provides interface for the
      providers to register the implementation, and interface for the users.
      
      Multiple algorithm implementations can cooperate via calculating abstract
      distance for different memory nodes.  The preference of algorithm
      implementations can be specified via priority (notifier_block.priority).
      
      Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBharata B Rao <bharata@amd.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07a8bdd4
    • Sidhartha Kumar's avatar
      mm/hugetlb: replace page_ref_freeze() with folio_ref_freeze() in hugetlb_folio_init_vmemmap() · a48bf7b4
      Sidhartha Kumar authored
      No functional difference, folio_ref_freeze() is currently a wrapper for
      page_ref_freeze().
      
      Link: https://lkml.kernel.org/r/20230926174433.81241-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Reviewed-by: Muchun Song <songmuchun@bytedance.com> 
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Usama Arif <usama.arif@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a48bf7b4
    • Sidhartha Kumar's avatar
      mm/filemap: remove hugetlb special casing in filemap.c · a08c7193
      Sidhartha Kumar authored
      Remove special cased hugetlb handling code within the page cache by
      changing the granularity of ->index to the base page size rather than the
      huge page size.  The motivation of this patch is to reduce complexity
      within the filemap code while also increasing performance by removing
      branches that are evaluated on every page cache lookup.
      
      To support the change in index, new wrappers for hugetlb page cache
      interactions are added.  These wrappers perform the conversion to a linear
      index which is now expected by the page cache for huge pages.
      
      ========================= PERFORMANCE ======================================
      
      Perf was used to check the performance differences after the patch. 
      Overall the performance is similar to mainline with a very small larger
      overhead that occurs in __filemap_add_folio() and
      hugetlb_add_to_page_cache().  This is because of the larger overhead that
      occurs in xa_load() and xa_store() as the xarray is now using more entries
      to store hugetlb folios in the page cache.
      
      Timing
      
      aarch64
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt
                  real    1m49.568s
                  user    0m0.000s
                  sys     1m49.461s
      
              6.5-rc3:
                  [root]# time fallocate -l 700GB test.txt
                  real    1m47.495s
                  user    0m0.000s
                  sys     1m47.370s
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m47.024s
                  user    0m0.000s
                  sys     1m46.921s
      
              6.5-rc3:
                  [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt
                  real    1m44.551s
                  user    0m0.000s
                  sys     1m44.438s
      
      x86
          2MB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt
                  real    0m22.383s
                  user    0m0.000s
                  sys     0m22.255s
      
              6.5-rc3:
                  [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt
                  real    0m22.735s
                  user    0m0.038s
                  sys     0m22.567s
      
          1GB Page Size
              6.5-rc3 + this patch:
                  [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt
                  real    0m25.786s
                  user    0m0.001s
                  sys     0m25.589s
      
              6.5-rc3:
                  [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt
                  real    0m33.454s
                  user    0m0.001s
                  sys     0m33.193s
      
      aarch64:
          workload - fallocate a 700GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--95.04%--__pi_clear_page
                                |
                                |--3.57%--clear_huge_page
                                |          |
                                |          |--2.63%--rcu_all_qs
                                |          |
                                |           --0.91%--__cond_resched
                                |
                                 --0.67%--__cond_resched
                  0.17%     0.00%             0  fallocate  [kernel.vmlinux]       [k] hugetlb_add_to_page_cache
                  0.14%     0.10%            11  fallocate  [kernel.vmlinux]       [k] __filemap_add_folio
      
          6.5-rc3
              2MB Page Size:
                      --100.00%--__arm64_sys_fallocate
                                ksys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                |--94.91%--__pi_clear_page
                                |
                                |--4.11%--clear_huge_page
                                |          |
                                |          |--3.00%--rcu_all_qs
                                |          |
                                |           --1.10%--__cond_resched
                                |
                                 --0.59%--__cond_resched
                  0.08%     0.01%             1  fallocate  [kernel.kallsyms]  [k] hugetlb_add_to_page_cache
                  0.05%     0.03%             3  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      x86
          workload - fallocate a 100GB file backed by huge pages
      
          6.5-rc3 + this patch:
              2MB Page Size:
                  hugetlbfs_fallocate
                  |
                  --99.57%--clear_huge_page
                      |
                      --98.47%--clear_page_erms
                          |
                          --0.53%--asm_sysvec_apic_timer_interrupt
      
                  0.04%     0.04%             1  fallocate  [kernel.kallsyms]     [k] xa_load
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] hugetlb_add_to_page_cache
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] __filemap_add_folio
                  0.04%     0.00%             0  fallocate  [kernel.kallsyms]     [k] xas_store
      
          6.5-rc3
              2MB Page Size:
                      --99.93%--__x64_sys_fallocate
                                vfs_fallocate
                                hugetlbfs_fallocate
                                |
                                 --99.38%--clear_huge_page
                                           |
                                           |--98.40%--clear_page_erms
                                           |
                                            --0.59%--__cond_resched
                  0.03%     0.03%             1  fallocate  [kernel.kallsyms]  [k] __filemap_add_folio
      
      ========================= TESTING ======================================
      
      This patch passes libhugetlbfs tests and LTP hugetlb tests
      
      ********** TEST SUMMARY
      *                      2M
      *                      32-bit 64-bit
      *     Total testcases:   110    113
      *             Skipped:     0      0
      *                PASS:   107    113
      *                FAIL:     0      0
      *    Killed by signal:     3      0
      *   Bad configuration:     0      0
      *       Expected FAIL:     0      0
      *     Unexpected PASS:     0      0
      *    Test not present:     0      0
      * Strange test result:     0      0
      **********
      
          Done executing testcases.
          LTP Version:  20220527-178-g2761a81c4
      
      page migration was also tested using Mike Kravetz's test program.[8]
      
      [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug]
        Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain
      Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.comSigned-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a08c7193
    • Stefan Roesch's avatar
      mm/ksm: test case for prctl fork/exec workflow · 0374af1d
      Stefan Roesch authored
      This adds a new test case to the ksm functional tests to make sure that
      the KSM setting is inherited by the child process when doing a fork/exec.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-3-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Carl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0374af1d
    • Stefan Roesch's avatar
      mm/ksm: support fork/exec for prctl · 3c6f33b7
      Stefan Roesch authored
      Patch series "mm/ksm: add fork-exec support for prctl", v4.
      
      A process can enable KSM with the prctl system call.  When the process is
      forked the KSM flag is inherited by the child process.  However if the
      process is executing an exec system call directly after the fork, the KSM
      setting is cleared.  This patch series addresses this problem.
      
      1) Change the mask in coredump.h for execing a new process
      2) Add a new test case in ksm_functional_tests
      
      
      This patch (of 2):
      
      Today we have two ways to enable KSM:
      
      1) madvise system call
         This allows to enable KSM for a memory region for a long time.
      
      2) prctl system call
         This is a recent addition to enable KSM for the complete process.
         In addition when a process is forked, the KSM setting is inherited.
      
      This change only affects the second case.
      
      One of the use cases for (2) was to support the ability to enable
      KSM for cgroups. This allows systemd to enable KSM for the seed
      process. By enabling it in the seed process all child processes inherit
      the setting.
      
      This works correctly when the process is forked. However it doesn't
      support fork/exec workflow.
      
      From the previous cover letter:
      
      ....
      Use case 3:
      With the madvise call sharing opportunities are only enabled for the
      current process: it is a workload-local decision. A considerable number
      of sharing opportunities may exist across multiple workloads or jobs
      (if they are part of the same security domain). Only a higler level
      entity like a job scheduler or container can know for certain if its
      running one or more instances of a job. That job scheduler however
      doesn't have the necessary internal workload knowledge to make targeted
      madvise calls.
      ....
      
      In addition it can also be a bit surprising that fork keeps the KSM
      setting and fork/exec does not.
      
      Link: https://lkml.kernel.org/r/20230922211141.320789-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230922211141.320789-2-shr@devkernel.ioSigned-off-by: default avatarStefan Roesch <shr@devkernel.io>
      Fixes: d7597f59 ("mm: add new api to enable ksm per process")
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarCarl Klemm <carl@uvos.xyz>
      Tested-by: default avatarCarl Klemm <carl@uvos.xyz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c6f33b7
    • Huan Yang's avatar
      mm/damon/core: remove unnecessary si_meminfo invoke. · 987ffa5a
      Huan Yang authored
      si_meminfo() will read and assign more info not just free/ram pages.  For
      just DAMOS_WMARK_FREE_MEM_RATE use, only get free and ram pages is ok to
      save cpu.
      
      Link: https://lkml.kernel.org/r/20230920015727.4482-1-link@vivo.comSigned-off-by: default avatarHuan Yang <link@vivo.com>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      987ffa5a