1. 13 Oct, 2022 7 commits
    • Alistair Popple's avatar
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple authored
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16ce101d
    • Xin Hao's avatar
      mm/damon: use damon_sz_region() in appropriate place · ab63f63f
      Xin Hao authored
      In many places we can use damon_sz_region() to instead of "r->ar.end -
      r->ar.start".
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-2-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab63f63f
    • Xin Hao's avatar
      mm/damon: move sz_damon_region to damon_sz_region · 652e0446
      Xin Hao authored
      Rename sz_damon_region() to damon_sz_region(), and move it to
      "include/linux/damon.h", because in many places, we can to use this func.
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      652e0446
    • Xiaoke Wang's avatar
      lib/test_meminit: add checks for the allocation functions · ea091fa5
      Xiaoke Wang authored
      alloc_pages(), kmalloc() and vmalloc() are all memory allocation functions
      which can return NULL when some internal memory failures happen.  So it is
      better to check the return of them to catch the failure in time for better
      test them.
      
      Link: https://lkml.kernel.org/r/tencent_D44A49FFB420EDCCBFB9221C8D14DFE12908@qq.comSigned-off-by: default avatarXiaoke Wang <xkernel.wang@foxmail.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea091fa5
    • Alexander Potapenko's avatar
      kmsan: unpoison @tlb in arch_tlb_gather_mmu() · ac801e7e
      Alexander Potapenko authored
      This is an optimization to reduce stackdepot pressure.
      
      struct mmu_gather contains 7 1-bit fields packed into a 32-bit unsigned
      int value.  The remaining 25 bits remain uninitialized and are never used,
      but KMSAN updates the origin for them in zap_pXX_range() in mm/memory.c,
      thus creating very long origin chains.  This is technically correct, but
      consumes too much memory.
      
      Unpoisoning the whole structure will prevent creating such chains.
      
      Link: https://lkml.kernel.org/r/20220905122452.2258262-20-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac801e7e
    • Matthew Wilcox (Oracle)'s avatar
      ext4,f2fs: fix readahead of verity data · 4fa0e3ff
      Matthew Wilcox (Oracle) authored
      The recent change of page_cache_ra_unbounded() arguments was buggy in the
      two callers, causing us to readahead the wrong pages.  Move the definition
      of ractl down to after the index is set correctly.  This affected
      performance on configurations that use fs-verity.
      
      Link: https://lkml.kernel.org/r/20221012193419.1453558-1-willy@infradead.org
      Fixes: 73bb49da ("mm/readahead: make page_cache_ra_unbounded take a readahead_control")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarJintao Yin <nicememory@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4fa0e3ff
    • Carlos Llamas's avatar
      mm/mmap: undo ->mmap() when arch_validate_flags() fails · deb0f656
      Carlos Llamas authored
      Commit c462ac28 ("mm: Introduce arch_validate_flags()") added a late
      check in mmap_region() to let architectures validate vm_flags.  The check
      needs to happen after calling ->mmap() as the flags can potentially be
      modified during this callback.
      
      If arch_validate_flags() check fails we unmap and free the vma.  However,
      the error path fails to undo the ->mmap() call that previously succeeded
      and depending on the specific ->mmap() implementation this translates to
      reference increments, memory allocations and other operations what will
      not be cleaned up.
      
      There are several places (mainly device drivers) where this is an issue.
      However, one specific example is bpf_map_mmap() which keeps count of the
      mappings in map->writecnt.  The count is incremented on ->mmap() and then
      decremented on vm_ops->close().  When arch_validate_flags() fails this
      count is off since bpf_map_mmap_close() is never called.
      
      One can reproduce this issue in arm64 devices with MTE support.  Here the
      vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
      previously.  From userspace then is enough to pass the PROT_MTE flag to
      mmap() syscall to trigger the arch_validate_flags() failure.
      
      The following program reproduces this issue:
      
        #include <stdio.h>
        #include <unistd.h>
        #include <linux/unistd.h>
        #include <linux/bpf.h>
        #include <sys/mman.h>
      
        int main(void)
        {
      	union bpf_attr attr = {
      		.map_type = BPF_MAP_TYPE_ARRAY,
      		.key_size = sizeof(int),
      		.value_size = sizeof(long long),
      		.max_entries = 256,
      		.map_flags = BPF_F_MMAPABLE,
      	};
      	int fd;
      
      	fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
      	mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
      
      	return 0;
        }
      
      By manually adding some log statements to the vm_ops callbacks we can
      confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
      ->release():
      
      With PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  111.263874] bpf_map_write_active_inc: map=9 writecnt=1
        [  111.288763] bpf_map_release: map=9 writecnt=1
      
      Without PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  157.816912] bpf_map_write_active_inc: map=10 writecnt=1
        [  157.830442] bpf_map_write_active_dec: map=10 writecnt=0
        [  157.832396] bpf_map_release: map=10 writecnt=0
      
      This patch fixes the above issue by calling vm_ops->close() when the
      arch_validate_flags() check fails, after this we can proceed to unmap and
      free the vma on the error path.
      
      Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
      Fixes: c462ac28 ("mm: Introduce arch_validate_flags()")
      Signed-off-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      deb0f656
  2. 12 Oct, 2022 6 commits
  3. 07 Oct, 2022 5 commits
    • Mike Kravetz's avatar
      hugetlb: allocate vma lock for all sharable vmas · bbff39cc
      Mike Kravetz authored
      The hugetlb vma lock was originally designed to synchronize pmd sharing. 
      As such, it was only necessary to allocate the lock for vmas that were
      capable of pmd sharing.  Later in the development cycle, it was discovered
      that it could also be used to simplify fault/truncation races as described
      in [1].  However, a subsequent change to allocate the lock for all vmas
      that use the page cache was never made.  A fault/truncation race could
      leave pages in a file past i_size until the file is removed.
      
      Remove the previous restriction and allocate lock for all VM_MAYSHARE
      vmas.  Warn in the unlikely event of allocation failure.
      
      [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
      Fixes: "hugetlb: clean up code checking for fault/truncation races"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbff39cc
    • Mike Kravetz's avatar
      hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer · ecfbd733
      Mike Kravetz authored
      hugetlb file truncation/hole punch code may need to back out and take
      locks in order in the routine hugetlb_unmap_file_folio().  This code could
      race with vma freeing as pointed out in [1] and result in accessing a
      stale vma pointer.  To address this, take the vma_lock when clearing the
      vma_lock->vma pointer.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      [mike.kravetz@oracle.com: address build issues]
        Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
      Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
      Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ecfbd733
    • Mike Kravetz's avatar
      hugetlb: fix vma lock handling during split vma and range unmapping · 131a79b4
      Mike Kravetz authored
      Patch series "hugetlb: fixes for new vma lock series".
      
      In review of the series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", Miaohe Lin pointed out two key issues:
      
      1) There is a race in the routine hugetlb_unmap_file_folio when locks
         are dropped and reacquired in the correct order [1].
      
      2) With the switch to using vma lock for fault/truncate synchronization,
         we need to make sure lock exists for all VM_MAYSHARE vmas, not just
         vmas capable of pmd sharing.
      
      These two issues are addressed here.  In addition, having a vma lock
      present in all VM_MAYSHARE vmas, uncovered some issues around vma
      splitting.  Those are also addressed.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      
      This patch (of 3):
      
      The hugetlb vma lock hangs off the vm_private_data field and is specific
      to the vma.  When vm_area_dup() is called as part of vma splitting, the
      vma lock pointer is copied to the new vma.  This will result in issues
      such as double freeing of the structure.  Update the hugetlb open vm_ops
      to allocate a new vma lock for the new vma.
      
      The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
      to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
      anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
      only VM_MAYSHARE was set we would miss the free.  With the introduction of
      the vma lock, a vma can not participate in pmd sharing if vm_private_data
      is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
      free the vma lock to prevent sharing.  Also, update the sharing code to
      make sure vma lock is indeed a condition for pmd sharing. 
      hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
      Fixes: "hugetlb: add vma based lock for pmd sharing"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      131a79b4
    • Yu Zhao's avatar
    • Yu Zhao's avatar
      mm/mglru: don't sync disk for each aging cycle · 14aa8b2d
      Yu Zhao authored
      wakeup_flusher_threads() was added under the assumption that if a system
      runs out of clean cold pages, it might want to write back dirty pages more
      aggressively so that they can become clean and be dropped.
      
      However, doing so can breach the rate limit a system wants to impose on
      writeback, resulting in early SSD wearout.
      
      Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14aa8b2d
  4. 03 Oct, 2022 22 commits
    • Johannes Weiner's avatar
      mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol · e55b9f96
      Johannes Weiner authored
      Since 2d1c4980 ("mm: memcontrol: make swap tracking an integral part
      of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config
      option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP.
      
      Update the sites accordingly and drop the symbol.
      
      [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM,
        which hasn't been a user-visible symbol for over half a decade. ]
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e55b9f96
    • Johannes Weiner's avatar
      mm: memcontrol: use do_memsw_account() in a few more places · b94c4e94
      Johannes Weiner authored
      It's slightly more descriptive and consistent with other places that
      distinguish cgroup1's combined memory+swap accounting scheme from
      cgroup2's dedicated swap accounting.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-4-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b94c4e94
    • Johannes Weiner's avatar
      mm: memcontrol: deprecate swapaccounting=0 mode · b25806dc
      Johannes Weiner authored
      The swapaccounting= commandline option already does very little today.  To
      close a trivial containment failure case, the swap ownership tracking part
      of the swap controller has recently become mandatory (see commit
      2d1c4980 ("mm: memcontrol: make swap tracking an integral part of
      memory control") for details), which makes up the majority of the work
      during swapout, swapin, and the swap slot map.
      
      The only thing left under this flag is the page_counter operations and the
      visibility of the swap control files in the first place, which are rather
      meager savings.  There also aren't many scenarios, if any, where
      controlling the memory of a cgroup while allowing it unlimited access to a
      global swap space is a workable resource isolation strategy.
      
      On the other hand, there have been several bugs and confusion around the
      many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
      accounting without swap accounting, memcg runtime disabled).
      
      This puts the maintenance overhead of retaining the toggle above its
      practical benefits.  Deprecate it.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b25806dc
    • Johannes Weiner's avatar
      mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled · c91bdc93
      Johannes Weiner authored
      Patch series "memcg swap fix & cleanups".
      
      
      This patch (of 4):
      
      Since commit 2d1c4980 ("mm: memcontrol: make swap tracking an integral
      part of memory control"), the cgroup swap arrays are used to track memory
      ownership at the time of swap readahead and swapoff, even if swap space
      *accounting* has been turned off by the user via swapaccount=0 (which sets
      cgroup_memory_noswap).
      
      However, the patch was overzealous: by simply dropping the
      cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
      path, it caused the cgroup arrays being allocated even when the memory
      controller as a whole is disabled.  This is a waste of that memory.
      
      Restore mem_cgroup_disabled() checks, implied previously by
      cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
      callbacks.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220926135704.400818-2-hannes@cmpxchg.org
      Fixes: 2d1c4980 ("mm: memcontrol: make swap tracking an integral part of memory control")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c91bdc93
    • Xiu Jianfeng's avatar
      mm/secretmem: remove reduntant return value · f7c5b1aa
      Xiu Jianfeng authored
      The return value @ret is always 0, so remove it and return 0 directly.
      
      Link: https://lkml.kernel.org/r/20220920012205.246217-1-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7c5b1aa
    • Xin Hao's avatar
      mm/hugetlb: add available_huge_pages() func · 8346d69d
      Xin Hao authored
      In hugetlb.c there are several places which compare the values of
      'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
      add a new available_huge_pages() function to do these.
      
      Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8346d69d
    • Gaosheng Cui's avatar
      mm: remove unused inline functions from include/linux/mm_inline.h · 6b91e5df
      Gaosheng Cui authored
      Remove the following unused inline functions from mm_inline.h:
      
      1.  All uses of add_page_to_lru_list_tail() have been removed since
         commit 7a3dbfe8 ("mm/swap: convert lru_deactivate_file to a
         folio_batch"), and it can be replaced by lruvec_add_folio_tail().
      
      2.  All uses of __clear_page_lru_flags() have been removed since commit
         188e8cae ("mm/swap: convert __page_cache_release() to use a
         folio"), and it can be replaced by __folio_clear_lru_flags().
      
      They are useless, so remove them.
      
      Link: https://lkml.kernel.org/r/20220922110935.1495099-1-cuigaosheng1@huawei.comSigned-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b91e5df
    • Zach O'Keefe's avatar
      selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory · 0f633baa
      Zach O'Keefe authored
      Add :collapse mod to userfaultfd selftest.  Currently this mod is only
      valid for "shmem" test type, but could be used for other test types.
      
      When provided, memory allocated by ->allocate_area() will be
      hugepage-aligned enforced to be hugepage-sized.  userfaultf_minor_test,
      after the UFFD-registered mapping has been populated by UUFD minor fault
      handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to collapse
      the memory into a pmd-mapped THP.
      
      This test is meant to be a functional test of what occurs during
      UFFD-driven live migration of VMs backed by huge tmpfs where, after a
      hugepage-sized region has been successfully migrated (in native page-sized
      chunks, to avoid latency of fetched a hugepage over the network), we want
      to reclaim previous VM performance by remapping it at the PMD level.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-11-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-11-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f633baa
    • Zach O'Keefe's avatar
      selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd · 69d9428c
      Zach O'Keefe authored
      This test tests that MADV_COLLAPSE acting on file/shmem memory for which
      (1) the file extent mapping by the memory is already a huge page in the
      page cache, and (2) the pmd mapping this memory in the target process is
      none.
      
      In practice, (1)+(2) is the state left over after khugepaged has
      successfully collapsed file/shmem memory for a target VMA, but the memory
      has not yet been refaulted.  So, this test in-effect tests MADV_COLLAPSE
      racing with khugepaged to collapse the memory first.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-10-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69d9428c
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse shmem testing · d0d35b60
      Zach O'Keefe authored
      Add memory operations for shmem (memfd) memory, and reuse existing tests
      with the new memory operations.
      
      Shmem tests can be called with "shmem" mem_type, and shmem tests are ran
      with "all" mem_type as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-9-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0d35b60
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse file and tmpfs testing · 1b03d0d5
      Zach O'Keefe authored
      Add memory operations for file-backed and tmpfs memory.  Call existing
      tests with these new memory operations to test collapse functionality of
      khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory.  Not all
      tests are reusable; for example, collapse_swapin_single_pte() which checks
      swap usage.
      
      Refactor test arguments.  Usage is now:
      
      Usage: ./khugepaged <test type> [dir]
      
              <test type>     : <context>:<mem_type>
              <context>       : [all|khugepaged|madvise]
              <mem_type>      : [all|anon|file]
      
              "file,all" mem_type requires [dir] argument
      
              "file,all" mem_type requires kernel built with
              CONFIG_READ_ONLY_THP_FOR_FS=y
      
              if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be
              mounted with huge=madvise option for khugepaged tests to work
      
      Refactor calling tests to make it clear what collapse context / memory
      operations they support, but only invoke tests requested by user.  Also
      log what test is being ran, and with what context / memory, to make test
      logs more human readable.
      
      A new test file is created and deleted for every test to ensure no pages
      remain in the page cache between tests (tests also may attempt to collapse
      different amount of memory).
      
      For file-backed memory where the file is stored on a block device, disable
      /sys/block/<device>/queue/read_ahead_kb so that pages don't find their way
      into the page cache without the tests faulting them in.
      
      Add file and shmem wrappers to vm_utils check for file and shmem hugepages
      in smaps.
      
      [zokeefe@google.com: fix "add thp collapse file and tmpfs testing" for
        tmpfs]
        Link: https://lkml.kernel.org/r/20220913212517.3163701-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220907144521.3115321-8-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-8-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b03d0d5
    • Zach O'Keefe's avatar
      selftests/vm: modularize thp collapse memory operations · 8e638707
      Zach O'Keefe authored
      Modularize operations to setup, cleanup, fault, and check for huge pages,
      for a given memory type.  This allows reusing existing tests with
      additional memory types by defining new memory operations.  Following
      patches will add file and shmem memory types.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-7-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-7-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e638707
    • Zach O'Keefe's avatar
      selftests/vm: dedup THP helpers · c07c343c
      Zach O'Keefe authored
      These files:
      
      tools/testing/selftests/vm/vm_util.c
      tools/testing/selftests/vm/khugepaged.c
      
      Both contain logic to:
      
      1) Determine hugepage size on current system
      2) Read /proc/self/smaps to determine number of THPs at an address
      
      Refactor selftests/vm/khugepaged.c to use the vm_util common helpers and
      add it as a build dependency.
      
      Since selftests/vm/khugepaged.c is the largest user of check_huge(),
      change the signature of check_huge() to match selftests/vm/khugepaged.c's
      useage: take an expected number of hugepages, and return a bool indicating
      if the correct number of hugepages were found.  Add a wrapper,
      check_huge_anon(), in anticipation of checking smaps for file and shmem
      hugepages.
      
      Update existing callsites to use the new pattern / function.
      
      Likewise, check_for_pattern() was duplicated, and it's a general enough
      helper to include in vm_util helpers as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-6-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-6-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c07c343c
    • Zach O'Keefe's avatar
      mm/khugepaged: add tracepoint to hpage_collapse_scan_file() · d41fd201
      Zach O'Keefe authored
      Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
      hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
      
      While this change is targeted at debugging MADV_COLLAPSE pathway, the
      "mm_khugepaged" prefix is retained for symmetry with
      huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
      to prevent changing kernel ABI as much as possible.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d41fd201
    • Zach O'Keefe's avatar
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe authored
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34488399
    • Zach O'Keefe's avatar
      mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds · 58ac9a89
      Zach O'Keefe authored
      The main benefit of THPs are that they can be mapped at the pmd level,
      increasing the likelihood of TLB hit and spending less cycles in page
      table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
      pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
      in physical memory, don't have this advantage.  In fact, one could argue
      they are detrimental to system performance overall since they occupy a
      precious hugepage-aligned/sized region of physical memory that could
      otherwise be used more effectively.  Additionally, pte-mapped hugepages
      can be the cheapest memory to collapse for khugepaged since no new
      hugepage allocation or copying of memory contents is necessary - we only
      need to update the mapping page tables.
      
      In the anonymous collapse path, we are able to collapse pte-mapped
      hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
      effort when compound pages (of any order) are encountered.
      
      Identify pte-mapped hugepages in the file/shmem collapse path.  The
      final step of which makes a racy check of the value of the pmd to
      ensure it maps a pte table.  This should be fine, since races that
      result in false-positive (i.e.  attempt collapse even though we
      shouldn't) will fail later in collapse_pte_mapped_thp() once we
      actually lock mmap_lock and reinspect the pmd value.  Races that result
      in false-negatives (i.e.  where we decide to not attempt collapse, but
      should have) shouldn't be an issue, since in the worst case, we do
      nothing - which is what we've done up to this point.  We make a similar
      check in retract_page_tables().  If we do think we've found a
      pte-mapped hugepgae in khugepaged context, attempt to update page
      tables mapping this hugepage.
      
      Note that these collapses still count towards the
      /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
      and if the pte-mapped hugepage was also mapped into multiple process'
      address spaces, could be incremented for each page table update.  Since we
      increment the counter when a pte-mapped hugepage is successfully added to
      the list of to-collapse pte-mapped THPs, it's possible that we never
      actually update the page table either.  This is different from how
      file/shmem pages_collapsed accounting works today where only a successful
      page cache update is counted (it's also possible here that no page tables
      are actually changed).  Though it incurs some slop, this is preferred to
      either not accounting for the event at all, or plumbing through data in
      struct mm_slot on whether to account for the collapse or not.
      
      Also note that work still needs to be done to support arbitrary compound
      pages, and that this should all be converted to using folios.
      
      [shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58ac9a89
    • Zach O'Keefe's avatar
      mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() · 7c6c6cc4
      Zach O'Keefe authored
      Patch series "mm: add file/shmem support to MADV_COLLAPSE", v4.
      
      This series builds on top of the previous "mm: userspace hugepage
      collapse" series which introduced the MADV_COLLAPSE madvise mode and added
      support for private, anonymous mappings[2], by adding support for file and
      shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels.
      
      File and shmem support have been added with effort to align with existing
      MADV_COLLAPSE semantics and policy decisions[3].  Collapse of shmem-backed
      memory ignores kernel-guiding directives and heuristics including all
      sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount
      options (shmem always supports large folios).  Like anonymous mappings, on
      successful return of MADV_COLLAPSE on file/shmem memory, the contents of
      memory mapped by the addresses provided will be synchronously pmd-mapped
      THPs.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      khugepaged has received a small improvement by association and can now
      detect and collapse pte-mapped THPs.  However, there is still work to be
      done along the file collapse path.  Compound pages of arbitrary order
      still needs to be supported and THP collapse needs to be converted to
      using folios in general.  Eventually, we'd like to move away from the
      read-only and executable-mapped constraints currently imposed on eligible
      files and support any inode claiming huge folio support.  That said, I
      think the series as-is covers enough to claim that MADV_COLLAPSE supports
      file/shmem memory.
      
      Patches 1-3	Implement the guts of the series.
      Patch 4 	Is a tracepoint for debugging.
      Patches 5-9 	Refactor existing khugepaged selftests to work with new
      		memory types + new collapse tests.
      Patch 10 	Adds a userfaultfd selftest mode to mimic a functional test
      		of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration.
      		(v4 note: "userfaultfd shmem" selftest is failing as of
      		Sep 22 mm-unstable)
      
      [1] https://lore.kernel.org/linux-mm/YyiK8YvVcrtZo0z3@google.com/
      [2] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@google.com/
      [3] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@google.com/
      [4] https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@google.com/
      [5] https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@google.com/
      
      
      This patch (of 10):
      
      Extend 'mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()' to
      shmem, allowing callers to ignore
      /sys/kernel/transparent_hugepage/shmem_enabled and tmpfs huge= mount.
      
      This is intended to be used by MADV_COLLAPSE, and the rationale is
      analogous to the anon/file case: MADV_COLLAPSE is not coupled to
      directives that advise the kernel's decisions on when THPs should be
      considered eligible.  shmem/tmpfs always claims large folio support,
      regardless of sysfs or mount options.
      
      [shy828301@gmail.com: test shmem_huge_force explicitly]
        Link: https://lore.kernel.org/linux-mm/CAHbLzko3A5-TpS0BgBeKkx5cuOkWgLvWXQH=TdgW-baO4rPtdg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220922224046.1143204-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220907144521.3115321-2-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-2-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7c6c6cc4
    • Zach O'Keefe's avatar
      selftests/vm: retry on EAGAIN for MADV_COLLAPSE selftest · 3505c8e6
      Zach O'Keefe authored
      MADV_COLLAPSE is a best-effort request that will set errno to an
      actionable value if the request cannot be performed.
      
      For example, if pages are not found on the LRU, or if they are currently
      locked by something else, MADV_COLLAPSE will fail and set errno to EAGAIN
      to inform callers that they may try again.
      
      Since the khugepaged selftest is the first public use of MADV_COLLAPSE,
      set a best practice of checking errno and retrying on EAGAIN.
      
      Link: https://lkml.kernel.org/r/20220922184651.1016461-2-zokeefe@google.com
      Fixes: 9330694d ("selftests/vm: add MADV_COLLAPSE collapse context to selftests")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3505c8e6
    • Zach O'Keefe's avatar
      mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolated · 0f3e2a2c
      Zach O'Keefe authored
      MADV_COLLAPSE is a best-effort request that attempts to set an actionable
      errno value if the request cannot be fulfilled at the time.  EAGAIN should
      be used to communicate that a resource was temporarily unavailable, but
      that the user may try again immediately.
      
      SCAN_DEL_PAGE_LRU is an internal result code used when a page cannot be
      isolated from it's LRU list.  Since this, like SCAN_PAGE_LRU, is likely a
      transitory state, make MADV_COLLAPSE return EAGAIN so that users know they
      may reattempt the operation.
      
      Another important scenario to consider is race with khugepaged. 
      khugepaged might isolate a page while MADV_COLLAPSE is interested in it. 
      Even though racing with khugepaged might mean that the memory has already
      been collapsed, signalling an errno that is non-intrinsic to that memory
      or arguments provided to madvise(2) lets the user know that future
      attempts might (and in this case likely would) succeed, and avoids
      false-negative assumptions by the user.
      
      Link: https://lkml.kernel.org/r/20220922184651.1016461-1-zokeefe@google.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f3e2a2c
    • Zach O'Keefe's avatar
      mm/khugepaged: check compound_order() in collapse_pte_mapped_thp() · 780a4b6f
      Zach O'Keefe authored
      By the time we lock a page in collapse_pte_mapped_thp(), the page mapped
      by the address pushed onto the slot's .pte_mapped_thp[] array might have
      changed arbitrarily since we last looked at it.  We revalidate that the
      page is still the head of a compound page, but we don't revalidate if the
      compound page is of order HPAGE_PMD_ORDER before applying rmap and page
      table updates.
      
      Since the kernel now supports large folios of arbitrary order, and since
      replacing page's pte mappings by a pmd mapping only makes sense for
      compound pages of order HPAGE_PMD_ORDER, revalidate that the compound
      order is indeed of order HPAGE_PMD_ORDER before proceeding.
      
      Link: https://lore.kernel.org/linux-mm/CAHbLzkon+2ky8v9ywGcsTUgXM_B35jt5NThYqQKXW2YV_GUacw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220922222731.1124481-1-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      780a4b6f
    • Liu Shixin's avatar
      mm: hugetlb: fix UAF in hugetlb_handle_userfault · 958f32ce
      Liu Shixin authored
      The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
      and reacquire them again after handle_userfault(), but reacquire the
      vma_lock could lead to UAF[1,2] due to the following race,
      
      hugetlb_fault
        hugetlb_no_page
          /*unlock vma_lock */
          hugetlb_handle_userfault
            handle_userfault
              /* unlock mm->mmap_lock*/
                                                 vm_mmap_pgoff
                                                   do_mmap
                                                     mmap_region
                                                       munmap_vma_range
                                                         /* clean old vma */
              /* lock vma_lock again  <--- UAF */
          /* unlock vma_lock */
      
      Since the vma_lock will unlock immediately after
      hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
      hugetlb_handle_userfault() to fix the issue.
      
      [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
      [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
      Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
      Reported-by: default avatarLiu Zixian <liuzixian4@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      958f32ce
    • Kairui Song's avatar
      mm: memcontrol: make cgroup_memory_noswap a static key · c1b8fdae
      Kairui Song authored
      cgroup_memory_noswap is used in many hot path, so make it a static key
      to lower the kernel overhead.
      
      Using 8G of ZRAM as SWAP, benchmark using `perf stat -d -d -d --repeat 100`
      with the following code snip in a non-root cgroup:
      
         #include <stdio.h>
         #include <string.h>
         #include <linux/mman.h>
         #include <sys/mman.h>
         #define MB 1024UL * 1024UL
         int main(int argc, char **argv){
            void *p = mmap(NULL, 8000 * MB, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            memset(p, 0xff, 8000 * MB);
            madvise(p, 8000 * MB, MADV_PAGEOUT);
            memset(p, 0xff, 8000 * MB);
            return 0;
         }
      
      Before:
                7,021.43 msec task-clock                #    0.967 CPUs utilized            ( +-  0.03% )
                   4,010      context-switches          #  573.853 /sec                     ( +-  0.01% )
                       0      cpu-migrations            #    0.000 /sec
               2,052,057      page-faults               #  293.661 K/sec                    ( +-  0.00% )
          12,616,546,027      cycles                    #    1.805 GHz                      ( +-  0.06% )  (39.92%)
             156,823,666      stalled-cycles-frontend   #    1.25% frontend cycles idle     ( +-  0.10% )  (40.25%)
             310,130,812      stalled-cycles-backend    #    2.47% backend cycles idle      ( +-  4.39% )  (40.73%)
          18,692,516,591      instructions              #    1.49  insn per cycle
                                                        #    0.01  stalled cycles per insn  ( +-  0.04% )  (40.75%)
           4,907,447,976      branches                  #  702.283 M/sec                    ( +-  0.05% )  (40.30%)
              13,002,578      branch-misses             #    0.26% of all branches          ( +-  0.08% )  (40.48%)
           7,069,786,296      L1-dcache-loads           #    1.012 G/sec                    ( +-  0.03% )  (40.32%)
             649,385,847      L1-dcache-load-misses     #    9.13% of all L1-dcache accesses  ( +-  0.07% )  (40.10%)
           1,485,448,688      L1-icache-loads           #  212.576 M/sec                    ( +-  0.15% )  (39.49%)
              31,628,457      L1-icache-load-misses     #    2.13% of all L1-icache accesses  ( +-  0.40% )  (39.57%)
               6,667,311      dTLB-loads                #  954.129 K/sec                    ( +-  0.21% )  (39.50%)
               5,668,555      dTLB-load-misses          #   86.40% of all dTLB cache accesses  ( +-  0.12% )  (39.03%)
                     765      iTLB-loads                #  109.476 /sec                     ( +- 21.81% )  (39.44%)
               4,370,351      iTLB-load-misses          # 214320.09% of all iTLB cache accesses  ( +-  1.44% )  (39.86%)
             149,207,254      L1-dcache-prefetches      #   21.352 M/sec                    ( +-  0.13% )  (40.27%)
      
                 7.25869 +- 0.00203 seconds time elapsed  ( +-  0.03% )
      
      After:
                6,576.16 msec task-clock                #    0.953 CPUs utilized            ( +-  0.10% )
                   4,020      context-switches          #  605.595 /sec                     ( +-  0.01% )
                       0      cpu-migrations            #    0.000 /sec
               2,052,056      page-faults               #  309.133 K/sec                    ( +-  0.00% )
          11,967,619,180      cycles                    #    1.803 GHz                      ( +-  0.36% )  (38.76%)
             161,259,240      stalled-cycles-frontend   #    1.38% frontend cycles idle     ( +-  0.27% )  (36.58%)
             253,605,302      stalled-cycles-backend    #    2.16% backend cycles idle      ( +-  4.45% )  (34.78%)
          19,328,171,892      instructions              #    1.65  insn per cycle
                                                        #    0.01  stalled cycles per insn  ( +-  0.10% )  (31.46%)
           5,213,967,902      branches                  #  785.461 M/sec                    ( +-  0.18% )  (30.68%)
              12,385,170      branch-misses             #    0.24% of all branches          ( +-  0.26% )  (34.13%)
           7,271,687,822      L1-dcache-loads           #    1.095 G/sec                    ( +-  0.12% )  (35.29%)
             649,873,045      L1-dcache-load-misses     #    8.93% of all L1-dcache accesses  ( +-  0.11% )  (41.41%)
           1,950,037,608      L1-icache-loads           #  293.764 M/sec                    ( +-  0.33% )  (43.11%)
              31,365,566      L1-icache-load-misses     #    1.62% of all L1-icache accesses  ( +-  0.39% )  (45.89%)
               6,767,809      dTLB-loads                #    1.020 M/sec                    ( +-  0.47% )  (48.42%)
               6,339,590      dTLB-load-misses          #   95.43% of all dTLB cache accesses  ( +-  0.50% )  (46.60%)
                     736      iTLB-loads                #  110.875 /sec                     ( +-  1.79% )  (48.60%)
               4,314,836      iTLB-load-misses          # 518653.73% of all iTLB cache accesses  ( +-  0.63% )  (42.91%)
             144,950,156      L1-dcache-prefetches      #   21.836 M/sec                    ( +-  0.37% )  (41.39%)
      
                 6.89935 +- 0.00703 seconds time elapsed  ( +-  0.10% )
      
      The performance is clearly better. There is no significant hotspot
      improvement according to perf report, as there are quite a few
      callers of memcg_swap_enabled and do_memsw_account (which calls
      memcg_swap_enabled). Many pieces of minor optimizations resulted
      in lower overhead for the branch predictor, and bettter performance.
      
      Link: https://lkml.kernel.org/r/20220919180634.45958-3-ryncsn@gmail.comSigned-off-by: default avatarKairui Song <kasong@tencent.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1b8fdae