1. 24 Feb, 2024 10 commits
    • Oscar Salvador's avatar
      mm,page_owner: maintain own list of stack_records structs · 4bedfb31
      Oscar Salvador authored
      page_owner needs to increment a stack_record refcount when a new
      allocation occurs, and decrement it on a free operation.  In order to do
      that, we need to have a way to get a stack_record from a handle. 
      Implement __stack_depot_get_stack_record() which just does that, and make
      it public so page_owner can use it.
      
      Also, traversing all stackdepot buckets comes with its own complexity,
      plus we would have to implement a way to mark only those stack_records
      that were originated from page_owner, as those are the ones we are
      interested in.  For that reason, page_owner maintains its own list of
      stack_records, because traversing that list is faster than traversing all
      buckets while keeping at the same time a low complexity.
      
      For now, add to stack_list only the stack_records of dummy_handle and
      failure_handle, and set their refcount of 1.
      
      Further patches will add code to increment or decrement stack_records
      count on allocation and free operation.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-4-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4bedfb31
    • Oscar Salvador's avatar
      lib/stackdepot: move stack_record struct definition into the header · 8151c7a3
      Oscar Salvador authored
      In order to move the heavy lifting into page_owner code, this one needs to
      have access to the stack_record structure, which right now sits in
      lib/stackdepot.c.  Move it to the stackdepot.h header so page_owner can
      access stack_record's struct fields.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-3-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8151c7a3
    • Oscar Salvador's avatar
      lib/stackdepot: fix first entry having a 0-handle · 3ee34eab
      Oscar Salvador authored
      Patch series "page_owner: print stacks and their outstanding allocations",
      v10.
      
      page_owner is a great debug functionality tool that lets us know about all
      pages that have been allocated/freed and their specific stacktrace.  This
      comes very handy when debugging memory leaks, since with some scripting we
      can see the outstanding allocations, which might point to a memory leak.
      
      In my experience, that is one of the most useful cases, but it can get
      really tedious to screen through all pages and try to reconstruct the
      stack <-> allocated/freed relationship, becoming most of the time a
      daunting and slow process when we have tons of allocation/free operations.
       
      
      This patchset aims to ease that by adding a new functionality into
      page_owner.  This functionality creates a new directory called
      'page_owner_stacks' under 'sys/kernel//debug' with a read-only file called
      'show_stacks', which prints out all the stacks followed by their
      outstanding number of allocations (being that the times the stacktrace has
      allocated but not freed yet).  This gives us a clear and a quick overview
      of stacks <-> allocated/free.
      
      We take advantage of the new refcount_f field that stack_record struct
      gained, and increment/decrement the stack refcount on every
      __set_page_owner() (alloc operation) and __reset_page_owner (free
      operation) call.
      
      Unfortunately, we cannot use the new stackdepot api STACK_DEPOT_FLAG_GET
      because it does not fulfill page_owner needs, meaning we would have to
      special case things, at which point makes more sense for page_owner to do
      its own {dec,inc}rementing of the stacks.  E.g: Using
      STACK_DEPOT_FLAG_PUT, once the refcount reaches 0, such stack gets
      evicted, so page_owner would lose information.
      
      This patchset also creates a new file called 'set_threshold' within
      'page_owner_stacks' directory, and by writing a value to it, the stacks
      which refcount is below such value will be filtered out.
      
      A PoC can be found below:
      
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks.txt
       # head -40 page_owner_full_stacks.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        page_cache_ra_unbounded+0x96/0x180
        filemap_get_pages+0xfd/0x590
        filemap_read+0xcc/0x330
        blkdev_read_iter+0xb8/0x150
        vfs_read+0x285/0x320
        ksys_read+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 521
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_write+0xa5/0xe0
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 4609
      ...
      ...
      
       # echo 5000 > /sys/kernel/debug/page_owner_stacks/set_threshold 
       # cat /sys/kernel/debug/page_owner_stacks/show_stacks > page_owner_full_stacks_5000.txt
       # head -40 page_owner_full_stacks_5000.txt 
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        alloc_pages_mpol+0x91/0x1f0
        folio_alloc+0x14/0x50
        filemap_alloc_folio+0xb2/0x100
        __filemap_get_folio+0x14a/0x490
        ext4_write_begin+0xbd/0x4b0 [ext4]
        generic_perform_write+0xc1/0x1e0
        ext4_buffered_write_iter+0x68/0xe0 [ext4]
        ext4_file_write_iter+0x70/0x740 [ext4]
        vfs_write+0x33d/0x420
        ksys_pwrite64+0x75/0x90
        do_syscall_64+0x80/0x160
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
       stack_count: 6781
      
      
      
        prep_new_page+0xa9/0x120
        get_page_from_freelist+0x801/0x2210
        __alloc_pages+0x18b/0x350
        pcpu_populate_chunk+0xec/0x350
        pcpu_balance_workfn+0x2d1/0x4a0
        process_scheduled_works+0x84/0x380
        worker_thread+0x12a/0x2a0
        kthread+0xe3/0x110
        ret_from_fork+0x30/0x50
        ret_from_fork_asm+0x1b/0x30
       stack_count: 8641
      
      
      This patch (of 7):
      
      The very first entry of stack_record gets a handle of 0, but this is wrong
      because stackdepot treats a 0-handle as a non-valid one.  E.g: See the
      check in stack_depot_fetch()
      
      Fix this by adding and offset of 1.
      
      This bug has been lurking since the very beginning of stackdepot, but no
      one really cared as it seems.  Because of that I am not adding a Fixes
      tag.
      
      Link: https://lkml.kernel.org/r/20240215215907.20121-1-osalvador@suse.de
      Link: https://lkml.kernel.org/r/20240215215907.20121-2-osalvador@suse.deCo-developed-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3ee34eab
    • Andrew Morton's avatar
    • Aneesh Kumar K.V (IBM)'s avatar
      mm/debug_vm_pgtable: fix BUG_ON with pud advanced test · 720da1e5
      Aneesh Kumar K.V (IBM) authored
      Architectures like powerpc add debug checks to ensure we find only devmap
      PUD pte entries.  These debug checks are only done with CONFIG_DEBUG_VM. 
      This patch marks the ptes used for PUD advanced test devmap pte entries so
      that we don't hit on debug checks on architecture like ppc64 as below.
      
      WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
      ....
      NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
      LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
      Call Trace:
      [c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
      [c000000004a2f980] [000d34c100000000] 0xd34c100000000
      [c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
      [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
      [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Also
      
       kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
       ....
      
       NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
       LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       Call Trace:
       [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
       [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
       [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
       [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388
      
      Link: https://lkml.kernel.org/r/20240129060022.68044-1-aneesh.kumar@kernel.org
      Fixes: 27af67f3 ("powerpc/book3s64/mm: enable transparent pud hugepage")
      Signed-off-by: default avatarAneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      720da1e5
    • Nhat Pham's avatar
      mm: cachestat: fix folio read-after-free in cache walk · 3a75cb05
      Nhat Pham authored
      In cachestat, we access the folio from the page cache's xarray to compute
      its page offset, and check for its dirty and writeback flags.  However, we
      do not hold a reference to the folio before performing these actions,
      which means the folio can concurrently be released and reused as another
      folio/page/slab.
      
      Get around this altogether by just using xarray's existing machinery for
      the folio page offsets and dirty/writeback states.
      
      This changes behavior for tmpfs files to now always report zeroes in their
      dirty and writeback counters.  This is okay as tmpfs doesn't follow
      conventional writeback cache behavior: its pages get "cleaned" during
      swapout, after which they're no longer resident etc.
      
      Link: https://lkml.kernel.org/r/20240220153409.GA216065@cmpxchg.org
      Fixes: cf264e13 ("cachestat: implement cachestat syscall")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: default avatarJann Horn <jannh@google.com>
      Cc: <stable@vger.kernel.org>	[6.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a75cb05
    • Lorenzo Stoakes's avatar
      MAINTAINERS: add memory mapping entry with reviewers · 00130266
      Lorenzo Stoakes authored
      Recently there have been a number of patches which have affected various
      aspects of the memory mapping logic as implemented in mm/mmap.c where it
      would have been useful for regular contributors to have been notified.
      
      Add an entry for this part of mm in particular with regular contributors
      tagged as reviewers.
      
      Link: https://lkml.kernel.org/r/20240220064410.4639-1-lstoakes@gmail.comSigned-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      00130266
    • Byungchul Park's avatar
      mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index · 2774f256
      Byungchul Park authored
      With numa balancing on, when a numa system is running where a numa node
      doesn't have its local memory so it has no managed zones, the following
      oops has been observed.  It's because wakeup_kswapd() is called with a
      wrong zone index, -1.  Fixed it by checking the index before calling
      wakeup_kswapd().
      
      > BUG: unable to handle page fault for address: 00000000000033f3
      > #PF: supervisor read access in kernel mode
      > #PF: error_code(0x0000) - not-present page
      > PGD 0 P4D 0
      > Oops: 0000 [#1] PREEMPT SMP NOPTI
      > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
      > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
      > Code: (omitted)
      > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
      > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
      > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
      > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
      > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
      > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
      > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      > PKRU: 55555554
      > Call Trace:
      >  <TASK>
      > ? __die
      > ? page_fault_oops
      > ? __pte_offset_map_lock
      > ? exc_page_fault
      > ? asm_exc_page_fault
      > ? wakeup_kswapd
      > migrate_misplaced_page
      > __handle_mm_fault
      > handle_mm_fault
      > do_user_addr_fault
      > exc_page_fault
      > asm_exc_page_fault
      > RIP: 0033:0x55b897ba0808
      > Code: (omitted)
      > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
      > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
      > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
      > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
      > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
      > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
      >  </TASK>
      
      Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.comSigned-off-by: default avatarByungchul Park <byungchul@sk.com>
      Reported-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Fixes: c574bbe9 ("NUMA balancing: optimize page placement for memory tiering system")
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2774f256
    • Marco Elver's avatar
      kasan: revert eviction of stack traces in generic mode · 711d3491
      Marco Elver authored
      This partially reverts commits cc478e0b, 63b85ac5, 08d7c94d,
      a414d428, and 773688a6 to make use of variable-sized stack depot
      records, since eviction of stack entries from stack depot forces fixed-
      sized stack records.  Care was taken to retain the code cleanups by the
      above commits.
      
      Eviction was added to generic KASAN as a response to alleviating the
      additional memory usage from fixed-sized stack records, but this still
      uses more memory than previously.
      
      With the re-introduction of variable-sized records for stack depot, we can
      just switch back to non-evictable stack records again, and return back to
      the previous performance and memory usage baseline.
      
      Before (observed after a KASAN kernel boot):
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      After:
      
        pools: 319
        refcounted_allocations: 0
        refcounted_frees: 0
        refcounted_in_use: 0
        freelist_size: 0
        persistent_count: 29397
        persistent_bytes: 5183536
      
      As can be seen from the counters, with a generic KASAN config, refcounted
      allocations and evictions are no longer used.  Due to using variable-sized
      records, I observe a reduction of 278 stack depot pools (saving 4448 KiB)
      with my test setup.
      
      Link: https://lkml.kernel.org/r/20240129100708.39460-2-elver@google.com
      Fixes: cc478e0b ("kasan: avoid resetting aux_lock")
      Fixes: 63b85ac5 ("kasan: stop leaking stack trace handles")
      Fixes: 08d7c94d ("kasan: memset free track in qlink_free")
      Fixes: a414d428 ("kasan: handle concurrent kasan_record_aux_stack calls")
      Fixes: 773688a6 ("kasan: use stack_depot_put for Generic mode")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      711d3491
    • Marco Elver's avatar
      stackdepot: use variable size records for non-evictable entries · 31639fd6
      Marco Elver authored
      With the introduction of stack depot evictions, each stack record is now
      fixed size, so that future reuse after an eviction can safely store
      differently sized stack traces.  In all cases that do not make use of
      evictions, this wastes lots of space.
      
      Fix it by re-introducing variable size stack records (up to the max
      allowed size) for entries that will never be evicted.  We know if an entry
      will never be evicted if the flag STACK_DEPOT_FLAG_GET is not provided,
      since a later stack_depot_put() attempt is undefined behavior.
      
      With my current kernel config that enables KASAN and also SLUB owner
      tracking, I observe (after a kernel boot) a whopping reduction of 296
      stack depot pools, which translates into 4736 KiB saved.  The savings here
      are from SLUB owner tracking only, because KASAN generic mode still uses
      refcounting.
      
      Before:
      
        pools: 893
        allocations: 29841
        frees: 6524
        in_use: 23317
        freelist_size: 3454
      
      After:
      
        pools: 597
        refcounted_allocations: 17547
        refcounted_frees: 6477
        refcounted_in_use: 11070
        freelist_size: 3497
        persistent_count: 12163
        persistent_bytes: 1717008
      
      [elver@google.com: fix -Wstringop-overflow warning]
        Link: https://lore.kernel.org/all/20240201135747.18eca98e@canb.auug.org.au/
        Link: https://lkml.kernel.org/r/20240201090434.1762340-1-elver@google.com
        Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20240129100708.39460-1-elver@google.com
      Link: https://lore.kernel.org/all/CABXGCsOzpRPZGg23QqJAzKnqkZPKzvieeg=W7sgjgi3q0pBo0g@mail.gmail.com/
      Fixes: 108be8de ("lib/stackdepot: allow users to evict stack traces")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Tested-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      31639fd6
  2. 22 Feb, 2024 30 commits