1. 30 Nov, 2022 1 commit
    • Yang Shi's avatar
      mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE · dec1d352
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
      include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
      6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
      ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
      96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
      f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      It is because khugepaged allocates pages with __GFP_THISNODE, but the
      preferred node is bogus.  The previous patch fixed the khugepaged code to
      avoid allocating page from non-existing node.  But it is still racy
      against memory hotremove.  There is no synchronization with the memory
      hotplug so it is possible that memory gets offline during a longer taking
      scanning.
      
      So this warning still seems not quite helpful because:
        * There is no guarantee the node is online for __GFP_THISNODE context
          for all the callsites.
        * Kernel just fails the allocation regardless the warning, and it looks
          all callsites handle the allocation failure gracefully.
      
      Although while the warning has helped to identify a buggy code, it is not
      safe in general and this warning could panic the system with panic-on-warn
      configuration which tends to be used surprisingly often.  So replace
      VM_WARN_ON to pr_warn().  And the warning will be triggered if
      __GFP_NOWARN is set since the allocator would print out warning for such
      case if __GFP_NOWARN is not set.
      
      [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp]
        Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com
      [akpm@linux-foundation.org: fix whitespace]
      [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel]
      Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dec1d352
  2. 23 Nov, 2022 24 commits
    • Li Hua's avatar
      test_kprobes: fix implicit declaration error of test_kprobes · de3db3f8
      Li Hua authored
      If KPROBES_SANITY_TEST and ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled, but
      STACKTRACE is not set. Build failed as below:
      
      lib/test_kprobes.c: In function `stacktrace_return_handler':
      lib/test_kprobes.c:228:8: error: implicit declaration of function `stack_trace_save'; did you mean `stacktrace_driver'? [-Werror=implicit-function-declaration]
        ret = stack_trace_save(stack_buf, STACK_BUF_SIZE, 0);
              ^~~~~~~~~~~~~~~~
              stacktrace_driver
      cc1: all warnings being treated as errors
      scripts/Makefile.build:250: recipe for target 'lib/test_kprobes.o' failed
      make[2]: *** [lib/test_kprobes.o] Error 1
      
      To fix this error, Select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled.
      
      Link: https://lkml.kernel.org/r/20221121030620.63181-1-hucool.lihua@huawei.com
      Fixes: 1f6d3a8f ("kprobes: Add a test case for stacktrace from kretprobe handler")
      Signed-off-by: default avatarLi Hua <hucool.lihua@huawei.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de3db3f8
    • Chen Zhongjin's avatar
      nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty · 512c5ca0
      Chen Zhongjin authored
      When extending segments, nilfs_sufile_alloc() is called to get an
      unassigned segment, then mark it as dirty to avoid accidentally allocating
      the same segment in the future.
      
      But for some special cases such as a corrupted image it can be unreliable.
      If such corruption of the dirty state of the segment occurs, nilfs2 may
      reallocate a segment that is in use and pick the same segment for writing
      twice at the same time.
      
      This will cause the problem reported by syzkaller:
      https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
      
      This case started with segbuf1.segnum = 3, nextnum = 4 when constructed. 
      It supposed segment 4 has already been allocated and marked as dirty.
      
      However the dirty state was corrupted and segment 4 usage was not dirty. 
      For the first time nilfs_segctor_extend_segments() segment 4 was allocated
      again, which made segbuf2 and next segbuf3 had same segment 4.
      
      sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
      to both buffer lists of two segbuf.  It makes the lists broken which
      causes NULL pointer dereference.
      
      Fix the problem by setting usage as dirty every time in
      nilfs_sufile_mark_dirty(), which is called during constructing current
      segment to be written out and before allocating next segment.
      
      [chenzhongjin@huawei.com: add lock protection per Ryusuke]
        Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
      Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
      Fixes: 9ff05123 ("nilfs2: segment constructor")
      Signed-off-by: default avatarChen Zhongjin <chenzhongjin@huawei.com>
      Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com>
      Reported-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      512c5ca0
    • Aneesh Kumar K.V's avatar
      mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 · 81a70c21
      Aneesh Kumar K.V authored
      balance_dirty_pages doesn't do the required dirty throttling on cgroupv1. 
      See commit 9badce00 ("cgroup, writeback: don't enable cgroup writeback
      on traditional hierarchies").  Instead, the kernel depends on writeback
      throttling in shrink_folio_list to achieve the same goal.  With large
      memory systems, the flusher may not be able to writeback quickly enough
      such that we will start finding pages in the shrink_folio_list already in
      writeback.  Hence for cgroupv1 let's do a reclaim throttle after waking up
      the flusher.
      
      The below test which used to fail on a 256GB system completes till the the
      file system is full with this change.
      
      root@lp2:/sys/fs/cgroup/memory# mkdir test
      root@lp2:/sys/fs/cgroup/memory# cd test/
      root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
      root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
      root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
      Killed
      
      Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: zefan li <lizefan.x@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      81a70c21
    • Qi Zheng's avatar
      mm: fix unexpected changes to {failslab|fail_page_alloc}.attr · ea4452de
      Qi Zheng authored
      When we specify __GFP_NOWARN, we only expect that no warnings will be
      issued for current caller.  But in the __should_failslab() and
      __should_fail_alloc_page(), the local GFP flags alter the global
      {failslab|fail_page_alloc}.attr, which is persistent and shared by all
      tasks.  This is not what we expected, let's fix it.
      
      [akpm@linux-foundation.org: unexport should_fail_ex()]
      Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
      Fixes: 3f913fc5 ("mm: fix missing handler for __GFP_NOWARN")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea4452de
    • Chen Wandun's avatar
      swapfile: fix soft lockup in scan_swap_map_slots · de1ccfb6
      Chen Wandun authored
      A softlockup occurs in scan free swap slot under huge memory pressure. 
      The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
      disksize of each zram device is 50MB.
      
      LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
      the real loop number would more than LATENCY_LIMIT because of "goto checks
      and goto scan" repeatly without decreasing latency limit.
      
      In order to fix it, decrease latency_ration in advance.
      
      There is also a suspicious place that will cause softlockups in
      get_swap_pages().  In this function, the "goto start_over" may result in
      continuous scanning of the swap partition.  If there is no cond_sched in
      scan_swap_map_slots(), it would cause a softlockup (I am not sure about
      this).
      
      WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
      CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
      dump backtrace+0x0/0x1le4
      show stack+0x20/@x2c
      dump_stack+0xd8/0x140
      watchdog print_info+0x48/0x54
      watchdog_process_before_softlockup+0x98/0xa0
      watchdog_timer_fn+0xlac/0x2d0
      hrtimer_rum_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handLe_percpu_devid_irq+0x90/0x1f4
      handle domain irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      e11 ira+0xhB/Bx140
      scan_swap_map_slots+0x678/0x890
      get_swap_pages+0x29c/0x440
      get_swap_page+0x120/0x2e0
      add_to_swap+UX2U/0XyC
      shrink_page_list+0x5d0/0x152c
      shrink_inactive_list+0xl6c/Bx500
      shrink_lruvec+0x270/0x304
      
      WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
      watchdog_timer_fn+0x1ac/0x2d0
      __run_hrtimer+0x98/0x2a0
      __hrtimer_run_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handle_percpu_devid_irq+0x90/0x1f4
      __handle_domain_irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      el1_irq+0xb8/0x140
      get_swap_pages+0x1e8/0x440
      get_swap_page+0x1c8/0x2e0
      add_to_swap+0x20/0x9c
      shrink_page_list+0x5d0/0x152c
      reclaim_pages+0x160/0x310
      madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
      walk_pmd_range.isra.0+0xac/0x22c
      walk_pud_range+0xfc/0x1c0
      walk_pgd_range+0x158/0x1b0
      __walk_page_range+0x64/0x100
      walk_page_range+0x104/0x150
      
      Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
      Fixes: 048c27fd ("[PATCH] swap: scan_swap_map latency breaks")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <xialonglong1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de1ccfb6
    • Mike Kravetz's avatar
      hugetlb: fix __prep_compound_gigantic_page page flag setting · 7fb0728a
      Mike Kravetz authored
      Commit 2b21624f ("hugetlb: freeze allocated pages before creating
      hugetlb pages") changed the order page flags were cleared and set in the
      head page.  It moved the __ClearPageReserved after __SetPageHead. 
      However, there is a check to make sure __ClearPageReserved is never done
      on a head page.  If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
      will be hit when creating a hugetlb gigantic page:
      
          page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
          ------------[ cut here ]------------
          kernel BUG at include/linux/page-flags.h:500!
          Call Trace will differ depending on whether hugetlb page is created
          at boot time or run time.
      
      Make sure to __ClearPageReserved BEFORE __SetPageHead.
      
      Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
      Fixes: 2b21624f ("hugetlb: freeze allocated pages before creating hugetlb pages")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Tested-by: default avatarTarun Sahu <tsahu@linux.ibm.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7fb0728a
    • Marco Elver's avatar
      kfence: fix stack trace pruning · 747c0f35
      Marco Elver authored
      Commit b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      refactored large parts of the kmalloc subsystem, resulting in the stack
      trace pruning logic done by KFENCE to no longer work.
      
      While b1405135 attempted to fix the situation by including
      '__kmem_cache_free' in the list of functions KFENCE should skip through,
      this only works when the compiler actually optimized the tail call from
      kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
      appearing in the full stack trace to begin with).
      
      In some configurations, the compiler no longer optimizes the tail call
      into a jump, and __kmem_cache_free() appears in the stack trace.  This
      means that the pruned stack trace shown by KFENCE would include kfree()
      which is not intended - for example:
      
       | BUG: KFENCE: invalid free in kfree+0x7c/0x120
       |
       | Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
       |  kfree+0x7c/0x120
       |  test_double_free+0x116/0x1a9
       |  kunit_try_run_case+0x90/0xd0
       | [...]
      
      Fix it by moving __kmem_cache_free() to the list of functions that may be
      tail called by an allocator entry function, making the pruning logic work
      in both the optimized and unoptimized tail call cases.
      
      Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
      Fixes: b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      747c0f35
    • Yosry Ahmed's avatar
      proc/meminfo: fix spacing in SecPageTables · f850c849
      Yosry Ahmed authored
      SecPageTables has a tab after it instead of a space, this can break
      fragile parsers that depend on spaces after the stat names.
      
      Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
      Fixes: ebc97a52 ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f850c849
    • Yu Zhao's avatar
      mm: multi-gen LRU: retry folios written back while isolated · 359a5e14
      Yu Zhao authored
      The page reclaim isolates a batch of folios from the tail of one of the
      LRU lists and works on those folios one by one.  For a suitable
      swap-backed folio, if the swap device is async, it queues that folio for
      writeback.  After the page reclaim finishes an entire batch, it puts back
      the folios it queued for writeback to the head of the original LRU list.
      
      In the meantime, the page writeback flushes the queued folios also by
      batches.  Its batching logic is independent from that of the page reclaim.
      For each of the folios it writes back, the page writeback calls
      folio_rotate_reclaimable() which tries to rotate a folio to the tail.
      
      folio_rotate_reclaimable() only works for a folio after the page reclaim
      has put it back.  If an async swap device is fast enough, the page
      writeback can finish with that folio while the page reclaim is still
      working on the rest of the batch containing it.  In this case, that folio
      will remain at the head and the page reclaim will not retry it before
      reaching there.
      
      This patch adds a retry to evict_folios().  After evict_folios() has
      finished an entire batch and before it puts back folios it cannot free
      immediately, it retries those that may have missed the rotation.
      
      Before this patch, ~60% of folios swapped to an Intel Optane missed
      folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
      reclaimed upon retry.
      
      This problem affects relatively slow async swap devices like Samsung 980
      Pro much less and does not affect sync swap devices like zram or zswap at
      all.
      
      Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
      Fixes: ac35a490 ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      359a5e14
    • Satya Priya's avatar
      mailmap: update email address for Satya Priya · 47123d7f
      Satya Priya authored
      Add and also update email address, skakit@codeaurora.org is no longer
      active.
      
      Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.comSigned-off-by: default avatarSatya Priya <quic_c_skakit@quicinc.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      47123d7f
    • Alistair Popple's avatar
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple authored
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reported-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • Sam James's avatar
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James authored
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.orgSigned-off-by: default avatarSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • Alex Hung's avatar
    • Alex Hung's avatar
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung authored
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.comSigned-off-by: default avatarAlex Hung <alexhung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • Ian Cowan's avatar
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan authored
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aeroSigned-off-by: default avatarIan Cowan <ian@linux.cowan.aero>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • SeongJae Park's avatar
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park authored
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • Alistair Popple's avatar
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple authored
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • Li Liguang's avatar
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang authored
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf ("zswap: memcg accounting")
      Signed-off-by: default avatarLi Liguang <liliguang@baidu.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cd08d80e
    • Mike Kravetz's avatar
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz authored
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDoug Nelson <doug.nelson@intel.com>
      Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b6305049
    • Mukesh Ojha's avatar
      gcov: clang: fix the buffer overflow issue · a6f810ef
      Mukesh Ojha authored
      Currently, in clang version of gcov code when module is getting removed
      gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
      dst->functions and it result in the kernel panic in below crash report. 
      Fix this by properly handling it.
      
      [    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
      [    8.899100][  T599] Mem abort info:
      [    8.899102][  T599]   ESR = 0x9600004f
      [    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    8.899105][  T599]   SET = 0, FnV = 0
      [    8.899107][  T599]   EA = 0, S1PTW = 0
      [    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
      [    8.899110][  T599] Data abort info:
      [    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
      [    8.899113][  T599]   CM = 0, WnR = 1
      [    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
      [    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
      [    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
      [    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
      ....
      ..,
      [    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
      [    8.899547][  T599] Hardware name: XXX (DT)
      [    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
      [    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
      [    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
      [    8.899559][  T599] sp : ffffffc00e733b00
      [    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
      [    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
      [    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
      [    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
      [    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
      [    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
      [    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
      [    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
      [    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
      [    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
      [    8.899589][  T599] Call trace:
      [    8.899590][  T599]  gcov_info_add+0x9c/0xb8
      [    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
      [    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
      [    8.899598][  T599]  do_init_module+0x2a8/0x33c
      [    8.899600][  T599]  load_module+0x23cc/0x261c
      [    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
      [    8.899604][  T599]  invoke_syscall+0x94/0x2bc
      [    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
      [    8.899609][  T599]  do_el0_svc+0x40/0x54
      [    8.899611][  T599]  el0_svc+0x94/0x2f0
      [    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
      [    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
      [    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
      [    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---
      
      Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
      Fixes: e178a5be ("gcov: clang support")
      Signed-off-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Tested-by: default avatarPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a6f810ef
    • Gautam Menghani's avatar
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani authored
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: default avatarGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • Charan Teja Kalla's avatar
      mm/page_exit: fix kernel doc warning in page_ext_put() · ed86b748
      Charan Teja Kalla authored
      Fix the below compiler warnings reported with 'make W=1 mm/'. 
      mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
      described in 'page_ext_put'.
      
      [quic_pkondeti@quicinc.com: better patch title]
      Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
      Fixes: b1d5488a ("mm: fix use-after free of page_ext after race with memory-offline")
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Tested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ed86b748
    • Yang Shi's avatar
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi authored
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Suggested-by: default avatarZach O'Keefe <zokeefe@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarZach O'Keefe <zokeefe@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner authored
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f53af428
  3. 08 Nov, 2022 15 commits