1. 18 Jul, 2024 7 commits
    • Yu Zhao's avatar
      mm/mglru: fix overshooting shrinker memory · 3f74e6bd
      Yu Zhao authored
      set_initial_priority() tries to jump-start global reclaim by estimating
      the priority based on cold/hot LRU pages.  The estimation does not account
      for shrinker objects, and it cannot do so because their sizes can be in
      different units other than page.
      
      If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where
      ZFS ARC can use almost all system memory, set_initial_priority() can
      vastly underestimate how much memory ARC shrinker can evict and assign
      extreme low values to scan_control->priority, resulting in overshoots of
      shrinker objects.
      
      To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a
      test ZFS pool and the following commands:
      
        fio --name=mglru.file --numjobs=36 --ioengine=io_uring \
            --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \
            --rw=randread --random_distribution=random \
            --time_based --runtime=1h &
      
        for ((i = 0; i < 20; i++))
        do
          sleep 120
          fio --name=mglru.anon --numjobs=16 --ioengine=mmap \
            --filename=/dev/zero --size=1024m --fadvise_hint=0 \
            --rw=randrw --random_distribution=random \
            --time_based --runtime=1m
        done
      
      To fix the problem:
      1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent
         the jump-start from being overly aggressive.
      2. Account for the progress from mm_account_reclaimed_pages(), to
         prevent kswapd_shrink_node() from raising the priority
         unnecessarily.
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com
      Fixes: e4dde56c ("mm: multi-gen LRU: per-node lru_gen_folio lists")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAlexander Motin <mav@ixsystems.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3f74e6bd
    • Yu Zhao's avatar
      mm/mglru: fix div-by-zero in vmpressure_calc_level() · 8b671fe1
      Yu Zhao authored
      evict_folios() uses a second pass to reclaim folios that have gone through
      page writeback and become clean before it finishes the first pass, since
      folio_rotate_reclaimable() cannot handle those folios due to the
      isolation.
      
      The second pass tries to avoid potential double counting by deducting
      scan_control->nr_scanned.  However, this can result in underflow of
      nr_scanned, under a condition where shrink_folio_list() does not increment
      nr_scanned, i.e., when folio_trylock() fails.
      
      The underflow can cause the divisor, i.e., scale=scanned+reclaimed in
      vmpressure_calc_level(), to become zero, resulting in the following crash:
      
        [exception RIP: vmpressure_work_fn+101]
        process_one_work at ffffffffa3313f2b
      
      Since scan_control->nr_scanned has no established semantics, the potential
      double counting has minimal risks.  Therefore, fix the problem by not
      deducting scan_control->nr_scanned in evict_folios().
      
      Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com
      Fixes: 359a5e14 ("mm: multi-gen LRU: retry folios written back while isolated")
      Reported-by: default avatarWei Xu <weixugc@google.com>
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Alexander Motin <mav@ixsystems.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b671fe1
    • Kees Cook's avatar
      mm/kmemleak: replace strncpy() with strscpy() · 0b847801
      Kees Cook authored
      Replace the depreciated[1] strncpy() calls with strscpy().  Uses of
      object->comm do not depend on the padding side-effect.
      
      Link: https://github.com/KSPP/linux/issues/90 [1]
      Link: https://lkml.kernel.org/r/20240710001300.work.004-kees@kernel.orgSigned-off-by: default avatarKees Cook <kees@kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0b847801
    • Vlastimil Babka's avatar
      mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC · 53dabce2
      Vlastimil Babka authored
      This mostly reverts commit af3b8544 ("mm/page_alloc.c: allow error
      injection").  The commit made should_fail_alloc_page() a noinline function
      that's always called from the page allocation hotpath, even if it's empty
      because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to
      disable it and prevent the associated function call overhead.
      
      As with the preceding patch "mm, slab: put should_failslab back behind
      CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the
      should_fail_alloc_page() back behind the config option.  When enabled, the
      ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a
      complete revert.
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-2-9e2651945d68@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53dabce2
    • Vlastimil Babka's avatar
      mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB · a7526fe8
      Vlastimil Babka authored
      Patch series "revert unconditional slab and page allocator fault injection
      calls".
      
      These two patches largely revert commits that added function call overhead
      into slab and page allocation hotpaths and that cannot be currently
      disabled even though related CONFIG_ options do exist.
      
      A much more involved solution that can keep the callsites always existing
      but hidden behind a static key if unused, is possible [1] and can be
      pursued by anyone who believes it's necessary.  Meanwhile the fact the
      should_failslab() error injection is already not functional on kernels
      built with current gcc without anyone noticing [2], and lukewarm response
      to [1] suggests the need is not there.  I believe it will be more fair to
      have the state after this series as a baseline for possible further
      optimisation, instead of the unconditional overhead.
      
      For example a possible compromise for anyone who's fine with an empty
      function call overhead but not the full CONFIG_FAILSLAB /
      CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a
      static key check only inside should_failslab() and
      should_fail_alloc_page() before performing the more expensive checks.
      
      [1] https://lore.kernel.org/all/20240620-fault-injection-statickeys-v2-0-e23947d3d84b@suse.cz/#t
      [2] https://github.com/bpftrace/bpftrace/issues/3258
      
      
      This patch (of 2):
      
      This mostly reverts commit 4f6923fb ("mm: make should_failslab always
      available for fault injection").  The commit made should_failslab() a
      noinline function that's always called from the slab allocation hotpath,
      even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and
      there is no option to disable that call.  This is visible in profiles and
      the function call overhead can be noticeable especially with cpu
      mitigations.
      
      Meanwhile the bpftrace program example in the commit silently does not
      work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the
      empty function gets a .constprop clone that is actually being called
      (uselessly) from the slab hotpath, while the error injection is hooked to
      the original function that's not being called at all [1].
      
      Thus put the whole should_failslab() function back behind
      CONFIG_SHOULD_FAILSLAB.  It's not a complete revert of 4f6923fb - the
      int return type that returns -ENOMEM on failure is preserved, as well
      ALLOW_ERROR_INJECTION annotation.  The BTF_ID() record that was meanwhile
      added is also guarded by CONFIG_SHOULD_FAILSLAB.
      
      [1] https://github.com/bpftrace/bpftrace/issues/3258
      
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-0-9e2651945d68@suse.cz
      Link: https://lkml.kernel.org/r/20240711-b4-fault-injection-reverts-v1-1-9e2651945d68@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Cc: Hao Luo <haoluo@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@kernel.org>
      Cc: Martin KaFai Lau <martin.lau@linux.dev>
      Cc: Mateusz Guzik <mjguzik@gmail.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Song Liu <song@kernel.org>
      Cc: Stanislav Fomichev <sdf@fomichev.me>
      Cc: Yonghong Song <yonghong.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a7526fe8
    • Pei Li's avatar
      mm: ignore data-race in __swap_writepage · 7b7aca6d
      Pei Li authored
      Syzbot reported a possible data race:
      
      BUG: KCSAN: data-race in __swap_writepage / scan_swap_map_slots
      
      read-write to 0xffff888102fca610 of 8 bytes by task 7106 on cpu 1.
      read to 0xffff888102fca610 of 8 bytes by task 7080 on cpu 0.
      
      While we are in __swap_writepage to read sis->flags, scan_swap_map_slots
      is trying to update it with SWP_SCANNING.
      
      value changed: 0x0000000000008083 -> 0x0000000000004083.
      
      While this can be updated non-atomicially, this won't affect
      SWP_SYNCHRONOUS_IO, so we consider this data-race safe.
      
      This is possibly introduced by commit 3222d8c2 ("block: remove
      ->rw_page"), where this if branch is introduced.
      
      Link: https://lkml.kernel.org/r/20240711-bug13-v1-1-cea2b8ae8d76@gmail.com
      Fixes: 3222d8c2 ("block: remove ->rw_page")
      Signed-off-by: default avatarPei Li <peili.dev@gmail.com>
      Reported-by: syzbot+da25887cc13da6bf3b8c@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=da25887cc13da6bf3b8c
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7b7aca6d
    • Donet Tom's avatar
      hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr · dffe24e9
      Donet Tom authored
      generic_hugetlb_get_unmapped_area() was returning an address less than
      mmap_min_addr if the mmap argument addr, after alignment, was less than
      mmap_min_addr, causing mmap to fail.
      
      This is because current generic_hugetlb_get_unmapped_area() code does not
      take into account mmap_min_addr.
      
      This patch ensures that generic_hugetlb_get_unmapped_area() always returns
      an address that is greater than mmap_min_addr.  Additionally, similar to
      generic_get_unmapped_area(), vm_end_gap() checks are included to maintain
      stack gap.
      
      How to reproduce
      ================
      
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/mman.h>
       #include <unistd.h>
      
       #define HUGEPAGE_SIZE (16 * 1024 * 1024)
      
       int main() {
      
          void *addr = mmap((void *)-1, HUGEPAGE_SIZE,
                       PROT_READ | PROT_WRITE,
                       MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
          if (addr == MAP_FAILED) {
              perror("mmap");
              exit(EXIT_FAILURE);
          }
      
          snprintf((char *)addr, HUGEPAGE_SIZE, "Hello, Huge Pages!");
      
          printf("%s\n", (char *)addr);
      
          if (munmap(addr, HUGEPAGE_SIZE) == -1) {
              perror("munmap");
              exit(EXIT_FAILURE);
          }
      
          return 0;
       }
      
      Result without fix
      ==================
       # cat /proc/meminfo |grep -i HugePages_Free
       HugePages_Free:       20
       # ./test
       mmap: Permission denied
       #
      
      Result with fix
      ===============
       # cat /proc/meminfo |grep -i HugePages_Free
       HugePages_Free:       20
       # ./test
       Hello, Huge Pages!
       #
      
      Link: https://lkml.kernel.org/r/20240710051912.4681-1-donettom@linux.ibm.comSigned-off-by: default avatarDonet Tom <donettom@linux.ibm.com>
      Reported-by Pavithra Prakash <pavrampu@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
      Cc: Tony Battersby <tonyb@cybernetics.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dffe24e9
  2. 12 Jul, 2024 33 commits