1. 26 Jan, 2024 15 commits
    • Audra Mitchell's avatar
      selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag · 52e63d67
      Audra Mitchell authored
      In order for the page table level 5 to be in use, the CPU must have the
      setting enabled in addition to the CONFIG option. Check for the flag to be
      set to avoid false test failures on systems that do not have this cpu flag
      set.
      
      The test does a series of mmap calls including three using the
      MAP_FIXED flag and specifying an address that is 1<<47 or 1<<48.  These
      addresses are only available if you are using level 5 page tables,
      which requires both the CPU to have the capabiltiy (la57 flag) and the
      kernel to be configured.  Currently the test only checks for the kernel
      configuration option, so this test can still report a false positive. 
      Here are the three failing lines:
      
      $ ./va_high_addr_switch | grep FAILED
      mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED
      mmap(HIGH_ADDR, MAP_FIXED): 0xffffffffffffffff - FAILED
      mmap(ADDR_SWITCH_HINT, 2 * PAGE_SIZE, MAP_FIXED): 0xffffffffffffffff - FAILED
      
      I thought (for about a second) refactoring the test so that these three
      mmap calls will only be run on systems with the level 5 page tables
      available, but the whole point of the test is to check the level 5
      feature...
      
      Link: https://lkml.kernel.org/r/20240119205801.62769-1-audra@redhat.com
      Fixes: 4f2930c6 ("selftests/vm: only run 128TBswitch with 5-level paging")
      Signed-off-by: default avatarAudra Mitchell <audra@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Adam Sindelar <adam@wowsignal.io>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52e63d67
    • Nico Pache's avatar
      selftests: mm: fix map_hugetlb failure on 64K page size systems · 91b80cc5
      Nico Pache authored
      On systems with 64k page size and 512M huge page sizes, the allocation and
      test succeeds but errors out at the munmap.  As the comment states, munmap
      will failure if its not HUGEPAGE aligned.  This is due to the length of
      the mapping being 1/2 the size of the hugepage causing the munmap to not
      be hugepage aligned.  Fix this by making the mapping length the full
      hugepage if the hugepage is larger than the length of the mapping.
      
      Link: https://lkml.kernel.org/r/20240119131429.172448-1-npache@redhat.comSigned-off-by: default avatarNico Pache <npache@redhat.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      91b80cc5
    • Yosry Ahmed's avatar
      MAINTAINERS: supplement of zswap maintainers update · 0fe8ff51
      Yosry Ahmed authored
      As discussed on the mailing list [1], merge the zpool maintainers entry
      into the zswap one.  Also, add CREDITS entries for previous zswap/zpool
      maintainers.
      
      [1] https://lore.kernel.org/linux-mm/CAJD7tkYx4YWhGoVwnSeGc8dY_1aRRxxg8PzWBV==A6iqG_OgFw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20240117182152.1439822-1-yosryahmed@google.comSigned-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarSeth Jennings <sjenning@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0fe8ff51
    • Marco Elver's avatar
      stackdepot: make fast paths lock-less again · 4434a56e
      Marco Elver authored
      With the introduction of the pool_rwlock (reader-writer lock), several
      fast paths end up taking the pool_rwlock as readers.  Furthermore,
      stack_depot_put() unconditionally takes the pool_rwlock as a writer.
      
      Despite allowing readers to make forward-progress concurrently,
      reader-writer locks have inherent cache contention issues, which does not
      scale well on systems with large CPU counts.
      
      Rework the synchronization story of stack depot to again avoid taking any
      locks in the fast paths.  This is done by relying on RCU-protected list
      traversal, and the NMI-safe subset of RCU to delay reuse of freed stack
      records.  See code comments for more details.
      
      Along with the performance issues, this also fixes incorrect nesting of
      rwlock within a raw_spinlock, given that stack depot should still be
      usable from anywhere:
      
       | [ BUG: Invalid wait context ]
       | -----------------------------
       | swapper/0/1 is trying to lock:
       | ffffffff89869be8 (pool_rwlock){..--}-{3:3}, at: stack_depot_save_flags
       | other info that might help us debug this:
       | context-{5:5}
       | 2 locks held by swapper/0/1:
       |  #0: ffffffff89632440 (rcu_read_lock){....}-{1:3}, at: __queue_work
       |  #1: ffff888100092018 (&pool->lock){-.-.}-{2:2}, at: __queue_work  <-- raw_spin_lock
      
      Stack depot usage stats are similar to the previous version after a KASAN
      kernel boot:
      
       $ cat /sys/kernel/debug/stackdepot/stats
       pools: 838
       allocations: 29865
       frees: 6604
       in_use: 23261
       freelist_size: 1879
      
      The number of pools is the same as previously.  The freelist size is
      minimally larger, but this may also be due to variance across system
      boots.  This shows that even though we do not eagerly wait for the next
      RCU grace period (such as with synchronize_rcu() or call_rcu()) after
      freeing a stack record - requiring depot_pop_free() to "poll" if an entry
      may be used - new allocations are very likely to happen in later RCU grace
      periods.
      
      Link: https://lkml.kernel.org/r/20240118110216.2539519-2-elver@google.com
      Fixes: 108be8de ("lib/stackdepot: allow users to evict stack traces")
      Reported-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4434a56e
    • Marco Elver's avatar
      stackdepot: add stats counters exported via debugfs · c2a29254
      Marco Elver authored
      Add a few basic stats counters for stack depot that can be used to derive
      if stack depot is working as intended.  This is a snapshot of the new
      stats after booting a system with a KASAN-enabled kernel:
      
       $ cat /sys/kernel/debug/stackdepot/stats
       pools: 838
       allocations: 29861
       frees: 6561
       in_use: 23300
       freelist_size: 1840
      
      Generally, "pools" should be well below the max; once the system is
      booted, "in_use" should remain relatively steady.
      
      Link: https://lkml.kernel.org/r/20240118110216.2539519-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2a29254
    • Marco Elver's avatar
      mm, kmsan: fix infinite recursion due to RCU critical section · f6564fce
      Marco Elver authored
      Alexander Potapenko writes in [1]: "For every memory access in the code
      instrumented by KMSAN we call kmsan_get_metadata() to obtain the metadata
      for the memory being accessed.  For virtual memory the metadata pointers
      are stored in the corresponding `struct page`, therefore we need to call
      virt_to_page() to get them.
      
      According to the comment in arch/x86/include/asm/page.h,
      virt_to_page(kaddr) returns a valid pointer iff virt_addr_valid(kaddr) is
      true, so KMSAN needs to call virt_addr_valid() as well.
      
      To avoid recursion, kmsan_get_metadata() must not call instrumented code,
      therefore ./arch/x86/include/asm/kmsan.h forks parts of
      arch/x86/mm/physaddr.c to check whether a virtual address is valid or not.
      
      But the introduction of rcu_read_lock() to pfn_valid() added instrumented
      RCU API calls to virt_to_page_or_null(), which is called by
      kmsan_get_metadata(), so there is an infinite recursion now.  I do not
      think it is correct to stop that recursion by doing
      kmsan_enter_runtime()/kmsan_exit_runtime() in kmsan_get_metadata(): that
      would prevent instrumented functions called from within the runtime from
      tracking the shadow values, which might introduce false positives."
      
      Fix the issue by switching pfn_valid() to the _sched() variant of
      rcu_read_lock/unlock(), which does not require calling into RCU.  Given
      the critical section in pfn_valid() is very small, this is a reasonable
      trade-off (with preemptible RCU).
      
      KMSAN further needs to be careful to suppress calls into the scheduler,
      which would be another source of recursion.  This can be done by wrapping
      the call to pfn_valid() into preempt_disable/enable_no_resched().  The
      downside is that this sacrifices breaking scheduling guarantees; however,
      a kernel compiled with KMSAN has already given up any performance
      guarantees due to being heavily instrumented.
      
      Note, KMSAN code already disables tracing via Makefile, and since mmzone.h
      is included, it is not necessary to use the notrace variant, which is
      generally preferred in all other cases.
      
      Link: https://lkml.kernel.org/r/20240115184430.2710652-1-glider@google.com [1]
      Link: https://lkml.kernel.org/r/20240118110022.2538350-1-elver@google.com
      Fixes: 5ec8e8ea ("mm/sparsemem: fix race in accessing memory_section->usage")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reported-by: default avatarAlexander Potapenko <glider@google.com>
      Reported-by: syzbot+93a9e8a3dea8d6085e12@syzkaller.appspotmail.com
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Tested-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f6564fce
    • Zach O'Keefe's avatar
      mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again · 9319b647
      Zach O'Keefe authored
      (struct dirty_throttle_control *)->thresh is an unsigned long, but is
      passed as the u32 divisor argument to div_u64().  On architectures where
      unsigned long is 64 bytes, the argument will be implicitly truncated.
      
      Use div64_u64() instead of div_u64() so that the value used in the "is
      this a safe division" check is the same as the divisor.
      
      Also, remove redundant cast of the numerator to u64, as that should happen
      implicitly.
      
      This would be difficult to exploit in memcg domain, given the ratio-based
      arithmetic domain_drity_limits() uses, but is much easier in global
      writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using e.g. 
      vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32)
      
      Link: https://lkml.kernel.org/r/20240118181954.1415197-1-zokeefe@google.com
      Fixes: f6789593 ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()")
      Signed-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Maxim Patlasov <MPatlasov@parallels.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9319b647
    • Muhammad Usama Anjum's avatar
      selftests/mm: switch to bash from sh · bc29036e
      Muhammad Usama Anjum authored
      Running charge_reserved_hugetlb.sh generates errors if sh is set to
      dash:
      
      ./charge_reserved_hugetlb.sh: 9: [[: not found
      ./charge_reserved_hugetlb.sh: 19: [[: not found
      ./charge_reserved_hugetlb.sh: 27: [[: not found
      ./charge_reserved_hugetlb.sh: 37: [[: not found
      ./charge_reserved_hugetlb.sh: 45: Syntax error: "(" unexpected
      
      Switch to using /bin/bash instead of /bin/sh.  Make the switch for
      write_hugetlb_memory.sh as well which is called from
      charge_reserved_hugetlb.sh.
      
      Link: https://lkml.kernel.org/r/20240116090455.3407378-1-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc29036e
    • Petr Vorel's avatar
    • Johannes Weiner's avatar
      mm: memcontrol: don't throttle dying tasks on memory.high · 63fd3270
      Johannes Weiner authored
      While investigating hosts with high cgroup memory pressures, Tejun
      found culprit zombie tasks that had were holding on to a lot of
      memory, had SIGKILL pending, but were stuck in memory.high reclaim.
      
      In the past, we used to always force-charge allocations from tasks
      that were exiting in order to accelerate them dying and freeing up
      their rss. This changed for memory.max in a4ebf1b6 ("memcg:
      prohibit unconditional exceeding the limit of dying tasks"); it noted
      that this can cause (userspace inducable) containment failures, so it
      added a mandatory reclaim and OOM kill cycle before forcing charges.
      At the time, memory.high enforcement was handled in the userspace
      return path, which isn't reached by dying tasks, and so memory.high
      was still never enforced by dying tasks.
      
      When c9afe31e ("memcg: synchronously enforce memory.high for large
      overcharges") added synchronous reclaim for memory.high, it added
      unconditional memory.high enforcement for dying tasks as well. The
      callstack shows that this path is where the zombie is stuck in.
      
      We need to accelerate dying tasks getting past memory.high, but we
      cannot do it quite the same way as we do for memory.max: memory.max is
      enforced strictly, and tasks aren't allowed to move past it without
      FIRST reclaiming and OOM killing if necessary. This ensures very small
      levels of excess. With memory.high, though, enforcement happens lazily
      after the charge, and OOM killing is never triggered. A lot of
      concurrent threads could have pushed, or could actively be pushing,
      the cgroup into excess. The dying task will enter reclaim on every
      allocation attempt, with little hope of restoring balance.
      
      To fix this, skip synchronous memory.high enforcement on dying tasks
      altogether again. Update memory.high path documentation while at it.
      
      [hannes@cmpxchg.org: also handle tasks are being killed during the reclaim]
        Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org
      Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org
      Fixes: c9afe31e ("memcg: synchronously enforce memory.high for large overcharges")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      63fd3270
    • Yang Shi's avatar
      mm: mmap: map MAP_STACK to VM_NOHUGEPAGE · c4608d1b
      Yang Shi authored
      commit efa7df3e ("mm: align larger anonymous mappings on THP
      boundaries") incured regression for stress-ng pthread benchmark [1].  It
      is because THP get allocated to pthread's stack area much more possible
      than before.  Pthread's stack area is allocated by mmap without
      VM_GROWSDOWN or VM_GROWSUP flag, so kernel can't tell whether it is a
      stack area or not.
      
      The MAP_STACK flag is used to mark the stack area, but it is a no-op on
      Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating THP
      for such stack area.
      
      With this change the stack area looks like:
      
      fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
      Size:               8192 kB
      KernelPageSize:        4 kB
      MMUPageSize:           4 kB
      Rss:                  12 kB
      Pss:                  12 kB
      Pss_Dirty:            12 kB
      Shared_Clean:          0 kB
      Shared_Dirty:          0 kB
      Private_Clean:         0 kB
      Private_Dirty:        12 kB
      Referenced:           12 kB
      Anonymous:            12 kB
      KSM:                   0 kB
      LazyFree:              0 kB
      AnonHugePages:         0 kB
      ShmemPmdMapped:        0 kB
      FilePmdMapped:         0 kB
      Shared_Hugetlb:        0 kB
      Private_Hugetlb:       0 kB
      Swap:                  0 kB
      SwapPss:               0 kB
      Locked:                0 kB
      THPeligible:           0
      VmFlags: rd wr mr mw me ac nh
      
      The "nh" flag is set.
      
      [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/
      
      Link: https://lkml.kernel.org/r/20231221065943.2803551-2-shy828301@gmail.com
      Fixes: efa7df3e ("mm: align larger anonymous mappings on THP boundaries")
      Signed-off-by: default avatarYang Shi <yang@os.amperecomputing.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Tested-by: default avatarOliver Sang <oliver.sang@intel.com>
      Reviewed-by: default avatarYin Fengwei <fengwei.yin@intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: <stable@vger.kerenl.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c4608d1b
    • David Hildenbrand's avatar
      uprobes: use pagesize-aligned virtual address when replacing pages · 4dca82d1
      David Hildenbrand authored
      uprobes passes an unaligned page mapping address to
      folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we
      recently extended in commit 372cbd4d ("mm: non-pmd-mappable, large
      folios for folio_add_new_anon_rmap()").
      
      Arguably, this is uprobes code doing something wrong; however, for the
      time being it would have likely worked in rmap code because
      __folio_set_anon() would set folio->index to the same value.
      
      Looking at __replace_page(), we'd also pass slightly wrong values to
      mmu_notifier_range_init(), page_vma_mapped_walk(), flush_cache_page(),
      ptep_clear_flush() and set_pte_at_notify().  I suspect most of them are
      fine, but let's just mark the introducing commit as the one needed fixing.
      I don't think CC stable is warranted.
      
      We'll add more sanity checks in rmap code separately, to make sure that we
      always get properly aligned addresses.
      
      Link: https://lkml.kernel.org/r/20240115100731.91007-1-david@redhat.com
      Fixes: c517ee74 ("uprobes: __replace_page() should not use page_address_in_vma()")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarJiri Olsa <jolsa@kernel.org>
      Closes: https://lkml.kernel.org/r/ZaMR2EWN-HvlCfUl@kravaTested-by: default avatarJiri Olsa <jolsa@kernel.org>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4dca82d1
    • Muhammad Usama Anjum's avatar
      selftests/mm: mremap_test: fix build warning · f8ee4361
      Muhammad Usama Anjum authored
      Use 2 separate variables of types int and unsigned long long instead of
      confusing them.  This corrects the correct print format for each of them
      and removes the build warning:
      
      warning: format `%d' expects argument of type `int', but argument 2 has type `long long unsigned int'
      
      Link: https://lkml.kernel.org/r/20240112071851.612930-1-usama.anjum@collabora.com
      Fixes: a4cb3b24 ("selftests: mm: add a test for remapping to area immediately after existing mapping")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8ee4361
    • Sidhartha Kumar's avatar
      fs/hugetlbfs/inode.c: mm/memory-failure.c: fix hugetlbfs hwpoison handling · 19d3e221
      Sidhartha Kumar authored
      has_extra_refcount() makes the assumption that the page cache adds a ref
      count of 1 and subtracts this in the extra_pins case.  Commit a08c7193
      (mm/filemap: remove hugetlb special casing in filemap.c) modifies
      __filemap_add_folio() by calling folio_ref_add(folio, nr); for all cases
      (including hugtetlb) where nr is the number of pages in the folio.  We
      should adjust the number of references coming from the page cache by
      subtracing the number of pages rather than 1.
      
      In hugetlbfs_read_iter(), folio_test_has_hwpoisoned() is testing the wrong
      flag as, in the hugetlb case, memory-failure code calls
      folio_test_set_hwpoison() to indicate poison.  folio_test_hwpoison() is
      the correct function to test for that flag.
      
      After these fixes, the hugetlb hwpoison read selftest passes all cases.
      
      Link: https://lkml.kernel.org/r/20240112180840.367006-1-sidhartha.kumar@oracle.com
      Fixes: a08c7193 ("mm/filemap: remove hugetlb special casing in filemap.c")
      Signed-off-by: default avatarSidhartha Kumar <sidhartha.kumar@oracle.com>
      Closes: https://lore.kernel.org/linux-mm/20230713001833.3778937-1-jiaqiyan@google.com/T/#m8e1469119e5b831bbd05d495f96b842e4a1c5519Reported-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Tested-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Acked-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarMuchun Song <muchun.song@linux.dev>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Jiaqi Yan <jiaqiyan@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>	[6.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      19d3e221
    • Jan Kara's avatar
      readahead: avoid multiple marked readahead pages · ab4443fe
      Jan Kara authored
      ra_alloc_folio() marks a page that should trigger next round of async
      readahead.  However it rounds up computed index to the order of page being
      allocated.  This can however lead to multiple consecutive pages being
      marked with readahead flag.  Consider situation with index == 1, mark ==
      1, order == 0.  We insert order 0 page at index 1 and mark it.  Then we
      bump order to 1, index to 2, mark (still == 1) is rounded up to 2 so page
      at index 2 is marked as well.  Then we bump order to 2, index is
      incremented to 4, mark gets rounded to 4 so page at index 4 is marked as
      well.  The fact that multiple pages get marked within a single readahead
      window confuses the readahead logic and results in readahead window being
      trimmed back to 1.  This situation is triggered in particular when maximum
      readahead window size is not a power of two (in the observed case it was
      768 KB) and as a result sequential read throughput suffers.
      
      Fix the problem by rounding 'mark' down instead of up.  Because the index
      is naturally aligned to 'order', we are guaranteed 'rounded mark' == index
      iff 'mark' is within the page we are allocating at 'index' and thus
      exactly one page is marked with readahead flag as required by the
      readahead code and sequential read performance is restored.
      
      This effectively reverts part of commit b9ff43dd ("mm/readahead: Fix
      readahead with large folios").  The commit changed the rounding with the
      rationale:
      
      "...  we were setting the readahead flag on the folio which contains the
      last byte read from the block.  This is wrong because we will trigger
      readahead at the end of the read without waiting to see if a subsequent
      read is going to use the pages we just read."
      
      Although this is true, the fact is this was always the case with read
      sizes not aligned to folio boundaries and large folios in the page cache
      just make the situation more obvious (and frequent).  Also for sequential
      read workloads it is better to trigger the readahead earlier rather than
      later.  It is true that the difference in the rounding and thus earlier
      triggering of the readahead can result in reading more for semi-random
      workloads.  However workloads really suffering from this seem to be rare. 
      In particular I have verified that the workload described in commit
      b9ff43dd ("mm/readahead: Fix readahead with large folios") of reading
      random 100k blocks from a file like:
      
      [reader]
      bs=100k
      rw=randread
      numjobs=1
      size=64g
      runtime=60s
      
      is not impacted by the rounding change and achieves ~70MB/s in both cases.
      
      [jack@suse.cz: fix one more place where mark rounding was done as well]
        Link: https://lkml.kernel.org/r/20240123153254.5206-1-jack@suse.cz
      Link: https://lkml.kernel.org/r/20240104085839.21029-1-jack@suse.cz
      Fixes: b9ff43dd ("mm/readahead: Fix readahead with large folios")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Guo Xuenan <guoxuenan@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab4443fe
  2. 23 Jan, 2024 1 commit
  3. 21 Jan, 2024 24 commits
    • Linus Torvalds's avatar
      Linux 6.8-rc1 · 6613476e
      Linus Torvalds authored
      6613476e
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs · 35a4474b
      Linus Torvalds authored
      Pull more bcachefs updates from Kent Overstreet:
       "Some fixes, Some refactoring, some minor features:
      
         - Assorted prep work for disk space accounting rewrite
      
         - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
           makes our trigger context more explicit
      
         - A few fixes to avoid excessive transaction restarts on
           multithreaded workloads: fstests (in addition to ktest tests) are
           now checking slowpath counters, and that's shaking out a few bugs
      
         - Assorted tracepoint improvements
      
         - Starting to break up bcachefs_format.h and move on disk types so
           they're with the code they belong to; this will make room to start
           documenting the on disk format better.
      
         - A few minor fixes"
      
      * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
        bcachefs: Improve inode_to_text()
        bcachefs: logged_ops_format.h
        bcachefs: reflink_format.h
        bcachefs; extents_format.h
        bcachefs: ec_format.h
        bcachefs: subvolume_format.h
        bcachefs: snapshot_format.h
        bcachefs: alloc_background_format.h
        bcachefs: xattr_format.h
        bcachefs: dirent_format.h
        bcachefs: inode_format.h
        bcachefs; quota_format.h
        bcachefs: sb-counters_format.h
        bcachefs: counters.c -> sb-counters.c
        bcachefs: comment bch_subvolume
        bcachefs: bch_snapshot::btime
        bcachefs: add missing __GFP_NOWARN
        bcachefs: opts->compression can now also be applied in the background
        bcachefs: Prep work for variable size btree node buffers
        bcachefs: grab s_umount only if snapshotting
        ...
      35a4474b
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4fbbed78
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Updates for time and clocksources:
      
         - A fix for the idle and iowait time accounting vs CPU hotplug.
      
           The time is reset on CPU hotplug which makes the accumulated
           systemwide time jump backwards.
      
         - Assorted fixes and improvements for clocksource/event drivers"
      
      * tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
        clocksource/drivers/ep93xx: Fix error handling during probe
        clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
        clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
        clocksource/timer-riscv: Add riscv_clock_shutdown callback
        dt-bindings: timer: Add StarFive JH8100 clint
        dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
      4fbbed78
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 7b297a5c
      Linus Torvalds authored
      Pull powerpc fixes from Aneesh Kumar:
      
       - Increase default stack size to 32KB for Book3S
      
      Thanks to Michael Ellerman.
      
      * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Increase default stack size to 32KB
      7b297a5c
    • Kent Overstreet's avatar
      bcachefs: Improve inode_to_text() · 249f441f
      Kent Overstreet authored
      Add line breaks - inode_to_text() is now much easier to read.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      249f441f
    • Kent Overstreet's avatar
      bcachefs: logged_ops_format.h · d826cc57
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d826cc57
    • Kent Overstreet's avatar
      bcachefs: reflink_format.h · 8d52ba60
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8d52ba60
    • Kent Overstreet's avatar
      bcachefs; extents_format.h · b2fa1b63
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b2fa1b63
    • Kent Overstreet's avatar
      bcachefs: ec_format.h · 0560eb9a
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      0560eb9a
    • Kent Overstreet's avatar
      bcachefs: subvolume_format.h · c6c4ff65
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c6c4ff65
    • Kent Overstreet's avatar
      bcachefs: snapshot_format.h · 8fed323b
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      8fed323b
    • Kent Overstreet's avatar
      d455179f
    • Kent Overstreet's avatar
      bcachefs: xattr_format.h · 72e08010
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      72e08010
    • Kent Overstreet's avatar
      bcachefs: dirent_format.h · 7ffc4daa
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      7ffc4daa
    • Kent Overstreet's avatar
      bcachefs: inode_format.h · b36425da
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      b36425da
    • Kent Overstreet's avatar
      bcachefs; quota_format.h · 82de6207
      Kent Overstreet authored
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      82de6207
    • Kent Overstreet's avatar
      bcachefs: sb-counters_format.h · 43314801
      Kent Overstreet authored
      bcachefs_format.h has gotten too big; let's do some organizing.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      43314801
    • Kent Overstreet's avatar
      3a58dfbc
    • Kent Overstreet's avatar
      12207f49
    • Kent Overstreet's avatar
      bcachefs: bch_snapshot::btime · d32088f2
      Kent Overstreet authored
      Add a field to bch_snapshot for creation time; this will be important
      when we start exposing the snapshot tree to userspace.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d32088f2
    • Kent Overstreet's avatar
      7be0208f
    • Kent Overstreet's avatar
      bcachefs: opts->compression can now also be applied in the background · d7e77f53
      Kent Overstreet authored
      The "apply this compression method in the background" paths now use the
      compression option if background_compression is not set; this means that
      setting or changing the compression option will cause existing data to
      be compressed accordingly in the background.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d7e77f53
    • Kent Overstreet's avatar
      bcachefs: Prep work for variable size btree node buffers · ec4edd7b
      Kent Overstreet authored
      bcachefs btree nodes are big - typically 256k - and btree roots are
      pinned in memory. As we're now up to 18 btrees, we now have significant
      memory overhead in mostly empty btree roots.
      
      And in the future we're going to start enforcing that certain btree node
      boundaries exist, to solve lock contention issues - analagous to XFS's
      AGIs.
      
      Thus, we need to start allocating smaller btree node buffers when we
      can. This patch changes code that refers to the filesystem constant
      c->opts.btree_node_size to refer to the btree node buffer size -
      btree_buf_bytes() - where appropriate.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ec4edd7b
    • Su Yue's avatar
      bcachefs: grab s_umount only if snapshotting · 2acc59dd
      Su Yue authored
      When I was testing mongodb over bcachefs with compression,
      there is a lockdep warning when snapshotting mongodb data volume.
      
      $ cat test.sh
      prog=bcachefs
      
      $prog subvolume create /mnt/data
      $prog subvolume create /mnt/data/snapshots
      
      while true;do
          $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
          sleep 1s
      done
      
      $ cat /etc/mongodb.conf
      systemLog:
        destination: file
        logAppend: true
        path: /mnt/data/mongod.log
      
      storage:
        dbPath: /mnt/data/
      
      lockdep reports:
      [ 3437.452330] ======================================================
      [ 3437.452750] WARNING: possible circular locking dependency detected
      [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G            E
      [ 3437.453562] ------------------------------------------------------
      [ 3437.453981] bcachefs/35533 is trying to acquire lock:
      [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
      [ 3437.454875]
                     but task is already holding lock:
      [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.456009]
                     which lock already depends on the new lock.
      
      [ 3437.456553]
                     the existing dependency chain (in reverse order) is:
      [ 3437.457054]
                     -> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
      [ 3437.457507]        down_read+0x3e/0x170
      [ 3437.457772]        bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.458206]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.458498]        do_syscall_64+0x42/0xf0
      [ 3437.458779]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.459155]
                     -> #2 (&c->snapshot_create_lock){++++}-{3:3}:
      [ 3437.459615]        down_read+0x3e/0x170
      [ 3437.459878]        bch2_truncate+0x82/0x110 [bcachefs]
      [ 3437.460276]        bchfs_truncate+0x254/0x3c0 [bcachefs]
      [ 3437.460686]        notify_change+0x1f1/0x4a0
      [ 3437.461283]        do_truncate+0x7f/0xd0
      [ 3437.461555]        path_openat+0xa57/0xce0
      [ 3437.461836]        do_filp_open+0xb4/0x160
      [ 3437.462116]        do_sys_openat2+0x91/0xc0
      [ 3437.462402]        __x64_sys_openat+0x53/0xa0
      [ 3437.462701]        do_syscall_64+0x42/0xf0
      [ 3437.462982]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.463359]
                     -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
      [ 3437.463843]        down_write+0x3b/0xc0
      [ 3437.464223]        bch2_write_iter+0x5b/0xcc0 [bcachefs]
      [ 3437.464493]        vfs_write+0x21b/0x4c0
      [ 3437.464653]        ksys_write+0x69/0xf0
      [ 3437.464839]        do_syscall_64+0x42/0xf0
      [ 3437.465009]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.465231]
                     -> #0 (sb_writers#10){.+.+}-{0:0}:
      [ 3437.465471]        __lock_acquire+0x1455/0x21b0
      [ 3437.465656]        lock_acquire+0xc6/0x2b0
      [ 3437.465822]        mnt_want_write+0x46/0x1a0
      [ 3437.465996]        filename_create+0x62/0x190
      [ 3437.466175]        user_path_create+0x2d/0x50
      [ 3437.466352]        bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.466617]        __x64_sys_ioctl+0x93/0xd0
      [ 3437.466791]        do_syscall_64+0x42/0xf0
      [ 3437.466957]        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.467180]
                     other info that might help us debug this:
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
                     other info that might help us debug this:
      
      [ 3437.467507] Chain exists of:
                       sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48
      
      [ 3437.467979]  Possible unsafe locking scenario:
      
      [ 3437.468223]        CPU0                    CPU1
      [ 3437.468405]        ----                    ----
      [ 3437.468585]   rlock(&type->s_umount_key#48);
      [ 3437.468758]                                lock(&c->snapshot_create_lock);
      [ 3437.469030]                                lock(&type->s_umount_key#48);
      [ 3437.469291]   rlock(sb_writers#10);
      [ 3437.469434]
                      *** DEADLOCK ***
      
      [ 3437.469670] 2 locks held by bcachefs/35533:
      [ 3437.469838]  #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
      [ 3437.470294]  #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
      [ 3437.470744]
                     stack backtrace:
      [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G            E      6.7.0-rc7-custom+ #85
      [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
      [ 3437.471694] Call Trace:
      [ 3437.471795]  <TASK>
      [ 3437.471884]  dump_stack_lvl+0x57/0x90
      [ 3437.472035]  check_noncircular+0x132/0x150
      [ 3437.472202]  __lock_acquire+0x1455/0x21b0
      [ 3437.472369]  lock_acquire+0xc6/0x2b0
      [ 3437.472518]  ? filename_create+0x62/0x190
      [ 3437.472683]  ? lock_is_held_type+0x97/0x110
      [ 3437.472856]  mnt_want_write+0x46/0x1a0
      [ 3437.473025]  ? filename_create+0x62/0x190
      [ 3437.473204]  filename_create+0x62/0x190
      [ 3437.473380]  user_path_create+0x2d/0x50
      [ 3437.473555]  bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
      [ 3437.473819]  ? lock_acquire+0xc6/0x2b0
      [ 3437.474002]  ? __fget_files+0x2a/0x190
      [ 3437.474195]  ? __fget_files+0xbc/0x190
      [ 3437.474380]  ? lock_release+0xc5/0x270
      [ 3437.474567]  ? __x64_sys_ioctl+0x93/0xd0
      [ 3437.474764]  ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
      [ 3437.475090]  __x64_sys_ioctl+0x93/0xd0
      [ 3437.475277]  do_syscall_64+0x42/0xf0
      [ 3437.475454]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [ 3437.475691] RIP: 0033:0x7f2743c313af
      ======================================================
      
      In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
      and unlock it at the end of the function. There is a comment
      "why do we need this lock?" about the lock coming from
      commit 42d23732 ("bcachefs: Snapshot creation, deletion")
      The reason is that __bch2_ioctl_subvolume_create() calls
      sync_inodes_sb() which enforce locked s_umount to writeback all dirty
      nodes before doing snapshot works.
      
      Fix it by read locking s_umount for snapshotting only and unlocking
      s_umount after sync_inodes_sb().
      Signed-off-by: default avatarSu Yue <glass.su@suse.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      2acc59dd