1. 27 Mar, 2024 6 commits
    • Linus Torvalds's avatar
      Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 400dd456
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix race when reading extent buffer and 'uptodate' status is missed
         by one thread (introduced in 6.5)
      
       - do additional validation of devices using major:minor numbers
      
       - zoned mode fixes:
           - use zone-aware super block access during scrub
           - fix use-after-free during device replace (found by KASAN)
           - also delete zones that are 100% unusable to reclaim space
      
       - extent unpinning fixes:
           - fix extent map leak after error handling
           - print correct range in error message
      
       - error code and message updates
      
      * tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix race in read_extent_buffer_pages()
        btrfs: return accurate error code on open failure in open_fs_devices()
        btrfs: zoned: don't skip block groups with 100% zone unusable
        btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
        btrfs: fix message not properly printing interval when adding extent map
        btrfs: fix warning messages not printing interval at unpin_extent_range()
        btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
        btrfs: validate device maj:min during open
        btrfs: zoned: fix use-after-free in do_zone_finish()
        btrfs: zoned: use zone aware sb location for scrub
      400dd456
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of... · dc189b8e
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "Various hotfixes. About half are cc:stable and the remainder address
        post-6.8 issues or aren't considered suitable for backporting.
      
        zswap figures prominently in the post-6.8 issues - folloup against the
        large amount of changes we have just made to that code.
      
        Apart from that, all over the map"
      
      * tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
        crash: use macro to add crashk_res into iomem early for specific arch
        mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
        selftests/mm: fix ARM related issue with fork after pthread_create
        hexagon: vmlinux.lds.S: handle attributes section
        userfaultfd: fix deadlock warning when locking src and dst VMAs
        tmpfs: fix race on handling dquot rbtree
        selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM
        mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion
        ARM: prctl: reject PR_SET_MDWE on pre-ARMv6
        prctl: generalize PR_SET_MDWE support check to be per-arch
        MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev
        mm: zswap: fix kernel BUG in sg_init_one
        selftests: mm: restore settings from only parent process
        tools/Makefile: remove cgroup target
        mm: cachestat: fix two shmem bugs
        mm: increase folio batch size
        mm,page_owner: fix recursion
        mailmap: update entry for Leonard Crestez
        init: open /initrd.image with O_LARGEFILE
        selftests/mm: Fix build with _FORTIFY_SOURCE
        ...
      dc189b8e
    • Linus Torvalds's avatar
      Merge tag 'probes-fixes-v6.9-rc1' of... · 96249052
      Linus Torvalds authored
      Merge tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull probes fixlet from Masami Hiramatsu:
      
       - tracing/probes: initialize a 'val' local variable with zero.
      
         This variable is read by FETCH_OP_ST_EDATA in a loop, and is
         initialized by FETCH_OP_ARG in the same loop. Since this
         initialization is not obvious, smatch warns about it.
      
         Explicitly initializing 'val' with zero fixes this warning.
      
      * tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing: probes: Fix to zero initialize a local variable
      96249052
    • Linus Torvalds's avatar
      Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · f4a43291
      Linus Torvalds authored
      Pull execve fixes from Kees Cook:
      
       - Fix selftests to conform to the TAP output format (Muhammad Usama
         Anjum)
      
       - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
      
       - Replace deprecated strncpy usage (Justin Stitt)
      
       - Replace another /bin/sh instance in selftests
      
      * tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        binfmt: replace deprecated strncpy
        exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
        selftests/exec: Convert remaining /bin/sh to /bin/bash
        selftests/exec: execveat: Improve debug reporting
        selftests/exec: recursion-depth: conform test to TAP format output
        selftests/exec: load_address: conform test to TAP format output
        selftests/exec: binfmt_script: Add the overall result line according to TAP
      f4a43291
    • Linus Torvalds's avatar
      Fix build errors due to new UIO_MEM_DMA_COHERENT mess · 498e47cd
      Linus Torvalds authored
      Commit 576882ef ("uio: introduce UIO_MEM_DMA_COHERENT type")
      introduced a new use-case for 'struct uio_mem' where the 'mem' field now
      contains a kernel virtual address when 'memtype' is set to
      UIO_MEM_DMA_COHERENT.
      
      That in turn causes build errors, because 'mem' is of type
      'phys_addr_t', and a virtual address is a pointer type.  When the code
      just blindly uses cast to mix the two, it caused problems when
      phys_addr_t isn't the same size as a pointer - notably on 32-bit
      architectures with PHYS_ADDR_T_64BIT.
      
      The proper thing to do would probably be to use a union member, and not
      have any casts, and make the 'mem' member be a union of 'mem.physaddr'
      and 'mem.vaddr', based on 'memtype'.
      
      This is not that proper thing.  This is just fixing the ugly casts to be
      even uglier, but at least not cause build errors on 32-bit platforms
      with 64-bit physical addresses.
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Fixes: 576882ef ("uio: introduce UIO_MEM_DMA_COHERENT type")
      Fixes: 7722151e ("uio_pruss: UIO_MEM_DMA_COHERENT conversion")
      Fixes: 01994780 ("uio_dmem_genirq: UIO_MEM_DMA_COHERENT conversion")
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: Nilesh Javali <njavali@marvell.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      498e47cd
    • Linus Torvalds's avatar
      Fix memory leak in posix_clock_open() · 5b4cdd9c
      Linus Torvalds authored
      If the clk ops.open() function returns an error, we don't release the
      pccontext we allocated for this clock.
      
      Re-organize the code slightly to make it all more obvious.
      Reported-by: default avatarRohit Keshri <rkeshri@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Fixes: 60c69466 ("posix-clock: introduce posix_clock_context concept")
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      5b4cdd9c
  2. 26 Mar, 2024 32 commits
    • Baoquan He's avatar
      crash: use macro to add crashk_res into iomem early for specific arch · 32fbe524
      Baoquan He authored
      There are regression reports[1][2] that crashkernel region on x86_64 can't
      be added into iomem tree sometime.  This causes the later failure of kdump
      loading.
      
      This happened after commit 4a693ce6 ("kdump: defer the insertion of
      crashkernel resources") was merged.
      
      Even though, these reported issues are proved to be related to other
      component, they are just exposed after above commmit applied, I still
      would like to keep crashk_res and crashk_low_res being added into iomem
      early as before because the early adding has been always there on x86_64
      and working very well.  For safety of kdump, Let's change it back.
      
      Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that
      only ARCH defining the macro can have the early adding
      crashk_res/_low_res into iomem. Then define
      HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it.
      
      Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res
      handling which was mistakenly added back in commit 85fcde40 ("kexec:
      split crashkernel reservation code out from crash_core.c").
      
      [1]
      [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
      https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u
      
      [2]
      Question about Address Range Validation in Crash Kernel Allocation
      https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u
      
      Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv
      Fixes: 4a693ce6 ("kdump: defer the insertion of crashkernel resources")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: Li Huafei <lihuafei1@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32fbe524
    • Johannes Weiner's avatar
      mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices · 25cd2414
      Johannes Weiner authored
      Zhongkun He reports data corruption when combining zswap with zram.
      
      The issue is the exclusive loads we're doing in zswap. They assume
      that all reads are going into the swapcache, which can assume
      authoritative ownership of the data and so the zswap copy can go.
      
      However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try to
      bypass the swapcache.  This results in an optimistic read of the swap data
      into a page that will be dismissed if the fault fails due to races.  In
      this case, zswap mustn't drop its authoritative copy.
      
      Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
      Fixes: b9c91c43 ("mm: zswap: support exclusive loads")
      Link: https://lkml.kernel.org/r/20240324210447.956973-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarZhongkun He <hezhongkun.hzk@bytedance.com>
      Tested-by: default avatarZhongkun He <hezhongkun.hzk@bytedance.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarBarry Song <baohua@kernel.org>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarChris Li <chrisl@kernel.org>
      Cc: <stable@vger.kernel.org>	[6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      25cd2414
    • Edward Liaw's avatar
      selftests/mm: fix ARM related issue with fork after pthread_create · 8c864371
      Edward Liaw authored
      Following issue was observed while running the uffd-unit-tests selftest
      on ARM devices. On x86_64 no issues were detected:
      
      pthread_create followed by fork caused deadlock in certain cases wherein
      fork required some work to be completed by the created thread.  Used
      synchronization to ensure that created thread's start function has started
      before invoking fork.
      
      [edliaw@google.com: refactored to use atomic_bool]
      Link: https://lkml.kernel.org/r/20240325194100.775052-1-edliaw@google.com
      Fixes: 760aee0b ("selftests/mm: add tests for RO pinning vs fork()")
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarEdward Liaw <edliaw@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c864371
    • Nathan Chancellor's avatar
      hexagon: vmlinux.lds.S: handle attributes section · 549aa967
      Nathan Chancellor authored
      After the linked LLVM change, the build fails with
      CONFIG_LD_ORPHAN_WARN_LEVEL="error", which happens with allmodconfig:
      
        ld.lld: error: vmlinux.a(init/main.o):(.hexagon.attributes) is being placed in '.hexagon.attributes'
      
      Handle the attributes section in a similar manner as arm and riscv by
      adding it after the primary ELF_DETAILS grouping in vmlinux.lds.S, which
      fixes the error.
      
      Link: https://lkml.kernel.org/r/20240319-hexagon-handle-attributes-section-vmlinux-lds-s-v1-1-59855dab8872@kernel.org
      Fixes: 113616ec ("hexagon: select ARCH_WANT_LD_ORPHAN_WARN")
      Link: https://github.com/llvm/llvm-project/commit/31f4b329c8234fab9afa59494d7f8bdaeaefeaadSigned-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarBrian Cain <bcain@quicinc.com>
      Cc: Bill Wendling <morbo@google.com>
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      549aa967
    • Lokesh Gidra's avatar
      userfaultfd: fix deadlock warning when locking src and dst VMAs · 30af24fa
      Lokesh Gidra authored
      Use down_read_nested() to avoid the warning.
      
      Link: https://lkml.kernel.org/r/20240321235818.125118-1-lokeshgidra@google.com
      Fixes: 867a43a3 ("userfaultfd: use per-vma locks in userfaultfd operations")
      Reported-by: syzbot+49056626fe41e01f2ba7@syzkaller.appspotmail.com
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jann Horn <jannh@google.com> [Bug #2]
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30af24fa
    • Carlos Maiolino's avatar
      tmpfs: fix race on handling dquot rbtree · 0a69b6b3
      Carlos Maiolino authored
      A syzkaller reproducer found a race while attempting to remove dquot
      information from the rb tree.
      
      Fetching the rb_tree root node must also be protected by the
      dqopt->dqio_sem, otherwise, giving the right timing, shmem_release_dquot()
      will trigger a warning because it couldn't find a node in the tree, when
      the real reason was the root node changing before the search starts:
      
      Thread 1				Thread 2
      - shmem_release_dquot()			- shmem_{acquire,release}_dquot()
      
      - fetch ROOT				- Fetch ROOT
      
      					- acquire dqio_sem
      - wait dqio_sem
      
      					- do something, triger a tree rebalance
      					- release dqio_sem
      
      - acquire dqio_sem
      - start searching for the node, but
        from the wrong location, missing
        the node, and triggering a warning.
      
      Link: https://lkml.kernel.org/r/20240320124011.398847-1-cem@kernel.org
      Fixes: eafc474e ("shmem: prepare shmem quota infrastructure")
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reported-by: default avatarUbisectech Sirius <bugreport@ubisectech.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a69b6b3
    • Edward Liaw's avatar
      selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM · 105840eb
      Edward Liaw authored
      The sigbus-wp test requires the UFFD_FEATURE_WP_HUGETLBFS_SHMEM flag for
      shmem and hugetlb targets.  Otherwise it is not backwards compatible with
      kernels <5.19 and fails with EINVAL.
      
      Link: https://lkml.kernel.org/r/20240321232023.2064975-1-edliaw@google.com
      Fixes: 73c1ea93 ("selftests/mm: move uffd sig/events tests into uffd unit tests")
      Signed-off-by: default avatarEdward Liaw <edliaw@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Peter Xu <peterx@redhat.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      105840eb
    • Johannes Weiner's avatar
      mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion · 30fb6a8d
      Johannes Weiner authored
      Kent forwards this bug report of zswap re-entering the block layer
      from an IO request allocation and locking up:
      
      [10264.128242] sysrq: Show Blocked State
      [10264.128268] task:kworker/20:0H   state:D stack:0     pid:143   tgid:143   ppid:2      flags:0x00004000
      [10264.128271] Workqueue: bcachefs_io btree_write_submit [bcachefs]
      [10264.128295] Call Trace:
      [10264.128295]  <TASK>
      [10264.128297]  __schedule+0x3e6/0x1520
      [10264.128303]  schedule+0x32/0xd0
      [10264.128304]  schedule_timeout+0x98/0x160
      [10264.128308]  io_schedule_timeout+0x50/0x80
      [10264.128309]  wait_for_completion_io_timeout+0x7f/0x180
      [10264.128310]  submit_bio_wait+0x78/0xb0
      [10264.128313]  swap_writepage_bdev_sync+0xf6/0x150
      [10264.128317]  zswap_writeback_entry+0xf2/0x180
      [10264.128319]  shrink_memcg_cb+0xe7/0x2f0
      [10264.128322]  __list_lru_walk_one+0xb9/0x1d0
      [10264.128325]  list_lru_walk_one+0x5d/0x90
      [10264.128326]  zswap_shrinker_scan+0xc4/0x130
      [10264.128327]  do_shrink_slab+0x13f/0x360
      [10264.128328]  shrink_slab+0x28e/0x3c0
      [10264.128329]  shrink_one+0x123/0x1b0
      [10264.128331]  shrink_node+0x97e/0xbc0
      [10264.128332]  do_try_to_free_pages+0xe7/0x5b0
      [10264.128333]  try_to_free_pages+0xe1/0x200
      [10264.128334]  __alloc_pages_slowpath.constprop.0+0x343/0xde0
      [10264.128337]  __alloc_pages+0x32d/0x350
      [10264.128338]  allocate_slab+0x400/0x460
      [10264.128339]  ___slab_alloc+0x40d/0xa40
      [10264.128345]  kmem_cache_alloc+0x2e7/0x330
      [10264.128348]  mempool_alloc+0x86/0x1b0
      [10264.128349]  bio_alloc_bioset+0x200/0x4f0
      [10264.128352]  bio_alloc_clone+0x23/0x60
      [10264.128354]  alloc_io+0x26/0xf0 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
      [10264.128361]  dm_submit_bio+0xb8/0x580 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
      [10264.128366]  __submit_bio+0xb0/0x170
      [10264.128367]  submit_bio_noacct_nocheck+0x159/0x370
      [10264.128368]  bch2_submit_wbio_replicas+0x21c/0x3a0 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
      [10264.128391]  btree_write_submit+0x1cf/0x220 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
      [10264.128406]  process_one_work+0x178/0x350
      [10264.128408]  worker_thread+0x30f/0x450
      [10264.128409]  kthread+0xe5/0x120
      
      The zswap shrinker resumes the swap_writepage()s that were intercepted
      by the zswap store. This will enter the block layer, and may even
      enter the filesystem depending on the swap backing file.
      
      Make it respect GFP_NOIO and GFP_NOFS.
      
      Link: https://lore.kernel.org/linux-mm/rc4pk2r42oyvjo4dc62z6sovquyllq56i5cdgcaqbd7wy3hfzr@n4nbxido3fme/
      Link: https://lkml.kernel.org/r/20240321182532.60000-1-hannes@cmpxchg.org
      Fixes: b5ba474f ("zswap: shrink zswap pool based on memory pressure")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reported-by: default avatarJérôme Poulin <jeromepoulin@gmail.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Cc: stable@vger.kernel.org	[v6.8]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30fb6a8d
    • Zev Weiss's avatar
      ARM: prctl: reject PR_SET_MDWE on pre-ARMv6 · 166ce846
      Zev Weiss authored
      On v5 and lower CPUs we can't provide MDWE protection, so ensure we fail
      any attempt to enable it via prctl(PR_SET_MDWE).
      
      Previously such an attempt would misleadingly succeed, leading to any
      subsequent mmap(PROT_READ|PROT_WRITE) or execve() failing unconditionally
      (the latter somewhat violently via force_fatal_sig(SIGSEGV) due to
      READ_IMPLIES_EXEC).
      
      Link: https://lkml.kernel.org/r/20240227013546.15769-6-zev@bewilderbeest.netSigned-off-by: default avatarZev Weiss <zev@bewilderbeest.net>
      Cc: <stable@vger.kernel.org>	[6.3+]
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Florent Revest <revest@chromium.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ondrej Mosnacek <omosnace@redhat.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sam James <sam@gentoo.org>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: Yang Shi <yang@os.amperecomputing.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      166ce846
    • Zev Weiss's avatar
      prctl: generalize PR_SET_MDWE support check to be per-arch · d5aad4c2
      Zev Weiss authored
      Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".
      
      I noticed after a recent kernel update that my ARM926 system started
      segfaulting on any execve() after calling prctl(PR_SET_MDWE).  After some
      investigation it appears that ARMv5 is incapable of providing the
      appropriate protections for MDWE, since any readable memory is also
      implicitly executable.
      
      The prctl_set_mdwe() function already had some special-case logic added
      disabling it on PARISC (commit 79383813, "prctl: Disable
      prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that
      check to use an arch_*() function, and (2) adds a corresponding override
      for ARM to disable MDWE on pre-ARMv6 CPUs.
      
      With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and
      subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can
      succeed instead of unconditionally failing; on ARMv6 the prctl works as it
      did previously.
      
      [0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/
      
      
      This patch (of 2):
      
      There exist systems other than PARISC where MDWE may not be feasible to
      support; rather than cluttering up the generic code with additional
      arch-specific logic let's add a generic function for checking MDWE support
      and allow each arch to override it as needed.
      
      Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net
      Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.netSigned-off-by: default avatarZev Weiss <zev@bewilderbeest.net>
      Acked-by: Helge Deller <deller@gmx.de>	[parisc]
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Florent Revest <revest@chromium.org>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ondrej Mosnacek <omosnace@redhat.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sam James <sam@gentoo.org>
      Cc: Stefan Roesch <shr@devkernel.io>
      Cc: Yang Shi <yang@os.amperecomputing.com>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Cc: <stable@vger.kernel.org>	[6.3+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5aad4c2
    • Kuan-Wei Chiu's avatar
      MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev · db09f2df
      Kuan-Wei Chiu authored
      The dm-devel@lists.linux.dev mailing list should only be listed under the
      L: (List) tag in the MAINTAINERS file.  However, it was incorrectly listed
      under both L: and M: (Maintainers) tags, which is not accurate.  Remove
      the M: tag for dm-devel@lists.linux.dev in the MAINTAINERS file to reflect
      the correct categorization.
      
      Link: https://lkml.kernel.org/r/20240319181842.249547-1-visitorckw@gmail.comSigned-off-by: default avatarKuan-Wei Chiu <visitorckw@gmail.com>
      Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
      Cc: Matthew Sakai <msakai@redhat.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      db09f2df
    • Barry Song's avatar
      mm: zswap: fix kernel BUG in sg_init_one · 9c500835
      Barry Song authored
      sg_init_one() relies on linearly mapped low memory for the safe
      utilization of virt_to_page().  Otherwise, we trigger a kernel BUG,
      
      kernel BUG at include/linux/scatterlist.h:187!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
      Modules linked in:
      CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0
      Hardware name: ARM-Versatile Express
      PC is at sg_set_buf include/linux/scatterlist.h:187 [inline]
      PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143
      LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128
      Backtrace:
      [<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089)
       r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4
      [<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637)
       r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0
      [<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518)
       r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c
      [<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684)
       r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca
       r4:00000001
      [<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904)
       r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000
       r4:00000001
      [<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046)
       r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000
       r4:df955eb8
      [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline])
      [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline])
      [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604)
       r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00
       r4:00000254
      [<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326)
       r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4
       r4:df955fb0
      [<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558)
       r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207
       r4:8261d0e0
      [<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427)
      Exception stack(0xdf955fb0 to 0xdf955ff8)
      5fa0:                                     00000000 00000000 22d5f800 0008d158
      5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c
      5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff
       r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4
      Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2)
      ---[ end trace 0000000000000000 ]---
      ----------------
      Code disassembly (best guess):
         0:	1a000004 	bne	0x18
         4:	e1822003 	orr	r2, r2, r3
         8:	e8860094 	stm	r6, {r2, r4, r7}
         c:	e89da8f0 	ldm	sp, {r4, r5, r6, r7, fp, sp, pc}
      * 10:	e7f001f2 	udf	#18 <-- trapping instruction
      
      Consequently, we have two choices: either employ kmap_to_page() alongside
      sg_set_page(), or resort to copying high memory contents to a temporary
      buffer residing in low memory.  However, considering the introduction of
      the WARN_ON_ONCE in commit ef6e06b2 ("highmem: fix kmap_to_page() for
      kmap_local_page() addresses"), which specifically addresses high memory
      concerns, it appears that memcpy remains the sole viable option.
      
      Link: https://lkml.kernel.org/r/20240318234706.95347-1-21cnbao@gmail.com
      Fixes: 270700dd ("mm/zswap: remove the memcpy if acomp is not sleepable")
      Signed-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reported-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/all/000000000000bbb3d80613f243a6@google.com/
      Tested-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Li <chrisl@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c500835
    • Muhammad Usama Anjum's avatar
      selftests: mm: restore settings from only parent process · c52eb6db
      Muhammad Usama Anjum authored
      The atexit() is called from parent process as well as forked processes. 
      Hence the child restores the settings at exit while the parent is still
      executing.  Fix this by checking pid of atexit() calling process and only
      restore THP number from parent process.
      
      Link: https://lkml.kernel.org/r/20240314094045.157149-1-usama.anjum@collabora.com
      Fixes: c23ea617 ("selftests/mm: protection_keys: save/restore nr_hugepages settings")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Tested-by: default avatarJoey Gouly <joey.gouly@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c52eb6db
    • Cong Liu's avatar
      tools/Makefile: remove cgroup target · 950bf45d
      Cong Liu authored
      The tools/cgroup directory no longer contains a Makefile.  This patch
      updates the top-level tools/Makefile to remove references to building and
      installing cgroup components.  This change reflects the current structure
      of the tools directory and fixes the build failure when building tools in
      the top-level directory.
      
      linux/tools$ make cgroup
        DESCEND cgroup
      make[1]: *** No targets specified and no makefile found.  Stop.
      make: *** [Makefile:73: cgroup] Error 2
      
      Link: https://lkml.kernel.org/r/20240315012249.439639-1-liucong2@kylinos.cnSigned-off-by: default avatarCong Liu <liucong2@kylinos.cn>
      Acked-by: default avatarStanislav Fomichev <sdf@google.com>
      Reviewed-by: default avatarDmitry Rokosov <ddrokosov@salutedevices.com>
      Cc: Cong Liu <liucong2@kylinos.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      950bf45d
    • Johannes Weiner's avatar
      mm: cachestat: fix two shmem bugs · d5d39c70
      Johannes Weiner authored
      When cachestat on shmem races with swapping and invalidation, there
      are two possible bugs:
      
      1) A swapin error can have resulted in a poisoned swap entry in the
         shmem inode's xarray. Calling get_shadow_from_swap_cache() on it
         will result in an out-of-bounds access to swapper_spaces[].
      
         Validate the entry with non_swap_entry() before going further.
      
      2) When we find a valid swap entry in the shmem's inode, the shadow
         entry in the swapcache might not exist yet: swap IO is still in
         progress and we're before __remove_mapping; swapin, invalidation,
         or swapoff have removed the shadow from swapcache after we saw the
         shmem swap entry.
      
         This will send a NULL to workingset_test_recent(). The latter
         purely operates on pointer bits, so it won't crash - node 0, memcg
         ID 0, eviction timestamp 0, etc. are all valid inputs - but it's a
         bogus test. In theory that could result in a false "recently
         evicted" count.
      
         Such a false positive wouldn't be the end of the world. But for
         code clarity and (future) robustness, be explicit about this case.
      
         Bail on get_shadow_from_swap_cache() returning NULL.
      
      Link: https://lkml.kernel.org/r/20240315095556.GC581298@cmpxchg.org
      Fixes: cf264e13 ("cachestat: implement cachestat syscall")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Chengming Zhou <chengming.zhou@linux.dev>	[Bug #1]
      Reported-by: Jann Horn <jannh@google.com>		[Bug #2]
      Reviewed-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Cc: <stable@vger.kernel.org>				[v6.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5d39c70
    • Matthew Wilcox (Oracle)'s avatar
      mm: increase folio batch size · 9cecde80
      Matthew Wilcox (Oracle) authored
      On a 104 thread, 2 socket Skylake system, Intel report a 4.7% performance
      reduction with will-it-scale page_fault2.  This was due to reducing the
      size of the batch from 32 to 15.  Increasing the folio batch size from 15
      to 31 gives a performance increase of 12.5% relative to the original, or
      17.2% relative to the reduced performance commit.
      
      The penalty of this commit is an additional 128 bytes of stack usage.  Six
      folio_batches are also allocated from percpu memory in cpu_fbatches so
      that will be an additional 768 bytes of percpu memory (per CPU).  Tim Chen
      originally submitted a patch like this in 2020:
      https://lore.kernel.org/linux-mm/d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com/
      
      Link: https://lkml.kernel.org/r/20240315140823.2478146-1-willy@infradead.org
      Fixes: 99fbb6bf ("mm: make folios_put() the basis of release_pages()")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarYujie Liu <yujie.liu@intel.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Closes: https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.comSigned-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9cecde80
    • Oscar Salvador's avatar
      mm,page_owner: fix recursion · 7844c014
      Oscar Salvador authored
      Prior to 217b2119 ("mm,page_owner: implement the tracking of the
      stacks count") the only place where page_owner could potentially go into
      recursion due to its need of allocating more memory was in save_stack(),
      which ends up calling into stackdepot code with the possibility of
      allocating memory.
      
      We made sure to guard against that by signaling that the current task was
      already in page_owner code, so in case a recursion attempt was made, we
      could catch that and return dummy_handle.
      
      After above commit, a new place in page_owner code was introduced where we
      could allocate memory, meaning we could go into recursion would we take
      that path.
      
      Make sure to signal that we are in page_owner in that codepath as well. 
      Move the guard code into two helpers {un}set_current_in_page_owner() and
      use them prior to calling in the two functions that might allocate memory.
      
      Link: https://lkml.kernel.org/r/20240315222610.6870-1-osalvador@suse.deSigned-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Fixes: 217b2119 ("mm,page_owner: implement the tracking of the stacks count")
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7844c014
    • Leonard Crestez's avatar
      mailmap: update entry for Leonard Crestez · 32900324
      Leonard Crestez authored
      Put my personal email first because NXP employment ended some time ago.
      Also add my old intel email address.
      
      Link: https://lkml.kernel.org/r/f568faa0-2380-4e93-a312-b80c1e367645@gmail.comSigned-off-by: default avatarLeonard Crestez <cdleonard@gmail.com>
      Cc: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32900324
    • John Sperbeck's avatar
      init: open /initrd.image with O_LARGEFILE · 4624b346
      John Sperbeck authored
      If initrd data is larger than 2Gb, we'll eventually fail to write to the
      /initrd.image file when we hit that limit, unless O_LARGEFILE is set.
      
      Link: https://lkml.kernel.org/r/20240317221522.896040-1-jsperbeck@google.comSigned-off-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4624b346
    • Vitaly Chikunov's avatar
      selftests/mm: Fix build with _FORTIFY_SOURCE · 8b65ef5a
      Vitaly Chikunov authored
      Add missing flags argument to open(2) call with O_CREAT.
      
      Some tests fail to compile if _FORTIFY_SOURCE is defined (to any valid
      value) (together with -O), resulting in similar error messages such as:
      
        In file included from /usr/include/fcntl.h:342,
                         from gup_test.c:1:
        In function 'open',
            inlined from 'main' at gup_test.c:206:10:
        /usr/include/bits/fcntl2.h:50:11: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
           50 |           __open_missing_mode ();
              |           ^~~~~~~~~~~~~~~~~~~~~~
      
      _FORTIFY_SOURCE is enabled by default in some distributions, so the
      tests are not built by default and are skipped.
      
      open(2) man-page warns about missing flags argument: "if it is not
      supplied, some arbitrary bytes from the stack will be applied as the
      file mode."
      
      Link: https://lkml.kernel.org/r/20240318023445.3192922-1-vt@altlinux.org
      Fixes: aeb85ed4 ("tools/testing/selftests/vm/gup_benchmark.c: allow user specified file")
      Fixes: fbe37501 ("mm: huge_memory: debugfs for file-backed THP split")
      Fixes: c942f5bd ("selftests: soft-dirty: add test for mprotect")
      Signed-off-by: default avatarVitaly Chikunov <vt@altlinux.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b65ef5a
    • Peter Xu's avatar
      mm/memory: fix missing pte marker for !page on pte zaps · f8572367
      Peter Xu authored
      Commit 0cf18e83 of large folio zap work broke uffd-wp.  Now mm's uffd
      unit test "wp-unpopulated" will trigger this WARN_ON_ONCE().
      
      The WARN_ON_ONCE() asserts that an VMA cannot be registered with
      userfaultfd-wp if it contains a !normal page, but it's actually possible. 
      One example is an anonymous vma, register with uffd-wp, read anything will
      install a zero page.  Then when zap on it, this should trigger.
      
      What's more, removing that WARN_ON_ONCE may not be enough either, because
      we should also not rely on "whether it's a normal page" to decide whether
      pte marker is needed.  For example, one can register wr-protect over some
      DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which
      case it can have page==NULL for a devmap but we may want to keep the
      marker around.
      
      Link: https://lkml.kernel.org/r/20240313213107.235067-1-peterx@redhat.com
      Fixes: 0cf18e83 ("mm/memory: handle !page case in zap_present_pte() separately")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8572367
    • Linus Torvalds's avatar
      Merge tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux · 7033999e
      Linus Torvalds authored
      Pull printk fix from Petr Mladek:
      
       - Prevent scheduling in an atomic context when printk() takes over the
         console flushing duty
      
      * tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
        printk: Update @console_may_schedule in console_trylock_spinning()
      7033999e
    • Linus Torvalds's avatar
      Merge tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux · 576bb2d8
      Linus Torvalds authored
      Pull pwm fix from Uwe Kleine-König:
       "This contains a single fix for a regression introduced in v5.18-rc1
        which made the img pwm driver fail to bind"
      
      * tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
        pwm: img: fix pwm clock lookup
      576bb2d8
    • Tavian Barnes's avatar
      btrfs: fix race in read_extent_buffer_pages() · ef1e6823
      Tavian Barnes authored
      There are reports from tree-checker that detects corrupted nodes,
      without any obvious pattern so possibly an overwrite in memory.
      After some debugging it turns out there's a race when reading an extent
      buffer the uptodate status can be missed.
      
      To prevent concurrent reads for the same extent buffer,
      read_extent_buffer_pages() performs these checks:
      
          /* (1) */
          if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
              return 0;
      
          /* (2) */
          if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
              goto done;
      
      At this point, it seems safe to start the actual read operation. Once
      that completes, end_bbio_meta_read() does
      
          /* (3) */
          set_extent_buffer_uptodate(eb);
      
          /* (4) */
          clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
      
      Normally, this is enough to ensure only one read happens, and all other
      callers wait for it to finish before returning.  Unfortunately, there is
      a racey interleaving:
      
          Thread A | Thread B | Thread C
          ---------+----------+---------
             (1)   |          |
                   |    (1)   |
             (2)   |          |
             (3)   |          |
             (4)   |          |
                   |    (2)   |
                   |          |    (1)
      
      When this happens, thread B kicks of an unnecessary read. Worse, thread
      C will see UPTODATE set and return immediately, while the read from
      thread B is still in progress.  This race could result in tree-checker
      errors like this as the extent buffer is concurrently modified:
      
          BTRFS critical (device dm-0): corrupted node, root=256
          block=8550954455682405139 owner mismatch, have 11858205567642294356
          expect [256, 18446744073709551360]
      
      Fix it by testing UPTODATE again after setting the READING bit, and if
      it's been set, skip the unnecessary read.
      
      Fixes: d7172f52 ("btrfs: use per-buffer locking for extent_buffer reading")
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
      Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
      CC: stable@vger.kernel.org # 6.5+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarTavian Barnes <tavianator@tavianator.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor update of changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef1e6823
    • Anand Jain's avatar
      btrfs: return accurate error code on open failure in open_fs_devices() · 2f1aeab9
      Anand Jain authored
      When attempting to exclusive open a device which has no exclusive open
      permission, such as a physical device associated with the flakey dm
      device, the open operation will fail, resulting in a mount failure.
      
      In this particular scenario, we erroneously return -EINVAL instead of the
      correct error code provided by the bdev_open_by_path() function, which is
      -EBUSY.
      
      Fix this, by returning error code from the bdev_open_by_path() function.
      With this correction, the mount error message will align with that of
      ext4 and xfs.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f1aeab9
    • Johannes Thumshirn's avatar
      btrfs: zoned: don't skip block groups with 100% zone unusable · a8b70c7f
      Johannes Thumshirn authored
      Commit f4a9f219 ("btrfs: do not delete unused block group if it may be
      used soon") changed the behaviour of deleting unused block-groups on zoned
      filesystems. Starting with this commit, we're using
      btrfs_space_info_used() to calculate the number of used bytes in a
      space_info. But btrfs_space_info_used() also accounts
      btrfs_space_info::bytes_zone_unusable as used bytes.
      
      So if a block group is 100% zone_unusable it is skipped from the deletion
      step.
      
      In order not to skip fully zone_unusable block-groups, also check if the
      block-group has bytes left that can be used on a zoned filesystem.
      
      Fixes: f4a9f219 ("btrfs: do not delete unused block group if it may be used soon")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8b70c7f
    • Filipe Manana's avatar
      btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping() · 21334600
      Filipe Manana authored
      At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
      is unexpected and theoretically should never happen, we use WARN_ONCE() to
      log a message which is not great because we don't get information about
      which filesystem it relates to in case we have multiple btrfs filesystems
      mounted. So change this to use btrfs_warn() and surround the error check
      with WARN_ON() so we always get a useful stack trace and the condition is
      flagged as "unlikely" since it's not expected to ever happen.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21334600
    • Filipe Manana's avatar
      btrfs: fix message not properly printing interval when adding extent map · 379c8723
      Filipe Manana authored
      At btrfs_add_extent_mapping(), if we are unable to merge the existing
      extent map, we print a warning message that suggests interval ranges in
      the form "[X, Y)", where the first element is the inclusive start offset
      of a range and the second element is the exclusive end offset. However
      we end up printing the length of the ranges instead of the exclusive end
      offsets. So fix this by printing the range end offsets.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      379c8723
    • Filipe Manana's avatar
      btrfs: fix warning messages not printing interval at unpin_extent_range() · 4dc1d69c
      Filipe Manana authored
      At unpin_extent_range() we print warning messages that are supposed to
      print an interval in the form "[X, Y)", with the first element being an
      inclusive start offset and the second element being the exclusive end
      offset of a range. However we end up printing the range's length instead
      of the range's exclusive end offset, so fix that to avoid having confusing
      and non-sense messages in case we hit one of these unexpected scenarios.
      
      Fixes: 00deaf04 ("btrfs: log messages at unpin_extent_range() during unexpected cases")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dc1d69c
    • Filipe Manana's avatar
      btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache() · 8a565ec0
      Filipe Manana authored
      At unpin_extent_cache() if we happen to find an extent map with an
      unexpected start offset, we jump to the 'out' label and never release the
      reference we added to the extent map through the call to
      lookup_extent_mapping(), therefore resulting in a leak. So fix this by
      moving the free_extent_map() under the 'out' label.
      
      Fixes: c03c89f8 ("btrfs: handle errors returned from unpin_extent_cache()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a565ec0
    • Anand Jain's avatar
      btrfs: validate device maj:min during open · 9f7eb840
      Anand Jain authored
      Boris managed to create a device capable of changing its maj:min without
      altering its device path.
      
      Only multi-devices can be scanned. A device that gets scanned and remains
      in the btrfs kernel cache might end up with an incorrect maj:min.
      
      Despite the temp-fsid feature patch did not introduce this bug, it could
      lead to issues if the above multi-device is converted to a single device
      with a stale maj:min. Subsequently, attempting to mount the same device
      with the correct maj:min might mistake it for another device with the same
      fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.
      
      To address this, this patch validates the device's maj:min at the time of
      device open and updates it if it has changed since the last scan.
      
      CC: stable@vger.kernel.org # 6.7+
      Fixes: a5b8a5f9 ("btrfs: support cloned-device mount capability")
      Reported-by: default avatarBoris Burkov <boris@bur.io>
      Co-developed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: Boris Burkov <boris@bur.io>#
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f7eb840
    • Johannes Thumshirn's avatar
      btrfs: zoned: fix use-after-free in do_zone_finish() · 1ec17ef5
      Johannes Thumshirn authored
      Shinichiro reported the following use-after-free triggered by the device
      replace operation in fstests btrfs/070.
      
       BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
       ==================================================================
       BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
       Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007
      
       CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
       Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
       Call Trace:
        <TASK>
        dump_stack_lvl+0x5b/0x90
        print_report+0xcf/0x670
        ? __virt_addr_valid+0x200/0x3e0
        kasan_report+0xd8/0x110
        ? do_zone_finish+0x91a/0xb90 [btrfs]
        ? do_zone_finish+0x91a/0xb90 [btrfs]
        do_zone_finish+0x91a/0xb90 [btrfs]
        btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
        ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
        ? btrfs_put_root+0x2d/0x220 [btrfs]
        ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
        cleaner_kthread+0x21e/0x380 [btrfs]
        ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
        kthread+0x2e3/0x3c0
        ? __pfx_kthread+0x10/0x10
        ret_from_fork+0x31/0x70
        ? __pfx_kthread+0x10/0x10
        ret_from_fork_asm+0x1b/0x30
        </TASK>
      
       Allocated by task 3493983:
        kasan_save_stack+0x33/0x60
        kasan_save_track+0x14/0x30
        __kasan_kmalloc+0xaa/0xb0
        btrfs_alloc_device+0xb3/0x4e0 [btrfs]
        device_list_add.constprop.0+0x993/0x1630 [btrfs]
        btrfs_scan_one_device+0x219/0x3d0 [btrfs]
        btrfs_control_ioctl+0x26e/0x310 [btrfs]
        __x64_sys_ioctl+0x134/0x1b0
        do_syscall_64+0x99/0x190
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
       Freed by task 3494056:
        kasan_save_stack+0x33/0x60
        kasan_save_track+0x14/0x30
        kasan_save_free_info+0x3f/0x60
        poison_slab_object+0x102/0x170
        __kasan_slab_free+0x32/0x70
        kfree+0x11b/0x320
        btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
        btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
        btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
        btrfs_ioctl+0xb27/0x57d0 [btrfs]
        __x64_sys_ioctl+0x134/0x1b0
        do_syscall_64+0x99/0x190
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
       The buggy address belongs to the object at ffff8881543c8000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 96 bytes inside of
        freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)
      
       The buggy address belongs to the physical page:
       page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
       head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
       flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
       page_type: 0xffffffff()
       raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
       raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       >ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                              ^
        ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      This UAF happens because we're accessing stale zone information of a
      already removed btrfs_device in do_zone_finish().
      
      The sequence of events is as follows:
      
      btrfs_dev_replace_start
        btrfs_scrub_dev
         btrfs_dev_replace_finishing
          btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
          btrfs_rm_dev_replace_free_srcdev
           btrfs_free_device                              <-- device freed
      
      cleaner_kthread
       btrfs_delete_unused_bgs
        btrfs_zone_finish
         do_zone_finish              <-- refers the freed device
      
      The reason for this is that we're using a cached pointer to the chunk_map
      from the block group, but on device replace this cached pointer can
      contain stale device entries.
      
      The staleness comes from the fact, that btrfs_block_group::physical_map is
      not a pointer to a btrfs_chunk_map but a memory copy of it.
      
      Also take the fs_info::dev_replace::rwsem to prevent
      btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
      underneath us again.
      
      Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
      fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
      cannot take it as the call to blkdev_zone_mgmt() requires a memory
      allocation which may not sleep.
      But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
      the fs_info::dev_replace::rwsem held in write mode.
      
      Many thanks to Shinichiro for analyzing the bug.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      CC: stable@vger.kernel.org # 6.8
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ec17ef5
  3. 25 Mar, 2024 2 commits