1. 04 Sep, 2024 11 commits
  2. 03 Sep, 2024 3 commits
    • Linus Torvalds's avatar
      Merge tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse · 88fac175
      Linus Torvalds authored
      Pull fuse fixes from Miklos Szeredi:
      
       - Fix EIO if splice and page stealing are enabled on the fuse device
      
       - Disable problematic combination of passthrough and writeback-cache
      
       - Other bug fixes found by code review
      
      * tag 'fuse-fixes-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
        fuse: disable the combination of passthrough and writeback cache
        fuse: update stats for pages in dropped aux writeback list
        fuse: clear PG_uptodate when using a stolen page
        fuse: fix memory leak in fuse_create_open
        fuse: check aborted connection before adding requests to pending list for resending
        fuse: use unsigned type for getxattr/listxattr size truncation
      88fac175
    • Filipe Manana's avatar
      btrfs: fix race between direct IO write and fsync when using same fd · cd9253c2
      Filipe Manana authored
      If we have 2 threads that are using the same file descriptor and one of
      them is doing direct IO writes while the other is doing fsync, we have a
      race where we can end up either:
      
      1) Attempt a fsync without holding the inode's lock, triggering an
         assertion failures when assertions are enabled;
      
      2) Do an invalid memory access from the fsync task because the file private
         points to memory allocated on stack by the direct IO task and it may be
         used by the fsync task after the stack was destroyed.
      
      The race happens like this:
      
      1) A user space program opens a file descriptor with O_DIRECT;
      
      2) The program spawns 2 threads using libpthread for example;
      
      3) One of the threads uses the file descriptor to do direct IO writes,
         while the other calls fsync using the same file descriptor.
      
      4) Call task A the thread doing direct IO writes and task B the thread
         doing fsyncs;
      
      5) Task A does a direct IO write, and at btrfs_direct_write() sets the
         file's private to an on stack allocated private with the member
         'fsync_skip_inode_lock' set to true;
      
      6) Task B enters btrfs_sync_file() and sees that there's a private
         structure associated to the file which has 'fsync_skip_inode_lock' set
         to true, so it skips locking the inode's VFS lock;
      
      7) Task A completes the direct IO write, and resets the file's private to
         NULL since it had no prior private and our private was stack allocated.
         Then it unlocks the inode's VFS lock;
      
      8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
         assertion that checks the inode's VFS lock is held fails, since task B
         never locked it and task A has already unlocked it.
      
      The stack trace produced is the following:
      
         assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
         ------------[ cut here ]------------
         kernel BUG at fs/btrfs/ordered-data.c:983!
         Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
         CPU: 9 PID: 5072 Comm: worker Tainted: G     U     OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
         Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
         RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
         Code: 50 d6 86 c0 e8 (...)
         RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
         RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
         RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
         RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
         R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
         R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
         FS:  00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
         Call Trace:
          <TASK>
          ? __die_body.cold+0x14/0x24
          ? die+0x2e/0x50
          ? do_trap+0xca/0x110
          ? do_error_trap+0x6a/0x90
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? exc_invalid_op+0x50/0x70
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? asm_exc_invalid_op+0x1a/0x20
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
          ? __seccomp_filter+0x31d/0x4f0
          __x64_sys_fdatasync+0x4f/0x90
          do_syscall_64+0x82/0x160
          ? do_futex+0xcb/0x190
          ? __x64_sys_futex+0x10e/0x1d0
          ? switch_fpu_return+0x4f/0xd0
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          ? syscall_exit_to_user_mode+0x72/0x220
          ? do_syscall_64+0x8e/0x160
          entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      Another problem here is if task B grabs the private pointer and then uses
      it after task A has finished, since the private was allocated in the stack
      of task A, it results in some invalid memory access with a hard to predict
      result.
      
      This issue, triggering the assertion, was observed with QEMU workloads by
      two users in the Link tags below.
      
      Fix this by not relying on a file's private to pass information to fsync
      that it should skip locking the inode and instead pass this information
      through a special value stored in current->journal_info. This is safe
      because in the relevant section of the direct IO write path we are not
      holding a transaction handle, so current->journal_info is NULL.
      
      The following C program triggers the issue:
      
         $ cat repro.c
         /* Get the O_DIRECT definition. */
         #ifndef _GNU_SOURCE
         #define _GNU_SOURCE
         #endif
      
         #include <stdio.h>
         #include <stdlib.h>
         #include <unistd.h>
         #include <stdint.h>
         #include <fcntl.h>
         #include <errno.h>
         #include <string.h>
         #include <pthread.h>
      
         static int fd;
      
         static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
         {
             while (count > 0) {
                 ssize_t ret;
      
                 ret = pwrite(fd, buf, count, offset);
                 if (ret < 0) {
                     if (errno == EINTR)
                         continue;
                     return ret;
                 }
                 count -= ret;
                 buf += ret;
             }
             return 0;
         }
      
         static void *fsync_loop(void *arg)
         {
             while (1) {
                 int ret;
      
                 ret = fsync(fd);
                 if (ret != 0) {
                     perror("Fsync failed");
                     exit(6);
                 }
             }
         }
      
         int main(int argc, char *argv[])
         {
             long pagesize;
             void *write_buf;
             pthread_t fsyncer;
             int ret;
      
             if (argc != 2) {
                 fprintf(stderr, "Use: %s <file path>\n", argv[0]);
                 return 1;
             }
      
             fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
             if (fd == -1) {
                 perror("Failed to open/create file");
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             if (pagesize == -1) {
                 perror("Failed to get page size");
                 return 2;
             }
      
             ret = posix_memalign(&write_buf, pagesize, pagesize);
             if (ret) {
                 perror("Failed to allocate buffer");
                 return 3;
             }
      
             ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create writer thread: %d\n", ret);
                 return 4;
             }
      
             while (1) {
                 ret = do_write(fd, write_buf, pagesize, 0);
                 if (ret != 0) {
                     perror("Write failed");
                     exit(5);
                 }
             }
      
             return 0;
         }
      
         $ mkfs.btrfs -f /dev/sdi
         $ mount /dev/sdi /mnt/sdi
         $ timeout 10 ./repro /mnt/sdi/foo
      
      Usually the race is triggered within less than 1 second. A test case for
      fstests will follow soon.
      Reported-by: default avatarPaulo Dias <paulo.miguel.dias@gmail.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187Reported-by: default avatarAndreas Jahn <jahn-andi@web.de>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
      Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
      Fixes: 939b656b ("btrfs: fix corruption after buffer fault in during direct IO append write")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd9253c2
    • Helge Deller's avatar
      parisc: Delay write-protection until mark_rodata_ro() call · 213aa670
      Helge Deller authored
      Do not write-protect the kernel read-only and __ro_after_init sections
      earlier than before mark_rodata_ro() is called.  This fixes a boot issue on
      parisc which is triggered by commit 91a1d97e ("jump_label,module: Don't
      alloc static_key_mod for __ro_after_init keys"). That commit may modify
      static key contents in the __ro_after_init section at bootup, so this
      section needs to be writable at least until mark_rodata_ro() is called.
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Reported-by: default avatarmatoro <matoro_mailinglist_kernel@matoro.tk>
      Reported-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Tested-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Link: https://lore.kernel.org/linux-parisc/096cad5aada514255cd7b0b9dbafc768@matoro.tk/#r
      Fixes: 91a1d97e ("jump_label,module: Don't alloc static_key_mod for __ro_after_init keys")
      Cc: stable@vger.kernel.org # v6.10+
      213aa670
  3. 02 Sep, 2024 25 commits
    • Naohiro Aota's avatar
      btrfs: zoned: handle broken write pointer on zones · b1934cd6
      Naohiro Aota authored
      Btrfs rejects to mount a FS if it finds a block group with a broken write
      pointer (e.g, unequal write pointers on two zones of RAID1 block group).
      Since such case can happen easily with a power-loss or crash of a system,
      we need to handle the case more gently.
      
      Handle such block group by making it unallocatable, so that there will be
      no writes into it. That can be done by setting the allocation pointer at
      the end of allocating region (= block_group->zone_capacity). Then, existing
      code handle zone_unusable properly.
      
      Having proper zone_capacity is necessary for the change. So, set it as fast
      as possible.
      
      We cannot handle RAID0 and RAID10 case like this. But, they are anyway
      unable to read because of a missing stripe.
      
      Fixes: 265f7237 ("btrfs: zoned: allow DUP on meta-data block groups")
      Fixes: 568220fa ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
      CC: stable@vger.kernel.org # 6.1+
      Reported-by: default avatarHAN Yuwei <hrx@bupt.moe>
      Cc: Xuefer <xuefer@gmail.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b1934cd6
    • Arnaldo Carvalho de Melo's avatar
      perf daemon: Fix the build on more 32-bit architectures · e162cb25
      Arnaldo Carvalho de Melo authored
      FYI: I'm carrying this on perf-tools-next.
      
      The previous attempt fixed the build on debian:experimental-x-mipsel,
      but when building on a larger set of containers I noticed it broke the
      build on some other 32-bit architectures such as:
      
        42     7.87 ubuntu:18.04-x-arm            : FAIL gcc version 7.5.0 (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04)
          builtin-daemon.c: In function 'cmd_session_list':
          builtin-daemon.c:692:16: error: format '%llu' expects argument of type 'long long unsigned int', but argument 4 has type 'long int' [-Werror=format=]
             fprintf(out, "%c%" PRIu64,
                          ^~~~~
          builtin-daemon.c:694:13:
              csv_sep, (curr - daemon->start) / 60);
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~
          In file included from builtin-daemon.c:3:0:
          /usr/arm-linux-gnueabihf/include/inttypes.h:105:34: note: format string is defined here
           # define PRIu64  __PRI64_PREFIX "u"
      
      So lets cast that time_t (32-bit/64-bit) to uint64_t to make sure it
      builds everywhere.
      
      Fixes: 4bbe6002 ("perf daemon: Fix the build on 32-bit architectures")
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Link: https://lore.kernel.org/r/ZsPmldtJ0D9Cua9_@x1Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      e162cb25
    • Xu Yang's avatar
      perf python: include "util/sample.h" · aee1d559
      Xu Yang authored
      The 32-bit arm build system will complain:
      
      tools/perf/util/python.c:75:28: error: field ‘sample’ has incomplete type
         75 |         struct perf_sample sample;
      
      However, arm64 build system doesn't complain this.
      
      The root cause is arm64 define "HAVE_KVM_STAT_SUPPORT := 1" in
      tools/perf/arch/arm64/Makefile, but arm arch doesn't define this.
      This will lead to kvm-stat.h include other header files on arm64 build
      system, especially "util/sample.h" for util/python.c.
      
      This will try to directly include "util/sample.h" for "util/python.c" to
      avoid such build issue on arm platform.
      Signed-off-by: default avatarXu Yang <xu.yang_2@nxp.com>
      Cc: imx@lists.linux.dev
      Link: https://lore.kernel.org/r/20240819023403.201324-1-xu.yang_2@nxp.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      aee1d559
    • Namhyung Kim's avatar
      perf lock contention: Fix spinlock and rwlock accounting · 287bd5cf
      Namhyung Kim authored
      The spinlock and rwlock use a single-element per-cpu array to track
      current locks due to performance reason.  But this means the key is
      always available and it cannot simply account lock stats in the array
      because some of them are invalid.
      
      In fact, the contention_end() program in the BPF invalidates the entry
      by setting the 'lock' value to 0 instead of deleting the entry for the
      hashmap.  So it should skip entries with the lock value of 0 in the
      account_end_timestamp().
      
      Otherwise, it'd have spurious high contention on an idle machine:
      
        $ sudo perf lock con -ab -Y spinlock sleep 3
         contended   total wait     max wait     avg wait         type   caller
      
                 8      4.72 s       1.84 s     590.46 ms     spinlock   rcu_core+0xc7
                 8      1.87 s       1.87 s     233.48 ms     spinlock   process_one_work+0x1b5
                 2      1.87 s       1.87 s     933.92 ms     spinlock   worker_thread+0x1a2
                 3      1.81 s       1.81 s     603.93 ms     spinlock   tmigr_update_events+0x13c
                 2      1.72 s       1.72 s     861.98 ms     spinlock   tick_do_update_jiffies64+0x25
                 6     42.48 us     13.02 us      7.08 us     spinlock   futex_q_lock+0x2a
                 1     13.03 us     13.03 us     13.03 us     spinlock   futex_wake+0xce
                 1     11.61 us     11.61 us     11.61 us     spinlock   rcu_core+0xc7
      
      I don't believe it has contention on a spinlock longer than 1 second.
      After this change, it only reports some small contentions.
      
        $ sudo perf lock con -ab -Y spinlock sleep 3
         contended   total wait     max wait     avg wait         type   caller
      
                 4    133.51 us     43.29 us     33.38 us     spinlock   tick_do_update_jiffies64+0x25
                 4     69.06 us     31.82 us     17.27 us     spinlock   process_one_work+0x1b5
                 2     50.66 us     25.77 us     25.33 us     spinlock   rcu_core+0xc7
                 1     28.45 us     28.45 us     28.45 us     spinlock   rcu_core+0xc7
                 1     24.77 us     24.77 us     24.77 us     spinlock   tmigr_update_events+0x13c
                 1     23.34 us     23.34 us     23.34 us     spinlock   raw_spin_rq_lock_nested+0x15
      
      Fixes: b5711042 ("perf lock contention: Use per-cpu array map for spinlocks")
      Reported-by: default avatarXi Wang <xii@google.com>
      Cc: Song Liu <song@kernel.org>
      Cc: bpf@vger.kernel.org
      Link: https://lore.kernel.org/r/20240828052953.1445862-1-namhyung@kernel.orgSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      287bd5cf
    • Veronika Molnarova's avatar
      perf test pmu: Set uninitialized PMU alias to null · 1c7fb536
      Veronika Molnarova authored
      Commit 3e0bf9fd ("perf pmu: Restore full PMU name wildcard
      support") adds a test case "PMU cmdline match" that covers PMU name
      wildcard support provided by function perf_pmu__match(). The test works
      with a wide range of supported combinations of PMU name matching but
      omits the case that if the perf_pmu__match() cannot match the PMU name
      to the wildcard, it tries to match its alias. However, this variable is
      not set up, causing the test case to fail when run with subprocesses or
      to segfault if run as a single process.
      
        ./perf test -vv 9
          9: Sysfs PMU tests                                                 :
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
          9.6: PMU cmdline match                                             : FAILED!
      
        ./perf test -F 9
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
        Segmentation fault (core dumped)
      
      Initialize the PMU alias to null for all tests of perf_pmu__match()
      as this functionality is not being tested and the alias matching works
      exactly the same as the matching of the PMU name.
      
        ./perf test -F 9
          9.1: Parsing with PMU format directory                             : Ok
          9.2: Parsing with PMU event                                        : Ok
          9.3: PMU event names                                               : Ok
          9.4: PMU name combining                                            : Ok
          9.5: PMU name comparison                                           : Ok
          9.6: PMU cmdline match                                             : Ok
      
      Fixes: 3e0bf9fd ("perf pmu: Restore full PMU name wildcard support")
      Signed-off-by: default avatarVeronika Molnarova <vmolnaro@redhat.com>
      Cc: james.clark@arm.com
      Cc: mpetlan@redhat.com
      Cc: rstoyano@redhat.com
      Link: https://lore.kernel.org/r/20240808103749.9356-1-vmolnaro@redhat.comSigned-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      1c7fb536
    • Fedor Pchelkin's avatar
      btrfs: qgroup: don't use extent changeset when not needed · c346c629
      Fedor Pchelkin authored
      The local extent changeset is passed to clear_record_extent_bits() where
      it may have some additional memory dynamically allocated for ulist. When
      qgroup is disabled, the memory is leaked because in this case the
      changeset is not released upon __btrfs_qgroup_release_data() return.
      
      Since the recorded contents of the changeset are not used thereafter, just
      don't pass it.
      
      Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
      
      Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com
      Fixes: af0e2aab ("btrfs: qgroup: flush reservations during quota disable")
      CC: stable@vger.kernel.org # 6.10+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFedor Pchelkin <pchelkin@ispras.ru>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c346c629
    • Armin Wolf's avatar
      hwmon: (hp-wmi-sensors) Check if WMI event data exists · a54da9df
      Armin Wolf authored
      The BIOS can choose to return no event data in response to a
      WMI event, so the ACPI object passed to the WMI notify handler
      can be NULL.
      
      Check for such a situation and ignore the event in such a case.
      
      Fixes: 23902f98 ("hwmon: add HP WMI Sensors driver")
      Signed-off-by: default avatarArmin Wolf <W_Armin@gmx.de>
      Reviewed-by: default avatarIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Message-ID: <20240901031055.3030-2-W_Armin@gmx.de>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      a54da9df
    • Linus Torvalds's avatar
      Merge tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux · 67784a74
      Linus Torvalds authored
      Pull ata fix from Damien Le Moal:
      
       - Fix a potential memory leak in the ata host initialization code (from
         Zheng)
      
      * tag 'ata-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
        ata: libata: Fix memory leak for error path in ata_host_alloc()
      67784a74
    • Suren Baghdasaryan's avatar
      alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n · 052a45c1
      Suren Baghdasaryan authored
      codetag_module_init() is used to initialize sections containing allocation
      tags.  This function is used to initialize module sections as well as core
      kernel sections, in which case the module parameter is set to NULL.  This
      function has to be called even when CONFIG_MODULES=n to initialize core
      kernel allocation tag sections.  When CONFIG_MODULES=n, this function is a
      NOP, which is wrong.  This leads to /proc/allocinfo reported as empty. 
      Fix this by making it independent of CONFIG_MODULES.
      
      Link: https://lkml.kernel.org/r/20240828231536.1770519-1-surenb@google.com
      Fixes: 916cc516 ("lib: code tagging framework")
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Sourav Panda <souravpanda@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      052a45c1
    • Adrian Huang's avatar
      mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area · 409faf8c
      Adrian Huang authored
      When running the vmalloc stress on a 448-core system, observe the average
      latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
      'funclatency.py' tool [1].
      
        # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
      
           usecs             : count    distribution
              0 -> 1         : 0       |                                        |
              2 -> 3         : 29      |                                        |
              4 -> 7         : 19      |                                        |
              8 -> 15        : 56      |                                        |
             16 -> 31        : 483     |****                                    |
             32 -> 63        : 1548    |************                            |
             64 -> 127       : 2634    |*********************                   |
            128 -> 255       : 2535    |*********************                   |
            256 -> 511       : 1776    |**************                          |
            512 -> 1023      : 1015    |********                                |
           1024 -> 2047      : 573     |****                                    |
           2048 -> 4095      : 488     |****                                    |
           4096 -> 8191      : 1091    |*********                               |
           8192 -> 16383     : 3078    |*************************               |
          16384 -> 32767     : 4821    |****************************************|
          32768 -> 65535     : 3318    |***************************             |
          65536 -> 131071    : 1718    |**************                          |
         131072 -> 262143    : 2220    |******************                      |
         262144 -> 524287    : 1147    |*********                               |
         524288 -> 1048575   : 1179    |*********                               |
        1048576 -> 2097151   : 822     |******                                  |
        2097152 -> 4194303   : 906     |*******                                 |
        4194304 -> 8388607   : 2148    |*****************                       |
        8388608 -> 16777215  : 4497    |*************************************   |
       16777216 -> 33554431  : 289     |**                                      |
      
        avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
      
        The worst case is over 16-33 seconds, so soft lockup is triggered [2].
      
      [Root Cause]
      1) Each purge_list has the long list. The following shows the number of
         vmap_area is purged.
      
         crash> p vmap_nodes
         vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
         crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
           nr_purged = 663070
           ...
           nr_purged = 821670
           nr_purged = 692214
           nr_purged = 726808
           ...
      
      2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
         operation when purging each vmap_area. However, the iteration is over
         600000 vmap_area (See 'nr_purged' above).
      
         Here is objdump output:
      
           $ objdump -D vmlinux
           ffffffff813e8c80 <purge_vmap_node>:
           ...
           ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
           ...
      
         Quote from "Instruction tables" pdf file [3]:
           Instructions with a LOCK prefix have a long latency that depends on
           cache organization and possibly RAM speed. If there are multiple
           processors or cores or direct memory access (DMA) devices, then all
           locked instructions will lock a cache line for exclusive access,
           which may involve RAM access. A LOCK prefix typically costs more
           than a hundred clock cycles, even on single-processor systems.
      
         That's why the latency of purge_vmap_node() dramatically increases
         on a many-core system: One core is busy on purging each vmap_area of
         the *long* purge_list and executing atomic_long_sub() for each
         vmap_area, while other cores free vmalloc allocations and execute
         atomic_long_add_return() in free_vmap_area_noflush().
      
      [Solution]
      Employ a local variable to record the total purged pages, and execute
      atomic_long_sub() after the traversal of the purge_list is done. The
      experiment result shows the latency improvement is 99%.
      
      [Experiment Result]
      1) System Configuration: Three servers (with HT-enabled) are tested.
           * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
           * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
           * 448-core server: AMD Zen 4 Processor*2
      
      2) Kernel Config
           * CONFIG_KASAN is disabled
      
      3) The data in column "w/o patch" and "w/ patch"
           * Unit: micro seconds (us)
           * Each data is the average of 3-time measurements
      
               System        w/o patch (us)   w/ patch (us)    Improvement (%)
           ---------------   --------------   -------------    -------------
           72-core server          2194              14            99.36%
           192-core server       143799            1139            99.21%
           448-core server      1992122            6883            99.65%
      
      [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
      [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
      [3] https://www.agner.org/optimize/instruction_tables.pdf
      
      Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.comSigned-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      409faf8c
    • Jan Kuliga's avatar
      mailmap: update entry for Jan Kuliga · 4f295229
      Jan Kuliga authored
      Soon I won't be able to use my current email address.
      
      Link: https://lkml.kernel.org/r/20240830095658.1203198-1-jankul@alatek.krakow.plSigned-off-by: default avatarJan Kuliga <jankul@alatek.krakow.pl>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Matthieu Baerts (NGI0) <matttbe@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f295229
    • Hao Ge's avatar
      codetag: debug: mark codetags for poisoned page as empty · 5e9784e9
      Hao Ge authored
      When PG_hwpoison pages are freed they are treated differently in
      free_pages_prepare() and instead of being released they are isolated.
      
      Page allocation tag counters are decremented at this point since the page
      is considered not in use.  Later on when such pages are released by
      unpoison_memory(), the allocation tag counters will be decremented again
      and the following warning gets reported:
      
      [  113.930443][ T3282] ------------[ cut here ]------------
      [  113.931105][ T3282] alloc_tag was not set
      [  113.931576][ T3282] WARNING: CPU: 2 PID: 3282 at ./include/linux/alloc_tag.h:130 pgalloc_tag_sub.part.66+0x154/0x164
      [  113.932866][ T3282] Modules linked in: hwpoison_inject fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_man4
      [  113.941638][ T3282] CPU: 2 UID: 0 PID: 3282 Comm: madvise11 Kdump: loaded Tainted: G        W          6.11.0-rc4-dirty #18
      [  113.943003][ T3282] Tainted: [W]=WARN
      [  113.943453][ T3282] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
      [  113.944378][ T3282] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  113.945319][ T3282] pc : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946016][ T3282] lr : pgalloc_tag_sub.part.66+0x154/0x164
      [  113.946706][ T3282] sp : ffff800087093a10
      [  113.947197][ T3282] x29: ffff800087093a10 x28: ffff0000d7a9d400 x27: ffff80008249f0a0
      [  113.948165][ T3282] x26: 0000000000000000 x25: ffff80008249f2b0 x24: 0000000000000000
      [  113.949134][ T3282] x23: 0000000000000001 x22: 0000000000000001 x21: 0000000000000000
      [  113.950597][ T3282] x20: ffff0000c08fcad8 x19: ffff80008251e000 x18: ffffffffffffffff
      [  113.952207][ T3282] x17: 0000000000000000 x16: 0000000000000000 x15: ffff800081746210
      [  113.953161][ T3282] x14: 0000000000000000 x13: 205d323832335420 x12: 5b5d353031313339
      [  113.954120][ T3282] x11: ffff800087093500 x10: 000000000000005d x9 : 00000000ffffffd0
      [  113.955078][ T3282] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008236ba90 x6 : c0000000ffff7fff
      [  113.956036][ T3282] x5 : ffff000b34bf4dc8 x4 : ffff8000820aba90 x3 : 0000000000000001
      [  113.956994][ T3282] x2 : ffff800ab320f000 x1 : 841d1e35ac932e00 x0 : 0000000000000000
      [  113.957962][ T3282] Call trace:
      [  113.958350][ T3282]  pgalloc_tag_sub.part.66+0x154/0x164
      [  113.959000][ T3282]  pgalloc_tag_sub+0x14/0x1c
      [  113.959539][ T3282]  free_unref_page+0xf4/0x4b8
      [  113.960096][ T3282]  __folio_put+0xd4/0x120
      [  113.960614][ T3282]  folio_put+0x24/0x50
      [  113.961103][ T3282]  unpoison_memory+0x4f0/0x5b0
      [  113.961678][ T3282]  hwpoison_unpoison+0x30/0x48 [hwpoison_inject]
      [  113.962436][ T3282]  simple_attr_write_xsigned.isra.34+0xec/0x1cc
      [  113.963183][ T3282]  simple_attr_write+0x38/0x48
      [  113.963750][ T3282]  debugfs_attr_write+0x54/0x80
      [  113.964330][ T3282]  full_proxy_write+0x68/0x98
      [  113.964880][ T3282]  vfs_write+0xdc/0x4d0
      [  113.965372][ T3282]  ksys_write+0x78/0x100
      [  113.965875][ T3282]  __arm64_sys_write+0x24/0x30
      [  113.966440][ T3282]  invoke_syscall+0x7c/0x104
      [  113.966984][ T3282]  el0_svc_common.constprop.1+0x88/0x104
      [  113.967652][ T3282]  do_el0_svc+0x2c/0x38
      [  113.968893][ T3282]  el0_svc+0x3c/0x1b8
      [  113.969379][ T3282]  el0t_64_sync_handler+0x98/0xbc
      [  113.969980][ T3282]  el0t_64_sync+0x19c/0x1a0
      [  113.970511][ T3282] ---[ end trace 0000000000000000 ]---
      
      To fix this, clear the page tag reference after the page got isolated
      and accounted for.
      
      Link: https://lkml.kernel.org/r/20240825163649.33294-1-hao.ge@linux.dev
      Fixes: d224eb02 ("codetag: debug: mark codetags for reserved pages as empty")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hao Ge <gehao@kylinos.cn>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>	[6.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e9784e9
    • Mike Yuan's avatar
      mm/memcontrol: respect zswap.writeback setting from parent cg too · e3992573
      Mike Yuan authored
      Currently, the behavior of zswap.writeback wrt.  the cgroup hierarchy
      seems a bit odd.  Unlike zswap.max, it doesn't honor the value from parent
      cgroups.  This surfaced when people tried to globally disable zswap
      writeback, i.e.  reserve physical swap space only for hibernation [1] -
      disabling zswap.writeback only for the root cgroup results in subcgroups
      with zswap.writeback=1 still performing writeback.
      
      The inconsistency became more noticeable after I introduced the
      MemoryZSwapWriteback= systemd unit setting [2] for controlling the knob.
      The patch assumed that the kernel would enforce the value of parent
      cgroups.  It could probably be workarounded from systemd's side, by going
      up the slice unit tree and inheriting the value.  Yet I think it's more
      sensible to make it behave consistently with zswap.max and friends.
      
      [1] https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate#Disable_zswap_writeback_to_use_the_swap_space_only_for_hibernation
      [2] https://github.com/systemd/systemd/pull/31734
      
      Link: https://lkml.kernel.org/r/20240823162506.12117-1-me@yhndnzj.com
      Fixes: 501a06fe ("zswap: memcontrol: implement zswap writeback disabling")
      Signed-off-by: default avatarMike Yuan <me@yhndnzj.com>
      Reviewed-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeel.butt@linux.dev>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e3992573
    • Marc Zyngier's avatar
      scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum · a3f6a89c
      Marc Zyngier authored
      Richard reports that since 772dd034 ("mm: enumerate all gfp flags"),
      gfp-translate is broken, as the bit numbers are implicit, leaving the
      shell script unable to extract them.  Even more, some bits are now at a
      variable location, making it double extra hard to parse using a simple
      shell script.
      
      Use a brute-force approach to the problem by generating a small C stub
      that will use the enum to dump the interesting bits.
      
      As an added bonus, we are now able to identify invalid bits for a given
      configuration.  As an added drawback, we cannot parse include files that
      predate this change anymore.  Tough luck.
      
      Link: https://lkml.kernel.org/r/20240823163850.3791201-1-maz@kernel.org
      Fixes: 772dd034 ("mm: enumerate all gfp flags")
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Reported-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Petr Tesařík <petr@tesarici.cz>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3f6a89c
    • Usama Arif's avatar
      Revert "mm: skip CMA pages when they are not available" · bfe0857c
      Usama Arif authored
      This reverts commit 5da226db ("mm: skip CMA pages when they are not
      available") and b7108d66 ("Multi-gen LRU: skip CMA pages when they are
      not eligible").
      
      lruvec->lru_lock is highly contended and is held when calling
      isolate_lru_folios.  If the lru has a large number of CMA folios
      consecutively, while the allocation type requested is not MIGRATE_MOVABLE,
      isolate_lru_folios can hold the lock for a very long time while it skips
      those.  For FIO workload, ~150million order=0 folios were skipped to
      isolate a few ZONE_DMA folios [1].  This can cause lockups [1] and high
      memory pressure for extended periods of time [2].
      
      Remove skipping CMA for MGLRU as well, as it was introduced in sort_folio
      for the same resaon as 5da226db.
      
      [1] https://lore.kernel.org/all/CAOUHufbkhMZYz20aM_3rHZ3OcK4m2puji2FGpUpn_-DevGk3Kg@mail.gmail.com/
      [2] https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/
      
      [usamaarif642@gmail.com: also revert b7108d66, per Johannes]
        Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
        Link: https://lkml.kernel.org/r/357ac325-4c61-497a-92a3-bdbd230d5ec9@gmail.com
      Link: https://lkml.kernel.org/r/9060a32d-b2d7-48c0-8626-1db535653c54@gmail.com
      Fixes: 5da226db ("mm: skip CMA pages when they are not available")
      Signed-off-by: default avatarUsama Arif <usamaarif642@gmail.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Breno Leitao <leitao@debian.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
      Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bfe0857c
    • Liam R. Howlett's avatar
      maple_tree: remove rcu_read_lock() from mt_validate() · f806de88
      Liam R. Howlett authored
      The write lock should be held when validating the tree to avoid updates
      racing with checks.  Holding the rcu read lock during a large tree
      validation may also cause a prolonged rcu read window and "rcu_preempt
      detected stalls" warnings.
      
      Link: https://lore.kernel.org/all/0000000000001d12d4062005aea1@google.com/
      Link: https://lkml.kernel.org/r/20240820175417.2782532-1-Liam.Howlett@oracle.com
      Fixes: 54a611b6 ("Maple Tree: add new data structure")
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reported-by: syzbot+036af2f0c7338a33b0cd@syzkaller.appspotmail.com
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f806de88
    • Petr Tesarik's avatar
      kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y · 6dacd79d
      Petr Tesarik authored
      Fix the condition to exclude the elfcorehdr segment from the SHA digest
      calculation.
      
      The j iterator is an index into the output sha_regions[] array, not into
      the input image->segment[] array.  Once it reaches
      image->elfcorehdr_index, all subsequent segments are excluded.  Besides,
      if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
      may be wrongly included in the calculation.
      
      Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
      Fixes: f7cc804a ("kexec: exclude elfcorehdr from the segment digest")
      Signed-off-by: default avatarPetr Tesarik <ptesarik@suse.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: Eric DeVolder <eric_devolder@yahoo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6dacd79d
    • Hao Ge's avatar
      mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook · ab7ca095
      Hao Ge authored
      When enable CONFIG_MEMCG & CONFIG_KFENCE & CONFIG_KMEMLEAK, the following
      warning always occurs,This is because the following call stack occurred:
      mem_pool_alloc
          kmem_cache_alloc_noprof
              slab_alloc_node
                  kfence_alloc
      
      Once the kfence allocation is successful,slab->obj_exts will not be empty,
      because it has already been assigned a value in kfence_init_pool.
      
      Since in the prepare_slab_obj_exts_hook function,we perform a check for
      s->flags & (SLAB_NO_OBJ_EXT | SLAB_NOLEAKTRACE),the alloc_tag_add function
      will not be called as a result.Therefore,ref->ct remains NULL.
      
      However,when we call mem_pool_free,since obj_ext is not empty, it
      eventually leads to the alloc_tag_sub scenario being invoked.  This is
      where the warning occurs.
      
      So we should add corresponding checks in the alloc_tagging_slab_free_hook.
      For __GFP_NO_OBJ_EXT case,I didn't see the specific case where it's using
      kfence,so I won't add the corresponding check in
      alloc_tagging_slab_free_hook for now.
      
      [    3.734349] ------------[ cut here ]------------
      [    3.734807] alloc_tag was not set
      [    3.735129] WARNING: CPU: 4 PID: 40 at ./include/linux/alloc_tag.h:130 kmem_cache_free+0x444/0x574
      [    3.735866] Modules linked in: autofs4
      [    3.736211] CPU: 4 UID: 0 PID: 40 Comm: ksoftirqd/4 Tainted: G        W          6.11.0-rc3-dirty #1
      [    3.736969] Tainted: [W]=WARN
      [    3.737258] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
      [    3.737875] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [    3.738501] pc : kmem_cache_free+0x444/0x574
      [    3.738951] lr : kmem_cache_free+0x444/0x574
      [    3.739361] sp : ffff80008357bb60
      [    3.739693] x29: ffff80008357bb70 x28: 0000000000000000 x27: 0000000000000000
      [    3.740338] x26: ffff80008207f000 x25: ffff000b2eb2fd60 x24: ffff0000c0005700
      [    3.740982] x23: ffff8000804229e4 x22: ffff800082080000 x21: ffff800081756000
      [    3.741630] x20: fffffd7ff8253360 x19: 00000000000000a8 x18: ffffffffffffffff
      [    3.742274] x17: ffff800ab327f000 x16: ffff800083398000 x15: ffff800081756df0
      [    3.742919] x14: 0000000000000000 x13: 205d344320202020 x12: 5b5d373038343337
      [    3.743560] x11: ffff80008357b650 x10: 000000000000005d x9 : 00000000ffffffd0
      [    3.744231] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008237bad0 x6 : c0000000ffff7fff
      [    3.744907] x5 : ffff80008237ba78 x4 : ffff8000820bbad0 x3 : 0000000000000001
      [    3.745580] x2 : 68d66547c09f7800 x1 : 68d66547c09f7800 x0 : 0000000000000000
      [    3.746255] Call trace:
      [    3.746530]  kmem_cache_free+0x444/0x574
      [    3.746931]  mem_pool_free+0x44/0xf4
      [    3.747306]  free_object_rcu+0xc8/0xdc
      [    3.747693]  rcu_do_batch+0x234/0x8a4
      [    3.748075]  rcu_core+0x230/0x3e4
      [    3.748424]  rcu_core_si+0x14/0x1c
      [    3.748780]  handle_softirqs+0x134/0x378
      [    3.749189]  run_ksoftirqd+0x70/0x9c
      [    3.749560]  smpboot_thread_fn+0x148/0x22c
      [    3.749978]  kthread+0x10c/0x118
      [    3.750323]  ret_from_fork+0x10/0x20
      [    3.750696] ---[ end trace 0000000000000000 ]---
      
      Link: https://lkml.kernel.org/r/20240816013336.17505-1-hao.ge@linux.dev
      Fixes: 4b873696 ("mm/slab: add allocation accounting into slab allocation and free paths")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab7ca095
    • Ryusuke Konishi's avatar
      nilfs2: fix state management in error path of log writing function · 6576dd66
      Ryusuke Konishi authored
      After commit a694291a ("nilfs2: separate wait function from
      nilfs_segctor_write") was applied, the log writing function
      nilfs_segctor_do_construct() was able to issue I/O requests continuously
      even if user data blocks were split into multiple logs across segments,
      but two potential flaws were introduced in its error handling.
      
      First, if nilfs_segctor_begin_construction() fails while creating the
      second or subsequent logs, the log writing function returns without
      calling nilfs_segctor_abort_construction(), so the writeback flag set on
      pages/folios will remain uncleared.  This causes page cache operations to
      hang waiting for the writeback flag.  For example,
      truncate_inode_pages_final(), which is called via nilfs_evict_inode() when
      an inode is evicted from memory, will hang.
      
      Second, the NILFS_I_COLLECTED flag set on normal inodes remain uncleared. 
      As a result, if the next log write involves checkpoint creation, that's
      fine, but if a partial log write is performed that does not, inodes with
      NILFS_I_COLLECTED set are erroneously removed from the "sc_dirty_files"
      list, and their data and b-tree blocks may not be written to the device,
      corrupting the block mapping.
      
      Fix these issues by uniformly calling nilfs_segctor_abort_construction()
      on failure of each step in the loop in nilfs_segctor_do_construct(),
      having it clean up logs and segment usages according to progress, and
      correcting the conditions for calling nilfs_redirty_inodes() to ensure
      that the NILFS_I_COLLECTED flag is cleared.
      
      Link: https://lkml.kernel.org/r/20240814101119.4070-1-konishi.ryusuke@gmail.com
      Fixes: a694291a ("nilfs2: separate wait function from nilfs_segctor_write")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6576dd66
    • Ryusuke Konishi's avatar
      nilfs2: fix missing cleanup on rollforward recovery error · 5787fcaa
      Ryusuke Konishi authored
      In an error injection test of a routine for mount-time recovery, KASAN
      found a use-after-free bug.
      
      It turned out that if data recovery was performed using partial logs
      created by dsync writes, but an error occurred before starting the log
      writer to create a recovered checkpoint, the inodes whose data had been
      recovered were left in the ns_dirty_files list of the nilfs object and
      were not freed.
      
      Fix this issue by cleaning up inodes that have read the recovery data if
      the recovery routine fails midway before the log writer starts.
      
      Link: https://lkml.kernel.org/r/20240810065242.3701-1-konishi.ryusuke@gmail.com
      Fixes: 0f3e1c7f ("nilfs2: recovery functions")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5787fcaa
    • Ryusuke Konishi's avatar
      nilfs2: protect references to superblock parameters exposed in sysfs · 68340825
      Ryusuke Konishi authored
      The superblock buffers of nilfs2 can not only be overwritten at runtime
      for modifications/repairs, but they are also regularly swapped, replaced
      during resizing, and even abandoned when degrading to one side due to
      backing device issues.  So, accessing them requires mutual exclusion using
      the reader/writer semaphore "nilfs->ns_sem".
      
      Some sysfs attribute show methods read this superblock buffer without the
      necessary mutual exclusion, which can cause problems with pointer
      dereferencing and memory access, so fix it.
      
      Link: https://lkml.kernel.org/r/20240811100320.9913-1-konishi.ryusuke@gmail.com
      Fixes: da7141fb ("nilfs2: add /sys/fs/nilfs2/<device> group")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      68340825
    • Jann Horn's avatar
      userfaultfd: don't BUG_ON() if khugepaged yanks our page table · 4828d207
      Jann Horn authored
      Since khugepaged was changed to allow retracting page tables in file
      mappings without holding the mmap lock, these BUG_ON()s are wrong - get
      rid of them.
      
      We could also remove the preceding "if (unlikely(...))" block, but then we
      could reach pte_offset_map_lock() with transhuge pages not just for file
      mappings but also for anonymous mappings - which would probably be fine
      but I think is not necessarily expected.
      
      Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-2-5efa61078a41@google.com
      Fixes: 1d65b771 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4828d207
    • Jann Horn's avatar
      userfaultfd: fix checks for huge PMDs · 71c186ef
      Jann Horn authored
      Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2.
      
      The pmd_trans_huge() code in mfill_atomic() is wrong in three different
      ways depending on kernel version:
      
      1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
         the right two race windows) - I've tested this in a kernel build with
         some extra mdelay() calls. See the commit message for a description
         of the race scenario.
         On older kernels (before 6.5), I think the same bug can even
         theoretically lead to accessing transhuge page contents as a page table
         if you hit the right 5 narrow race windows (I haven't tested this case).
      2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
         detecting PMDs that don't point to page tables.
         On older kernels (before 6.5), you'd just have to win a single fairly
         wide race to hit this.
         I've tested this on 6.1 stable by racing migration (with a mdelay()
         patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
         VM, that causes a kernel oops in ptlock_ptr().
      3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
         to yank page tables out from under us (though I haven't tested that),
         so I think the BUG_ON() checks in mfill_atomic() are just wrong.
      
      I decided to write two separate fixes for these (one fix for bugs 1+2, one
      fix for bug 3), so that the first fix can be backported to kernels
      affected by bugs 1+2.
      
      
      This patch (of 2):
      
      This fixes two issues.
      
      I discovered that the following race can occur:
      
        mfill_atomic                other thread
        ============                ============
                                    <zap PMD>
        pmdp_get_lockless() [reads none pmd]
        <bail if trans_huge>
        <if none:>
                                    <pagefault creates transhuge zeropage>
          __pte_alloc [no-op]
                                    <zap PMD>
        <bail if pmd_trans_huge(*dst_pmd)>
        BUG_ON(pmd_none(*dst_pmd))
      
      I have experimentally verified this in a kernel with extra mdelay() calls;
      the BUG_ON(pmd_none(*dst_pmd)) triggers.
      
      On kernels newer than commit 0d940a9b ("mm/pgtable: allow
      pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
      a BUG_ON(), since the page table access helpers are actually designed to
      deal with page tables concurrently disappearing; but on older kernels
      (<=6.4), I think we could probably theoretically race past the two
      BUG_ON() checks and end up treating a hugepage as a page table.
      
      The second issue is that, as Qi Zheng pointed out, there are other types
      of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
      (in particular, migration PMDs).
      
      On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
      PMD that contains a migration entry (which just requires winning a single,
      fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
      assumes that the PMD points to a page table.
      
      Breakage follows: First, the kernel tries to take the PTE lock (which will
      crash or maybe worse if there is no "struct page" for the address bits in
      the migration entry PMD - I think at least on X86 there usually is no
      corresponding "struct page" thanks to the PTE inversion mitigation, amd64
      looks different).
      
      If that didn't crash, the kernel would next try to write a PTE into what
      it wrongly thinks is a page table.
      
      As part of fixing these issues, get rid of the check for pmd_trans_huge()
      before __pte_alloc() - that's redundant, we're going to have to check for
      that after the __pte_alloc() anyway.
      
      Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels.
      
      Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com
      Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com
      Fixes: c1a4de99 ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      71c186ef
    • Will Deacon's avatar
      mm: vmalloc: ensure vmap_block is initialised before adding to queue · 3e3de794
      Will Deacon authored
      Commit 8c61291f ("mm: fix incorrect vbq reference in
      purge_fragmented_block") extended the 'vmap_block' structure to contain a
      'cpu' field which is set at allocation time to the id of the initialising
      CPU.
      
      When a new 'vmap_block' is being instantiated by new_vmap_block(), the
      partially initialised structure is added to the local 'vmap_block_queue'
      xarray before the 'cpu' field has been initialised.  If another CPU is
      concurrently walking the xarray (e.g.  via vm_unmap_aliases()), then it
      may perform an out-of-bounds access to the remote queue thanks to an
      uninitialised index.
      
      This has been observed as UBSAN errors in Android:
      
       | Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP
       |
       | Call trace:
       |  purge_fragmented_block+0x204/0x21c
       |  _vm_unmap_aliases+0x170/0x378
       |  vm_unmap_aliases+0x1c/0x28
       |  change_memory_common+0x1dc/0x26c
       |  set_memory_ro+0x18/0x24
       |  module_enable_ro+0x98/0x238
       |  do_init_module+0x1b0/0x310
      
      Move the initialisation of 'vb->cpu' in new_vmap_block() ahead of the
      addition to the xarray.
      
      Link: https://lkml.kernel.org/r/20240812171606.17486-1-will@kernel.org
      Fixes: 8c61291f ("mm: fix incorrect vbq reference in purge_fragmented_block")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
      Cc: Hailong.Liu <hailong.liu@oppo.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e3de794
    • Muhammad Usama Anjum's avatar
      selftests: mm: fix build errors on armhf · b808f629
      Muhammad Usama Anjum authored
      The __NR_mmap isn't found on armhf.  The mmap() is commonly available
      system call and its wrapper is present on all architectures.  So it should
      be used directly.  It solves problem for armhf and doesn't create problem
      for other architectures.
      
      Remove sys_mmap() functions as they aren't doing anything else other than
      calling mmap().  There is no need to set errno = 0 manually as glibc
      always resets it.
      
      For reference errors are as following:
      
        CC       seal_elf
      seal_elf.c: In function 'sys_mmap':
      seal_elf.c:39:33: error: '__NR_mmap' undeclared (first use in this function)
         39 |         sret = (void *) syscall(__NR_mmap, addr, len, prot,
            |                                 ^~~~~~~~~
      
      mseal_test.c: In function 'sys_mmap':
      mseal_test.c:90:33: error: '__NR_mmap' undeclared (first use in this function)
         90 |         sret = (void *) syscall(__NR_mmap, addr, len, prot,
            |                                 ^~~~~~~~~
      
      Link: https://lkml.kernel.org/r/20240809082511.497266-1-usama.anjum@collabora.com
      Fixes: 4926c7a5 ("selftest mm/mseal memory sealing")
      Signed-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Jeff Xu <jeffxu@chromium.org>
      Cc: Kees Cook <kees@kernel.org>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b808f629
  4. 01 Sep, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c9f016e7
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
      
       - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
         even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
         disable it thereby creating inconsistent state.
      
         Reorder the logic so it actually works correctly
      
       - The XSTATE logic for handling LBR is incorrect as it assumes that
         XSAVES supports LBR when the CPU supports LBR. In fact both
         conditions need to be true. Otherwise the enablement of LBR in the
         IA32_XSS MSR fails and subsequently the machine crashes on the next
         XRSTORS operation because IA32_XSS is not initialized.
      
         Cache the XSTATE support bit during init and make the related
         functions use this cached information and the LBR CPU feature bit to
         cure this.
      
       - Cure a long standing bug in KASLR
      
         KASLR uses the full address space between PAGE_OFFSET and vaddr_end
         to randomize the starting points of the direct map, vmalloc and
         vmemmap regions. It thereby limits the size of the direct map by
         using the installed memory size plus an extra configurable margin for
         hot-plug memory. This limitation is done to gain more randomization
         space because otherwise only the holes between the direct map,
         vmalloc, vmemmap and vaddr_end would be usable for randomizing.
      
         The limited direct map size is not exposed to the rest of the kernel,
         so the memory hot-plug and resource management related code paths
         still operate under the assumption that the available address space
         can be determined with MAX_PHYSMEM_BITS.
      
         request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
         downwards. That means the first allocation happens past the end of
         the direct map and if unlucky this address is in the vmalloc space,
         which causes high_memory to become greater than VMALLOC_START and
         consequently causes iounmap() to fail for valid ioremap addresses.
      
         Cure this by exposing the end of the direct map via PHYSMEM_END and
         use that for the memory hot-plug and resource management related
         places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
         PHYSMEM_END maps to a variable which is initialized by the KASLR
         initialization and otherwise it is based on MAX_PHYSMEM_BITS as
         before.
      
       - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
         an initialized variabled on the stack to the VMM. The variable is
         only required as output value, so it does not have to exposed to the
         VMM in the first place.
      
       - Prevent an array overrun in the resource control code on systems with
         Sub-NUMA Clustering enabled because the code failed to adjust the
         index by the number of SNC nodes per L3 cache.
      
      * tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/resctrl: Fix arch_mbm_* array overrun on SNC
        x86/tdx: Fix data leak in mmio_read()
        x86/kaslr: Expose and use the end of the physical memory address space
        x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
        x86/apic: Make x2apic_disable() work correctly
      c9f016e7