1. 23 Sep, 2024 7 commits
    • Linus Torvalds's avatar
      Merge tag 'gfs2-v6.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 · 721068de
      Linus Torvalds authored
      Pull gfs2 update from Andreas Gruenbacher:
      
       - Convert the writepage address space operation to writepages (Matthew
         Wilcox)
      
       - A syzkaller fix (by Julian Sun) and a minor cleanup (Andreas
         Gruenbacher)
      
      * tag 'gfs2-v6.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
        gfs2: Remove gfs2_aspace_writepage()
        gfs2: Remove gfs2_jdata_writepage()
        gfs2: Remove __gfs2_writepage()
        gfs2: Add gfs2_aspace_writepages()
        gfs2: fix double destroy_workqueue error
        gfs2: Minor gfs2_glock_cb cleanup
      721068de
    • Linus Torvalds's avatar
      Merge tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · a1fb2fcb
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - fix dangling pointer to rb-tree of defragmented inodes after cleanup
      
       - a followup fix to handle concurrent lseek on the same fd that could
         leak memory under some conditions
      
       - fix wrong root id reported in tree checker when verifying dref
      
      * tag 'for-6.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix use-after-free on rbtree that tracks inodes for auto defrag
        btrfs: tree-checker: fix the wrong output of data backref objectid
        btrfs: fix race setting file private on concurrent lseek using same fd
      a1fb2fcb
    • Linus Torvalds's avatar
      Merge tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · d0359e4c
      Linus Torvalds authored
      Pull quota and isofs updates from Jan Kara:
       "A few small cleanups in quota and isofs"
      
      * tag 'fs_for_v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        isofs: Annotate struct SL_component with __counted_by()
        quota: remove unnecessary error code translation in dquot_quota_enable
        quota: remove redundant return at end of void function
        quota: remove unneeded return value of register_quota_format
        quota: avoid missing put_quota_format when DQUOT_SUSPENDED is passed
      d0359e4c
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs · b3f391fd
      Linus Torvalds authored
      Pull bcachefs updates from Kent Overstreet:
      
       - rcu_pending, btree key cache rework: this solves lock contenting in
         the key cache, eliminating the biggest source of the srcu lock hold
         time warnings, and drastically improving performance on some metadata
         heavy workloads - on multithreaded creates we're now 3-4x faster than
         xfs.
      
       - We're now using an rhashtable instead of the system inode hash table;
         this is another significant performance improvement on multithreaded
         metadata workloads, eliminating more lock contention.
      
       - for_each_btree_key_in_subvolume_upto(): new helper for iterating over
         keys within a specific subvolume, eliminating a lot of open coded
         "subvolume_get_snapshot()" and also fixing another source of srcu
         lock time warnings, by running each loop iteration in its own
         transaction (as the existing for_each_btree_key() does).
      
       - More work on btree_trans locking asserts; we now assert that we don't
         hold btree node locks when trans->locked is false, which is important
         because we don't use lockdep for tracking individual btree node
         locks.
      
       - Some cleanups and improvements in the bset.c btree node lookup code,
         from Alan.
      
       - Rework of btree node pinning, which we use in backpointers fsck. The
         old hacky implementation, where the shrinker just skipped over nodes
         in the pinned range, was causing OOMs; instead we now use another
         shrinker with a much higher seeks number for pinned nodes.
      
       - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
         where rebalance would sometimes fall back to allocating from the full
         filesystem, which is not what we want when it's trying to move data
         to a specific target.
      
       - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
         allocations.
      
       - Idmap mounts are now supported (Hongbo Li)
      
       - Rename whiteouts are now supported (Hongbo Li)
      
       - Erasure coding can now handle devices being marked as failed, or
         forcibly removed. We still need the evacuate path for erasure coding,
         but it's getting very close to ready for people to start using.
      
      * tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits)
        bcachefs: return err ptr instead of null in read sb clean
        bcachefs: Remove duplicated include in backpointers.c
        bcachefs: Don't drop devices with stripe pointers
        bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
        bcachefs: bch_fs.rw_devs_change_count
        bcachefs: bch2_dev_remove_stripes()
        bcachefs: bch2_trigger_ptr() calculates sectors even when no device
        bcachefs: improve error messages in bch2_ec_read_extent()
        bcachefs: improve error message on too few devices for ec
        bcachefs: improve bch2_new_stripe_to_text()
        bcachefs: ec_stripe_head.nr_created
        bcachefs: bch_stripe.disk_label
        bcachefs: stripe_to_mem()
        bcachefs: EIO errcode cleanup
        bcachefs: Rework btree node pinning
        bcachefs: split up btree cache counters for live, freeable
        bcachefs: btree cache counters should be size_t
        bcachefs: Don't count "skipped access bit" as touched in btree cache scan
        bcachefs: Failed devices no longer require mounting in degraded mode
        bcachefs: bch2_dev_rcu_noerror()
        ...
      b3f391fd
    • Linus Torvalds's avatar
      Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · f8ffbc36
      Linus Torvalds authored
      Pull 'struct fd' updates from Al Viro:
       "Just the 'struct fd' layout change, with conversion to accessor
        helpers"
      
      * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        add struct fd constructors, get rid of __to_fd()
        struct fd: representation change
        introduce fd_file(), convert all accessors to it.
      f8ffbc36
    • Linus Torvalds's avatar
      mm: fix build on 32-bit targets without MAX_PHYSMEM_BITS · f8eb5bd9
      Linus Torvalds authored
      The merge resolution to deal with the conflict between commits
      ea72ce5d ("x86/kaslr: Expose and use the end of the physical memory
      address space") and 99185c10 ("resource, kunit: add test case for
      region_intersects()") ended up being broken in configurations didn't
      define a MAX_PHYSMEM_BITS and that had a 32-bit 'phys_addr_t'.
      
      The fallback to using all bits set (ie "(-1ULL)") ended up causing a
      build error:
      
          kernel/resource.c: In function ‘gfr_start’:
          include/linux/minmax.h:93:30: error: conversion from ‘long long unsigned int’ to ‘resource_size_t’ {aka ‘unsigned int’} changes value from ‘18446744073709551615’ to ‘4294967295’ [-Werror=overflow]
      
      this was reported by Geert for m68k, but he points out that it happens
      on other 32-bit architectures too, eg mips, xtensa, parisc, and powerpc.
      
      Limiting 'PHYSMEM_END' to a 'phys_addr_t' (which is the same as
      'resource_size_t') fixes the build, but Geert points out that it will
      then cause a silent overflow in mm/sparse.c:
      
      	unsigned long max_sparsemem_pfn = (PHYSMEM_END + 1) >> PAGE_SHIFT;
      
      so we actually do want PHYSMEM_END to be defined a 64-bit type - just
      not all ones, and not larger than 'phys_addr_t'.
      
      The proper fix is probably to not have some kind of default fallback at
      all, but just make sure every architecture has a valid MAX_PHYSMEM_BITS.
      But in the meantime, this just applies the rule that PHYSMEM_END is the
      largest value that fits in a 'phys_addr_t', but does not have the high
      bit set in 64 bits.
      
      Ugly, ugly.
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f8eb5bd9
    • Guenter Roeck's avatar
      hexagon: vdso: Fix build failure · 9631042b
      Guenter Roeck authored
      Hexagon images fail to build with the following error.
      
      arch/hexagon/kernel/vdso.c:57:3: error: use of undeclared identifier 'name'
                      name = "[vdso]",
                      ^
      
      Add the missing '.' to fix the problem.
      
      Fixes: 497258df ("mm: remove legacy install_special_mapping() code")
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarBrian Cain <bcain@quicinc.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9631042b
  2. 22 Sep, 2024 7 commits
    • Christoph Lameter (Ampere)'s avatar
      seqcount: replace smp_rmb() in read_seqcount() with load acquire · d0dd066a
      Christoph Lameter (Ampere) authored
      Many architectures support load acquire which can replace a memory
      barrier and save some cycles.
      
      A typical sequence
      
      	do {
      		seq = read_seqcount_begin(&s);
      		<something>
      	} while (read_seqcount_retry(&s, seq);
      
      requires 13 cycles on an N1 Neoverse arm64 core (Ampere Altra, to be
      specific) for an empty loop.  Two read memory barriers are needed.  One
      for each of the seqcount_* functions.
      
      We can replace the first read barrier with a load acquire of the
      seqcount which saves us one barrier.
      
      On the Altra doing so reduces the cycle count from 13 to 8.
      
      According to ARM, this is a general improvement for the ARM64
      architecture and not specific to a certain processor.
      
      See
      
        https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
      
       "Weaker ordering requirements that are imposed by Load-Acquire and
        Store-Release instructions allow for micro-architectural
        optimizations, which could reduce some of the performance impacts that
        are otherwise imposed by an explicit memory barrier.
      
        If the ordering requirement is satisfied using either a Load-Acquire
        or Store-Release, then it would be preferable to use these
        instructions instead of a DMB"
      
      [ NOTE! This is my original minimal patch that unconditionally switches
        over to using smp_load_acquire(), instead of the much more involved
        and subtle patch that Christoph Lameter wrote that made it
        conditional.
      
        But Christoph gets authorship credit because I had initially thought
        that we needed the more complex model, and Christoph ran with it it
        and did the work. Only after looking at code generation for all the
        relevant architectures, did I come to the conclusion that nobody
        actually really needs the old "smp_rmb()" model.
      
        Even architectures without load-acquire support generally do as well
        or better with smp_load_acquire().
      
        So credit to Christoph, but if this then causes issues on other
        architectures, put the blame solidly on me.
      
        Also note as part of the ruthless simplification, this gets rid of the
        overly subtle optimization where some code uses a non-barrier version
        of the sequence count (see the __read_seqcount_begin() users in
        fs/namei.c). They then play games with their own barriers and/or with
        nested sequence counts.
      
        Those optimizations are literally meaningless on x86, and questionable
        elsewhere. If somebody can show that they matter, we need to re-do
        them more cleanly than "use an internal helper".       - Linus ]
      Signed-off-by: default avatarChristoph Lameter (Ampere) <cl@gentwo.org>
      Link: https://lore.kernel.org/all/20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org/Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0dd066a
    • Linus Torvalds's avatar
      Merge branch 'address-masking' · de5cb0dc
      Linus Torvalds authored
      Merge user access fast validation using address masking.
      
      This allows architectures to optionally use a data dependent address
      masking model instead of a conditional branch for validating user
      accesses.  That avoids the Spectre-v1 speculation barriers.
      
      Right now only x86-64 takes advantage of this, and not all architectures
      will be able to do it.  It requires a guard region between the user and
      kernel address spaces (so that you can't overflow from one to the
      other), and an easy way to generate a guaranteed-to-fault address for
      invalid user pointers.
      
      Also note that this currently assumes that there is no difference
      between user read and write accesses.  If extended to architectures like
      powerpc, we'll also need to separate out the user read-vs-write cases.
      
      * address-masking:
        x86: make the masked_user_access_begin() macro use its argument only once
        x86: do the user address masking outside the user access area
        x86: support user address masking instead of non-speculative conditional
      de5cb0dc
    • Linus Torvalds's avatar
      x86: make the masked_user_access_begin() macro use its argument only once · 533ab223
      Linus Torvalds authored
      This doesn't actually matter for any of the current users, but before
      merging it mainline, make sure we don't have any surprising semantics.
      
      We don't actually want to use an inline function here, because we want
      to allow - but not require - const pointer arguments, and return them as
      such.  But we already had a local auto-type variable, so let's just use
      it to avoid any possible double evaluation.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      533ab223
    • Linus Torvalds's avatar
      Merge tag 'trace-ring-buffer-v6.12' of... · af9c191a
      Linus Torvalds authored
      Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull ring-buffer updates from Steven Rostedt:
      
       - tracing/ring-buffer: persistent buffer across reboots
      
         This allows for the tracing instance ring buffer to stay persistent
         across reboots. The way this is done is by adding to the kernel
         command line:
      
           trace_instance=boot_map@0x285400000:12M
      
         This will reserve 12 megabytes at the address 0x285400000, and then
         map the tracing instance "boot_map" ring buffer to that memory. This
         will appear as a normal instance in the tracefs system:
      
           /sys/kernel/tracing/instances/boot_map
      
         A user could enable tracing in that instance, and on reboot or kernel
         crash, if the memory is not wiped by the firmware, it will recreate
         the trace in that instance. For example, if one was debugging a
         shutdown of a kernel reboot:
      
           # cd /sys/kernel/tracing
           # echo function > instances/boot_map/current_tracer
           # reboot
           [..]
           # cd /sys/kernel/tracing
           # tail instances/boot_map/trace
                 swapper/0-1       [000] d..1.   164.549800: restore_boot_irq_mode <-native_machine_shutdown
                 swapper/0-1       [000] d..1.   164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
                 swapper/0-1       [000] d..1.   164.549802: disconnect_bsp_APIC <-native_machine_shutdown
                 swapper/0-1       [000] d..1.   164.549811: hpet_disable <-native_machine_shutdown
                 swapper/0-1       [000] d..1.   164.549812: iommu_shutdown_noop <-native_machine_restart
                 swapper/0-1       [000] d..1.   164.549813: native_machine_emergency_restart <-__do_sys_reboot
                 swapper/0-1       [000] d..1.   164.549813: tboot_shutdown <-native_machine_emergency_restart
                 swapper/0-1       [000] d..1.   164.549820: acpi_reboot <-native_machine_emergency_restart
                 swapper/0-1       [000] d..1.   164.549821: acpi_reset <-acpi_reboot
                 swapper/0-1       [000] d..1.   164.549822: acpi_os_write_port <-acpi_reboot
      
         On reboot, the buffer is examined to make sure it is valid. The
         validation check even steps through every event to make sure the meta
         data of the event is correct. If any test fails, it will simply reset
         the buffer, and the buffer will be empty on boot.
      
       - Allow the tracing persistent boot buffer to use the "reserve_mem"
         option
      
         Instead of having the admin find a physical address to store the
         persistent buffer, which can be very tedious if they have to
         administrate several different machines, allow them to use the
         "reserve_mem" option that will find a location for them. It is not as
         reliable because of KASLR, as the loading of the kernel in different
         locations can cause the memory allocated to be inconsistent. Booting
         with "nokaslr" can make reserve_mem more reliable.
      
       - Have function graph tracer handle offsets from a previous boot.
      
         The ring buffer output from a previous boot may have different
         addresses due to kaslr. Have the function graph tracer handle these
         by using the delta from the previous boot to the new boot address
         space.
      
       - Only reset the saved meta offset when the buffer is started or reset
      
         In the persistent memory meta data, it holds the previous address
         space information, so that it can calculate the delta to have
         function tracing work. But this gets updated after being read to hold
         the new address space. But if the buffer isn't used for that boot, on
         reboot, the delta is now calculated from the previous boot and not
         the boot that holds the data in the ring buffer. This causes the
         functions not to be shown. Do not save the address space information
         of the current kernel until it is being recorded.
      
       - Add a magic variable to test the valid meta data
      
         Add a magic variable in the meta data that can also be used for
         validation. The validator of the previous buffer doesn't need this
         magic data, but it can be used if the meta data is changed by a new
         kernel, which may have the same format that passes the validator but
         is used differently. This magic number can also be used as a
         "versioning" of the meta data.
      
       - Align user space mapped ring buffer sub buffers to improve TLB
         entries
      
         Linus mentioned that the mapped ring buffer sub buffers were
         misaligned between the meta page and the sub-buffers, so that if the
         sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
         use bigger entries.
      
       - Add new kernel command line "traceoff" to disable tracing on boot for
         instances
      
         If tracing is enabled for a boot instance, there needs a way to be
         able to disable it on boot so that new events do not get entered into
         the ring buffer and be mixed with events from a previous boot, as
         that can be confusing.
      
       - Allow trace_printk() to go to other instances
      
         Currently, trace_printk() can only go to the top level instance. When
         debugging with a persistent buffer, it is really useful to be able to
         add trace_printk() to go to that buffer, so that you have access to
         them after a crash.
      
       - Do not use "bin_printk()" for traces to a boot instance
      
         The bin_printk() saves only a pointer to the printk format in the
         ring buffer, as the reader of the buffer can still have access to it.
         But this is not the case if the buffer is from a previous boot. If
         the trace_printk() is going to a "persistent" buffer, it will use the
         slower version that writes the printk format into the buffer.
      
       - Add command line option to allow trace_printk() to go to an instance
      
         Allow the kernel command line to define which instance the
         trace_printk() goes to, instead of forcing the admin to set it for
         every boot via the tracefs options.
      
       - Start a document that explains how to use tracefs to debug the kernel
      
       - Add some more kernel selftests to test user mapped ring buffer
      
      * tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
        selftests/ring-buffer: Handle meta-page bigger than the system
        selftests/ring-buffer: Verify the entire meta-page padding
        tracing/Documentation: Start a document on how to debug with tracing
        tracing: Add option to set an instance to be the trace_printk destination
        tracing: Have trace_printk not use binary prints if boot buffer
        tracing: Allow trace_printk() to go to other instance buffers
        tracing: Add "traceoff" flag to boot time tracing instances
        ring-buffer: Align meta-page to sub-buffers for improved TLB usage
        ring-buffer: Add magic and struct size to boot up meta data
        ring-buffer: Don't reset persistent ring-buffer meta saved addresses
        tracing/fgraph: Have fgraph handle previous boot function addresses
        tracing: Allow boot instances to use reserve_mem boot memory
        tracing: Fix ifdef of snapshots to not prevent last_boot_info file
        ring-buffer: Use vma_pages() helper function
        tracing: Fix NULL vs IS_ERR() check in enable_instances()
        tracing: Add last boot delta offset for stack traces
        tracing: Update function tracing output for previous boot buffer
        tracing: Handle old buffer mappings for event strings and functions
        tracing/ring-buffer: Add last_boot_info file to boot instance
        ring-buffer: Save text and data locations in mapped meta data
        ...
      af9c191a
    • Linus Torvalds's avatar
      Merge tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest · dd609b8a
      Linus Torvalds authored
      Pull ktest updates from Steven Rostedt:
      
       - Add notification of build warnings for all tests
      
         Currently, the build will only fail on warnings if the ktest config
         file states that it should fail or if the compile is done with
         '-Werror'. This has allowed warnings to sneak in if it doesn't fail.
      
         Add a notification at the end of the test that will state that
         warnings were found in the build so that the developer will be aware
         of it.
      
       - Fix the grub2 parser to not return the wrong kernel index
      
         ktest.pl can read the grub.cfg file to know what kernel to boot to
         via grub-reboot. This requires knowing the index that the kernel is
         referenced by in the grub.cfg file. Some distros have logic to
         determine the menuentry that can cause the ktest.pl to come up with
         the wrong index and boot the wrong kernel.
      
      * tag 'ktest-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
        ktest.pl: Avoid false positives with grub2 skip regex
        ktest.pl: Always warn on build warnings
      dd609b8a
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-for-v6.12-1-2024-09-19' of... · 891e8abe
      Linus Torvalds authored
      Merge tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
      
      Pull perf tools updates from Arnaldo Carvalho de Melo:
      
       - Use BPF + BTF to collect and pretty print syscall and tracepoint
         arguments in 'perf trace', done as an GSoC activity
      
       - Data-type profiling improvements:
      
           - Cache debuginfo to speed up data type resolution
      
           - Add the 'typecln' sort order, to show which cacheline in a target
             is hot or cold. The following shows members in the cfs_rq's first
             cache line:
      
               $ perf report -s type,typecln,typeoff -H
               ...
               -    2.67%        struct cfs_rq
                  +    1.23%        struct cfs_rq: cache-line 2
                  +    0.57%        struct cfs_rq: cache-line 4
                  +    0.46%        struct cfs_rq: cache-line 6
                  -    0.41%        struct cfs_rq: cache-line 0
                          0.39%        struct cfs_rq +0x14 (h_nr_running)
                          0.02%        struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)
      
           - When a typedef resolves to a unnamed struct, use the typedef name
      
           - When a struct has just one basic type field (int, etc), resolve
             the type sort order to the name of the struct, not the type of
             the field
      
           - Support type folding/unfolding in the data-type annotation TUI
      
           - Fix bitfields offsets and sizes
      
           - Initial support for PowerPC, using libcapstone and the usual
             objdump disassembly parsing routines
      
       - Add support for disassembling and addr2line using the LLVM libraries,
         speeding up those operations
      
       - Support --addr2line option in 'perf script' as with other tools
      
       - Intel branch counters (LBR event logging) support, only available in
         recent Intel processors, for instance, the new "brcntr" field can be
         asked from 'perf script' to print the information collected from this
         feature:
      
           $ perf script -F +brstackinsn,+brcntr
      
           # Branch counter abbr list:
           # branch-instructions:ppp = A
           # branch-misses = B
           # '-' No event occurs
           # '+' Event occurrences may be lost due to branch counter saturated
               tchain_edit  332203 3366329.405674:  53030 branch-instructions:ppp:    401781 f3+0x2c (home/sdp/test/tchain_edit)
                  f3+31:
               0000000000401774   insn: eb 04                  br_cntr: AA  # PRED 5 cycles [5]
               000000000040177a   insn: 81 7d fc 0f 27 00 00
               0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 1 cycles [6] 2.00 IPC
               0000000000401766   insn: 8b 45 fc
               0000000000401769   insn: 83 e0 01
               000000000040176c   insn: 85 c0
               000000000040176e   insn: 74 06                  br_cntr: A   # PRED 1 cycles [7] 4.00 IPC
               0000000000401776   insn: 83 45 fc 01
               000000000040177a   insn: 81 7d fc 0f 27 00 00
               0000000000401781   insn: 7e e3                  br_cntr: A   # PRED 7 cycles [14] 0.43 IPC
      
       - Support Timed PEBS (Precise Event-Based Sampling), a recent hardware
         feature in Intel processors
      
       - Add 'perf ftrace profile' subcommand, using ftrace's function-graph
         tracer so that users can see the total, average, max execution time
         as well as the number of invocations easily, for instance:
      
           $ sudo perf ftrace profile -G __x64_sys_perf_event_open -- \
             perf stat -e cycles -C1 true 2> /dev/null | head
           # Total (us)  Avg (us)  Max (us)  Count  Function
                 65.611    65.611    65.611      1  __x64_sys_perf_event_open
                 30.527    30.527    30.527      1  anon_inode_getfile
                 30.260    30.260    30.260      1  __anon_inode_getfile
                 29.700    29.700    29.700      1  alloc_file_pseudo
                 17.578    17.578    17.578      1  d_alloc_pseudo
                 17.382    17.382    17.382      1  __d_alloc
                 16.738    16.738    16.738      1  kmem_cache_alloc_lru
                 15.686    15.686    15.686      1  perf_event_alloc
                 14.012     7.006    11.264      2  obj_cgroup_charge
      
       - 'perf sched timehist' improvements, including the addition of
         priority showing/filtering command line options
      
       - Varios improvements to the 'perf probe', including 'perf test'
         regression testings
      
       - Introduce the 'perf check', initially to check if some feature is
         in place, using it in 'perf test'
      
       - Various fixes for 32-bit systems
      
       - Address more leak sanitizer failures
      
       - Fix memory leaks (LBR, disasm lock ops, etc)
      
       - More reference counting fixes (branch_info, etc)
      
       - Constify 'struct perf_tool' parameters to improve code generation
         and reduce the chances of having its internals changed, which isn't
         expected
      
       - More constifications in various other places
      
       - Add more build tests, including for JEVENTS
      
       - Add more 'perf test' entries ('perf record LBR', pipe/inject,
         --setup-filter, 'perf ftrace', 'cgroup sampling', etc)
      
       - Inject build ids for all entries in a call chain in 'perf inject',
         not just for the main sample
      
       - Improve the BPF based sample filter, allowing root to setup filters
         in bpffs that then can be used by non-root users
      
       - Allow filtering by cgroups with the BPF based sample filter
      
       - Allow a more compact way for 'perf mem report' using the
         -T/--type-profile and also provide a --sort option similar to the one
         in 'perf report', 'perf top', to setup the sort order manually
      
       - Fix --group behavior in 'perf annotate' when leader has no samples,
         where it was not showing anything even when other events in the group
         had samples
      
       - Fix spinlock and rwlock accounting in 'perf lock contention'
      
       - Fix libsubcmd fixdep Makefile dependencies
      
       - Improve 'perf ftrace' error message when ftrace isn't available
      
       - Update various Intel JSON vendor event files
      
       - ARM64 CoreSight hardware tracing infrastructure improvements, mostly
         not visible to users
      
       - Update power10 JSON events
      
      * tag 'perf-tools-for-v6.12-1-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (310 commits)
        perf trace: Mark the 'head' arg in the set_robust_list syscall as coming from user space
        perf trace: Mark the 'rseq' arg in the rseq syscall as coming from user space
        perf env: Find correct branch counter info on hybrid
        perf evlist: Print hint for group
        tools: Drop nonsensical -O6
        perf pmu: To info add event_type_desc
        perf evsel: Add accessor for tool_event
        perf pmus: Fake PMU clean up
        perf list: Avoid potential out of bounds memory read
        perf help: Fix a typo ("bellow")
        perf ftrace: Detect whether ftrace is enabled on system
        perf test shell probe_vfs_getname: Remove extraneous '=' from probe line number regex
        perf build: Require at least clang 16.0.6 to build BPF skeletons
        perf trace: If a syscall arg is marked as 'const', assume it is coming _from_ userspace
        perf parse-events: Remove duplicated include in parse-events.c
        perf callchain: Allow symbols to be optional when resolving a callchain
        perf inject: Lazy build-id mmap2 event insertion
        perf inject: Add new mmap2-buildid-all option
        perf inject: Fix build ID injection
        perf annotate-data: Add pr_debug_scope()
        ...
      891e8abe
    • Kan Liang's avatar
      perf: Fix topology_sibling_cpumask check warning on ARM · 673a5009
      Kan Liang authored
      The below warning is triggered when building with arm
      multi_v7_defconfig.
      
        kernel/events/core.c: In function 'perf_event_setup_cpumask':
        kernel/events/core.c:14012:13: warning: the comparison will always evaluate as 'true' for the address of 'thread_sibling' will never be NULL [-Waddress]
        14012 |         if (!topology_sibling_cpumask(cpu)) {
      
      The perf_event_init_cpu() may be invoked at the early boot stage, while
      the topology_*_cpumask hasn't been initialized yet.  The check is to
      specially handle the case, and initialize the perf_online_<domain>_masks
      on the boot CPU.
      
      X86 uses a per-cpu cpumask pointer, which could be NULL at the early
      boot stage.  However, ARM uses a global variable, which never be NULL.
      
      Use perf_online_mask as an indicator instead.  Only initialize the
      perf_online_<domain>_masks when perf_online_mask is empty.
      
      Fix a typo as well.
      
      Fixes: 4ba4f1af ("perf: Generic hotplug support for a PMU with a scope")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Closes: https://lore.kernel.org/lkml/20240911153854.240bbc1f@canb.auug.org.au/Reported-by: default avatarSteven Price <steven.price@arm.com>
      Closes: https://lore.kernel.org/lkml/1835eb6d-3e05-47f3-9eae-507ce165c3bf@arm.com/Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Tested-by: default avatarSteven Price <steven.price@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      673a5009
  3. 21 Sep, 2024 26 commits
    • Linus Torvalds's avatar
      Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext · 88264981
      Linus Torvalds authored
      Pull sched_ext support from Tejun Heo:
       "This implements a new scheduler class called ‘ext_sched_class’, or
        sched_ext, which allows scheduling policies to be implemented as BPF
        programs.
      
        The goals of this are:
      
         - Ease of experimentation and exploration: Enabling rapid iteration
           of new scheduling policies.
      
         - Customization: Building application-specific schedulers which
           implement policies that are not applicable to general-purpose
           schedulers.
      
         - Rapid scheduler deployments: Non-disruptive swap outs of scheduling
           policies in production environments"
      
      See individual commits for more documentation, but also the cover letter
      for the latest series:
      
      Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/
      
      * tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits)
        sched: Move update_other_load_avgs() to kernel/sched/pelt.c
        sched_ext: Don't trigger ops.quiescent/runnable() on migrations
        sched_ext: Synchronize bypass state changes with rq lock
        scx_qmap: Implement highpri boosting
        sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
        sched_ext: Compact struct bpf_iter_scx_dsq_kern
        sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
        sched_ext: Move consume_local_task() upward
        sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
        sched_ext: Reorder args for consume_local/remote_task()
        sched_ext: Restructure dispatch_to_local_dsq()
        sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
        sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
        sched_ext: Refactor consume_remote_task()
        sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
        sched_ext: Add missing static to scx_dump_data
        sched_ext: Add missing static to scx_has_op[]
        sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
        sched_ext: Add a cgroup scheduler which uses flattened hierarchy
        sched_ext: Add cgroup support
        ...
      88264981
    • Linus Torvalds's avatar
      Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 440b6523
      Linus Torvalds authored
      Pull bpf updates from Alexei Starovoitov:
      
       - Introduce '__attribute__((bpf_fastcall))' for helpers and kfuncs with
         corresponding support in LLVM.
      
         It is similar to existing 'no_caller_saved_registers' attribute in
         GCC/LLVM with a provision for backward compatibility. It allows
         compilers generate more efficient BPF code assuming the verifier or
         JITs will inline or partially inline a helper/kfunc with such
         attribute. bpf_cast_to_kern_ctx, bpf_rdonly_cast,
         bpf_get_smp_processor_id are the first set of such helpers.
      
       - Harden and extend ELF build ID parsing logic.
      
         When called from sleepable context the relevants parts of ELF file
         will be read to find and fetch .note.gnu.build-id information. Also
         harden the logic to avoid TOCTOU, overflow, out-of-bounds problems.
      
       - Improvements and fixes for sched-ext:
          - Allow passing BPF iterators as kfunc arguments
          - Make the pointer returned from iter_next method trusted
          - Fix x86 JIT convergence issue due to growing/shrinking conditional
            jumps in variable length encoding
      
       - BPF_LSM related:
          - Introduce few VFS kfuncs and consolidate them in
            fs/bpf_fs_kfuncs.c
          - Enforce correct range of return values from certain LSM hooks
          - Disallow attaching to other LSM hooks
      
       - Prerequisite work for upcoming Qdisc in BPF:
          - Allow kptrs in program provided structs
          - Support for gen_epilogue in verifier_ops
      
       - Important fixes:
          - Fix uprobe multi pid filter check
          - Fix bpf_strtol and bpf_strtoul helpers
          - Track equal scalars history on per-instruction level
          - Fix tailcall hierarchy on x86 and arm64
          - Fix signed division overflow to prevent INT_MIN/-1 trap on x86
          - Fix get kernel stack in BPF progs attached to tracepoint:syscall
      
       - Selftests:
          - Add uprobe bench/stress tool
          - Generate file dependencies to drastically improve re-build time
          - Match JIT-ed and BPF asm with __xlated/__jited keywords
          - Convert older tests to test_progs framework
          - Add support for RISC-V
          - Few fixes when BPF programs are compiled with GCC-BPF backend
            (support for GCC-BPF in BPF CI is ongoing in parallel)
          - Add traffic monitor
          - Enable cross compile and musl libc
      
      * tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (260 commits)
        btf: require pahole 1.21+ for DEBUG_INFO_BTF with default DWARF version
        btf: move pahole check in scripts/link-vmlinux.sh to lib/Kconfig.debug
        btf: remove redundant CONFIG_BPF test in scripts/link-vmlinux.sh
        bpf: Call the missed kfree() when there is no special field in btf
        bpf: Call the missed btf_record_free() when map creation fails
        selftests/bpf: Add a test case to write mtu result into .rodata
        selftests/bpf: Add a test case to write strtol result into .rodata
        selftests/bpf: Rename ARG_PTR_TO_LONG test description
        selftests/bpf: Fix ARG_PTR_TO_LONG {half-,}uninitialized test
        bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error
        bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types
        bpf: Fix helper writes to read-only maps
        bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpers
        bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit
        selftests/bpf: Add tests for sdiv/smod overflow cases
        bpf: Fix a sdiv overflow issue
        libbpf: Add bpf_object__token_fd accessor
        docs/bpf: Add missing BPF program types to docs
        docs/bpf: Add constant values for linkages
        bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
        ...
      440b6523
    • Linus Torvalds's avatar
      Merge tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 1ec6d097
      Linus Torvalds authored
      Pull s390 updates from Vasily Gorbik:
      
       - Optimize ftrace and kprobes code patching and avoid stop machine for
         kprobes if sequential instruction fetching facility is available
      
       - Add hiperdispatch feature to dynamically adjust CPU capacity in
         vertical polarization to improve scheduling efficiency and overall
         performance. Also add infrastructure for handling warning track
         interrupts (WTI), allowing for graceful CPU preemption
      
       - Rework crypto code pkey module and split it into separate,
         independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
         to load only when the relevant hardware is available
      
       - Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
         utilizing message-security assist extensions (MSA) 10 and 11. It
         introduces new shash implementations for HMAC-SHA224/256/384/512 and
         registers the hardware-accelerated AES-XTS cipher as the preferred
         option. Also add clear key token support
      
       - Add MSA 10 and 11 processor activity instrumentation counters to perf
         and update PAI Extension 1 NNPA counters
      
       - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
         statements
      
       - Add support for SHA3 performance enhancements introduced with MSA 12
      
       - Add support for the query authentication information feature of MSA
         13 and introduce the KDSA CPACF instruction. Provide query and query
         authentication information in sysfs, enabling tools like cpacfinfo to
         present this data in a human-readable form
      
       - Update kernel disassembler instructions
      
       - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
         kpatch compatibility
      
       - Add missing warning handling and relocated lowcore support to the
         early program check handler
      
       - Optimize ftrace_return_address() and avoid calling unwinder
      
       - Make modules use kernel ftrace trampolines
      
       - Strip relocs from the final vmlinux ELF file to make it roughly 2
         times smaller
      
       - Dump register contents and call trace for early crashes to the
         console
      
       - Generate ptdump address marker array dynamically
      
       - Fix rcu_sched stalls that might occur when adding or removing large
         amounts of pages at once to or from the CMM balloon
      
       - Fix deadlock caused by recursive lock of the AP bus scan mutex
      
       - Unify sync and async register save areas in entry code
      
       - Cleanup debug prints in crypto code
      
       - Various cleanup and sanitizing patches for the decompressor
      
       - Various small ftrace cleanups
      
      * tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
        s390/crypto: Display Query and Query Authentication Information in sysfs
        s390/crypto: Add Support for Query Authentication Information
        s390/crypto: Rework RRE and RRF CPACF inline functions
        s390/crypto: Add KDSA CPACF Instruction
        s390/disassembler: Remove duplicate instruction format RSY_RDRU
        s390/boot: Move boot_printk() code to own file
        s390/boot: Use boot_printk() instead of sclp_early_printk()
        s390/boot: Rename decompressor_printk() to boot_printk()
        s390/boot: Compile all files with the same march flag
        s390: Use MARCH_HAS_*_FEATURES defines
        s390: Provide MARCH_HAS_*_FEATURES defines
        s390/facility: Disable compile time optimization for decompressor code
        s390/boot: Increase minimum architecture to z10
        s390/als: Remove obsolete comment
        s390/sha3: Fix SHA3 selftests failures
        s390/pkey: Add AES xts and HMAC clear key token support
        s390/cpacf: Add MSA 10 and 11 new PCKMO functions
        s390/mm: Add cond_resched() to cmm_alloc/free_pages()
        s390/pai_ext: Update PAI extension 1 counters
        s390/pai_crypto: Add support for MSA 10 and 11 pai counters
        ...
      1ec6d097
    • Diogo Jahchan Koike's avatar
      bcachefs: return err ptr instead of null in read sb clean · 025c55a4
      Diogo Jahchan Koike authored
      syzbot reported a null-ptr-deref in bch2_fs_start. [0]
      
      When a sb is marked clear but doesn't have a clean section
      bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO
      lets through, eventually leading to a null ptr dereference down
      the line. Adjust read sb clean to return an ERR_PTR indicating the
      invalid clean section.
      
      [0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543
      
      Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543Signed-off-by: default avatarDiogo Jahchan Koike <djahchankoike@gmail.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      025c55a4
    • Yang Li's avatar
      bcachefs: Remove duplicated include in backpointers.c · abb43dd6
      Yang Li authored
      The header files bbpos.h is included twice in backpointers.c,
      so one inclusion of each can be removed.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783Signed-off-by: default avatarYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      abb43dd6
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices · 035d72f7
      Kent Overstreet authored
      This factors out ec_strie_head_devs_update(), which initializes the
      bitmap of devices we're allocating from, and runs it every time
      c->rw_devs_change_count changes.
      
      We also cancel pending, not allocated stripes, since they may refer to
      devices that are no longer available.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      035d72f7
    • Kent Overstreet's avatar
      bcachefs: bch_fs.rw_devs_change_count · 83ccd9b3
      Kent Overstreet authored
      Add a counter that's incremented whenever rw devices change; this will
      be used for erasure coding so that it can keep ec_stripe_head in sync
      and not deadlock on a new stripe when a device it wants goes away.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      83ccd9b3
    • Kent Overstreet's avatar
      bcachefs: bch2_dev_remove_stripes() · ad8d1f77
      Kent Overstreet authored
      We can now correctly force-remove a device that has stripes on it; this
      uses the new BCH_SB_MEMBER_INVALID sentinal value.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      ad8d1f77
    • Kent Overstreet's avatar
      bcachefs: bch2_trigger_ptr() calculates sectors even when no device · 934137b0
      Kent Overstreet authored
      This is necessary for erasure coded pointers to devices that have been
      removed.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      934137b0
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: improve bch2_new_stripe_to_text() · c9cabfb2
      Kent Overstreet authored
      also print out the new stripe key
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      c9cabfb2
    • Kent Overstreet's avatar
      bcachefs: ec_stripe_head.nr_created · a4b7a0c0
      Kent Overstreet authored
      additional debug stat
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      a4b7a0c0
    • Kent Overstreet's avatar
      bcachefs: bch_stripe.disk_label · fa85c473
      Kent Overstreet authored
      When reshaping existing stripes, we should keep them on the same target
      that they were allocated on; to do this, we need to add a field to the
      btree stripe type.
      
      This is a tad awkward, because we only have 8 bits left, and targets are
      16 bits - but we only need to store a label, not a full target.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      fa85c473
    • Kent Overstreet's avatar
      bcachefs: stripe_to_mem() · 1b11c4d3
      Kent Overstreet authored
      factor out a common helper
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      1b11c4d3
    • Kent Overstreet's avatar
      bcachefs: EIO errcode cleanup · 54a12984
      Kent Overstreet authored
      We want to be using private errcodes whenever possible, for better error
      messages.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      54a12984
    • Kent Overstreet's avatar
      bcachefs: Rework btree node pinning · 7a51608d
      Kent Overstreet authored
      In backpointers fsck, we do a seqential scan of one btree, and check
      references to another: extents <-> backpointers
      
      Checking references generates random lookups, so we want to pin that
      btree in memory (or only a range, if it doesn't fit in ram).
      
      Previously, this was done with a simple check in the shrinker - "if
      btree node is in range being pinned, don't free it" - but this generated
      OOMs, as our shrinker wasn't well behaved if there was less memory
      available than expected.
      
      Instead, we now have two different shrinkers and lru lists; the second
      shrinker being for pinned nodes, with seeks set much higher than normal
      - so they can still be freed if necessary, but we'll prefer not to.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      7a51608d
    • Kent Overstreet's avatar
      bcachefs: split up btree cache counters for live, freeable · 91ddd715
      Kent Overstreet authored
      this is prep for introducing a second live list and shrinker for pinned
      nodes
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      91ddd715
    • Kent Overstreet's avatar
      bcachefs: btree cache counters should be size_t · 691f2cba
      Kent Overstreet authored
      32 bits won't overflow any time soon, but size_t is the correct type for
      counting objects in memory.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      691f2cba
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: bch2_dev_rcu_noerror() · 805ddc20
      Kent Overstreet authored
      bch2_dev_rcu() now properly errors if the device is invalid
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      805ddc20
    • Kent Overstreet's avatar
    • Kent Overstreet's avatar
      bcachefs: bch2_opts_to_text() · 3621ecc1
      Kent Overstreet authored
      Factor out bch2_show_options() into a generic helper, for debugging
      option passing issues.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      3621ecc1
    • Kent Overstreet's avatar