1. 10 Sep, 2024 1 commit
    • Kan Liang's avatar
      perf: Generic hotplug support for a PMU with a scope · 4ba4f1af
      Kan Liang authored
      The perf subsystem assumes that the counters of a PMU are per-CPU. So
      the user space tool reads a counter from each CPU in the system wide
      mode. However, many PMUs don't have a per-CPU counter. The counter is
      effective for a scope, e.g., a die or a socket. To address this, a
      cpumask is exposed by the kernel driver to restrict to one CPU to stand
      for a specific scope. In case the given CPU is removed,
      the hotplug support has to be implemented for each such driver.
      
      The codes to support the cpumask and hotplug are very similar.
      - Expose a cpumask into sysfs
      - Pickup another CPU in the same scope if the given CPU is removed.
      - Invoke the perf_pmu_migrate_context() to migrate to a new CPU.
      - In event init, always set the CPU in the cpumask to event->cpu
      
      Similar duplicated codes are implemented for each such PMU driver. It
      would be good to introduce a generic infrastructure to avoid such
      duplication.
      
      5 popular scopes are implemented here, core, die, cluster, pkg, and
      the system-wide. The scope can be set when a PMU is registered. If so, a
      "cpumask" is automatically exposed for the PMU.
      
      The "cpumask" is from the perf_online_<scope>_mask, which is to track
      the active CPU for each scope. They are set when the first CPU of the
      scope is online via the generic perf hotplug support. When a
      corresponding CPU is removed, the perf_online_<scope>_mask is updated
      accordingly and the PMU will be moved to a new CPU from the same scope
      if possible.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20240802151643.1691631-2-kan.liang@linux.intel.com
      4ba4f1af
  2. 05 Sep, 2024 10 commits
    • Andrii Nakryiko's avatar
      uprobes: perform lockless SRCU-protected uprobes_tree lookup · cd7bdd9d
      Andrii Nakryiko authored
      Another big bottleneck to scalablity is uprobe_treelock that's taken in
      a very hot path in handle_swbp(). Now that uprobes are SRCU-protected,
      take advantage of that and make uprobes_tree RB-tree look up lockless.
      
      To make RB-tree RCU-protected lockless lookup correct, we need to take
      into account that such RB-tree lookup can return false negatives if there
      are parallel RB-tree modifications (rotations) going on. We use seqcount
      lock to detect whether RB-tree changed, and if we find nothing while
      RB-tree got modified inbetween, we just retry. If uprobe was found, then
      it's guaranteed to be a correct lookup.
      
      With all the lock-avoiding changes done, we get a pretty decent
      improvement in performance and scalability of uprobes with number of
      CPUs, even though we are still nowhere near linear scalability. This is
      due to SRCU not really scaling very well with number of CPUs on
      a particular hardware that was used for testing (80-core Intel Xeon Gold
      6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is
      currently taken to resolve interrupt address to inode+offset and then
      uprobe instance. And, of course, uretprobes still need similar RCU to
      avoid refcount in the hot path, which will be addressed in the follow up
      patches.
      
      Nevertheless, the improvement is good. We used BPF selftest-based
      uprobe-nop and uretprobe-nop benchmarks to get the below numbers,
      varying number of CPUs on which uprobes and uretprobes are triggered.
      
      BASELINE
      ========
      uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
      uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
      uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
      uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
      uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
      uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
      uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)
      
      uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
      uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
      uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
      uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
      uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
      uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
      uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)
      
      SRCU CHANGES
      ============
      uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
      uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
      uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
      uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
      uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
      uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
      uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)
      
      uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
      uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
      uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
      uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
      uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
      uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
      uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)
      
      Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about
      8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s
      to 5mln/s.
      Suggested-by: default avatar"Peter Zijlstra (Intel)" <peterz@infradead.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-8-andrii@kernel.org
      cd7bdd9d
    • Peter Zijlstra's avatar
      rbtree: provide rb_find_rcu() / rb_find_add_rcu() · 50a38035
      Peter Zijlstra authored
      Much like latch_tree, add two RCU methods for the regular RB-tree,
      which can be used in conjunction with a seqcount to provide lockless
      lookups.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatar"Peter Zijlstra (Intel)" <peterz@infradead.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatar"Masami Hiramatsu (Google)" <mhiramat@kernel.org>
      Link: https://lore.kernel.org/r/20240903174603.3554182-7-andrii@kernel.org
      50a38035
    • Peter Zijlstra's avatar
      perf/uprobe: split uprobe_unregister() · 04b01625
      Peter Zijlstra authored
      With uprobe_unregister() having grown a synchronize_srcu(), it becomes
      fairly slow to call. Esp. since both users of this API call it in a
      loop.
      
      Peel off the sync_srcu() and do it once, after the loop.
      
      We also need to add uprobe_unregister_sync() into uprobe_register()'s
      error handling path, as we need to be careful about returning to the
      caller before we have a guarantee that partially attached consumer won't
      be called anymore. This is an unlikely slow path and this should be
      totally fine to be slow in the case of a failed attach.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatar"Peter Zijlstra (Intel)" <peterz@infradead.org>
      Co-developed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-6-andrii@kernel.org
      04b01625
    • Andrii Nakryiko's avatar
      uprobes: travers uprobe's consumer list locklessly under SRCU protection · cc01bd04
      Andrii Nakryiko authored
      uprobe->register_rwsem is one of a few big bottlenecks to scalability of
      uprobes, so we need to get rid of it to improve uprobe performance and
      multi-CPU scalability.
      
      First, we turn uprobe's consumer list to a typical doubly-linked list
      and utilize existing RCU-aware helpers for traversing such lists, as
      well as adding and removing elements from it.
      
      For entry uprobes we already have SRCU protection active since before
      uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
      won't go away from under us, but we add SRCU protection around consumer
      list traversal.
      
      Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
      we remember whether any removal was requested during handler calls, but
      then we double-check the decision under a proper register_rwsem using
      consumers' filter callbacks. Handler removal is very rare, so this extra
      lock won't hurt performance, overall, but we also avoid the need for any
      extra protection (e.g., seqcount locks).
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-5-andrii@kernel.org
      cc01bd04
    • Andrii Nakryiko's avatar
      uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks · 59da880a
      Andrii Nakryiko authored
      It serves no purpose beyond adding unnecessray argument passed to the
      filter callback. Just get rid of it, no one is actually using it.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-4-andrii@kernel.org
      59da880a
    • Andrii Nakryiko's avatar
      uprobes: protected uprobe lifetime with SRCU · 8617408f
      Andrii Nakryiko authored
      To avoid unnecessarily taking a (brief) refcount on uprobe during
      breakpoint handling in handle_swbp for entry uprobes, make find_uprobe()
      not take refcount, but protect the lifetime of a uprobe instance with
      RCU. This improves scalability, as refcount gets quite expensive due to
      cache line bouncing between multiple CPUs.
      
      Specifically, we utilize our own uprobe-specific SRCU instance for this
      RCU protection. put_uprobe() will delay actual kfree() using call_srcu().
      
      For now, uretprobe and single-stepping handling will still acquire
      refcount as necessary. We'll address these issues in follow up patches
      by making them use SRCU with timeout.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-3-andrii@kernel.org
      8617408f
    • Andrii Nakryiko's avatar
      uprobes: revamp uprobe refcounting and lifetime management · 3f7f1a64
      Andrii Nakryiko authored
      Revamp how struct uprobe is refcounted, and thus how its lifetime is
      managed.
      
      Right now, there are a few possible "owners" of uprobe refcount:
        - uprobes_tree RB tree assumes one refcount when uprobe is registered
          and added to the lookup tree;
        - while uprobe is triggered and kernel is handling it in the breakpoint
          handler code, temporary refcount bump is done to keep uprobe from
          being freed;
        - if we have uretprobe requested on a given struct uprobe instance, we
          take another refcount to keep uprobe alive until user space code
          returns from the function and triggers return handler.
      
      The uprobe_tree's extra refcount of 1 is confusing and problematic. No
      matter how many actual consumers are attached, they all share the same
      refcount, and we have an extra logic to drop the "last" (which might not
      really be last) refcount once uprobe's consumer list becomes empty.
      
      This is unconventional and has to be kept in mind as a special case all
      the time. Further, because of this design we have the situations where
      find_uprobe() will find uprobe, bump refcount, return it to the caller,
      but that uprobe will still need uprobe_is_active() check, after which
      the caller is required to drop refcount and try again. This is just too
      many details leaking to the higher level logic.
      
      This patch changes refcounting scheme in such a way as to not have
      uprobes_tree keeping extra refcount for struct uprobe. Instead, each
      uprobe_consumer is assuming its own refcount, which will be dropped
      when consumer is unregistered. Other than that, all the active users of
      uprobe (entry and return uprobe handling code) keeps exactly the same
      refcounting approach.
      
      With the above setup, once uprobe's refcount drops to zero, we need to
      make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
      of course. This, though, races with uprobe entry handling code in
      handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
      can race with uprobe being destroyed after refcount drops to zero (e.g.,
      due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
      will attempt to bump refcount, unless it already is zero. Caller needs
      to guarantee that uprobe instance won't be freed in parallel, which is
      the case while we keep uprobes_treelock (for read or write, doesn't
      matter).
      
      Note also, we now don't leak the race between registration and
      unregistration, so we remove the retry logic completely. If
      find_uprobe() returns valid uprobe, it's guaranteed to remain in
      uprobes_tree with properly incremented refcount. The race is handled
      inside __insert_uprobe() and put_uprobe() working together:
      __insert_uprobe() will remove uprobe from RB-tree, if it can't bump
      refcount and will retry to insert the new uprobe instance. put_uprobe()
      won't attempt to remove uprobe from RB-tree, if it's already not there.
      All that is protected by uprobes_treelock, which keeps things simple.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20240903174603.3554182-2-andrii@kernel.org
      3f7f1a64
    • Oleg Nesterov's avatar
      bpf: Fix use-after-free in bpf_uprobe_multi_link_attach() · 5fe6e308
      Oleg Nesterov authored
      If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the
      error_free label and frees the array of bpf_uprobe's without calling
      bpf_uprobe_unregister().
      
      This leaks bpf_uprobe->uprobe and worse, this frees bpf_uprobe->consumer
      without removing it from the uprobe->consumers list.
      
      Fixes: 89ae89f5 ("bpf: Add multi uprobe link")
      Closes: https://lore.kernel.org/all/000000000000382d39061f59f2dd@google.com/
      Reported-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJiri Olsa <jolsa@kernel.org>
      Tested-by: syzbot+f7a1c2c2711e4a780f19@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240813152524.GA7292@redhat.com
      5fe6e308
    • Luo Gengkun's avatar
      perf/core: Fix small negative period being ignored · 62c0b106
      Luo Gengkun authored
      In perf_adjust_period, we will first calculate period, and then use
      this period to calculate delta. However, when delta is less than 0,
      there will be a deviation compared to when delta is greater than or
      equal to 0. For example, when delta is in the range of [-14,-1], the
      range of delta = delta + 7 is between [-7,6], so the final value of
      delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored.
      This is unacceptable when the target period is very short, because
      we will lose a lot of samples.
      
      Here are some tests and analyzes:
      before:
        # perf record -e cs -F 1000  ./a.out
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ]
      
        # perf script
        ...
        a.out     396   257.956048:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.957891:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.959730:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.961545:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.963355:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.965163:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.966973:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.968785:         23 cs:  ffffffff81f4eeec schedul>
        a.out     396   257.970593:         23 cs:  ffffffff81f4eeec schedul>
        ...
      
      after:
        # perf record -e cs -F 1000  ./a.out
        [ perf record: Woken up 1 times to write data ]
        [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ]
      
        # perf script
        ...
        a.out     395    59.338813:         11 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.339707:         12 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.340682:         13 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.341751:         13 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.342799:         12 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.343765:         11 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.344651:         11 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.345539:         12 cs:  ffffffff81f4eeec schedul>
        a.out     395    59.346502:         13 cs:  ffffffff81f4eeec schedul>
        ...
      
      test.c
      
      int main() {
              for (int i = 0; i < 20000; i++)
                      usleep(10);
      
              return 0;
      }
      
        # time ./a.out
        real    0m1.583s
        user    0m0.040s
        sys     0m0.298s
      
      The above results were tested on x86-64 qemu with KVM enabled using
      test.c as test program. Ideally, we should have around 1500 samples,
      but the previous algorithm had only about 500, whereas the modified
      algorithm now has about 1400. Further more, the new version shows 1
      sample per 0.001s, while the previous one is 1 sample per 0.002s.This
      indicates that the new algorithm is more sensitive to small negative
      values compared to old algorithm.
      
      Fixes: bd2b5b12 ("perf_counter: More aggressive frequency adjustment")
      Signed-off-by: default avatarLuo Gengkun <luogengkun@huaweicloud.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Reviewed-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20240831074316.2106159-2-luogengkun@huaweicloud.com
      62c0b106
    • Ingo Molnar's avatar
      Merge branch 'perf/urgent' into perf/core, to pick up fixes · 95c13662
      Ingo Molnar authored
      This also refreshes the -rc1 based branch to -rc5.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      95c13662
  3. 04 Sep, 2024 1 commit
  4. 03 Sep, 2024 1 commit
  5. 25 Aug, 2024 6 commits
    • Kan Liang's avatar
      perf/x86/intel: Limit the period on Haswell · 25dfc9e3
      Kan Liang authored
      Running the ltp test cve-2015-3290 concurrently reports the following
      warnings.
      
      perfevents: irq loop stuck!
        WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174
        intel_pmu_handle_irq+0x285/0x370
        Call Trace:
         <NMI>
         ? __warn+0xa4/0x220
         ? intel_pmu_handle_irq+0x285/0x370
         ? __report_bug+0x123/0x130
         ? intel_pmu_handle_irq+0x285/0x370
         ? __report_bug+0x123/0x130
         ? intel_pmu_handle_irq+0x285/0x370
         ? report_bug+0x3e/0xa0
         ? handle_bug+0x3c/0x70
         ? exc_invalid_op+0x18/0x50
         ? asm_exc_invalid_op+0x1a/0x20
         ? irq_work_claim+0x1e/0x40
         ? intel_pmu_handle_irq+0x285/0x370
         perf_event_nmi_handler+0x3d/0x60
         nmi_handle+0x104/0x330
      
      Thanks to Thomas Gleixner's analysis, the issue is caused by the low
      initial period (1) of the frequency estimation algorithm, which triggers
      the defects of the HW, specifically erratum HSW11 and HSW143. (For the
      details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/)
      
      The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL
      event, but the initial period in the freq mode is 1. The erratum is the
      same as the BDM11, which has been supported in the kernel. A minimum
      period of 128 is enforced as well on HSW.
      
      HSW143 is regarding that the fixed counter 1 may overcount 32 with the
      Hyper-Threading is enabled. However, based on the test, the hardware
      has more issues than it tells. Besides the fixed counter 1, the message
      'interrupt took too long' can be observed on any counter which was armed
      with a period < 32 and two events expired in the same NMI. A minimum
      period of 32 is enforced for the rest of the events.
      The recommended workaround code of the HSW143 is not implemented.
      Because it only addresses the issue for the fixed counter. It brings
      extra overhead through extra MSR writing. No related overcounting issue
      has been reported so far.
      
      Fixes: 3a632cb2 ("perf/x86/intel: Add simple Haswell PMU support")
      Reported-by: default avatarLi Huafei <lihuafei1@huawei.com>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.com
      Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/
      25dfc9e3
    • Linus Torvalds's avatar
      Linux 6.11-rc5 · 5be63fc1
      Linus Torvalds authored
      5be63fc1
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs · 72bea05c
      Linus Torvalds authored
      Pull bcachefs fixes from Kent Overstreet:
      
       - assorted syzbot fixes
      
       - some upgrade fixes for old (pre 1.0) filesystems
      
       - fix for moving data off a device that was switched to durability=0
         after data had been written to it.
      
       - nocow deadlock fix
      
       - fix for new rebalance_work accounting
      
      * tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits)
        bcachefs: Fix rebalance_work accounting
        bcachefs: Fix failure to flush moves before sleeping in copygc
        bcachefs: don't use rht_bucket() in btree_key_cache_scan()
        bcachefs: add missing inode_walker_exit()
        bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
        bcachefs: Fix double assignment in check_dirent_to_subvol()
        bcachefs: Fix refcounting in discard path
        bcachefs: Fix compat issue with old alloc_v4 keys
        bcachefs: Fix warning in bch2_fs_journal_stop()
        fs/super.c: improve get_tree() error message
        bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
        bcachefs: Fix replay_now_at() assert
        bcachefs: Fix locking in bch2_ioc_setlabel()
        bcachefs: fix failure to relock in btree_node_fill()
        bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
        bcachefs: unlock_long() before resort in journal replay
        bcachefs: fix missing bch2_err_str()
        bcachefs: fix time_stats_to_text()
        bcachefs: Fix bch2_bucket_gens_init()
        bcachefs: Fix bch2_trigger_alloc assert
        ...
      72bea05c
    • Linus Torvalds's avatar
      Merge tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd · 780bdc1b
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
      
       - query directory flex array fix
      
       - fix potential null ptr reference in open
      
       - fix error message in some open cases
      
       - two minor cleanups
      
      * tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd:
        smb/server: update misguided comment of smb2_allocate_rsp_buf()
        smb/server: remove useless assignment of 'file_present' in smb2_open()
        smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
        smb/server: fix return value of smb2_open()
        ksmbd: the buffer of smb2 query dir response has at least 1 byte
      780bdc1b
    • Linus Torvalds's avatar
      Merge tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 48fb4b3d
      Linus Torvalds authored
      Pull s390 fixes from Vasily Gorbik:
      
       - Fix KASLR base offset to account for symbol offsets in the vmlinux
         ELF file, preventing tool breakages like the drgn debugger
      
       - Fix potential memory corruption of physmem_info during kernel
         physical address randomization
      
       - Fix potential memory corruption due to overlap between the relocated
         lowcore and identity mapping by correctly reserving lowcore memory
      
       - Fix performance regression and avoid randomizing identity mapping
         base by default
      
       - Fix unnecessary delay of AP bus binding complete uevent to prevent
         startup lag in KVM guests using AP
      
      * tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/boot: Fix KASLR base offset off by __START_KERNEL bytes
        s390/boot: Avoid possible physmem_info segment corruption
        s390/ap: Refine AP bus bindings complete processing
        s390/mm: Pin identity mapping base to zero
        s390/mm: Prevent lowcore vs identity mapping overlap
      48fb4b3d
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 891e811a
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "The important core fix is another tweak to our discard discovery
        issues. The off by 512 in logical block count seems bad, but in fact
        the inline was only ever used in debug prints, which is why no-one
        noticed"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Do not attempt to configure discard unless LBPME is set
        scsi: MAINTAINERS: Add header files to SCSI SUBSYSTEM
        scsi: ufs: qcom: Add UFSHCD_QUIRK_BROKEN_LSDBS_CAP for SM8550 SoC
        scsi: ufs: core: Add a quirk for handling broken LSDBS field in controller capabilities register
        scsi: core: Fix the return value of scsi_logical_block_count()
        scsi: MAINTAINERS: Update HiSilicon SAS controller driver maintainer
      891e811a
  6. 24 Aug, 2024 10 commits
    • Kent Overstreet's avatar
      bcachefs: Fix rebalance_work accounting · 49aa7830
      Kent Overstreet authored
      rebalance_work was keying off of the presence of rebelance_opts in the
      extent - but that was incorrect, we keep those around after rebalance
      for indirect extents since the inode's options are not directly
      available
      
      Fixes: 20ac515a ("bcachefs: bch_acct_rebalance_work")
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      49aa7830
    • Kent Overstreet's avatar
      bcachefs: Fix failure to flush moves before sleeping in copygc · d3204616
      Kent Overstreet authored
      This fixes an apparent deadlock - rebalance would get stuck trying to
      take nocow locks because they weren't being released by copygc.
      Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
      d3204616
    • Linus Torvalds's avatar
      Merge tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · d2bafcf2
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Three patches addressing cpuset corner cases"
      
      * tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
        cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
        cgroup/cpuset: fix panic caused by partcmd_update
      d2bafcf2
    • Linus Torvalds's avatar
      Merge tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · cb2c84b3
      Linus Torvalds authored
      Pull workqueue fixes from Tejun Heo:
       "Nothing too interesting. One patch to remove spurious warning and
        others to address static checker warnings"
      
      * tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: Correct declaration of cpu_pwq in struct workqueue_struct
        workqueue: Fix spruious data race in __flush_work()
        workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker
        workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
        workqueue: doc: Fix function name, remove markers
      cb2c84b3
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · 5bd6cf00
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - Set correct timer mode on Loongson64
      
       - Only request r4k clockevent interrupt on one CPU
      
      * tag 'mips-fixes_6.11_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: cevt-r4k: Don't call get_c0_compare_int if timer irq is installed
        MIPS: Loongson64: Set timer mode in cpu-probe
      5bd6cf00
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · a8a8dcbd
      Linus Torvalds authored
      Pull arm64 kvm fixes from Catalin Marinas:
      
       - Don't drop references on LPIs that weren't visited by the vgic-debug
         iterator
      
       - Cure lock ordering issue when unregistering vgic redistributors
      
       - Fix for misaligned stage-2 mappings when VMs are backed by hugetlb
         pages
      
       - Treat SGI registers as UNDEFINED if a VM hasn't been configured for
         GICv3
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        KVM: arm64: Make ICC_*SGI*_EL1 undef in the absence of a vGICv3
        KVM: arm64: Ensure canonical IPA is hugepage-aligned when handling fault
        KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors
        KVM: arm64: vgic-debug: Don't put unmarked LPIs
      a8a8dcbd
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs · 60f0560f
      Linus Torvalds authored
      Pull NFS client fixes from Anna Schumaker:
      
       - Fix rpcrdma refcounting in xa_alloc
      
       - Fix rpcrdma usage of XA_FLAGS_ALLOC
      
       - Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS
      
       - Fix attribute bitmap decoder to handle a 3rd word
      
       - Add reschedule points when returning delegations to avoid soft lockups
      
       - Fix clearing layout segments in layoutreturn
      
       - Avoid unnecessary rescanning of the per-server delegation list
      
      * tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        NFS: Avoid unnecessary rescanning of the per-server delegation list
        NFSv4: Fix clearing of layout segments in layoutreturn
        NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
        nfs: fix bitmap decoder to handle a 3rd word
        nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
        rpcrdma: Trace connection registration and unregistration
        rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1
        rpcrdma: Device kref is over-incremented on error from xa_alloc
      60f0560f
    • Linus Torvalds's avatar
      Merge tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 66ace9a8
      Linus Torvalds authored
      Pull smb client fixes from Steve French:
      
       - fix refcount leak (can cause rmmod fail)
      
       - fix byte range locking problem with cached reads
      
       - fix for mount failure if reparse point unrecognized
      
       - minor typo
      
      * tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock
        smb: client: ignore unhandled reparse tags
        smb3: fix problem unloading module due to leaked refcount on shutdown
        smb3: fix broken cached reads when posix locks
      66ace9a8
    • Linus Torvalds's avatar
      Merge tag 'input-for-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 7eb61cc6
      Linus Torvalds authored
      Pull input fixes from Dmitry Torokhov:
      
       - a tweak to uinput interface to reject requests with abnormally large
         number of slots. 100 slots/contacts should be enough for real devices
      
       - support for FocalTech FT8201 added to the edt-ft5x06 driver
      
       - tweaks to i8042 to handle more devices that have issue with its
         emulation
      
       - Synaptics touchpad switched to native SMbus/RMI mode on HP Elitebook
         840 G2
      
       - other minor fixes
      
      * tag 'input-for-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: himax_hx83112b - fix incorrect size when reading product ID
        Input: i8042 - use new forcenorestore quirk to replace old buggy quirk combination
        Input: i8042 - add forcenorestore quirk to leave controller untouched even on s3
        Input: i8042 - add Fujitsu Lifebook E756 to i8042 quirk table
        Input: uinput - reject requests with unreasonable number of slots
        Input: edt-ft5x06 - add support for FocalTech FT8201
        dt-bindings: input: touchscreen: edt-ft5x06: Document FT8201 support
        Input: adc-joystick - fix optional value handling
        Input: synaptics - enable SMBus for HP Elitebook 840 G2
        Input: ads7846 - ratelimit the spi_sync error message
      7eb61cc6
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2024-08-24' of https://gitlab.freedesktop.org/drm/kernel · 79a899e3
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Weekly fixes. xe and msm are the major groups, with
        amdgpu/i915/nouveau having smaller bits. xe has a bunch of hw
        workaround fixes that were found to be missing, so that is why there
        are a bunch of scattered fixes, and one larger one. But overall size
        doesn't look too out of the ordinary.
      
        msm:
         - virtual plane fixes:
            - drop yuv on hw where not supported
            - csc vs yuv format fix
            - rotation fix
         - fix fb cleanup on close
         - reset phy before link training
         - fix visual corruption at 4K
         - fix NULL ptr crash on hotplug
         - simplify debug macros
         - sc7180 fix
         - adreno firmware name error path fix
      
        amdgpu:
         - GFX10 firmware loading fix
         - SDMA 5.2 fix
         - Debugfs parameter validation fix
         - eGPU hotplug fix
      
        i915:
         - fix HDCP timeouts
      
        nouveau:
         - fix SG_DEBUG crash
      
        xe:
         - Fix OA format masks which were breaking build with gcc-5
         - Fix opregion leak (Lucas)
         - Fix OA sysfs entry (Ashutosh)
         - Fix VM dma-resv lock (Brost)
         - Fix tile fini sequence (Brost)
         - Prevent UAF around preempt fence (Auld)
         - Fix DGFX display suspend/resume (Maarten)
         - Many Xe/Xe2 critical workarounds (Auld, Ngai-Mint, Bommu, Tejas, Daniele)
         - Fix devm/drmm issues (Daniele)
         - Fix missing workqueue destroy in xe_gt_pagefault (Stuart)
         - Drop HW fence pointer to HW fence ctx (Brost)
         - Free job before xe_exec_queue_put (Brost)"
      
      * tag 'drm-fixes-2024-08-24' of https://gitlab.freedesktop.org/drm/kernel: (35 commits)
        drm/xe: Free job before xe_exec_queue_put
        drm/xe: Drop HW fence pointer to HW fence ctx
        drm/xe: Fix missing workqueue destroy in xe_gt_pagefault
        drm/amdgpu: fix eGPU hotplug regression
        drm/amdgpu: Validate TA binary size
        drm/amdgpu/sdma5.2: limit wptr workaround to sdma 5.2.1
        drm/amdgpu: fixing rlc firmware loading failure issue
        drm/xe/uc: Use devm to register cleanup that includes exec_queues
        drm/xe: use devm instead of drmm for managed bo
        drm/xe/xe2hpg: Add Wa_14021821874
        drm/xe: fix WA 14018094691
        drm/xe/xe2: Add Wa_15015404425
        drm/xe/xe2: Make subsequent L2 flush sequential
        drm/xe/xe2lpg: Extend workaround 14021402888
        drm/xe/xe2lpm: Extend Wa_16021639441
        drm/xe/bmg: implement Wa_16023588340
        drm/xe/oa/uapi: Make bit masks unsigned
        drm/xe/display: Make display suspend/resume work on discrete
        drm/xe: prevent UAF around preempt fence
        drm/xe: Fix tile fini sequence
        ...
      79a899e3
  7. 23 Aug, 2024 10 commits
  8. 22 Aug, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'net-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · aa0743a2
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bluetooth and netfilter.
      
        Current release - regressions:
      
         - virtio_net: avoid crash on resume - move netdev_tx_reset_queue()
           call before RX napi enable
      
        Current release - new code bugs:
      
         - net/mlx5e: fix page leak and incorrect header release w/ HW GRO
      
        Previous releases - regressions:
      
         - udp: fix receiving fraglist GSO packets
      
         - tcp: prevent refcount underflow due to concurrent execution of
           tcp_sk_exit_batch()
      
        Previous releases - always broken:
      
         - ipv6: fix possible UAF when incrementing error counters on output
      
         - ip6: tunnel: prevent merging of packets with different L2
      
         - mptcp: pm: fix IDs not being reusable
      
         - bonding: fix potential crashes in IPsec offload handling
      
         - Bluetooth: HCI:
            - MGMT: add error handling to pair_device() to avoid a crash
            - invert LE State quirk to be opt-out rather then opt-in
            - fix LE quote calculation
      
         - drv: dsa: VLAN fixes for Ocelot driver
      
         - drv: igb: cope with large MAX_SKB_FRAGS Kconfig settings
      
         - drv: ice: fi Rx data path on architectures with PAGE_SIZE >= 8192
      
        Misc:
      
         - netpoll: do not export netpoll_poll_[disable|enable]()
      
         - MAINTAINERS: update the list of networking headers"
      
      * tag 'net-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (82 commits)
        s390/iucv: Fix vargs handling in iucv_alloc_device()
        net: ovs: fix ovs_drop_reasons error
        net: xilinx: axienet: Fix dangling multicast addresses
        net: xilinx: axienet: Always disable promiscuous mode
        MAINTAINERS: Mark JME Network Driver as Odd Fixes
        MAINTAINERS: Add header files to NETWORKING sections
        MAINTAINERS: Add limited globs for Networking headers
        MAINTAINERS: Add net_tstamp.h to SOCKET TIMESTAMPING section
        MAINTAINERS: Add sonet.h to ATM section of MAINTAINERS
        octeontx2-af: Fix CPT AF register offset calculation
        net: phy: realtek: Fix setting of PHY LEDs Mode B bit on RTL8211F
        net: ngbe: Fix phy mode set to external phy
        netfilter: flowtable: validate vlan header
        bnxt_en: Fix double DMA unmapping for XDP_REDIRECT
        ipv6: prevent possible UAF in ip6_xmit()
        ipv6: fix possible UAF in ip6_finish_output2()
        ipv6: prevent UAF in ip6_send_skb()
        netpoll: do not export netpoll_poll_[disable|enable]()
        selftests: mlxsw: ethtool_lanes: Source ethtool lib from correct path
        udp: fix receiving fraglist GSO packets
        ...
      aa0743a2