1. 04 Jul, 2024 1 commit
  2. 05 Jun, 2024 7 commits
    • Sean Christopherson's avatar
      sched/core: Drop spinlocks on contention iff kernel is preemptible · c793a628
      Sean Christopherson authored
      Use preempt_model_preemptible() to detect a preemptible kernel when
      deciding whether or not to reschedule in order to drop a contended
      spinlock or rwlock.  Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
      built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
      preemption model is "none" or "voluntary".  In short, make kernels with
      dynamically selected models behave the same as kernels with statically
      selected models.
      
      Somewhat counter-intuitively, NOT yielding a lock can provide better
      latency for the relevant tasks/processes.  E.g. KVM x86's mmu_lock, a
      rwlock, is often contended between an invalidation event (takes mmu_lock
      for write) and a vCPU servicing a guest page fault (takes mmu_lock for
      read).  For _some_ setups, letting the invalidation task complete even
      if there is mmu_lock contention provides lower latency for *all* tasks,
      i.e. the invalidation completes sooner *and* the vCPU services the guest
      page fault sooner.
      
      But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
      can vary depending on the host VMM, the guest workload, the number of
      vCPUs, the number of pCPUs in the host, why there is lock contention, etc.
      
      In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
      opposite and removing contention yielding entirely) needs to come with a
      big pile of data proving that changing the status quo is a net positive.
      
      Opportunistically document this side effect of preempt=full, as yielding
      contended spinlocks can have significant, user-visible impact.
      
      Fixes: c597bfdd ("sched: Provide Kconfig support for default dynamic preempt mode")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAnkur Arora <ankur.a.arora@oracle.com>
      Reviewed-by: default avatarChen Yu <yu.c.chen@intel.com>
      Link: https://lore.kernel.org/kvm/ef81ff36-64bb-4cfe-ae9b-e3acf47bff24@proxmox.com
      c793a628
    • Sean Christopherson's avatar
      sched/core: Move preempt_model_*() helpers from sched.h to preempt.h · f0dc887f
      Sean Christopherson authored
      Move the declarations and inlined implementations of the preempt_model_*()
      helpers to preempt.h so that they can be referenced in spinlock.h without
      creating a potential circular dependency between spinlock.h and sched.h.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAnkur Arora <ankur.a.arora@oracle.com>
      Link: https://lkml.kernel.org/r/20240528003521.979836-2-ankur.a.arora@oracle.com
      f0dc887f
    • Tim Chen's avatar
      sched/balance: Skip unnecessary updates to idle load balancer's flags · f90cc919
      Tim Chen authored
      We observed that the overhead on trigger_load_balance(), now renamed
      sched_balance_trigger(), has risen with a system's core counts.
      
      For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
      having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
      trigger_load_balance(). On older systems with fewer cores/socket, this
      function's overhead was less than 0.1%.
      
      The cause of this overhead was that there are multiple cpus calling
      kick_ilb(flags), updating the balancing work needed to a common idle
      load balancer cpu. The ilb_cpu's flags field got updated unconditionally
      with atomic_fetch_or().  The atomic read and writes to ilb_cpu's flags
      causes much cache bouncing and cpu cycles overhead. This is seen in the
      annotated profile below.
      
                   kick_ilb():
                   if (ilb_cpu < 0)
                     test   %r14d,%r14d
                   ↑ js     6c
                   flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
                     mov    $0x2d600,%rdi
                     movslq %r14d,%r8
                     mov    %rdi,%rdx
                     add    -0x7dd0c3e0(,%r8,8),%rdx
                   arch_atomic_read():
        0.01         mov    0x64(%rdx),%esi
       35.58         add    $0x64,%rdx
                   arch_atomic_fetch_or():
      
                   static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
                   {
                   int val = arch_atomic_read(v);
      
                   do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
        0.03  157:   mov    %r12d,%ecx
                   arch_atomic_try_cmpxchg():
                   return arch_try_cmpxchg(&v->counter, old, new);
        0.00         mov    %esi,%eax
                   arch_atomic_fetch_or():
                   do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
                     or     %esi,%ecx
                   arch_atomic_try_cmpxchg():
                   return arch_try_cmpxchg(&v->counter, old, new);
        0.01         lock   cmpxchg %ecx,(%rdx)
       42.96       ↓ jne    2d2
                   kick_ilb():
      
      With instrumentation, we found that 81% of the updates do not result in
      any change in the ilb_cpu's flags.  That is, multiple cpus are asking
      the ilb_cpu to do the same things over and over again, before the ilb_cpu
      has a chance to run NOHZ load balance.
      
      Skip updates to ilb_cpu's flags if no new work needs to be done.
      Such updates do not change ilb_cpu's NOHZ flags.  This requires an extra
      atomic read but it is less expensive than frequent unnecessary atomic
      updates that generate cache bounces.
      
      We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
      (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reviewed-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
      f90cc919
    • Christian Loehle's avatar
      idle: Remove stale RCU comment · 764d5fcc
      Christian Loehle authored
      The call of rcu_idle_enter() from within cpuidle_idle_call() was
      removed in commit 1098582a ("sched,idle,rcu: Push rcu_idle deeper
      into the idle path") which makes the comment out of place.
      Signed-off-by: default avatarChristian Loehle <christian.loehle@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/5b936388-47df-4050-9229-6617a6c2bba5@arm.com
      764d5fcc
    • Ingo Molnar's avatar
      sched/headers: Move struct pre-declarations to the beginning of the header · 3cd72719
      Ingo Molnar authored
      There's a random number of structure pre-declaration lines in
      kernel/sched/sched.h, some of which are unnecessary duplicates.
      
      Move them to the head & order them a bit for readability.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      3cd72719
    • Ingo Molnar's avatar
      sched/core: Clean up kernel/sched/sched.h a bit · 127f6bf1
      Ingo Molnar authored
       - Fix whitespace noise
       - Fix col80 linebreak damage where possible
       - Apply CodingStyle consistently
       - Use consistent #else and #endif comments
       - Use consistent vertical alignment
       - Use 'extern' consistently
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      127f6bf1
    • Ingo Molnar's avatar
      sched/core: Simplify prefetch_curr_exec_start() · 85c9a8f4
      Ingo Molnar authored
      Remove unnecessary use of the address operator.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      85c9a8f4
  3. 27 May, 2024 2 commits
    • Ingo Molnar's avatar
      sched: Fix spelling in comments · 402de7fc
      Ingo Molnar authored
      Do a spell-checking pass.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      402de7fc
    • Ingo Molnar's avatar
      sched/syscalls: Split out kernel/sched/syscalls.c from kernel/sched/core.c · 04746ed8
      Ingo Molnar authored
      core.c has become rather large, move most scheduler syscall
      related functionality into a separate file, syscalls.c.
      
      This is about ~15% of core.c's raw linecount.
      
      Move the alloc_user_cpus_ptr(), __rt_effective_prio(),
      rt_effective_prio(), uclamp_none(), uclamp_se_set()
      and uclamp_bucket_id() inlines to kernel/sched/sched.h.
      
      Internally export the __sched_setscheduler(), __sched_setaffinity(),
      __setscheduler_prio(), set_load_weight(), enqueue_task(), dequeue_task(),
      check_class_changed(), splice_balance_callbacks() and balance_callbacks()
      methods to better facilitate this.
      
      Move the new file's build to sched_policy.c, because it fits there
      semantically, but also because it's the smallest of the 4 build units
      under an allmodconfig build:
      
        -rw-rw-r-- 1 mingo mingo 7.3M May 27 12:35 kernel/sched/core.i
        -rw-rw-r-- 1 mingo mingo 6.4M May 27 12:36 kernel/sched/build_utility.i
        -rw-rw-r-- 1 mingo mingo 6.3M May 27 12:36 kernel/sched/fair.i
        -rw-rw-r-- 1 mingo mingo 5.8M May 27 12:36 kernel/sched/build_policy.i
      
      This better balances build time for scheduler subsystem rebuilds.
      
      I build-tested this new file as a standalone syscalls.o file for a bit,
      to make sure all the encapsulations & abstractions are robust.
      
      Also update/add my copyright notices to these files.
      
      Build time measurements:
      
       # -Before/+After:
      
       kepler:~/tip> perf stat -e 'cycles,instructions,duration_time' --sync --repeat 5 --pre 'rm -f kernel/sched/*.o' m kernel/sched/built-in.a >/dev/null
      
       Performance counter stats for 'm kernel/sched/built-in.a' (5 runs):
      
       -    71,938,508,607      cycles                                                                  ( +-  0.17% )
       +    71,992,916,493      cycles                                                                  ( +-  0.22% )
       -   106,214,780,964      instructions                     #    1.48  insn per cycle              ( +-  0.01% )
       +   105,450,231,154      instructions                     #    1.46  insn per cycle              ( +-  0.01% )
       -     5,878,232,620 ns   duration_time                                                           ( +-  0.38% )
       +     5,290,085,069 ns   duration_time                                                           ( +-  0.21% )
      
       -            5.8782 +- 0.0221 seconds time elapsed  ( +-  0.38% )
       +            5.2901 +- 0.0111 seconds time elapsed  ( +-  0.21% )
      
      Build time improvement of -11.1% (duration_time) is expected: the
      parallel build time of the scheduler subsystem is determined by the
      largest, slowest to build object file, which is kernel/sched/core.o.
      By moving ~15% of its complexity into another build unit, we reduced
      build time by -11%.
      
      Measured cycles spent on building is within its ~0.2% stddev noise envelope.
      
      The -0.7% reduction in instructions spent on building the scheduler is
      statistically reliable and somewhat surprising - I can only speculate:
      maybe compilers aren't that efficient at building & optimizing 10+ KLOC files
      (core.c), and it's an overall win to balance the linecount a bit.
      
      Anyway, this might be a data point that suggests that reducing the linecount
      of our largest files will improve not just code readability and maintainability,
      but might also improve build times a bit.
      
      Code generation got a bit worse, by 0.5kb text on an x86 defconfig build:
      
        # -Before/+After:
      
        kepler:~/tip> size vmlinux
           text	   data	    bss	    dec	    hex	filename
        -26475475	10439178	1740804	38655457	24dd5e1	vmlinux
        +26476003	10439178	1740804	38655985	24dd7f1	vmlinux
      
        kepler:~/tip> size kernel/sched/built-in.a
           text	   data	    bss	    dec	    hex	filename
        - 76056	  30025	    489	 106570	  1a04a	kernel/sched/core.o (ex kernel/sched/built-in.a)
        + 63452	  29453	    489	  93394	  16cd2	kernel/sched/core.o (ex kernel/sched/built-in.a)
          44299	   2181	    104	  46584	   b5f8	kernel/sched/fair.o (ex kernel/sched/built-in.a)
        - 42764	   3424	    120	  46308	   b4e4	kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
        + 55651	   4044	    120	  59815	   e9a7	kernel/sched/build_policy.o (ex kernel/sched/built-in.a)
          44866	  12655	   2192	  59713	   e941	kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
          44866	  12655	   2192	  59713	   e941	kernel/sched/build_utility.o (ex kernel/sched/built-in.a)
      
      This is primarily due to the extra functions exported, and the size
      gets exaggerated somewhat by __pfx CFI function padding:
      
      	ffffffff810cc710 <__pfx_enqueue_task>:
      	ffffffff810cc710:	90                   	nop
      	ffffffff810cc711:	90                   	nop
      	ffffffff810cc712:	90                   	nop
      	ffffffff810cc713:	90                   	nop
      	ffffffff810cc714:	90                   	nop
      	ffffffff810cc715:	90                   	nop
      	ffffffff810cc716:	90                   	nop
      	ffffffff810cc717:	90                   	nop
      	ffffffff810cc718:	90                   	nop
      	ffffffff810cc719:	90                   	nop
      	ffffffff810cc71a:	90                   	nop
      	ffffffff810cc71b:	90                   	nop
      	ffffffff810cc71c:	90                   	nop
      	ffffffff810cc71d:	90                   	nop
      	ffffffff810cc71e:	90                   	nop
      	ffffffff810cc71f:	90                   	nop
      
      AFAICS the cost is primarily not to core.o and fair.o though (which contain
      most performance sensitive scheduler functions), only to syscalls.o
      that get called with much lower frequency - so I think this is an acceptable
      trade-off for better code separation.
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Link: https://lore.kernel.org/r/20240407084319.1462211-2-mingo@kernel.org
      04746ed8
  4. 26 May, 2024 5 commits
  5. 25 May, 2024 12 commits
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2024-05-25-09-13' of... · 9b62e02e
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2024-05-25-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull misc fixes from Andrew Morton:
       "16 hotfixes, 11 of which are cc:stable.
      
        A few nilfs2 fixes, the remainder are for MM: a couple of selftests
        fixes, various singletons fixing various issues in various parts"
      
      * tag 'mm-hotfixes-stable-2024-05-25-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        mm/ksm: fix possible UAF of stable_node
        mm/memory-failure: fix handling of dissolved but not taken off from buddy pages
        mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again
        nilfs2: fix potential hang in nilfs_detach_log_writer()
        nilfs2: fix unexpected freezing of nilfs_segctor_sync()
        nilfs2: fix use-after-free of timer for log writer thread
        selftests/mm: fix build warnings on ppc64
        arm64: patching: fix handling of execmem addresses
        selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation
        selftests/mm: compaction_test: fix incorrect write of zero to nr_hugepages
        selftests/mm: compaction_test: fix bogus test success on Aarch64
        mailmap: update email address for Satya Priya
        mm/huge_memory: don't unpoison huge_zero_folio
        kasan, fortify: properly rename memintrinsics
        lib: add version into /proc/allocinfo output
        mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL
      9b62e02e
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a0db36ed
      Linus Torvalds authored
      Pull irq fixes from Ingo Molnar:
      
       - Fix x86 IRQ vector leak caused by a CPU offlining race
      
       - Fix build failure in the riscv-imsic irqchip driver
         caused by an API-change semantic conflict
      
       - Fix use-after-free in irq_find_at_or_after()
      
      * tag 'irq-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after()
        genirq/cpuhotplug, x86/vector: Prevent vector leak during CPU offline
        irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflict
      a0db36ed
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3a390f24
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
      
       - Fix regressions of the new x86 CPU VFM (vendor/family/model)
         enumeration/matching code
      
       - Fix crash kernel detection on buggy firmware with
         non-compliant ACPI MADT tables
      
       - Address Kconfig warning
      
      * tag 'x86-urgent-2024-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/cpu: Fix x86_match_cpu() to match just X86_VENDOR_INTEL
        crypto: x86/aes-xts - switch to new Intel CPU model defines
        x86/topology: Handle bogus ACPI tables correctly
        x86/kconfig: Select ARCH_WANT_FRAME_POINTERS again when UNWINDER_FRAME_POINTER=y
      3a390f24
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.10-1' of https://github.com/cminyard/linux-ipmi · 56676c4c
      Linus Torvalds authored
      Pull ipmi updates from Corey Minyard:
       "Mostly updates for deprecated interfaces, platform.remove and
        converting from a tasklet to a BH workqueue.
      
        Also use HAS_IOPORT for disabling inb()/outb()"
      
      * tag 'for-linus-6.10-1' of https://github.com/cminyard/linux-ipmi:
        ipmi: kcs_bmc_npcm7xx: Convert to platform remove callback returning void
        ipmi: kcs_bmc_aspeed: Convert to platform remove callback returning void
        ipmi: ipmi_ssif: Convert to platform remove callback returning void
        ipmi: ipmi_si_platform: Convert to platform remove callback returning void
        ipmi: ipmi_powernv: Convert to platform remove callback returning void
        ipmi: bt-bmc: Convert to platform remove callback returning void
        char: ipmi: handle HAS_IOPORT dependencies
        ipmi: Convert from tasklet to BH workqueue
      56676c4c
    • Linus Torvalds's avatar
      Merge tag 'ceph-for-6.10-rc1' of https://github.com/ceph/ceph-client · 74eca356
      Linus Torvalds authored
      Pull ceph updates from Ilya Dryomov:
       "A series from Xiubo that adds support for additional access checks
        based on MDS auth caps which were recently made available to clients.
      
        This is needed to prevent scenarios where the MDS quietly discards
        updates that a UID-restricted client previously (wrongfully) acked to
        the user.
      
        Other than that, just a documentation fixup"
      
      * tag 'ceph-for-6.10-rc1' of https://github.com/ceph/ceph-client:
        doc: ceph: update userspace command to get CephFS metadata
        ceph: add CEPHFS_FEATURE_MDS_AUTH_CAPS_CHECK feature bit
        ceph: check the cephx mds auth access for async dirop
        ceph: check the cephx mds auth access for open
        ceph: check the cephx mds auth access for setattr
        ceph: add ceph_mds_check_access() helper
        ceph: save cap_auths in MDS client when session is opened
      74eca356
    • Linus Torvalds's avatar
      Merge tag 'ntfs3_for_6.10' of https://github.com/Paragon-Software-Group/linux-ntfs3 · 89b61ca4
      Linus Torvalds authored
      Pull ntfs3 updates from Konstantin Komarov:
       "Fixes:
         - reusing of the file index (could cause the file to be trimmed)
         - infinite dir enumeration
         - taking DOS names into account during link counting
         - le32_to_cpu conversion, 32 bit overflow, NULL check
         - some code was refactored
      
        Changes:
         - removed max link count info display during driver init
      
        Remove:
         - atomic_open has been removed for lack of use"
      
      * tag 'ntfs3_for_6.10' of https://github.com/Paragon-Software-Group/linux-ntfs3:
        fs/ntfs3: Break dir enumeration if directory contents error
        fs/ntfs3: Fix case when index is reused during tree transformation
        fs/ntfs3: Mark volume as dirty if xattr is broken
        fs/ntfs3: Always make file nonresident on fallocate call
        fs/ntfs3: Redesign ntfs_create_inode to return error code instead of inode
        fs/ntfs3: Use variable length array instead of fixed size
        fs/ntfs3: Use 64 bit variable to avoid 32 bit overflow
        fs/ntfs3: Check 'folio' pointer for NULL
        fs/ntfs3: Missed le32_to_cpu conversion
        fs/ntfs3: Remove max link count info display during driver init
        fs/ntfs3: Taking DOS names into account during link counting
        fs/ntfs3: remove atomic_open
        fs/ntfs3: use kcalloc() instead of kzalloc()
      89b61ca4
    • Linus Torvalds's avatar
      Merge tag '6.10-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd · 6c8b1a2d
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
       "Two ksmbd server fixes, both for stable"
      
      * tag '6.10-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
        ksmbd: ignore trailing slashes in share paths
        ksmbd: avoid to send duplicate oplock break notifications
      6c8b1a2d
    • Linus Torvalds's avatar
      Merge tag 'rtc-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · 54f71b03
      Linus Torvalds authored
      Pull RTC updates from Alexandre Belloni:
       "There is one new driver and then most of the changes are the device
        tree bindings conversions to yaml.
      
        New driver:
         - Epson RX8111
      
        Drivers:
         - Many Device Tree bindings conversions to dtschema
         - pcf8563: wakeup-source support"
      
      * tag 'rtc-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        pcf8563: add wakeup-source support
        rtc: rx8111: handle VLOW flag
        rtc: rx8111: demote warnings to debug level
        rtc: rx6110: Constify struct regmap_config
        dt-bindings: rtc: convert trivial devices into dtschema
        dt-bindings: rtc: stmp3xxx-rtc: convert to dtschema
        dt-bindings: rtc: pxa-rtc: convert to dtschema
        rtc: Add driver for Epson RX8111
        dt-bindings: rtc: Add Epson RX8111
        rtc: mcp795: drop unneeded MODULE_ALIAS
        rtc: nuvoton: Modify part number value
        rtc: test: Split rtc unit test into slow and normal speed test
        dt-bindings: rtc: nxp,lpc1788-rtc: convert to dtschema
        dt-bindings: rtc: digicolor-rtc: move to trivial-rtc
        dt-bindings: rtc: alphascale,asm9260-rtc: convert to dtschema
        dt-bindings: rtc: armada-380-rtc: convert to dtschema
        rtc: cros-ec: provide ID table for avoiding fallback match
      54f71b03
    • Linus Torvalds's avatar
      Merge tag 'i3c/for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · 4286e1fc
      Linus Torvalds authored
      Pull i3c updates from Alexandre Belloni:
       "Runtime PM (power management) is improved and hot-join support has
        been added to the dw controller driver.
      
        Core:
         - Allow device driver to trigger controller runtime PM
      
        Drivers:
         - dw: hot-join support
         - svc: better IBI handling"
      
      * tag 'i3c/for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
        i3c: dw: Add hot-join support.
        i3c: master: Enable runtime PM for master controller
        i3c: master: svc: fix invalidate IBI type and miss call client IBI handler
        i3c: master: svc: change ENXIO to EAGAIN when IBI occurs during start frame
        i3c: Add comment for -EAGAIN in i3c_device_do_priv_xfers()
      4286e1fc
    • Linus Torvalds's avatar
      Merge tag 'jffs2-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs · 6951abe8
      Linus Torvalds authored
      Pull jffs2 updates from Richard Weinberger:
      
       - Fix illegal memory access in jffs2_free_inode()
      
       - Kernel-doc fixes
      
       - print symbolic error names
      
      * tag 'jffs2-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
        jffs2: Fix potential illegal address access in jffs2_free_inode
        jffs2: Simplify the allocation of slab caches
        jffs2: nodemgmt: fix kernel-doc comments
        jffs2: print symbolic error name instead of error code
      6951abe8
    • Linus Torvalds's avatar
      Merge tag 'uml-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux · 2313022e
      Linus Torvalds authored
      Pull UML updates from Richard Weinberger:
      
       - Fixes for -Wmissing-prototypes warnings and further cleanup
      
       - Remove callback returning void from rtc and virtio drivers
      
       - Fix bash location
      
      * tag 'uml-for-linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (26 commits)
        um: virtio_uml: Convert to platform remove callback returning void
        um: rtc: Convert to platform remove callback returning void
        um: Remove unused do_get_thread_area function
        um: Fix -Wmissing-prototypes warnings for __vdso_*
        um: Add an internal header shared among the user code
        um: Fix the declaration of kasan_map_memory
        um: Fix the -Wmissing-prototypes warning for get_thread_reg
        um: Fix the -Wmissing-prototypes warning for __switch_mm
        um: Fix -Wmissing-prototypes warnings for (rt_)sigreturn
        um: Stop tracking host PID in cpu_tasks
        um: process: remove unused 'n' variable
        um: vector: remove unused len variable/calculation
        um: vector: fix bpfflash parameter evaluation
        um: slirp: remove set but unused variable 'pid'
        um: signal: move pid variable where needed
        um: Makefile: use bash from the environment
        um: Add winch to winch_handlers before registering winch IRQ
        um: Fix -Wmissing-prototypes warnings for __warp_* and foo
        um: Fix -Wmissing-prototypes warnings for text_poke*
        um: Move declarations to proper headers
        ...
      2313022e
    • Linus Torvalds's avatar
      Merge tag 'drm-next-2024-05-25' of https://gitlab.freedesktop.org/drm/kernel · 56fb6f92
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Some fixes for the end of the merge window, mostly amdgpu and panthor,
        with one nouveau uAPI change that fixes a bad decision we made a few
        months back.
      
        nouveau:
         - fix bo metadata uAPI for vm bind
      
        panthor:
         - Fixes for panthor's heap logical block.
         - Reset on unrecoverable fault
         - Fix VM references.
         - Reset fix.
      
        xlnx:
         - xlnx compile and doc fixes.
      
        amdgpu:
         - Handle vbios table integrated info v2.3
      
        amdkfd:
         - Handle duplicate BOs in reserve_bo_and_cond_vms
         - Handle memory limitations on small APUs
      
        dp/mst:
         - MST null deref fix.
      
        bridge:
         - Don't let next bridge create connector in adv7511 to make probe
           work"
      
      * tag 'drm-next-2024-05-25' of https://gitlab.freedesktop.org/drm/kernel:
        drm/amdgpu/atomfirmware: add intergrated info v2.3 table
        drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2
        drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs
        drm/amdkfd: handle duplicate BOs in reserve_bo_and_cond_vms
        drm/bridge: adv7511: Attach next bridge without creating connector
        drm/buddy: Fix the warn on's during force merge
        drm/nouveau: use tile_mode and pte_kind for VM_BIND bo allocations
        drm/panthor: Call panthor_sched_post_reset() even if the reset failed
        drm/panthor: Reset the FW VM to NULL on unplug
        drm/panthor: Keep a ref to the VM at the panthor_kernel_bo level
        drm/panthor: Force an immediate reset on unrecoverable faults
        drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity constraints
        drm/panthor: Fix an off-by-one in the heap context retrieval logic
        drm/panthor: Relax the constraints on the tiler chunk size
        drm/panthor: Make sure the tiler initial/max chunks are consistent
        drm/panthor: Fix tiler OOM handling to allow incremental rendering
        drm: xlnx: zynqmp_dpsub: Fix compilation error
        drm: xlnx: zynqmp_dpsub: Fix few function comments
      56fb6f92
  6. 24 May, 2024 13 commits
    • David Howells's avatar
      cifs: Fix missing set of remote_i_size · 93a43155
      David Howells authored
      Occasionally, the generic/001 xfstest will fail indicating corruption in
      one of the copy chains when run on cifs against a server that supports
      FSCTL_DUPLICATE_EXTENTS_TO_FILE (eg. Samba with a share on btrfs).  The
      problem is that the remote_i_size value isn't updated by cifs_setsize()
      when called by smb2_duplicate_extents(), but i_size *is*.
      
      This may cause cifs_remap_file_range() to then skip the bit after calling
      ->duplicate_extents() that sets sizes.
      
      Fix this by calling netfs_resize_file() in smb2_duplicate_extents() before
      calling cifs_setsize() to set i_size.
      
      This means we don't then need to call netfs_resize_file() upon return from
      ->duplicate_extents(), but we also fix the test to compare against the pre-dup
      inode size.
      
      [Note that this goes back before the addition of remote_i_size with the
      netfs_inode struct.  It should probably have been setting cifsi->server_eof
      previously.]
      
      Fixes: cfc63fc8 ("smb3: fix cached file size problems in duplicate extents (reflink)")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Steve French <sfrench@samba.org>
      cc: Paulo Alcantara <pc@manguebit.com>
      cc: Shyam Prasad N <nspmangalore@gmail.com>
      cc: Rohith Surabattula <rohiths.msft@gmail.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      93a43155
    • David Howells's avatar
      cifs: Fix smb3_insert_range() to move the zero_point · 8a160723
      David Howells authored
      Fix smb3_insert_range() to move the zero_point over to the new EOF.
      Without this, generic/147 fails as reads of data beyond the old EOF point
      return zeroes.
      
      Fixes: 3ee1a1fc ("cifs: Cut over to using netfslib")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Shyam Prasad N <nspmangalore@gmail.com>
      cc: Rohith Surabattula <rohiths.msft@gmail.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cifs@vger.kernel.org
      cc: netfs@lists.linux.dev
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      8a160723
    • Linus Torvalds's avatar
      Merge tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm · 0b32d436
      Linus Torvalds authored
      Pull more mm updates from Andrew Morton:
       "Jeff Xu's implementation of the mseal() syscall"
      
      * tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        selftest mm/mseal read-only elf memory segment
        mseal: add documentation
        selftest mm/mseal memory sealing
        mseal: add mseal syscall
        mseal: wire up mseal syscall
      0b32d436
    • Chengming Zhou's avatar
      mm/ksm: fix possible UAF of stable_node · 90e82349
      Chengming Zhou authored
      The commit 2c653d0e ("ksm: introduce ksm_max_page_sharing per page
      deduplication limit") introduced a possible failure case in the
      stable_tree_insert(), where we may free the new allocated stable_node_dup
      if we fail to prepare the missing chain node.
      
      Then that kfolio return and unlock with a freed stable_node set...  And
      any MM activities can come in to access kfolio->mapping, so UAF.
      
      Fix it by moving folio_set_stable_node() to the end after stable_node
      is inserted successfully.
      
      Link: https://lkml.kernel.org/r/20240513-b4-ksm-stable-node-uaf-v1-1-f687de76f452@linux.dev
      Fixes: 2c653d0e ("ksm: introduce ksm_max_page_sharing per page deduplication limit")
      Signed-off-by: default avatarChengming Zhou <chengming.zhou@linux.dev>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Stefan Roesch <shr@devkernel.io>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90e82349
    • Miaohe Lin's avatar
      mm/memory-failure: fix handling of dissolved but not taken off from buddy pages · 8cf360b9
      Miaohe Lin authored
      When I did memory failure tests recently, below panic occurs:
      
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
      page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
      ------------[ cut here ]------------
      kernel BUG at include/linux/page-flags.h:1009!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      RIP: 0010:__del_page_from_free_list+0x151/0x180
      RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
      RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
      RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
      RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
      R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
      R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
      FS:  00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
      Call Trace:
       <TASK>
       __rmqueue_pcplist+0x23b/0x520
       get_page_from_freelist+0x26b/0xe40
       __alloc_pages_noprof+0x113/0x1120
       __folio_alloc_noprof+0x11/0xb0
       alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
       __alloc_fresh_hugetlb_folio+0xe7/0x140
       alloc_pool_huge_folio+0x68/0x100
       set_max_huge_pages+0x13d/0x340
       hugetlb_sysctl_handler_common+0xe8/0x110
       proc_sys_call_handler+0x194/0x280
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7ff916114887
      RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
      RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
      RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
      R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
      R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
       </TASK>
      Modules linked in: mce_inject hwpoison_inject
      ---[ end trace 0000000000000000 ]---
      
      And before the panic, there had an warning about bad page state:
      
      BUG: Bad page state in process page-types  pfn:8cee00
      page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
      flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
      page_type: 0xffffff7f(buddy)
      raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
      raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
      page dumped because: nonzero mapcount
      Modules linked in: mce_inject hwpoison_inject
      CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
      Call Trace:
       <TASK>
       dump_stack_lvl+0x83/0xa0
       bad_page+0x63/0xf0
       free_unref_page+0x36e/0x5c0
       unpoison_memory+0x50b/0x630
       simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
       debugfs_attr_write+0x42/0x60
       full_proxy_write+0x5b/0x80
       vfs_write+0xcd/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xc2/0x1d0
       entry_SYSCALL_64_after_hwframe+0x77/0x7f
      RIP: 0033:0x7f189a514887
      RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
      RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
      RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
      R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
       </TASK>
      
      The root cause should be the below race:
      
       memory_failure
        try_memory_failure_hugetlb
         me_huge_page
          __page_handle_poison
           dissolve_free_hugetlb_folio
           drain_all_pages -- Buddy page can be isolated e.g. for compaction.
           take_page_off_buddy -- Failed as page is not in the buddy list.
      	     -- Page can be putback into buddy after compaction.
          page_ref_inc -- Leads to buddy page with refcnt = 1.
      
      Then unpoison_memory() can unpoison the page and send the buddy page back
      into buddy list again leading to the above bad page state warning.  And
      bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
      page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
      allocate this page.
      
      Fix this issue by only treating __page_handle_poison() as successful when
      it returns 1.
      
      Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
      Fixes: ceaf8fbe ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8cf360b9
    • Yuanyuan Zhong's avatar
      mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again · 6d065f50
      Yuanyuan Zhong authored
      After switching smaps_rollup to use VMA iterator, searching for next entry
      is part of the condition expression of the do-while loop.  So the current
      VMA needs to be addressed before the continue statement.
      
      Otherwise, with some VMAs skipped, userspace observed memory
      consumption from /proc/pid/smaps_rollup will be smaller than the sum of
      the corresponding fields from /proc/pid/smaps.
      
      Link: https://lkml.kernel.org/r/20240523183531.2535436-1-yzhong@purestorage.com
      Fixes: c4c84f06 ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
      Signed-off-by: default avatarYuanyuan Zhong <yzhong@purestorage.com>
      Reviewed-by: default avatarMohamed Khalfella <mkhalfella@purestorage.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d065f50
    • Ryusuke Konishi's avatar
      nilfs2: fix potential hang in nilfs_detach_log_writer() · eb85dace
      Ryusuke Konishi authored
      Syzbot has reported a potential hang in nilfs_detach_log_writer() called
      during nilfs2 unmount.
      
      Analysis revealed that this is because nilfs_segctor_sync(), which
      synchronizes with the log writer thread, can be called after
      nilfs_segctor_destroy() terminates that thread, as shown in the call trace
      below:
      
      nilfs_detach_log_writer
        nilfs_segctor_destroy
          nilfs_segctor_kill_thread  --> Shut down log writer thread
          flush_work
            nilfs_iput_work_func
              nilfs_dispose_list
                iput
                  nilfs_evict_inode
                    nilfs_transaction_commit
                      nilfs_construct_segment (if inode needs sync)
                        nilfs_segctor_sync  --> Attempt to synchronize with
                                                log writer thread
                                 *** DEADLOCK ***
      
      Fix this issue by changing nilfs_segctor_sync() so that the log writer
      thread returns normally without synchronizing after it terminates, and by
      forcing tasks that are already waiting to complete once after the thread
      terminates.
      
      The skipped inode metadata flushout will then be processed together in the
      subsequent cleanup work in nilfs_segctor_destroy().
      
      Link: https://lkml.kernel.org/r/20240520132621.4054-4-konishi.ryusuke@gmail.comSigned-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: syzbot+e3973c409251e136fdd0@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=e3973c409251e136fdd0Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: "Bai, Shuangpeng" <sjb7183@psu.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb85dace
    • Ryusuke Konishi's avatar
      nilfs2: fix unexpected freezing of nilfs_segctor_sync() · 936184ea
      Ryusuke Konishi authored
      A potential and reproducible race issue has been identified where
      nilfs_segctor_sync() would block even after the log writer thread writes a
      checkpoint, unless there is an interrupt or other trigger to resume log
      writing.
      
      This turned out to be because, depending on the execution timing of the
      log writer thread running in parallel, the log writer thread may skip
      responding to nilfs_segctor_sync(), which causes a call to schedule()
      waiting for completion within nilfs_segctor_sync() to lose the opportunity
      to wake up.
      
      The reason why waking up the task waiting in nilfs_segctor_sync() may be
      skipped is that updating the request generation issued using a shared
      sequence counter and adding an wait queue entry to the request wait queue
      to the log writer, are not done atomically.  There is a possibility that
      log writing and request completion notification by nilfs_segctor_wakeup()
      may occur between the two operations, and in that case, the wait queue
      entry is not yet visible to nilfs_segctor_wakeup() and the wake-up of
      nilfs_segctor_sync() will be carried over until the next request occurs.
      
      Fix this issue by performing these two operations simultaneously within
      the lock section of sc_state_lock.  Also, following the memory barrier
      guidelines for event waiting loops, move the call to set_current_state()
      in the same location into the event waiting loop to ensure that a memory
      barrier is inserted just before the event condition determination.
      
      Link: https://lkml.kernel.org/r/20240520132621.4054-3-konishi.ryusuke@gmail.com
      Fixes: 9ff05123 ("nilfs2: segment constructor")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: "Bai, Shuangpeng" <sjb7183@psu.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      936184ea
    • Ryusuke Konishi's avatar
      nilfs2: fix use-after-free of timer for log writer thread · f5d4e046
      Ryusuke Konishi authored
      Patch series "nilfs2: fix log writer related issues".
      
      This bug fix series covers three nilfs2 log writer-related issues,
      including a timer use-after-free issue and potential deadlock issue on
      unmount, and a potential freeze issue in event synchronization found
      during their analysis.  Details are described in each commit log.
      
      
      This patch (of 3):
      
      A use-after-free issue has been reported regarding the timer sc_timer on
      the nilfs_sc_info structure.
      
      The problem is that even though it is used to wake up a sleeping log
      writer thread, sc_timer is not shut down until the nilfs_sc_info structure
      is about to be freed, and is used regardless of the thread's lifetime.
      
      Fix this issue by limiting the use of sc_timer only while the log writer
      thread is alive.
      
      Link: https://lkml.kernel.org/r/20240520132621.4054-1-konishi.ryusuke@gmail.com
      Link: https://lkml.kernel.org/r/20240520132621.4054-2-konishi.ryusuke@gmail.com
      Fixes: fdce895e ("nilfs2: change sc_timer from a pointer to an embedded one in struct nilfs_sc_info")
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Reported-by: default avatar"Bai, Shuangpeng" <sjb7183@psu.edu>
      Closes: https://groups.google.com/g/syzkaller/c/MK_LYqtt8ko/m/8rgdWeseAwAJTested-by: default avatarRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f5d4e046
    • Michael Ellerman's avatar
      selftests/mm: fix build warnings on ppc64 · 1901472f
      Michael Ellerman authored
      Fix warnings like:
      
        In file included from uffd-unit-tests.c:8:
        uffd-unit-tests.c: In function `uffd_poison_handle_fault':
        uffd-common.h:45:33: warning: format `%llu' expects argument of type
        `long long unsigned int', but argument 3 has type `__u64' {aka `long
        unsigned int'} [-Wformat=]
      
      By switching to unsigned long long for u64 for ppc64 builds.
      
      Link: https://lkml.kernel.org/r/20240521030219.57439-1-mpe@ellerman.id.auSigned-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1901472f
    • Will Deacon's avatar
      arm64: patching: fix handling of execmem addresses · b1480ed2
      Will Deacon authored
      Klara Modin reported warnings for a kernel configured with BPF_JIT but
      without MODULES:
      
      [   44.131296] Trying to vfree() bad address (000000004a17c299)
      [   44.138024] WARNING: CPU: 1 PID: 193 at mm/vmalloc.c:3189 remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.146675] CPU: 1 PID: 193 Comm: kworker/1:2 Tainted: G      D W          6.9.0-01786-g2c9e5d4a #25
      [   44.158229] Hardware name: Raspberry Pi 3 Model B (DT)
      [   44.164433] Workqueue: events bpf_prog_free_deferred
      [   44.170492] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [   44.178601] pc : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.183705] lr : remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.188772] sp : ffff800082a13c70
      [   44.193112] x29: ffff800082a13c70 x28: 0000000000000000 x27: 0000000000000000
      [   44.201384] x26: 0000000000000000 x25: ffff00003a44efa0 x24: 00000000d4202000
      [   44.209658] x23: ffff800081223dd0 x22: ffff00003a198a40 x21: ffff8000814dd880
      [   44.217924] x20: 00000000d4202000 x19: ffff8000814dd880 x18: 0000000000000006
      [   44.226206] x17: 0000000000000000 x16: 0000000000000020 x15: 0000000000000002
      [   44.234460] x14: ffff8000811a6370 x13: 0000000020000000 x12: 0000000000000000
      [   44.242710] x11: ffff8000811a6370 x10: 0000000000000144 x9 : ffff8000811fe370
      [   44.250959] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000811fe370
      [   44.259206] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
      [   44.267457] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000002203240
      [   44.275703] Call trace:
      [   44.279158] remove_vm_area (mm/vmalloc.c:3189 (discriminator 1))
      [   44.283858] vfree (mm/vmalloc.c:3322)
      [   44.287835] execmem_free (mm/execmem.c:70)
      [   44.292347] bpf_jit_free_exec+0x10/0x1c
      [   44.297283] bpf_prog_pack_free (kernel/bpf/core.c:1006)
      [   44.302457] bpf_jit_binary_pack_free (kernel/bpf/core.c:1195)
      [   44.307951] bpf_jit_free (include/linux/filter.h:1083 arch/arm64/net/bpf_jit_comp.c:2474)
      [   44.312342] bpf_prog_free_deferred (kernel/bpf/core.c:2785)
      [   44.317785] process_one_work (kernel/workqueue.c:3273)
      [   44.322684] worker_thread (kernel/workqueue.c:3342 (discriminator 2) kernel/workqueue.c:3429 (discriminator 2))
      [   44.327292] kthread (kernel/kthread.c:388)
      [   44.331342] ret_from_fork (arch/arm64/kernel/entry.S:861)
      
      The problem is because bpf_arch_text_copy() silently fails to write to the
      read-only area as a result of patch_map() faulting and the resulting
      -EFAULT being chucked away.
      
      Update patch_map() to use CONFIG_EXECMEM instead of
      CONFIG_STRICT_MODULE_RWX to check for vmalloc addresses.
      
      Link: https://lkml.kernel.org/r/20240521213813.703309-1-rppt@kernel.org
      Fixes: 2c9e5d4a ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reported-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Closes: https://lore.kernel.org/all/7983fbbf-0127-457c-9394-8d6e4299c685@gmail.comTested-by: default avatarKlara Modin <klarasmodin@gmail.com>
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b1480ed2
    • Dev Jain's avatar
      selftests/mm: compaction_test: fix bogus test success and reduce probability... · fb9293b6
      Dev Jain authored
      selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation
      
      Reset nr_hugepages to zero before the start of the test.
      
      If a non-zero number of hugepages is already set before the start of the
      test, the following problems arise:
      
       - The probability of the test getting OOM-killed increases.  Proof:
         The test wants to run on 80% of available memory to prevent OOM-killing
         (see original code comments).  Let the value of mem_free at the start
         of the test, when nr_hugepages = 0, be x.  In the other case, when
         nr_hugepages > 0, let the memory consumed by hugepages be y.  In the
         former case, the test operates on 0.8 * x of memory.  In the latter,
         the test operates on 0.8 * (x - y) of memory, with y already filled,
         hence, memory consumed is y + 0.8 * (x - y) = 0.8 * x + 0.2 * y > 0.8 *
         x.  Q.E.D
      
       - The probability of a bogus test success increases.  Proof: Let the
         memory consumed by hugepages be greater than 25% of x, with x and y
         defined as above.  The definition of compaction_index is c_index = (x -
         y)/z where z is the memory consumed by hugepages after trying to
         increase them again.  In check_compaction(), we set the number of
         hugepages to zero, and then increase them back; the probability that
         they will be set back to consume at least y amount of memory again is
         very high (since there is not much delay between the two attempts of
         changing nr_hugepages).  Hence, z >= y > (x/4) (by the 25% assumption).
         Therefore, c_index = (x - y)/z <= (x - y)/y = x/y - 1 < 4 - 1 = 3
         hence, c_index can always be forced to be less than 3, thereby the test
         succeeding always.  Q.E.D
      
      Link: https://lkml.kernel.org/r/20240521074358.675031-4-dev.jain@arm.com
      Fixes: bd67d5c1 ("Test compaction of mlocked memory")
      Signed-off-by: default avatarDev Jain <dev.jain@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Sri Jayaramappa <sjayaram@akamai.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb9293b6
    • Dev Jain's avatar
      selftests/mm: compaction_test: fix incorrect write of zero to nr_hugepages · 9ad665ef
      Dev Jain authored
      Currently, the test tries to set nr_hugepages to zero, but that is not
      actually done because the file offset is not reset after read().  Fix that
      using lseek().
      
      Link: https://lkml.kernel.org/r/20240521074358.675031-3-dev.jain@arm.com
      Fixes: bd67d5c1 ("Test compaction of mlocked memory")
      Signed-off-by: default avatarDev Jain <dev.jain@arm.com>
      Cc: <stable@vger.kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Sri Jayaramappa <sjayaram@akamai.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9ad665ef