Commits · b07996c7abac0fe3f70bf74b0b3f76eb7852ef5a · Kirill Smelkov / linux

10 Oct, 2024 7 commits

sched_ext: Don't hold scx_tasks_lock for too long · b07996c7

Tejun Heo authored Oct 10, 2024

While enabling and disabling a BPF scheduler, every task is iterated a
couple times by walking scx_tasks. Except for one, all iterations keep
holding scx_tasks_lock. On multi-socket systems under heavy rq lock
contention and high number of threads, this can can lead to RCU and other
stalls.

The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
running `stress-ng --workload 150 --workload-threads 10` with >400k idle
threads and RCU stall period reduced to 5s:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
  rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
  rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
  Sending NMI from CPU 80 to CPUs 91:
  NMI backtrace for cpu 91
  CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
  Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
  Sched_ext: simple (disabling+all)
  RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
  Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
  RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
  RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
  RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
  RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
  R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
  R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
  FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
  Call Trace:
   <NMI>
   </NMI>
   <TASK>
   do_raw_spin_lock+0x9c/0xb0
   task_rq_lock+0x50/0x190
   scx_task_iter_next_locked+0x157/0x170
   scx_ops_disable_workfn+0x2c2/0xbf0
   kthread_worker_fn+0x108/0x2a0
   kthread+0xeb/0x110
   ret_from_fork+0x36/0x40
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Sending NMI from CPU 80 to CPUs 186:
  NMI backtrace for cpu 186
  CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471

scx_task_iter can safely drop locks while iterating. Make
scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
stalls.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

b07996c7

sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers · 967da578

Tejun Heo authored Oct 10, 2024

Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.

- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
  scx_task_lock, respectively. Renamed to
  scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
  there are non-trivial side-effects.

- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
  is internal.

- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
  and scx_tasks_lock and the latter re-locks only scx_tasks_lock.

This doesn't cause behavior changes and will be used to implement stall
avoidance.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

967da578

sched_ext: bypass mode shouldn't depend on ops.select_cpu() · aebe7ae4

Tejun Heo authored Oct 10, 2024

Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

aebe7ae4

sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() · cc3e1cac

Tejun Heo authored Oct 10, 2024

Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>

cc3e1cac

sched_ext: Start schedulers with consistent p->scx.slice values · 3fdb9ebc

Tejun Heo authored Oct 10, 2024

The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

3fdb9ebc

Revert "sched_ext: Use shorter slice while bypassing" · 54baa7ac

Tejun Heo authored Oct 10, 2024

This reverts commit 6f34d8d3.

Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

54baa7ac

sched_ext: use correct function name in pick_task_scx() warning message · c425180d

Honglei Wang authored Oct 10, 2024

pick_next_task_scx() was turned into pick_task_scx() since
commit 753e2836 ("sched_ext: Unify regular and core-sched pick
task paths"). Update the outdated message.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

c425180d

09 Oct, 2024 1 commit

selftests: sched_ext: Add sched_ext as proper selftest target · 7941b83b

Björn Töpel authored Oct 08, 2024

The sched_ext selftests is missing proper cross-compilation support, a
proper target entry, and out-of-tree build support.

When building the kselftest suite, e.g.:

  make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-  \
    TARGETS=sched_ext SKIP_TARGETS="" O=/output/foo \
    -C tools/testing/selftests install

or:

  make ARCH=arm64 LLVM=1 TARGETS=sched_ext SKIP_TARGETS="" \
    O=/output/foo -C tools/testing/selftests install

The expectation is that the sched_ext is included, cross-built, the
correct toolchain is picked up, and placed into /output/foo.

In contrast to the BPF selftests, the sched_ext suite does not use
bpftool at test run-time, so it is sufficient to build bpftool for the
build host only.

Add ARCH, CROSS_COMPILE, OUTPUT, and TARGETS support to the sched_ext
selftest. Also, remove some variables that were unused by the
Makefile.
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: David Vernet <void@manifault.com>
Tested-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

7941b83b

08 Oct, 2024 1 commit

sched_ext: Documentation: Update instructions for running example schedulers · e0ed5215

Devaansh-Kumar authored Oct 08, 2024

Since the artifact paths for tools changed, we need to update the documentation to reflect that path.
Signed-off-by: Devaansh-Kumar <devaanshk840@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

e0ed5215

07 Oct, 2024 3 commits

sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED · 9b671793

Tejun Heo authored Sep 27, 2024

scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.

sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>

9b671793

sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called · f207dc2d

Tejun Heo authored Sep 27, 2024

During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.

Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.

Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().

This will be used by sched_ext to fix ->select_task_rq() skip detection.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

f207dc2d

sched/core: Make select_task_rq() take the pointer to wake_flags instead of value · b62933ee

Tejun Heo authored Sep 27, 2024

This will be used to allow select_task_rq() to indicate whether
->select_task_rq() was called by modifying *wake_flags.

This makes try_to_wake_up() call all functions that take wake_flags with
WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
more consistent, and, as the flag is only tested by ->select_task_rq()
implementations, it doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

b62933ee

04 Oct, 2024 2 commits

sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init() · ec010333

Tejun Heo authored Oct 02, 2024

568894ed ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.

As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 568894ed ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")

ec010333

sched_ext: Improve error reporting during loading · cc9877fb

Tejun Heo authored Oct 02, 2024

When the BPF scheduler fails, ops.exit() allows rich error reporting through
scx_exit_info. Use scx.exit() path consistently for all failures which can
be caused by the BPF scheduler:

- scx_ops_error() is called after ops.init() and ops.cgroup_init() failure
  to record error information.

- ops.init_task() failure now uses scx_ops_error() instead of pr_err().

- The err_disable path updated to automatically trigger scx_ops_error() to
  cover cases that the error message hasn't already been generated and
  always return 0 indicating init success so that the error is reported
  through ops.exit().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>

cc9877fb

02 Oct, 2024 1 commit

sched_ext: Add __weak markers to BPF helper function decalarations · fcbc4235

Vishal Chourasia authored Oct 02, 2024

Fix build errors by adding __weak markers to BPF helper function
declarations in header files. This resolves static assertion failures
in scx_qmap.bpf.c and scx_flatcg.bpf.c where functions like
scx_bpf_dispatch_from_dsq_set_slice, scx_bpf_dispatch_from_dsq_set_vtime,
and scx_bpf_task_cgroup were missing the __weak attribute.

[1] https://lore.kernel.org/all/ZvvfUqRNM4-jYQzH@linux.ibm.comSigned-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

fcbc4235

27 Sep, 2024 9 commits

sched_ext: Remove redundant p->nr_cpus_allowed checker · 95b87369

Zhang Qiao authored Sep 26, 2024

select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

95b87369

sched_ext: Decouple locks in scx_ops_enable() · efe231d9

Tejun Heo authored Sep 27, 2024

The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.

For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:

scx_ops_enable() hotplug

percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem

Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.

With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.

- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.

- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com

efe231d9

sched_ext: Decouple locks in scx_ops_disable_workfn() · 16021656

Tejun Heo authored Sep 27, 2024

The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:

- As cgroup disabling no longer requires modifying a static_key which
  requires cpus_read_lock(), no need to grab cpus_read_lock() before
  grabbing scx_cgroup_rwsem.

- cgroup can now be independently disabled before tasks are moved back to
  the fair class.

Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com

16021656

sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() · 568894ed

Tejun Heo authored Sep 27, 2024

If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.

Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>

568894ed

sched_ext: Enable scx_ops_init_task() separately · 4269c603

Tejun Heo authored Sep 27, 2024

scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.

The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).

  sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
  WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
  ...
  Sched_ext: simple (enabling)
  RIP: 0010:scx_ops_enable_task+0x1a1/0x200
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x850/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().
Signed-off-by: Tejun Heo <tj@kernel.org>

4269c603

sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() · 9753358a

Tejun Heo authored Sep 27, 2024

scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.

The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:

  sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
  WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
  ...
  Sched_ext: qmap (enabling+all)
  RIP: 0010:scx_ops_enable_task+0x1a1/0x200
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x850/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>

9753358a

sched_ext: Initialize in bypass mode · 8c2090c5

Tejun Heo authored Sep 27, 2024

scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.

The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.

While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>

8c2090c5

sched_ext: Remove SCX_OPS_PREPPING · fc1fcebe

Tejun Heo authored Sep 27, 2024

The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>

fc1fcebe

sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() · 1bbcfe62

Tejun Heo authored Sep 27, 2024

check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization can be retried if
CPU hotplug events take place before the CPU hotplug callbacks are online.

As such, the best place to call it is in the same cpu_read_lock() section
that enables the CPU hotplug ops. Currently, it is called in the next
cpus_read_lock() block in scx_ops_enable(). The side effect of this
placement is a small window in which hotplug sequence detection can trigger
unnecessarily, which isn't critical.

Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
the hotplug operation enablement to close the window and get the invocation
out of the way for planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>

1bbcfe62

26 Sep, 2024 5 commits

sched_ext: Use shorter slice while bypassing · 6f34d8d3

Tejun Heo authored Sep 26, 2024

While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypassing, guaranteeing timely forward progress is the main goal. There's no
point in giving long slices. Shorten the time slice used while bypassing
from 20ms to 5ms.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

6f34d8d3

sched_ext: Split the global DSQ per NUMA node · b7b3b2db

Tejun Heo authored Sep 26, 2024

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.

Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
the task's CPU and all CPUs in a node only consume its node-local global
DSQ.

This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

b7b3b2db

sched_ext: Relocate find_user_dsq() · bba26bf3

Tejun Heo authored Sep 26, 2024

To prepare for the addition of find_global_dsq(). No functional changes.
Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

bba26bf3

sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued()... · 63fb3ec8

Tejun Heo authored Sep 26, 2024

sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()

SCX_DSQ_GLOBAL is special in that it can't be used as a priority queue and
is consumed implicitly, but all BPF DSQ related kfuncs could be used on it.
SCX_DSQ_GLOBAL will be split per-node for scalability and those operations
won't make sense anymore. Disallow SCX_DSQ_GLOBAL on scx_bpf_consume(),
scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new(). This means that
SCX_DSQ_GLOBAL can only be used as a dispatch target from BPF schedulers.

With scx_flatcg, which was using SCX_DSQ_GLOBAL as the fallback DSQ,
updated, this shouldn't affect any schedulers.

This leaves find_dsq_for_dispatch() the only user of find_non_local_dsq().
Open code and remove find_non_local_dsq().
Signed-off-by: tejun heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

63fb3ec8

scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL · c9c809f4

Tejun Heo authored Sep 26, 2024

scx_flatcg was using SCX_DSQ_GLOBAL for fallback handling. However, it is
assuming that SCX_DSQ_GLOBAL isn't automatically consumed, which was true a
while ago but is no longer the case. Also, there are further changes planned
for SCX_DSQ_GLOBAL which will disallow explicit consumption from it. Switch
to a user DSQ for fallback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>

c9c809f4

25 Sep, 2024 2 commits

tools/sched_ext: Receive misc updates from SCX repo · a748db0c

Tejun Heo authored Sep 25, 2024

Receive misc tools/sched_ext updates from https://github.com/sched-ext/scx
to sync userspace bits.

- LSP macros to help language servers.

- bpf_cpumask_weight() declaration and cast_mask() helper.

- Cosmetic updates to scx_flatcg.bpf.c.
Signed-off-by: Tejun Heo <tj@kernel.org>

a748db0c

sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle · 1e123fd7

Tejun Heo authored Sep 25, 2024

cgroup support and scx_bpf_dispatch[_vtime]_from_dsq() are newly added since
8bb30798 ("sched_ext: Fixes incorrect type in bpf_scx_init()") which is
the current earliest commit targeted by BPF schedulers. Add compat helpers
for them and apply them in the example schedulers.

These will be dropped after a few kernel releases. The exact backward
compatibility window hasn't been decided yet.
Signed-off-by: Tejun Heo <tj@kernel.org>

1e123fd7

24 Sep, 2024 9 commits

sched_ext: Build fix for !CONFIG_SMP · 42268ad0

Tejun Heo authored Sep 24, 2024

move_remote_task_to_local_dsq() is only defined on SMP configs but
scx_disaptch_from_dsq() was calling move_remote_task_to_local_dsq() on UP
configs too causing build failures. Add a dummy
move_remote_task_to_local_dsq() which triggers a warning.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 4c30f5ce ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409241108.jaocHiDJ-lkp@intel.com/

42268ad0

Merge tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 68e5c7d4

Linus Torvalds authored Sep 24, 2024

Pull Kbuild updates from Masahiro Yamada:

 - Support cross-compiling linux-headers Debian package and kernel-devel
   RPM package

 - Add support for the linux-debug Pacman package

 - Improve module rebuilding speed by factoring out the common code to
   scripts/module-common.c

 - Separate device tree build rules into scripts/Makefile.dtbs

 - Add a new script to generate modules.builtin.ranges, which is useful
   for tracing tools to find symbols in built-in modules

 - Refactor Kconfig and misc tools

 - Update Kbuild and Kconfig documentation

* tag 'kbuild-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (51 commits)
  kbuild: doc: replace "gcc" in external module description
  kbuild: doc: describe the -C option precisely for external module builds
  kbuild: doc: remove the description about shipped files
  kbuild: doc: drop section numbering, use references in modules.rst
  kbuild: doc: throw out the local table of contents in modules.rst
  kbuild: doc: remove outdated description of the limitation on -I usage
  kbuild: doc: remove description about grepping CONFIG options
  kbuild: doc: update the description about Kbuild/Makefile split
  kbuild: remove unnecessary export of RUST_LIB_SRC
  kbuild: remove append operation on cmd_ld_ko_o
  kconfig: cache expression values
  kconfig: use hash table to reuse expressions
  kconfig: refactor expr_eliminate_dups()
  kconfig: add comments to expression transformations
  kconfig: change some expr_*() functions to bool
  scripts: move hash function from scripts/kconfig/ to scripts/include/
  kallsyms: change overflow variable to bool type
  kallsyms: squash output_address()
  kbuild: add install target for modules.builtin.ranges
  scripts: add verifier script for builtin module range data
  ...

68e5c7d4

Merge tag 'linux-cpupower-6.12-rc1-fixes' of... · 7f8de2bf

Linus Torvalds authored Sep 24, 2024

Merge tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux

Pull cpupower updates from Shuah Khan
 "The 'raw_pylibcpupower.i' file was being removed by "make mrproper".

  That was because '*.i', '.s' and '*.o' files are generated during
  kernel compile and removed when the repo is cleaned by mrproper.

  Rename it to use .swg extension instead to avoid the problem.

  A second patch removes references to it from .gitignore"

* tag 'linux-cpupower-6.12-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux:
  pm: cpupower: Clean up bindings gitignore
  pm: cpupower: rename raw_pylibcpupower.i

7f8de2bf

Merge tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · cd3d6477

Linus Torvalds authored Sep 24, 2024

Pull i3c updates from Alexandre Belloni:
 "This adds support for the I3C HCI controller of the AMD SoC which as
  expected requires quirks. Also fixes for the other drivers, including
  rate selection fixes for svc.

  Core:
   - allow adjusting first broadcast address speed

  Drivers:
   - cdns: few fixes
   - mipi-i3c-hci: Add AMD SoC I3C controller support and quirks, fix
     get_i3c_mode
   - svc: adjust rates, fix race condition"

* tag 'i3c/for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
  i3c: master: svc: Fix use after free vulnerability in svc_i3c_master Driver Due to Race Condition
  i3c: master: cdns: Fix use after free vulnerability in cdns_i3c_master Driver Due to Race Condition
  i3c: master: svc: adjust SDR according to i3c spec
  i3c: master: svc: use slow speed for first broadcast address
  i3c: master: support to adjust first broadcast address speed
  i3c/master: cmd_v1: Fix the rule for getting i3c mode
  i3c: master: cdns: fix module autoloading
  i3c: mipi-i3c-hci: Add a quirk to set Response buffer threshold
  i3c: mipi-i3c-hci: Add a quirk to set timing parameters
  i3c: mipi-i3c-hci: Relocate helper macros to HCI header file
  i3c: mipi-i3c-hci: Add a quirk to set PIO mode
  i3c: mipi-i3c-hci: Read HC_CONTROL_PIO_MODE only after i3c hci v1.1
  i3c: mipi-i3c-hci: Add AMDI5017 ACPI ID to the I3C Support List

cd3d6477

remoteproc: k3-m4: use the proper dependencies · ba0c0cb5

Linus Torvalds authored Sep 24, 2024

The TI_K3_M4_REMOTEPROC Kconfig entry selects OMAP2PLUS_MBOX, but that
driver in turn depends on other things, which the k4-m4 driver didn't.

This causes a Kconfig time warning:

  WARNING: unmet direct dependencies detected for OMAP2PLUS_MBOX
    Depends on [n]: MAILBOX [=y] && (ARCH_OMAP2PLUS || ARCH_K3)
    Selected by [m]:
    - TI_K3_M4_REMOTEPROC [=m] && REMOTEPROC [=y] && (ARCH_K3 || COMPILE_TEST [=y])

because you can't select something that is unavailable.

Make the dependencies for TI_K3_M4_REMOTEPROC match those of the
OMAP2PLUS_MBOX driver that it needs.

Fixes: ebcf9008 ("remoteproc: k3-m4: Add a remoteproc driver for M4F subsystem")
Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Martyn Welch <martyn.welch@collabora.com>
Cc: Hari Nagalla <hnagalla@ti.com>
Cc: Andrew Davis <afd@ti.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ba0c0cb5

Merge tag 'input-for-v6.12-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 9ae2940c

Linus Torvalds authored Sep 24, 2024

Pull input updates from Dmitry Torokhov:

 - support for PixArt PS/2 touchpad

 - updates to tsc2004/5, usbtouchscreen, and zforce_ts drivers

 - support for GPIO-only mode for ADP55888 controller

 - support for touch keys in Zinitix driver

 - support for querying density of Synaptics sensors

 - sysfs interface for Goodex "Berlin" devices to read and write touch
   IC registers

 - more quirks to i8042 to handle various Tuxedo laptops

 - a number of drivers have been converted to using "guard" notation
   when acquiring various locks, as well as using other cleanup
   functions to simplify releasing of resources (with more drivers to
   follow)

 - evdev will limit amount of data that can be written into an evdev
   instance at a given time to 4096 bytes (170 input events) to avoid
   holding evdev->mutex for too long and starving other users

 - Spitz has been converted to use software nodes/properties to describe
   its matrix keypad and GPIO-connected LEDs

 - msc5000_ts, msc_touchkey and keypad-nomadik-ske drivers have been
   removed since noone in mainline have been using them

 - other assorted cleanups and fixes

* tag 'input-for-v6.12-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (98 commits)
  ARM: spitz: fix compile error when matrix keypad driver is enabled
  Input: hynitron_cstxxx - drop explicit initialization of struct i2c_device_id::driver_data to 0
  Input: adp5588-keys - fix check on return code
  Input: Convert comma to semicolon
  Input: i8042 - add TUXEDO Stellaris 15 Slim Gen6 AMD to i8042 quirk table
  Input: i8042 - add another board name for TUXEDO Stellaris Gen5 AMD line
  Input: tegra-kbc - use of_property_read_variable_u32_array() and of_property_present()
  Input: ps2-gpio - use IRQF_NO_AUTOEN flag in request_irq()
  Input: ims-pcu - fix calling interruptible mutex
  Input: zforce_ts - switch to using asynchronous probing
  Input: zforce_ts - remove assert/deassert wrappers
  Input: zforce_ts - do not hardcode interrupt level
  Input: zforce_ts - switch to using devm_regulator_get_enable()
  Input: zforce_ts - stop treating VDD regulator as optional
  Input: zforce_ts - make zforce_idtable constant
  Input: zforce_ts - use dev_err_probe() where appropriate
  Input: zforce_ts - do not ignore errors when acquiring regulator
  Input: zforce_ts - make parsing of contacts less confusing
  Input: zforce_ts - switch to using get_unaligned_le16
  Input: zforce_ts - use guard notation when acquiring mutexes
  ...

9ae2940c

Merge tag 'hwlock-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux · 6db6a19f

Linus Torvalds authored Sep 24, 2024

Pull hwspinlock update from Bjorn Andersson:
 "This converts the Spreadtrum hardware spinlock DeviceTree binding to
  YAML, to allow validation of related DeviceTree source"

* tag 'hwlock-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  dt-bindings: hwlock: sprd-hwspinlock: convert to YAML

6db6a19f

Merge tag 'rpmsg-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux · 6e10aa1f

Linus Torvalds authored Sep 24, 2024

Pull rpmsg updates from Bjorn Andersson:

 - Minor cleanup/refactor to the Qualcomm GLINK code, in order to add
   trace events related to the messages exchange with the remote side,
   useful for debugging a range of interoperability issues

 - Rewrite the nested structs with flexible array members in order to
   avoid the risk of invalid accesses

* tag 'rpmsg-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
  rpmsg: glink: Avoid -Wflex-array-member-not-at-end warnings
  rpmsg: glink: Introduce packet tracepoints
  rpmsg: glink: Pass channel to qcom_glink_send_close_ack()
  rpmsg: glink: Tidy up RX advance handling

6e10aa1f

Merge tag 'rproc-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux · 5c480f1d

Linus Torvalds authored Sep 24, 2024

Pull remoteproc updates from Bjorn Andersson:

 - Add remoteproc support for the Cortex M4F found in AM62x and AM64x of
   the TI K3 family, support for the modem remoteproc in the Qualcomm
   SDX75, and audio, compute and general-purpose DSPs of the Qualcomm
   SA8775P.

 - Add support for blocking and non-blocking mailbox transmissions to
   the i.MX remoteproc driver, and implement poweroff and reboot
   mechanisms using them. Plus a few bug fixes and minor improvements.

 - Cleanups and bug fixes for the TI K3 DSP and R5F drivers

 - Support mapping SRAM regions into the AMD-Xilinx Zynqmp R5 cores

 - Use devres helpers for various allocations in the Ingenic, TI DA8xx,
   TI Keystone, TI K3, ST slim drivers

 - Replace uses of of_{find,get}_property() with of_property_present()
   where possible

* tag 'rproc-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux: (25 commits)
  remoteporc: ingenic: Use devm_platform_ioremap_resource_byname()
  remoteproc: da8xx: Use devm_platform_ioremap_resource_byname()
  remoteproc: st_slim: Use devm_platform_ioremap_resource_byname()
  remoteproc: xlnx: Add sram support
  remoteproc: k3-r5: Fix error handling when power-up failed
  remoteproc: imx_rproc: Add support for poweroff and reboot
  remoteproc: imx_rproc: Allow setting of the mailbox transmit mode
  remoteproc: k3-r5: Delay notification of wakeup event
  remoteproc: k3-m4: Add a remoteproc driver for M4F subsystem
  remoteproc: k3: Factor out TI-SCI processor control OF get function
  dt-bindings: remoteproc: k3-m4f: Add K3 AM64x SoCs
  remoteproc: k3-dsp: Acquire mailbox handle during probe routine
  remoteproc: k3-r5: Acquire mailbox handle during probe routine
  remoteproc: k3-r5: Use devm_rproc_alloc() helper
  remoteproc: qcom: pas: Add support for SA8775p ADSP, CDSP and GPDSP
  remoteproc: qcom: pas: Add SDX75 remoteproc support
  dt-bindings: remoteproc: qcom,sm8550-pas: document the SDX75 PAS
  remoteproc: keystone: Use devm_rproc_alloc() helper
  remoteproc: keystone: Use devm_kasprintf() to build name string
  dt-bindings: remoteproc: xlnx,zynqmp-r5fss: Add missing "additionalProperties" on child nodes
  ...

5c480f1d