1. 12 Jul, 2024 6 commits
    • Tejun Heo's avatar
      sched_ext/scx_qmap: Pick idle CPU for direct dispatch on !wakeup enqueues · 1edab907
      Tejun Heo authored
      Because there was no way to directly dispatch to the local DSQ of a remote
      CPU from ops.enqueue(), scx_qmap skipped looking for an idle CPU on !wakeup
      enqueues. This restriction was removed and sched_ext now allows
      SCX_DSQ_LOCAL_ON verdicts for direct dispatches.
      
      Factor out pick_direct_dispatch_cpu() from ops.select_cpu() and use it to
      direct dispatch from ops.enqueue() on !wakeup enqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      Cc: Changwoo Min <changwoo@igalia.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      1edab907
    • Tejun Heo's avatar
      sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches · 5b26f7b9
      Tejun Heo authored
      In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the
      local DSQ of any CPU. However, during direct dispatch from ops.select_cpu()
      and ops.enqueue(), this isn't allowed. This is because dispatching to the
      local DSQ of a remote CPU requires locking both the task's current and new
      rq's and such double locking can't be done directly from ops.enqueue().
      
      While waking up a task, as ops.select_cpu() can pick any CPU and both
      ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch
      target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still
      do whatever it wants to do. However, while a task is being enqueued for a
      different reason, e.g. after its slice expiration, only ops.enqueue() is
      called and there's no way for the BPF scheduler to directly dispatch to the
      local DSQ of a remote CPU. This gap in API forces schedulers into
      work-arounds which are not straightforward or optimal such as skipping
      direct dispatches in such cases.
      
      Implement deferred enqueueing to allow directly dispatching to the local DSQ
      of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are
      temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be
      safely released, the tasks are taken off the list and queued on the target
      local DSQs using dispatch_to_local_dsq().
      
      v2: - Add missing return after queue_balance_callback() in
            schedule_deferred(). (David).
      
          - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned
            and thus no longer takes @rf. Updated accordingly.
      
          - UP build warning fix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarAndrea Righi <righi.andrea@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      Cc: Changwoo Min <changwoo@igalia.com>
      5b26f7b9
    • Tejun Heo's avatar
      sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP · f47a8189
      Tejun Heo authored
      SCX_RQ_BALANCING is used to mark that the rq is currently in balance().
      Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether
      the rq is currently enqueueing for a wakeup. This will be used to implement
      direct dispatching to local DSQ of another CPU.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      f47a8189
    • Tejun Heo's avatar
      sched_ext: Unpin and repin rq lock from balance_scx() · 3cf78c5d
      Tejun Heo authored
      sched_ext often needs to migrate tasks across CPUs right before execution
      and thus uses the balance path to dispatch tasks from the BPF scheduler.
      balance_scx() is called with rq locked and pinned but is passed @rf and thus
      allowed to unpin and unlock. Currently, @rf is passed down the call stack so
      the rq lock is unpinned just when double locking is needed.
      
      This creates unnecessary complications such as having to explicitly
      manipulate lock pinning for core scheduling. We also want to use
      dispatch_to_local_dsq_lock() from other paths which are called with rq
      locked but unpinned.
      
      rq lock handling in the dispatch path is straightforward outside the
      migration implementation and extending the pinning protection down the
      callstack doesn't add enough meaningful extra protection to justify the
      extra complexity.
      
      Unpin and repin rq lock from the outer balance_scx() and drop @rf passing
      and lock pinning handling from the inner functions. UP is updated to call
      balance_one() instead of balance_scx() to avoid adding NULL @rf handling to
      balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a
      CONFIG_SMP block.
      
      No functional changes intended outside of lock annotation updates.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      3cf78c5d
    • Tejun Heo's avatar
      sched_ext: Open-code task_linked_on_dsq() · d6a05910
      Tejun Heo authored
      task_linked_on_dsq() exists as a helper because it used to test both the
      rbtree and list nodes. It now only tests the list node and the list node
      will soon be used for something else too. The helper doesn't improve
      anything materially and the naming will become confusing. Open-code the list
      node testing and remove task_linked_on_dsq()
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      d6a05910
    • Tejun Heo's avatar
      sched: Move struct balance_callback definition upward · fc283116
      Tejun Heo authored
      Move struct balance_callback definition upward so that it's visible to
      class-specific rq struct definitions. This will be used to embed a struct
      balance_callback in struct scx_rq.
      
      No functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      fc283116
  2. 09 Jul, 2024 5 commits
    • Tejun Heo's avatar
      sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated · e7a6395a
      Tejun Heo authored
      When a running task is migrated to another CPU, the stop_task is used to
      preempt the running task and migrate it. This, expectedly, invokes
      ops.cpu_release(). If the BPF scheduler then calls
      scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ
      including the task which is being migrated.
      
      This creates an unnecessary re-enqueue of a task which is about to be
      deactivated and re-activated for migration anyway. It can also cause
      confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its
      allowed CPUs may not agree while migration is pending.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 245254f7 ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      e7a6395a
    • Tejun Heo's avatar
      sched_ext: Reimplement scx_bpf_reenqueue_local() · fd0cf516
      Tejun Heo authored
      scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from
      ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same
      local DSQ, to avoid processing the same tasks repeatedly, it first takes the
      number of queued tasks and processes the task at the head of the queue that
      number of times.
      
      This is incorrect as a task can be dispatched to the same local DSQ with
      SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is
      exhausted and the succeeding tasks won't be processed at all.
      
      Fix it by first moving all candidate tasks to a private list and then
      processing that list. While at it, remove the WARNs. They're rather
      superflous as later steps will check them anyway.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 245254f7 ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      fd0cf516
    • Tejun Heo's avatar
      sched_ext/scx_qmap: Add an example usage of DSQ iterator · 6fbd6433
      Tejun Heo authored
      Implement periodic dumping of the shared DSQ to demonstrate the use of the
      newly added DSQ iterator.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: bpf@vger.kernel.org
      6fbd6433
    • Tejun Heo's avatar
      sched_ext: Implement DSQ iterator · 650ba21b
      Tejun Heo authored
      DSQs are very opaque in the consumption path. The BPF scheduler has no way
      of knowing which tasks are being considered and which is picked. This patch
      adds BPF DSQ iterator.
      
      - Allows iterating tasks queued on a DSQ in the dispatch order or reverse
        from anywhere using bpf_for_each(scx_dsq) or calling the iterator kfuncs
        directly.
      
      - Has ordering guarantee where only tasks which were already queued when the
        iteration started are visible and consumable during the iteration.
      
      v5: - Add a comment to the naked list_empty(&dsq->list) test in
            consume_dispatch_q() to explain the reasoning behind the lockless test
            and by extension why nldsq_next_task() isn't used there.
      
          - scx_qmap changes separated into its own patch.
      
      v4: - bpf_iter_scx_dsq_new() declaration in common.bpf.h was using the wrong
            type for the last argument (bool rev instead of u64 flags). Fix it.
      
      v3: - Alexei pointed out that the iterator is too big to allocate on stack.
            Added a prep patch to reduce the size of the cursor. Now
            bpf_iter_scx_dsq is 48 bytes and bpf_iter_scx_dsq_kern is 40 bytes on
            64bit.
      
          - u32_before() comparison factored out.
      
      v2: - scx_bpf_consume_task() is separated out into a separate patch.
      
          - DSQ seq and iter flags don't need to be u64. Use u32.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: bpf@vger.kernel.org
      650ba21b
    • Tejun Heo's avatar
      sched_ext: Take out ->priq and ->flags from scx_dsq_node · d4af01c3
      Tejun Heo authored
      struct scx_dsq_node contains two data structure nodes to link the containing
      task to a DSQ and a flags field that is protected by the lock of the
      associated DSQ. One reason why they are grouped into a struct is to use the
      type independently as a cursor node when iterating tasks on a DSQ. However,
      when iterating, the cursor only needs to be linked on the FIFO list and the
      rb_node part ends up inflating the size of the iterator data structure
      unnecessarily making it potentially too expensive to place it on stack.
      
      Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity
      as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to
      scx_dsq_list_node and the field names are renamed accordingly. This will
      help implementing DSQ task iterator that can be allocated on stack.
      
      No functional change intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: David Vernet <void@manifault.com>
      d4af01c3
  3. 08 Jul, 2024 8 commits
    • Tejun Heo's avatar
      sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h · e196c908
      Tejun Heo authored
      While sched_ext was out of tree, everything sched_ext specific which can be
      put in kernel/sched/ext.h was put there to ease forward porting. However,
      kernel/sched/sched.h is the better location for some of them. Relocate.
      
      - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and
        sched_enq_and_set_task().
      
      - scx_enabled() and scx_switched_all().
      
      - for_active_class_range() and for_each_active_class(). sched_class
        declarations are moved above the class iterators for this.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      e196c908
    • Tejun Heo's avatar
      sched, sched_ext: Open code for_balance_class_range() · 744d8360
      Tejun Heo authored
      For flexibility, sched_ext allows the BPF scheduler to select the CPU to
      execute a task on at dispatch time so that e.g. a queue can be shared across
      multiple CPUs. To enable this, the dispatch path is executed from balance()
      so that a dispatched task can be hot-migrated to its target CPU. This means
      that sched_ext needs its balance() method invoked before every
      pick_next_task() even when the CPU is waking up from SCHED_IDLE.
      
      for_balance_class_range() defined in kernel/sched/ext.h implements this
      selective iteration promotion. However, the indirection obfuscates more than
      helps. Open code the iteration promotion in put_prev_task_balance() and
      remove for_balance_class_range().
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      744d8360
    • Tejun Heo's avatar
      sched_ext: Minor cleanups in kernel/sched/ext.h · 6ab228ec
      Tejun Heo authored
      - scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to
        be global. Make it static.
      
      - Relocate task_on_scx() so that the inline functions are located together.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      6ab228ec
    • Tejun Heo's avatar
      sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect · 9f391f94
      Tejun Heo authored
      sched_domains regulate the load balancing for sched_classes. A machine can
      be partitioned into multiple sections that are not load-balanced across
      using either isolcpus= boot param or cpuset partitions. In such cases, tasks
      that are in one partition are expected to stay within that partition.
      
      cpuset configured partitions are always reflected in each member task's
      cpumask. As SCX always honors the task cpumasks, the BPF scheduler is
      automatically in compliance with the configured partitions.
      
      However, for isolcpus= domain isolation, the isolated CPUs are simply
      omitted from the top-level sched_domain[s] without further restrictions on
      tasks' cpumasks, so, for example, a task currently running in an isolated
      CPU may have more CPUs in its allowed cpumask while expected to remain on
      the same CPU.
      
      There is no straightforward way to enforce this partitioning preemptively on
      BPF schedulers and erroring out after a violation can be surprising.
      isolcpus= domain isolation is being replaced with cpuset partitions anyway,
      so keep it simple and simply disallow loading a BPF scheduler if isolcpus=
      domain isolation is in effect.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      9f391f94
    • Tejun Heo's avatar
      sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() · e98abd22
      Tejun Heo authored
      When initializing p->scx.weight, scx_ops_enable_task() wasn't considering
      whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the
      source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole
      user of set_task_scx_weight(). Open code it. @weight is going to be provided
      by sched core in the future anyway.
      
      v2: Use the newly available @lw->weight to set @p->scx.weight in
          reweight_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      e98abd22
    • Tejun Heo's avatar
      sched, sched_ext: Simplify dl_prio() case handling in sched_fork() · 60564acb
      Tejun Heo authored
      sched_fork() returns with -EAGAIN if dl_prio(@p). a7a9fc54 ("sched_ext:
      Add boilerplate for extensible scheduler class") added scx_pre_fork() call
      before it and then scx_cancel_fork() on the exit path. This is silly as the
      dl_prio() block can just be moved above the scx_pre_fork() call.
      
      Move the dl_prio() block above the scx_pre_fork() call and remove the now
      unnecessary scx_cancel_fork() invocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Vernet <void@manifault.com>
      60564acb
    • Hongyan Xia's avatar
      sched/ext: Add BPF function to fetch rq · 6203ef73
      Hongyan Xia authored
      rq contains many useful fields to implement a custom scheduler. For
      example, various clock signals like clock_task and clock_pelt can be
      used to track load. It also contains stats in other sched_classes, which
      are useful to drive scheduling decisions in ext.
      
      tj: Put the new helper below scx_bpf_task_*() helpers.
      Signed-off-by: default avatarHongyan Xia <hongyan.xia2@arm.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6203ef73
    • Tejun Heo's avatar
      Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 · 7b9f6c86
      Tejun Heo authored
      d3296052 ("sched/fair: set_load_weight() must also call reweight_task()
      for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is
      called causing conflicts with e83edbf8 ("sched: Add
      sched_class->reweight_task()"). Resolve the conflicts by taking
      set_load_weight() changes from d3296052 and updating
      sched_class->reweight_task() to take pointer to struct load_weight instead
      of int prio.
      
      Signed-off-by: Tejun Heo<tj@kernel.org>
      7b9f6c86
  4. 04 Jul, 2024 2 commits
  5. 02 Jul, 2024 1 commit
  6. 01 Jul, 2024 1 commit
  7. 27 Jun, 2024 3 commits
  8. 25 Jun, 2024 1 commit
    • Tejun Heo's avatar
      sched_ext: Drop tools_clean target from the top-level Makefile · eb4a3b62
      Tejun Heo authored
      2a52ca7c ("sched_ext: Add scx_simple and scx_example_qmap example
      schedulers") added the tools_clean target which is triggered by mrproper.
      The tools_clean target triggers the sched_ext_clean target in tools/. This
      unfortunately makes mrproper fail when no BTF enabled kernel image is found:
      
        Makefile:83: *** Cannot find a vmlinux for VMLINUX_BTF at any of "  ../../vmlinux /sys/kernel/btf/vmlinux/boot/vmlinux-4.15.0-136-generic".  Stop.
        Makefile:192: recipe for target 'sched_ext_clean' failed
        make[2]: *** [sched_ext_clean] Error 2
        Makefile:1361: recipe for target 'sched_ext' failed
        make[1]: *** [sched_ext] Error 2
        Makefile:240: recipe for target '__sub-make' failed
        make: *** [__sub-make] Error 2
      
      Clean targets shouldn't fail like this but also it's really odd for mrproper
      to single out and trigger the sched_ext_clean target when no other clean
      targets under tools/ are triggered.
      
      Fix builds by dropping the tools_clean target from the top-level Makefile.
      The offending Makefile line is shared across BPF targets under tools/. Let's
      revisit them later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Link: http://lkml.kernel.org/r/ac065f1f-8754-4626-95db-2c9fcf02567b@nvidia.com
      Fixes: 2a52ca7c ("sched_ext: Add scx_simple and scx_example_qmap example schedulers")
      Cc: David Vernet <void@manifault.com>
      eb4a3b62
  9. 23 Jun, 2024 1 commit
    • David Vernet's avatar
      sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signed · 8a6c6b4b
      David Vernet authored
      The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative
      performance target of a specified CPU. Commit d86adb4f ("sched_ext: Add
      cpuperf support") defined the @cpu argument to be unsigned. Let's update it
      to be signed to match the norm for the rest of ext.c and the kernel.
      
      Note that the kfunc declaration of scx_bpf_cpuperf_set() in the
      common.bpf.h header in tools/sched_ext already listed the cpu as signed, so
      this also fixes the build for tools/sched_ext and the sched_ext selftests
      due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus
      causing the compiler to error due to observing conflicting types).
      
      Fixes: d86adb4f ("sched_ext: Add cpuperf support")
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8a6c6b4b
  10. 21 Jun, 2024 3 commits
    • Tejun Heo's avatar
      sched_ext: Add cpuperf support · d86adb4f
      Tejun Heo authored
      sched_ext currently does not integrate with schedutil. When schedutil is the
      governor, frequencies are left unregulated and usually get stuck close to
      the highest performance level from running RT tasks.
      
      Add CPU performance monitoring and scaling support by integrating into
      schedutil. The following kfuncs are added:
      
      - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
        different CPUs in the system.
      
      - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
        relative to its max performance.
      
      - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
      
      This gives direct control over CPU performance setting to the BPF scheduler.
      The only changes on the schedutil side are accounting for the utilization
      factor from sched_ext and disabling frequency holding heuristics as it may
      not apply well to sched_ext schedulers which may have a lot weaker
      connection between tasks and their current / last CPU.
      
      With cpuperf support added, there is no reason to block uclamp. Enable while
      at it.
      
      A toy implementation of cpuperf is added to scx_qmap as a demonstration of
      the feature.
      
      v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
          to avoid factoring in stale util metric. (Christian)
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Christian Loehle <christian.loehle@arm.com>
      d86adb4f
    • Tejun Heo's avatar
      cpufreq_schedutil: Refactor sugov_cpu_is_busy() · 8988cad8
      Tejun Heo authored
      sugov_cpu_is_busy() is used to avoid decreasing performance level while the
      CPU is busy and called by sugov_update_single_freq() and
      sugov_update_single_perf(). Both callers repeat the same pattern to first
      test for uclamp and then the business. Let's refactor so that the tests
      aren't repeated.
      
      The new helper is named sugov_hold_freq() and tests both the uclamp
      exception and CPU business. No functional changes. This will make adding
      more exception conditions easier.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarChristian Loehle <christian.loehle@arm.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      8988cad8
    • Tejun Heo's avatar
      sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() · b999e365
      Tejun Heo authored
      scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
      a CPU is taken away by a task dispatched from a higher priority sched_class
      so that the BPF scheduler can, e.g., punt the task[s] which was running or
      were waiting for the CPU to other CPUs.
      
      Replace the sched_ext specific hook scx_next_task_picked() with a new
      sched_class operation switch_class().
      
      The changes are straightforward and the code looks better afterwards.
      However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook
      which is unlikely to be useful to other sched_classes. For further
      discussion on this subject, please refer to the following:
      
        http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.comSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      b999e365
  11. 18 Jun, 2024 9 commits
    • David Vernet's avatar
      sched_ext: Add selftests · a5db7817
      David Vernet authored
      Add basic selftests.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      a5db7817
    • Tejun Heo's avatar
      sched_ext: Documentation: scheduler: Document extensible scheduler class · fa48e8d2
      Tejun Heo authored
      Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
      and pointers to the examples.
      
      v6: - Add paragraph explaining debug dump.
      
      v5: - Updated to reflect /sys/kernel interface change. Kconfig options
            added.
      
      v4: - README improved, reformatted in markdown and renamed to README.md.
      
      v3: - Added tools/sched_ext/README.
      
          - Dropped _example prefix from scheduler names.
      
      v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
            of them are addressed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      fa48e8d2
    • Tejun Heo's avatar
      sched_ext: Add vtime-ordered priority queue to dispatch_q's · 06e51be3
      Tejun Heo authored
      Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
      consumed or executed earlier. While this is sufficient when dsq's are used
      for simple staging areas for tasks which are ready to execute, it'd make
      dsq's a lot more useful if they can implement custom ordering.
      
      This patch adds a vtime-ordered priority queue to dsq's. When the BPF
      scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
      can specify the vtime tha the task should be inserted at and the task is
      inserted into the priority queue in the dsq which is ordered according to
      time_before64() comparison of the vtime values.
      
      A DSQ can either be a FIFO or priority queue and automatically switches
      between the two depending on whether scx_bpf_dispatch() or
      scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
      already has the other type queued is not allowed and triggers an ops error.
      Built-in DSQs must always be FIFOs.
      
      This makes it very easy for the BPF schedulers to implement proper vtime
      based scheduling within each dsq very easy and efficient at a negligible
      cost in terms of code complexity and overhead.
      
      scx_simple and scx_example_flatcg are updated to default to weighted
      vtime scheduling (the latter within each cgroup). FIFO scheduling can be
      selected with -f option.
      
      v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
            led to unexpected starvations, DSQs now error out if both modes are
            used at the same time and the built-in DSQs are no longer allowed to
            be priority queues.
      
          - Explicit type struct scx_dsq_node added to contain fields needed to be
            linked on DSQs. This will be used to implement stateful iterator.
      
          - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
            PRIQ mode. This confines PRIQ related complexities to the enqueue and
            dequeue paths. Other paths only need to look at dsq->list. This will
            also ease implementing BPF iterator.
      
          - Print p->scx.dsq_flags in debug dump.
      
      v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
            p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
            flags in p->scx.flags. This led to flag corruption in some cases.
      
          - Add comments explaining the interaction between using consumption of
            p->scx.slice to determine vtime progress and yielding.
      
      v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
            migrations leading to some tasks being stalled for extended period of
            time depending on how saturated the machine is. Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      06e51be3
    • Tejun Heo's avatar
      sched_ext: Implement core-sched support · 7b0888b7
      Tejun Heo authored
      The core-sched support is composed of the following parts:
      
      - task_struct->scx.core_sched_at is added. This is a timestamp which can be
        used to order tasks. Depending on whether the BPF scheduler implements
        custom ordering, it tracks either global FIFO ordering of all tasks or
        local-DSQ ordering within the dispatched tasks on a CPU.
      
      - prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
        scx_prio_less() calls ops.core_sched_before() if available or uses the
        core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
        doesn't need to do anything. Otherwise, it should implement
        ops.core_sched_before() which reflects the ordering.
      
      - When core-sched is enabled, balance_scx() balances all SMT siblings so
        that they all have tasks dispatched if necessary before pick_task_scx() is
        called. pick_task_scx() picks between the current task and the first
        dispatched task on the local DSQ based on availability and the
        core_sched_at timestamps. Note that FIFO ordering is expected among the
        already dispatched tasks whether running or on the local DSQ, so this path
        always compares core_sched_at instead of calling into
        ops.core_sched_before().
      
      qmap_core_sched_before() is added to scx_qmap. It scales the
      distances from the heads of the queues to compare the tasks across different
      priority queues and seems to behave as expected.
      
      v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi.
      
      v2: Sched core added the const qualifiers to prio_less task arguments.
          Explicitly drop them for ops.core_sched_before() task arguments. BPF
          enforces access control through the verifier, so the qualifier isn't
          actually operative and only gets in the way when interacting with
          various helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarJosh Don <joshdon@google.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      7b0888b7
    • Tejun Heo's avatar
      sched_ext: Bypass BPF scheduler while PM events are in progress · 0fd55582
      Tejun Heo authored
      PM operations freeze userspace. Some BPF schedulers have active userspace
      component and may misbehave as expected across PM events. While the system
      is frozen, nothing too interesting is happening in terms of scheduling and
      we can get by just fine with the fallback FIFO behavior. Let's make things
      easier by always bypassing the BPF scheduler while PM events are in
      progress.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      0fd55582
    • Tejun Heo's avatar
      sched_ext: Implement sched_ext_ops.cpu_online/offline() · 60c27fb5
      Tejun Heo authored
      Add ops.cpu_online/offline() which are invoked when CPUs come online and
      offline respectively. As the enqueue path already automatically bypasses
      tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
      to see tasks only on CPUs which are between online() and offline().
      
      If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
      scheduler is automatically exited with SCX_ECODE_RESTART |
      SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
      trivially by simply reinitializing and reloading the scheduler.
      
      scx_qmap is updated to print out online CPUs on hotplug events. Other
      schedulers are updated to restart based on ecode.
      
      v3: - The previous implementation added @reason to
            sched_class.rq_on/offline() to distinguish between CPU hotplug events
            and topology updates. This was buggy and fragile as the methods are
            skipped if the current state equals the target state. Instead, add
            scx_rq_[de]activate() which are directly called from
            sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
            sleep which can be useful.
      
          - ops.dispatch() could be called on a CPU that the BPF scheduler was
            told to be offline. The dispatch patch is updated to bypass in such
            cases.
      
      v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
            cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
            block and enabled eariler during scx_ope_enable() so that
            cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
      
          - Auto exit with ECODE added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      60c27fb5
    • David Vernet's avatar
      sched_ext: Implement sched_ext_ops.cpu_acquire/release() · 245254f7
      David Vernet authored
      Scheduler classes are strictly ordered and when a higher priority class has
      tasks to run, the lower priority ones lose access to the CPU. Being able to
      monitor and act on these events are necessary for use cases includling
      strict core-scheduling and latency management.
      
      This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
      former is invoked when a CPU becomes available to the BPF scheduler and the
      opposite for the latter. This patch also implements
      scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
      requeueing of all tasks in the local dsq of the CPU so that the tasks can be
      reassigned to other available CPUs.
      
      scx_pair is updated to use .cpu_acquire/release() along with
      %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
      is preempted by a higher priority scheduler class.
      
      scx_qmap is updated to use .cpu_acquire/release() to empty the local
      dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
      that want to have a tight control over latency.
      
      v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
      
      v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
          access control through the verifier, so the qualifier isn't actually
          operative and only gets in the way when interacting with various
          helpers.
      
      v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
          from ops.cpu_release() nested inside ops.init() and other sleepable
          operations.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      245254f7
    • David Vernet's avatar
      sched_ext: Implement SCX_KICK_WAIT · 90e55164
      David Vernet authored
      If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
      the kicked cpu to enter the scheduler. See the following for example usage:
      
        https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
      
      v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
      
          - Include SCX_KICK_WAIT related information in debug dump.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      90e55164
    • Tejun Heo's avatar
      sched_ext: Track tasks that are subjects of the in-flight SCX operation · 36454023
      Tejun Heo authored
      When some SCX operations are in flight, it is known that the subject task's
      rq lock is held throughout which makes it safe to access certain fields of
      the task - e.g. its current task_group. We want to add SCX kfunc helpers
      that can make use of this guarantee - e.g. to help determining the currently
      associated CPU cgroup from the task's current task_group.
      
      As it'd be dangerous call such a helper on a task which isn't rq lock
      protected, the helper should be able to verify the input task and reject
      accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
      tasks which are currently being operated on by a terminal SCX operation. The
      new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
      which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
      used by kfunc helpers to verify the input task status.
      
      Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
      is currently only limited to terminal SCX operations. If needed in the
      future, this restriction can be removed by moving the tracking to the task
      side with a couple per-task counters.
      
      v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      36454023