1. 09 Jul, 2024 1 commit
    • Tejun Heo's avatar
      sched_ext: Take out ->priq and ->flags from scx_dsq_node · d4af01c3
      Tejun Heo authored
      struct scx_dsq_node contains two data structure nodes to link the containing
      task to a DSQ and a flags field that is protected by the lock of the
      associated DSQ. One reason why they are grouped into a struct is to use the
      type independently as a cursor node when iterating tasks on a DSQ. However,
      when iterating, the cursor only needs to be linked on the FIFO list and the
      rb_node part ends up inflating the size of the iterator data structure
      unnecessarily making it potentially too expensive to place it on stack.
      
      Take ->priq and ->flags out of scx_dsq_node and put them in sched_ext_entity
      as ->dsq_priq and ->dsq_flags, respectively. scx_dsq_node is renamed to
      scx_dsq_list_node and the field names are renamed accordingly. This will
      help implementing DSQ task iterator that can be allocated on stack.
      
      No functional change intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: David Vernet <void@manifault.com>
      d4af01c3
  2. 08 Jul, 2024 8 commits
    • Tejun Heo's avatar
      sched, sched_ext: Move some declarations from kernel/sched/ext.h to sched.h · e196c908
      Tejun Heo authored
      While sched_ext was out of tree, everything sched_ext specific which can be
      put in kernel/sched/ext.h was put there to ease forward porting. However,
      kernel/sched/sched.h is the better location for some of them. Relocate.
      
      - struct sched_enq_and_set_ctx, sched_deq_and_put_task() and
        sched_enq_and_set_task().
      
      - scx_enabled() and scx_switched_all().
      
      - for_active_class_range() and for_each_active_class(). sched_class
        declarations are moved above the class iterators for this.
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      e196c908
    • Tejun Heo's avatar
      sched, sched_ext: Open code for_balance_class_range() · 744d8360
      Tejun Heo authored
      For flexibility, sched_ext allows the BPF scheduler to select the CPU to
      execute a task on at dispatch time so that e.g. a queue can be shared across
      multiple CPUs. To enable this, the dispatch path is executed from balance()
      so that a dispatched task can be hot-migrated to its target CPU. This means
      that sched_ext needs its balance() method invoked before every
      pick_next_task() even when the CPU is waking up from SCHED_IDLE.
      
      for_balance_class_range() defined in kernel/sched/ext.h implements this
      selective iteration promotion. However, the indirection obfuscates more than
      helps. Open code the iteration promotion in put_prev_task_balance() and
      remove for_balance_class_range().
      
      No functional changes intended.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      744d8360
    • Tejun Heo's avatar
      sched_ext: Minor cleanups in kernel/sched/ext.h · 6ab228ec
      Tejun Heo authored
      - scx_ops_cpu_preempt is only used in kernel/sched/ext.c and doesn't need to
        be global. Make it static.
      
      - Relocate task_on_scx() so that the inline functions are located together.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      6ab228ec
    • Tejun Heo's avatar
      sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect · 9f391f94
      Tejun Heo authored
      sched_domains regulate the load balancing for sched_classes. A machine can
      be partitioned into multiple sections that are not load-balanced across
      using either isolcpus= boot param or cpuset partitions. In such cases, tasks
      that are in one partition are expected to stay within that partition.
      
      cpuset configured partitions are always reflected in each member task's
      cpumask. As SCX always honors the task cpumasks, the BPF scheduler is
      automatically in compliance with the configured partitions.
      
      However, for isolcpus= domain isolation, the isolated CPUs are simply
      omitted from the top-level sched_domain[s] without further restrictions on
      tasks' cpumasks, so, for example, a task currently running in an isolated
      CPU may have more CPUs in its allowed cpumask while expected to remain on
      the same CPU.
      
      There is no straightforward way to enforce this partitioning preemptively on
      BPF schedulers and erroring out after a violation can be surprising.
      isolcpus= domain isolation is being replaced with cpuset partitions anyway,
      so keep it simple and simply disallow loading a BPF scheduler if isolcpus=
      domain isolation is in effect.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20240626082342.GY31592@noisy.programming.kicks-ass.net
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      9f391f94
    • Tejun Heo's avatar
      sched_ext: Account for idle policy when setting p->scx.weight in scx_ops_enable_task() · e98abd22
      Tejun Heo authored
      When initializing p->scx.weight, scx_ops_enable_task() wasn't considering
      whether the task is SCHED_IDLE. Update it to use WEIGHT_IDLEPRIO as the
      source weight for SCHED_IDLE tasks. This leaves reweight_task_scx() the sole
      user of set_task_scx_weight(). Open code it. @weight is going to be provided
      by sched core in the future anyway.
      
      v2: Use the newly available @lw->weight to set @p->scx.weight in
          reweight_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      e98abd22
    • Tejun Heo's avatar
      sched, sched_ext: Simplify dl_prio() case handling in sched_fork() · 60564acb
      Tejun Heo authored
      sched_fork() returns with -EAGAIN if dl_prio(@p). a7a9fc54 ("sched_ext:
      Add boilerplate for extensible scheduler class") added scx_pre_fork() call
      before it and then scx_cancel_fork() on the exit path. This is silly as the
      dl_prio() block can just be moved above the scx_pre_fork() call.
      
      Move the dl_prio() block above the scx_pre_fork() call and remove the now
      unnecessary scx_cancel_fork() invocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Vernet <void@manifault.com>
      60564acb
    • Hongyan Xia's avatar
      sched/ext: Add BPF function to fetch rq · 6203ef73
      Hongyan Xia authored
      rq contains many useful fields to implement a custom scheduler. For
      example, various clock signals like clock_task and clock_pelt can be
      used to track load. It also contains stats in other sched_classes, which
      are useful to drive scheduling decisions in ext.
      
      tj: Put the new helper below scx_bpf_task_*() helpers.
      Signed-off-by: default avatarHongyan Xia <hongyan.xia2@arm.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      6203ef73
    • Tejun Heo's avatar
      Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.11 · 7b9f6c86
      Tejun Heo authored
      d3296052 ("sched/fair: set_load_weight() must also call reweight_task()
      for SCHED_IDLE tasks") applied to sched/core changes how reweight_task() is
      called causing conflicts with e83edbf8 ("sched: Add
      sched_class->reweight_task()"). Resolve the conflicts by taking
      set_load_weight() changes from d3296052 and updating
      sched_class->reweight_task() to take pointer to struct load_weight instead
      of int prio.
      
      Signed-off-by: Tejun Heo<tj@kernel.org>
      7b9f6c86
  3. 04 Jul, 2024 2 commits
  4. 02 Jul, 2024 1 commit
  5. 01 Jul, 2024 1 commit
  6. 27 Jun, 2024 3 commits
  7. 25 Jun, 2024 1 commit
    • Tejun Heo's avatar
      sched_ext: Drop tools_clean target from the top-level Makefile · eb4a3b62
      Tejun Heo authored
      2a52ca7c ("sched_ext: Add scx_simple and scx_example_qmap example
      schedulers") added the tools_clean target which is triggered by mrproper.
      The tools_clean target triggers the sched_ext_clean target in tools/. This
      unfortunately makes mrproper fail when no BTF enabled kernel image is found:
      
        Makefile:83: *** Cannot find a vmlinux for VMLINUX_BTF at any of "  ../../vmlinux /sys/kernel/btf/vmlinux/boot/vmlinux-4.15.0-136-generic".  Stop.
        Makefile:192: recipe for target 'sched_ext_clean' failed
        make[2]: *** [sched_ext_clean] Error 2
        Makefile:1361: recipe for target 'sched_ext' failed
        make[1]: *** [sched_ext] Error 2
        Makefile:240: recipe for target '__sub-make' failed
        make: *** [__sub-make] Error 2
      
      Clean targets shouldn't fail like this but also it's really odd for mrproper
      to single out and trigger the sched_ext_clean target when no other clean
      targets under tools/ are triggered.
      
      Fix builds by dropping the tools_clean target from the top-level Makefile.
      The offending Makefile line is shared across BPF targets under tools/. Let's
      revisit them later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJon Hunter <jonathanh@nvidia.com>
      Link: http://lkml.kernel.org/r/ac065f1f-8754-4626-95db-2c9fcf02567b@nvidia.com
      Fixes: 2a52ca7c ("sched_ext: Add scx_simple and scx_example_qmap example schedulers")
      Cc: David Vernet <void@manifault.com>
      eb4a3b62
  8. 23 Jun, 2024 1 commit
    • David Vernet's avatar
      sched_ext: Make scx_bpf_cpuperf_set() @cpu arg signed · 8a6c6b4b
      David Vernet authored
      The scx_bpf_cpuperf_set() kfunc allows a BPF program to set the relative
      performance target of a specified CPU. Commit d86adb4f ("sched_ext: Add
      cpuperf support") defined the @cpu argument to be unsigned. Let's update it
      to be signed to match the norm for the rest of ext.c and the kernel.
      
      Note that the kfunc declaration of scx_bpf_cpuperf_set() in the
      common.bpf.h header in tools/sched_ext already listed the cpu as signed, so
      this also fixes the build for tools/sched_ext and the sched_ext selftests
      due to kfunc declarations now being emitted in vmlinux.h based on BTF (thus
      causing the compiler to error due to observing conflicting types).
      
      Fixes: d86adb4f ("sched_ext: Add cpuperf support")
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8a6c6b4b
  9. 21 Jun, 2024 3 commits
    • Tejun Heo's avatar
      sched_ext: Add cpuperf support · d86adb4f
      Tejun Heo authored
      sched_ext currently does not integrate with schedutil. When schedutil is the
      governor, frequencies are left unregulated and usually get stuck close to
      the highest performance level from running RT tasks.
      
      Add CPU performance monitoring and scaling support by integrating into
      schedutil. The following kfuncs are added:
      
      - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
        different CPUs in the system.
      
      - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
        relative to its max performance.
      
      - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
      
      This gives direct control over CPU performance setting to the BPF scheduler.
      The only changes on the schedutil side are accounting for the utilization
      factor from sched_ext and disabling frequency holding heuristics as it may
      not apply well to sched_ext schedulers which may have a lot weaker
      connection between tasks and their current / last CPU.
      
      With cpuperf support added, there is no reason to block uclamp. Enable while
      at it.
      
      A toy implementation of cpuperf is added to scx_qmap as a demonstration of
      the feature.
      
      v2: Ignore cpu_util_cfs_boost() when scx_switched_all() in sugov_get_util()
          to avoid factoring in stale util metric. (Christian)
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Christian Loehle <christian.loehle@arm.com>
      d86adb4f
    • Tejun Heo's avatar
      cpufreq_schedutil: Refactor sugov_cpu_is_busy() · 8988cad8
      Tejun Heo authored
      sugov_cpu_is_busy() is used to avoid decreasing performance level while the
      CPU is busy and called by sugov_update_single_freq() and
      sugov_update_single_perf(). Both callers repeat the same pattern to first
      test for uclamp and then the business. Let's refactor so that the tests
      aren't repeated.
      
      The new helper is named sugov_hold_freq() and tests both the uclamp
      exception and CPU business. No functional changes. This will make adding
      more exception conditions easier.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarChristian Loehle <christian.loehle@arm.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      8988cad8
    • Tejun Heo's avatar
      sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() · b999e365
      Tejun Heo authored
      scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
      a CPU is taken away by a task dispatched from a higher priority sched_class
      so that the BPF scheduler can, e.g., punt the task[s] which was running or
      were waiting for the CPU to other CPUs.
      
      Replace the sched_ext specific hook scx_next_task_picked() with a new
      sched_class operation switch_class().
      
      The changes are straightforward and the code looks better afterwards.
      However, when !CONFIG_SCHED_CLASS_EXT, this ends up adding an unused hook
      which is unlikely to be useful to other sched_classes. For further
      discussion on this subject, please refer to the following:
      
        http://lkml.kernel.org/r/CAHk-=wjFPLqo7AXu8maAGEGnOy6reUg-F4zzFhVB0Kyu22h7pw@mail.gmail.comSigned-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      b999e365
  10. 18 Jun, 2024 19 commits
    • David Vernet's avatar
      sched_ext: Add selftests · a5db7817
      David Vernet authored
      Add basic selftests.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      a5db7817
    • Tejun Heo's avatar
      sched_ext: Documentation: scheduler: Document extensible scheduler class · fa48e8d2
      Tejun Heo authored
      Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
      and pointers to the examples.
      
      v6: - Add paragraph explaining debug dump.
      
      v5: - Updated to reflect /sys/kernel interface change. Kconfig options
            added.
      
      v4: - README improved, reformatted in markdown and renamed to README.md.
      
      v3: - Added tools/sched_ext/README.
      
          - Dropped _example prefix from scheduler names.
      
      v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
            of them are addressed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      fa48e8d2
    • Tejun Heo's avatar
      sched_ext: Add vtime-ordered priority queue to dispatch_q's · 06e51be3
      Tejun Heo authored
      Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
      consumed or executed earlier. While this is sufficient when dsq's are used
      for simple staging areas for tasks which are ready to execute, it'd make
      dsq's a lot more useful if they can implement custom ordering.
      
      This patch adds a vtime-ordered priority queue to dsq's. When the BPF
      scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
      can specify the vtime tha the task should be inserted at and the task is
      inserted into the priority queue in the dsq which is ordered according to
      time_before64() comparison of the vtime values.
      
      A DSQ can either be a FIFO or priority queue and automatically switches
      between the two depending on whether scx_bpf_dispatch() or
      scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
      already has the other type queued is not allowed and triggers an ops error.
      Built-in DSQs must always be FIFOs.
      
      This makes it very easy for the BPF schedulers to implement proper vtime
      based scheduling within each dsq very easy and efficient at a negligible
      cost in terms of code complexity and overhead.
      
      scx_simple and scx_example_flatcg are updated to default to weighted
      vtime scheduling (the latter within each cgroup). FIFO scheduling can be
      selected with -f option.
      
      v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
            led to unexpected starvations, DSQs now error out if both modes are
            used at the same time and the built-in DSQs are no longer allowed to
            be priority queues.
      
          - Explicit type struct scx_dsq_node added to contain fields needed to be
            linked on DSQs. This will be used to implement stateful iterator.
      
          - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
            PRIQ mode. This confines PRIQ related complexities to the enqueue and
            dequeue paths. Other paths only need to look at dsq->list. This will
            also ease implementing BPF iterator.
      
          - Print p->scx.dsq_flags in debug dump.
      
      v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
            p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
            flags in p->scx.flags. This led to flag corruption in some cases.
      
          - Add comments explaining the interaction between using consumption of
            p->scx.slice to determine vtime progress and yielding.
      
      v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
            migrations leading to some tasks being stalled for extended period of
            time depending on how saturated the machine is. Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      06e51be3
    • Tejun Heo's avatar
      sched_ext: Implement core-sched support · 7b0888b7
      Tejun Heo authored
      The core-sched support is composed of the following parts:
      
      - task_struct->scx.core_sched_at is added. This is a timestamp which can be
        used to order tasks. Depending on whether the BPF scheduler implements
        custom ordering, it tracks either global FIFO ordering of all tasks or
        local-DSQ ordering within the dispatched tasks on a CPU.
      
      - prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
        scx_prio_less() calls ops.core_sched_before() if available or uses the
        core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
        doesn't need to do anything. Otherwise, it should implement
        ops.core_sched_before() which reflects the ordering.
      
      - When core-sched is enabled, balance_scx() balances all SMT siblings so
        that they all have tasks dispatched if necessary before pick_task_scx() is
        called. pick_task_scx() picks between the current task and the first
        dispatched task on the local DSQ based on availability and the
        core_sched_at timestamps. Note that FIFO ordering is expected among the
        already dispatched tasks whether running or on the local DSQ, so this path
        always compares core_sched_at instead of calling into
        ops.core_sched_before().
      
      qmap_core_sched_before() is added to scx_qmap. It scales the
      distances from the heads of the queues to compare the tasks across different
      priority queues and seems to behave as expected.
      
      v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi.
      
      v2: Sched core added the const qualifiers to prio_less task arguments.
          Explicitly drop them for ops.core_sched_before() task arguments. BPF
          enforces access control through the verifier, so the qualifier isn't
          actually operative and only gets in the way when interacting with
          various helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarJosh Don <joshdon@google.com>
      Cc: Andrea Righi <andrea.righi@canonical.com>
      7b0888b7
    • Tejun Heo's avatar
      sched_ext: Bypass BPF scheduler while PM events are in progress · 0fd55582
      Tejun Heo authored
      PM operations freeze userspace. Some BPF schedulers have active userspace
      component and may misbehave as expected across PM events. While the system
      is frozen, nothing too interesting is happening in terms of scheduling and
      we can get by just fine with the fallback FIFO behavior. Let's make things
      easier by always bypassing the BPF scheduler while PM events are in
      progress.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      0fd55582
    • Tejun Heo's avatar
      sched_ext: Implement sched_ext_ops.cpu_online/offline() · 60c27fb5
      Tejun Heo authored
      Add ops.cpu_online/offline() which are invoked when CPUs come online and
      offline respectively. As the enqueue path already automatically bypasses
      tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
      to see tasks only on CPUs which are between online() and offline().
      
      If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
      scheduler is automatically exited with SCX_ECODE_RESTART |
      SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
      trivially by simply reinitializing and reloading the scheduler.
      
      scx_qmap is updated to print out online CPUs on hotplug events. Other
      schedulers are updated to restart based on ecode.
      
      v3: - The previous implementation added @reason to
            sched_class.rq_on/offline() to distinguish between CPU hotplug events
            and topology updates. This was buggy and fragile as the methods are
            skipped if the current state equals the target state. Instead, add
            scx_rq_[de]activate() which are directly called from
            sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
            sleep which can be useful.
      
          - ops.dispatch() could be called on a CPU that the BPF scheduler was
            told to be offline. The dispatch patch is updated to bypass in such
            cases.
      
      v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
            cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
            block and enabled eariler during scx_ope_enable() so that
            cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
      
          - Auto exit with ECODE added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      60c27fb5
    • David Vernet's avatar
      sched_ext: Implement sched_ext_ops.cpu_acquire/release() · 245254f7
      David Vernet authored
      Scheduler classes are strictly ordered and when a higher priority class has
      tasks to run, the lower priority ones lose access to the CPU. Being able to
      monitor and act on these events are necessary for use cases includling
      strict core-scheduling and latency management.
      
      This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
      former is invoked when a CPU becomes available to the BPF scheduler and the
      opposite for the latter. This patch also implements
      scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
      requeueing of all tasks in the local dsq of the CPU so that the tasks can be
      reassigned to other available CPUs.
      
      scx_pair is updated to use .cpu_acquire/release() along with
      %SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
      is preempted by a higher priority scheduler class.
      
      scx_qmap is updated to use .cpu_acquire/release() to empty the local
      dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
      that want to have a tight control over latency.
      
      v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
      
      v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
          access control through the verifier, so the qualifier isn't actually
          operative and only gets in the way when interacting with various
          helpers.
      
      v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
          from ops.cpu_release() nested inside ops.init() and other sleepable
          operations.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      245254f7
    • David Vernet's avatar
      sched_ext: Implement SCX_KICK_WAIT · 90e55164
      David Vernet authored
      If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
      the kicked cpu to enter the scheduler. See the following for example usage:
      
        https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
      
      v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
      
          - Include SCX_KICK_WAIT related information in debug dump.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      90e55164
    • Tejun Heo's avatar
      sched_ext: Track tasks that are subjects of the in-flight SCX operation · 36454023
      Tejun Heo authored
      When some SCX operations are in flight, it is known that the subject task's
      rq lock is held throughout which makes it safe to access certain fields of
      the task - e.g. its current task_group. We want to add SCX kfunc helpers
      that can make use of this guarantee - e.g. to help determining the currently
      associated CPU cgroup from the task's current task_group.
      
      As it'd be dangerous call such a helper on a task which isn't rq lock
      protected, the helper should be able to verify the input task and reject
      accordingly. This patch adds sched_ext_entity.kf_tasks[] that track the
      tasks which are currently being operated on by a terminal SCX operation. The
      new SCX_CALL_OP_[2]TASK[_RET]() can be used when invoking SCX operations
      which take tasks as arguments and the scx_kf_allowed_on_arg_tasks() can be
      used by kfunc helpers to verify the input task status.
      
      Note that as sched_ext_entity.kf_tasks[] can't handle nesting, the tracking
      is currently only limited to terminal SCX operations. If needed in the
      future, this restriction can be removed by moving the tracking to the task
      side with a couple per-task counters.
      
      v2: Updated to reflect the addition of SCX_KF_SELECT_CPU.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      36454023
    • Tejun Heo's avatar
      sched_ext: Implement tickless support · 22a92020
      Tejun Heo authored
      Allow BPF schedulers to indicate tickless operation by setting p->scx.slice
      to SCX_SLICE_INF. A CPU whose current task has infinte slice goes into
      tickless operation.
      
      scx_central is updated to use tickless operations for all tasks and
      instead use a BPF timer to expire slices. This also uses the SCX_ENQ_PREEMPT
      and task state tracking added by the previous patches.
      
      Currently, there is no way to pin the timer on the central CPU, so it may
      end up on one of the worker CPUs; however, outside of that, the worker CPUs
      can go tickless both while running sched_ext tasks and idling.
      
      With schbench running, scx_central shows:
      
        root@test ~# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     142024        656        664        449   Local timer interrupts
        LOC:     161663        663        665        449   Local timer interrupts
      
      Without it:
      
        root@test ~ [SIGINT]# grep ^LOC /proc/interrupts; sleep 10; grep ^LOC /proc/interrupts
        LOC:     188778       3142       3793       3993   Local timer interrupts
        LOC:     198993       5314       6323       6438   Local timer interrupts
      
      While scx_central itself is too barebone to be useful as a
      production scheduler, a more featureful central scheduler can be built using
      the same approach. Google's experience shows that such an approach can have
      significant benefits for certain applications such as VM hosting.
      
      v4: Allow operation even if BPF_F_TIMER_CPU_PIN is not available.
      
      v3: Pin the central scheduler's timer on the central_cpu using
          BPF_F_TIMER_CPU_PIN.
      
      v2: Convert to BPF inline iterators.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      22a92020
    • Tejun Heo's avatar
      sched_ext: Add task state tracking operations · 1c29f854
      Tejun Heo authored
      Being able to track the task runnable and running state transitions are
      useful for a variety of purposes including latency tracking and load factor
      calculation.
      
      Currently, BPF schedulers don't have a good way of tracking these
      transitions. Becoming runnable can be determined from ops.enqueue() but
      becoming quiescent can only be inferred from the lack of subsequent enqueue.
      Also, as the local dsq can have multiple tasks and some events are handled
      in the sched_ext core, it's difficult to determine when a given task starts
      and stops executing.
      
      This patch adds sched_ext_ops.runnable(), .running(), .stopping() and
      .quiescent() operations to track the task runnable and running state
      transitions. They're mostly self explanatory; however, we want to ensure
      that running <-> stopping transitions are always contained within runnable
      <-> quiescent transitions which is a bit different from how the scheduler
      core behaves. This adds a bit of complication. See the comment in
      dequeue_task_scx().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      1c29f854
    • Tejun Heo's avatar
      sched_ext: Make watchdog handle ops.dispatch() looping stall · 0922f54f
      Tejun Heo authored
      The dispatch path retries if the local DSQ is still empty after
      ops.dispatch() either dispatched or consumed a task. This is both out of
      necessity and for convenience. It has to retry because the dispatch path
      might lose the tasks to dequeue while the rq lock is released while trying
      to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
      implementation easier as it only needs to make some forward progress each
      iteration.
      
      However, this makes it possible for ops.dispatch() to stall CPUs by
      repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
      the watchdog or sysrq handler can't run and the system can't be saved. Let's
      address the issue by breaking out of the dispatch loop after 32 iterations.
      
      It is unlikely but not impossible for ops.dispatch() to legitimately go over
      the iteration limit. We want to come back to the dispatch path in such cases
      as not doing so risks stalling the CPU by idling with runnable tasks
      pending. As the previous task is still current in balance_scx(),
      resched_curr() doesn't do anything - it will just get cleared. Let's instead
      use scx_kick_bpf() which will trigger reschedule after switching to the next
      task which will likely be the idle task.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      0922f54f
    • Tejun Heo's avatar
      sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU · 037df2a3
      Tejun Heo authored
      This patch adds a new example scheduler, scx_central, which demonstrates
      central scheduling where one CPU is responsible for making all scheduling
      decisions in the system using scx_bpf_kick_cpu(). The central CPU makes
      scheduling decisions for all CPUs in the system, queues tasks on the
      appropriate local dsq's and preempts the worker CPUs. The worker CPUs in
      turn preempt the central CPU when it needs tasks to run.
      
      Currently, every CPU depends on its own tick to expire the current task. A
      follow-up patch implementing tickless support for sched_ext will allow the
      worker CPUs to go full tickless so that they can run completely undisturbed.
      
      v3: - Kumar fixed a bug where the dispatch path could overflow the dispatch
            buffer if too many are dispatched to the fallback DSQ.
      
          - Use the new SCX_KICK_IDLE to wake up non-central CPUs.
      
          - Dropped '-p' option.
      
      v2: - Use RESIZABLE_ARRAY() instead of fixed MAX_CPUS and use SCX_BUG[_ON]()
            to simplify error handling.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      037df2a3
    • Tejun Heo's avatar
      sched_ext: Implement scx_bpf_kick_cpu() and task preemption support · 81aae789
      Tejun Heo authored
      It's often useful to wake up and/or trigger reschedule on other CPUs. This
      patch adds scx_bpf_kick_cpu() kfunc helper that BPF scheduler can call to
      kick the target CPU into the scheduling path.
      
      As a sched_ext task relinquishes its CPU only after its slice is depleted,
      this patch also adds SCX_KICK_PREEMPT and SCX_ENQ_PREEMPT which clears the
      slice of the target CPU's current task to guarantee that sched_ext's
      scheduling path runs on the CPU.
      
      If SCX_KICK_IDLE is specified, the target CPU is kicked iff the CPU is idle
      to guarantee that the target CPU will go through at least one full sched_ext
      scheduling cycle after the kicking. This can be used to wake up idle CPUs
      without incurring unnecessary overhead if it isn't currently idle.
      
      As a demonstration of how backward compatibility can be supported using BPF
      CO-RE, tools/sched_ext/include/scx/compat.bpf.h is added. It provides
      __COMPAT_scx_bpf_kick_cpu_IDLE() which uses SCX_KICK_IDLE if available or
      becomes a regular kicking otherwise. This allows schedulers to use the new
      SCX_KICK_IDLE while maintaining support for older kernels. The plan is to
      temporarily use compat helpers to ease API updates and drop them after a few
      kernel releases.
      
      v5: - SCX_KICK_IDLE added. Note that this also adds a compat mechanism for
            schedulers so that they can support kernels without SCX_KICK_IDLE.
            This is useful as a demonstration of how new feature flags can be
            added in a backward compatible way.
      
          - kick_cpus_irq_workfn() reimplemented so that it touches the pending
            cpumasks only as necessary to reduce kicking overhead on machines with
            a lot of CPUs.
      
          - tools/sched_ext/include/scx/compat.bpf.h added.
      
      v4: - Move example scheduler to its own patch.
      
      v3: - Make scx_example_central switch all tasks by default.
      
          - Convert to BPF inline iterators.
      
      v2: - Julia Lawall reported that scx_example_central can overflow the
            dispatch buffer and malfunction. As scheduling for other CPUs can't be
            handled by the automatic retry mechanism, fix by implementing an
            explicit overflow and retry handling.
      
          - Updated to use generic BPF cpumask helpers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      81aae789
    • Tejun Heo's avatar
      tools/sched_ext: Add scx_show_state.py · 1c3ae1cb
      Tejun Heo authored
      There are states which are interesting but don't quite fit the interface
      exposed under /sys/kernel/sched_ext. Add tools/scx_show_state.py to show
      them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      1c3ae1cb
    • Tejun Heo's avatar
      sched_ext: Print debug dump after an error exit · 07814a94
      Tejun Heo authored
      If a BPF scheduler triggers an error, the scheduler is aborted and the
      system is reverted to the built-in scheduler. In the process, a lot of
      information which may be useful for figuring out what happened can be lost.
      
      This patch adds debug dump which captures information which may be useful
      for debugging including runqueue and runnable thread states at the time of
      failure. The following shows a debug dump after triggering the watchdog:
      
        root@test ~# os/work/tools/sched_ext/build/bin/scx_qmap -t 100
        stats  : enq=1 dsp=0 delta=1 deq=0
        stats  : enq=90 dsp=90 delta=0 deq=0
        stats  : enq=156 dsp=156 delta=0 deq=0
        stats  : enq=218 dsp=218 delta=0 deq=0
        stats  : enq=255 dsp=255 delta=0 deq=0
        stats  : enq=271 dsp=271 delta=0 deq=0
        stats  : enq=284 dsp=284 delta=0 deq=0
        stats  : enq=293 dsp=293 delta=0 deq=0
      
        DEBUG DUMP
        ================================================================================
      
        kworker/u32:12[320] triggered exit kind 1026:
          runnable task stall (stress[1530] failed to run for 6.841s)
      
        Backtrace:
          scx_watchdog_workfn+0x136/0x1c0
          process_scheduled_works+0x2b5/0x600
          worker_thread+0x269/0x360
          kthread+0xeb/0x110
          ret_from_fork+0x36/0x40
          ret_from_fork_asm+0x1a/0x30
      
        QMAP FIFO[0]:
        QMAP FIFO[1]:
        QMAP FIFO[2]: 1436
        QMAP FIFO[3]:
        QMAP FIFO[4]:
      
        CPU states
        ----------
      
        CPU 0   : nr_run=1 ops_qseq=244
      	    curr=swapper/0[0] class=idle_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
          R stress[1530] -6841ms
      	scx_state/flags=3/0x1 ops_state/qseq=2/20
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=0
      
            asm_sysvec_apic_timer_interrupt+0x16/0x20
      
        CPU 2   : nr_run=2 ops_qseq=142
      	    curr=swapper/2[0] class=idle_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
          R sshd[1703] -5905ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/88
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=1
      
            __x64_sys_ppoll+0xf6/0x120
            do_syscall_64+0x7b/0x150
            entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
          R fish[1539] -4141ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/124
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=1
      
            futex_wait+0x60/0xe0
            do_futex+0x109/0x180
            __x64_sys_futex+0x117/0x190
            do_syscall_64+0x7b/0x150
            entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
        CPU 3   : nr_run=2 ops_qseq=162
      	    curr=kworker/u32:12[320] class=ext_sched_class
      
          QMAP: dsp_idx=1 dsp_cnt=0
      
         *R kworker/u32:12[320] +0ms
      	scx_state/flags=3/0xd ops_state/qseq=0/0
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=ff
      
            QMAP: force_local=0
      
            scx_dump_state+0x613/0x6f0
            scx_ops_error_irq_workfn+0x1f/0x40
            irq_work_run_list+0x82/0xd0
            irq_work_run+0x14/0x30
            __sysvec_irq_work+0x40/0x140
            sysvec_irq_work+0x60/0x70
            asm_sysvec_irq_work+0x16/0x20
            scx_watchdog_workfn+0x15f/0x1c0
            process_scheduled_works+0x2b5/0x600
            worker_thread+0x269/0x360
            kthread+0xeb/0x110
            ret_from_fork+0x36/0x40
            ret_from_fork_asm+0x1a/0x30
      
          R kworker/3:2[1436] +0ms
      	scx_state/flags=3/0x9 ops_state/qseq=2/160
      	sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      	cpus=08
      
            QMAP: force_local=0
      
            kthread+0xeb/0x110
            ret_from_fork+0x36/0x40
            ret_from_fork_asm+0x1a/0x30
      
        CPU 7   : nr_run=0 ops_qseq=76
      	    curr=swapper/7[0] class=idle_sched_class
      
      
        ================================================================================
      
        EXIT: runnable task stall (stress[1530] failed to run for 6.841s)
      
      It shows that CPU 3 was running the watchdog when it triggered the error
      condition and the scx_qmap thread has been queued on CPU 0 for over 5
      seconds but failed to run. It also prints out scx_qmap specific information
      - e.g. which tasks are queued on each FIFO and so on using the dump_*() ops.
      This dump has proved pretty useful for developing and debugging BPF
      schedulers.
      
      Debug dump is generated automatically when the BPF scheduler exits due to an
      error. The debug buffer used in such cases is determined by
      sched_ext_ops.exit_dump_len and defaults to 32k. If the debug dump overruns
      the available buffer, the output is truncated and marked accordingly.
      
      Debug dump output can also be read through the sched_ext_dump tracepoint.
      When read through the tracepoint, there is no length limit.
      
      SysRq-D can be used to trigger debug dump at any time while a BPF scheduler
      is loaded. This is non-destructive - the scheduler keeps running afterwards.
      The output can be read through the sched_ext_dump tracepoint.
      
      v2: - The size of exit debug dump buffer can now be customized using
            sched_ext_ops.exit_dump_len.
      
          - sched_ext_ops.dump*() added to enable dumping of BPF scheduler
            specific information.
      
          - Tracpoint output and SysRq-D triggering added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      07814a94
    • David Vernet's avatar
      sched_ext: Print sched_ext info when dumping stack · 1538e339
      David Vernet authored
      It would be useful to see what the sched_ext scheduler state is, and what
      scheduler is running, when we're dumping a task's stack. This patch
      therefore adds a new print_scx_info() function that's called in the same
      context as print_worker_info() and print_stop_info(). An example dump
      follows.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000999
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP
        CPU: 13 PID: 2047 Comm: insmod Tainted: G           O       6.6.0-work-10323-gb58d4cae8e99-dirty #34
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
        Sched_ext: qmap (enabled+all), task: runnable_at=-17ms
        RIP: 0010:init_module+0x9/0x1000 [test_module]
        ...
      
      v3: - scx_ops_enable_state_str[] definition moved to an earlier patch as
            it's now used by core implementation.
      
          - Convert jiffy delta to msecs using jiffies_to_msecs() instead of
            multiplying by (HZ / MSEC_PER_SEC). The conversion is implemented in
            jiffies_delta_msecs().
      
      v2: - We are now using scx_ops_enable_state_str[] outside
            CONFIG_SCHED_DEBUG. Move it outside of CONFIG_SCHED_DEBUG and to the
            top. This was reported by Changwoo and Andrea.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Reported-by: default avatarChangwoo Min <changwoo@igalia.com>
      Reported-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1538e339
    • Tejun Heo's avatar
      sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT · 7bb6f081
      Tejun Heo authored
      BPF schedulers might not want to schedule certain tasks - e.g. kernel
      threads. This patch adds p->scx.disallow which can be set by BPF schedulers
      in such cases. The field can be changed anytime and setting it in
      ops.prep_enable() guarantees that the task can never be scheduled by
      sched_ext.
      
      scx_qmap is updated with the -d option to disallow a specific PID:
      
        # echo $$
        1092
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    0
        ext.enabled                                  :                    0
        # ./set-scx 1092
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    7
        ext.enabled                                  :                    0
      
      Run "scx_qmap -p -d 1092" in another terminal.
      
        # cat /sys/kernel/sched_ext/nr_rejected
        1
        # grep -E '(policy)|(ext\.enabled)' /proc/self/sched
        policy                                       :                    0
        ext.enabled                                  :                    0
        # ./set-scx 1092
        setparam failed for 1092 (Permission denied)
      
      - v4: Refreshed on top of tip:sched/core.
      
      - v3: Update description to reflect /sys/kernel/sched_ext interface change.
      
      - v2: Use atomic_long_t instead of atomic64_t for scx_kick_cpus_pnt_seqs to
            accommodate 32bit archs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarBarret Rhoden <brho@google.com>
      Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      7bb6f081
    • David Vernet's avatar
      sched_ext: Implement runnable task stall watchdog · 8a010b81
      David Vernet authored
      The most common and critical way that a BPF scheduler can misbehave is by
      failing to run runnable tasks for too long. This patch implements a
      watchdog.
      
      * All tasks record when they become runnable.
      
      * A watchdog work periodically scans all runnable tasks. If any task has
        stayed runnable for too long, the BPF scheduler is aborted.
      
      * scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
        BPF scheduler is aborted.
      
      Because the watchdog only scans the tasks which are currently runnable and
      usually very infrequently, the overhead should be negligible.
      scx_qmap is updated so that it can be told to stall user and/or
      kernel tasks.
      
      A detected task stall looks like the following:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
          scx_check_timeout_workfn+0x10e/0x1b0
          process_one_work+0x287/0x560
          worker_thread+0x234/0x420
          kthread+0xe9/0x100
          ret_from_fork+0x1f/0x30
      
      A detected watchdog stall:
      
       sched_ext: BPF scheduler "qmap" errored, disabling
       sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
          scheduler_tick+0x2eb/0x340
          update_process_times+0x7a/0x90
          tick_sched_timer+0xd8/0x130
          __hrtimer_run_queues+0x178/0x3b0
          hrtimer_interrupt+0xfc/0x390
          __sysvec_apic_timer_interrupt+0xb7/0x2b0
          sysvec_apic_timer_interrupt+0x90/0xb0
          asm_sysvec_apic_timer_interrupt+0x1b/0x20
          default_idle+0x14/0x20
          arch_cpu_idle+0xf/0x20
          default_idle_call+0x50/0x90
          do_idle+0xe8/0x240
          cpu_startup_entry+0x1d/0x20
          kernel_init+0x0/0x190
          start_kernel+0x0/0x392
          start_kernel+0x324/0x392
          x86_64_start_reservations+0x2a/0x2c
          x86_64_start_kernel+0x104/0x109
          secondary_startup_64_no_verify+0xce/0xdb
      
      Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
      inline scx_notify_sched_tick().
      
      v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
            being called before forward progress was guaranteed and thus could
            lead to system lockup. Relocated.
      
          - While enabling, it was comparing msecs against jiffies without
            conversion leading to spurious load failures on lower HZ kernels.
            Fixed.
      
          - runnable list management is now used by core bypass logic and moved to
            the patch implementing sched_ext core.
      
      v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
            against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
            conversion leading to spurious load failures in lower HZ kernels.
            Fixed.
      
      v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
            jiffies. Fix by using jiffies for everything.
      Signed-off-by: default avatarDavid Vernet <dvernet@meta.com>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJosh Don <joshdon@google.com>
      Acked-by: default avatarHao Luo <haoluo@google.com>
      Acked-by: default avatarBarret Rhoden <brho@google.com>
      Cc: Julia Lawall <julia.lawall@inria.fr>
      8a010b81