1. 04 Sep, 2024 2 commits
    • Tejun Heo's avatar
      sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP · 8b1451f2
      Tejun Heo authored
      SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
      keep running the current task. It's not really a task property. Replace it
      with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
      the usage. Also, the existing clearing rule is unnecessarily strict and
      makes it difficult to use with core-sched. Just clear it on entry to
      balance_one().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      8b1451f2
    • Tejun Heo's avatar
      sched_ext: Don't call put_prev_task_scx() before picking the next task · 7c65ae81
      Tejun Heo authored
      fd03c5b8 ("sched: Rework pick_next_task()") changed the definition of
      pick_next_task() from:
      
        pick_next_task() := pick_task() + set_next_task(.first = true)
      
      to:
      
        pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)
      
      making invoking put_prev_task() pick_next_task()'s responsibility. This
      reordering allows pick_task() to be shared between regular and core-sched
      paths and put_prev_task() to know the next task.
      
      sched_ext depended on put_prev_task_scx() enqueueing the current task before
      pick_next_task_scx() is called. While pulling sched/core changes,
      70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
      explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
      before picking the first task as a workaround.
      
      Clean it up and adopt the conventions that other sched classes are
      following.
      
      The operation of keeping running the current task was spread and required
      the task to be put on the local DSQ before picking:
      
        - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
          runnable, hasn't exhausted its slice, and thus should keep running.
      
        - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
          is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
          only runnable task. do_enqueue_task() in turn decided whether to use the
          local DSQ depending on SCX_OPS_ENQ_LAST.
      
      Consolidate the logic in balance_one() as it always knows whether it is
      going to keep the current task. balance_one() now considers all conditions
      where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
      pick_next_task_scx() to keep the current task instead of picking one from
      the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
      put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
      updated to pick the current task if SCX_TASK_BAL_KEEP is set.
      
      The workaround put_prev_task[_scx]() calls are replaced with
      put_prev_set_next_task().
      
      This causes two behavior changes observable from the BPF scheduler:
      
      - When a task keep running, it no longer goes through enqueue/dequeue cycle
        and thus ops.stopping/running() transitions. The new behavior is better
        and all the existing schedulers should be able to handle the new behavior.
      
      - The BPF scheduler cannot keep executing the current task by enqueueing
        SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
        BPF scheduler is responsible for resuming execution after each
        SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
        decisions are not made on the local CPU - e.g. central or userspace-driven
        schedulin - and the new behavior is more logical and shouldn't pose any
        problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
        doesn't fit that well anymore and the last task handling is moved to the
        end of qmap_dispatch().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Vernet <void@manifault.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Cc: Changwoo Min <multics69@gmail.com>
      Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
      Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
      7c65ae81
  2. 03 Sep, 2024 11 commits
  3. 31 Aug, 2024 2 commits
    • Tejun Heo's avatar
      sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched() · 62607d03
      Tejun Heo authored
      Since 3cf78c5d ("sched_ext: Unpin and repin rq lock from
      balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost
      function. This is simpler and in line with what other balance functions are
      doing but it loses control over rq->clock_update_flags which makes
      assert_clock_udpated() trigger if other CPUs pins the rq lock.
      
      The only place this matters is touch_core_sched() which uses the timestamp
      to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may
      be better to use per-core dispatch sequence number.
      
      v2: Use sched_clock_cpu() instead of ktime_get_ns() per David.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 3cf78c5d ("sched_ext: Unpin and repin rq lock from balance_scx()")
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      62607d03
    • Tejun Heo's avatar
      sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq() · 0366017e
      Tejun Heo authored
      When deciding whether a task can be migrated to a CPU,
      dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online()
      tests instead of using task_can_run_on_remote_rq(). This had two problems.
      
      - It was missing is_migration_disabled() check and thus could try to migrate
        a task which shouldn't leading to assertion and scheduling failures.
      
      - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu()
        and thus failed to consider ISA compatibility.
      
      Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq():
      
      - Move scx_ops_error() triggering into task_can_run_on_remote_rq().
      
      - When migration isn't allowed, fall back to the global DSQ instead of the
        source DSQ by returning DTL_INVALID. This is both simpler and an overall
        better behavior.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      0366017e
  4. 28 Aug, 2024 1 commit
  5. 27 Aug, 2024 1 commit
    • Tejun Heo's avatar
      scx_central: Fix smatch checker warning · 59cfdf3f
      Tejun Heo authored
      ARRAY_ELEM_PTR() is an access macro used to help the BPF verifier not
      confused by offseted memory acceeses by yiedling a valid pointer or NULL in
      a way that's clear to the verifier. As such, the canonical usage involves
      checking NULL return from the macro. Note that in many cases, the NULL
      condition can never happen - they're there just to hint the verifier.
      
      In a bpf_loop in scx_central.bpf.c::central_dispatch(), the NULL check was
      incorrect in that there was another dereference of the pointer in addition
      to the NULL checked access. This worked as the pointer can never be NULL and
      the verifier could tell it would never be NULL in this case.
      
      However, this still looks wrong and trips smatch:
      
        ./tools/sched_ext/scx_central.bpf.c:205 ____central_dispatch()
        error: we previously assumed 'gimme' could be null (see line 201)
      
        ./tools/sched_ext/scx_central.bpf.c
            195
            196                         if (!scx_bpf_dispatch_nr_slots())
            197                                 break;
            198
            199                         /* central's gimme is never set */
            200                         gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
            201                         if (gimme && !*gimme)
      				      ^^^^^
        If gimme is NULL
      
            202                                 continue;
            203
            204                         if (dispatch_to_cpu(cpu))
        --> 205                                 *gimme = false;
      
      Fix the NULL check so that there are no derefs if NULL. This doesn't change
      actual behavior.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Link: http://lkml.kernel.org/r/<955e1c3c-ace2-4a1d-b246-15b8196038a3@stanley.mountain>
      59cfdf3f
  6. 20 Aug, 2024 2 commits
    • Yipeng Zou's avatar
      sched_ext: Allow dequeue_task_scx to fail · 9ad2861b
      Yipeng Zou authored
      Since dequeue_task() allowed to fail, there is a compile error:
      
      kernel/sched/ext.c:3630:19: error: initialization of ‘bool (*)(struct rq*, struct task_struct *, int)’ {aka ‘_Bool (*)(struct rq *, struct task_struct *, int)’} from incompatible pointer type ‘void (*)(struct rq*, struct task_struct *, int)’
        3630 |  .dequeue_task  = dequeue_task_scx,
             |                   ^~~~~~~~~~~~~~~~
      
      Allow dequeue_task_scx to fail too.
      
      Fixes: 863ccdbb ("sched: Allow sched_class::dequeue_task() to fail")
      Signed-off-by: default avatarYipeng Zou <zouyipeng@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9ad2861b
    • Tejun Heo's avatar
      Merge branch 'tip/sched/core' into for-6.12 · 5ac99857
      Tejun Heo authored
      To receive 863ccdbb ("sched: Allow sched_class::dequeue_task() to fail")
      which makes sched_class.dequeue_task() return bool instead of void. This
      leads to compile breakage and will be fixed by a follow-up patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      5ac99857
  7. 17 Aug, 2024 21 commits