• Tejun Heo's avatar
    sched_ext: Add vtime-ordered priority queue to dispatch_q's · 06e51be3
    Tejun Heo authored
    Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
    consumed or executed earlier. While this is sufficient when dsq's are used
    for simple staging areas for tasks which are ready to execute, it'd make
    dsq's a lot more useful if they can implement custom ordering.
    
    This patch adds a vtime-ordered priority queue to dsq's. When the BPF
    scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
    can specify the vtime tha the task should be inserted at and the task is
    inserted into the priority queue in the dsq which is ordered according to
    time_before64() comparison of the vtime values.
    
    A DSQ can either be a FIFO or priority queue and automatically switches
    between the two depending on whether scx_bpf_dispatch() or
    scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
    already has the other type queued is not allowed and triggers an ops error.
    Built-in DSQs must always be FIFOs.
    
    This makes it very easy for the BPF schedulers to implement proper vtime
    based scheduling within each dsq very easy and efficient at a negligible
    cost in terms of code complexity and overhead.
    
    scx_simple and scx_example_flatcg are updated to default to weighted
    vtime scheduling (the latter within each cgroup). FIFO scheduling can be
    selected with -f option.
    
    v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
          led to unexpected starvations, DSQs now error out if both modes are
          used at the same time and the built-in DSQs are no longer allowed to
          be priority queues.
    
        - Explicit type struct scx_dsq_node added to contain fields needed to be
          linked on DSQs. This will be used to implement stateful iterator.
    
        - Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
          PRIQ mode. This confines PRIQ related complexities to the enqueue and
          dequeue paths. Other paths only need to look at dsq->list. This will
          also ease implementing BPF iterator.
    
        - Print p->scx.dsq_flags in debug dump.
    
    v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
          p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
          flags in p->scx.flags. This led to flag corruption in some cases.
    
        - Add comments explaining the interaction between using consumption of
          p->scx.slice to determine vtime progress and yielding.
    
    v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
          migrations leading to some tasks being stalled for extended period of
          time depending on how saturated the machine is. Fixed.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Reviewed-by: default avatarDavid Vernet <dvernet@meta.com>
    06e51be3
scx_simple.bpf.c 4.46 KB