• Tejun Heo's avatar
    sched_ext: Split the global DSQ per NUMA node · b7b3b2db
    Tejun Heo authored
    In the bypass mode, the global DSQ is used to schedule all tasks in simple
    FIFO order. All tasks are queued into the global DSQ and all CPUs try to
    execute tasks from it. This creates a lot of cross-node cacheline accesses
    and scheduling across the node boundaries, and can lead to live-lock
    conditions where the system takes tens of minutes to disable the BPF
    scheduler while executing in the bypass mode.
    
    Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
    task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
    the task's CPU and all CPUs in a node only consume its node-local global
    DSQ.
    
    This resolves a livelock condition which could be reliably triggered on an
    2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
    `stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
    and disabling a SCX scheduler.
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Acked-by: default avatarDavid Vernet <void@manifault.com>
    b7b3b2db
ext.c 204 KB