• Joel Fernandes (Google)'s avatar
    sched/rt: Fix live lock between select_fallback_rq() and RT push · fc090277
    Joel Fernandes (Google) authored
    During RCU-boost testing with the TREE03 rcutorture config, I found that
    after a few hours, the machine locks up.
    
    On tracing, I found that there is a live lock happening between 2 CPUs.
    One CPU has an RT task running, while another CPU is being offlined
    which also has an RT task running.  During this offlining, all threads
    are migrated. The migration thread is repeatedly scheduled to migrate
    actively running tasks on the CPU being offlined. This results in a live
    lock because select_fallback_rq() keeps picking the CPU that an RT task
    is already running on only to get pushed back to the CPU being offlined.
    
    It is anyway pointless to pick CPUs for pushing tasks to if they are
    being offlined only to get migrated away to somewhere else. This could
    also add unwanted latency to this task.
    
    Fix these issues by not selecting CPUs in RT if they are not 'active'
    for scheduling, using the cpu_active_mask. Other parts in core.c already
    use cpu_active_mask to prevent tasks from being put on CPUs going
    offline.
    
    With this fix I ran the tests for days and could not reproduce the
    hang. Without the patch, I hit it in a few hours.
    Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Tested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
    fc090277
cpupri.c 8.57 KB