• Peter Zijlstra's avatar
    sched/core: Fix ttwu() race · b6e13e85
    Peter Zijlstra authored
    Paul reported rcutorture occasionally hitting a NULL deref:
    
      sched_ttwu_pending()
        ttwu_do_wakeup()
          check_preempt_curr() := check_preempt_wakeup()
            find_matching_se()
              is_same_group()
                if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*
    
    Debugging showed that this only appears to happen when we take the new
    code-path from commit:
    
      2ebb1771 ("sched/core: Offload wakee task activation if it the wakee is descheduling")
    
    and only when @cpu == smp_processor_id(). Something which should not
    be possible, because p->on_cpu can only be true for remote tasks.
    Similarly, without the new code-path from commit:
    
      c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    
    this would've unconditionally hit:
    
      smp_cond_load_acquire(&p->on_cpu, !VAL);
    
    and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
    would result in an instant live-lock (with IRQs disabled), something
    that hasn't been reported.
    
    The NULL deref can be explained however if the task_cpu(p) load at the
    beginning of try_to_wake_up() returns an old value, and this old value
    happens to be smp_processor_id(). Further assume that the p->on_cpu
    load accurately returns 1, it really is still running, just not here.
    
    Then, when we enqueue the task locally, we can crash in exactly the
    observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
    is from the wrong CPU, therefore we'll iterate into the non-existant
    parents and NULL deref.
    
    The closest semi-plausible scenario I've managed to contrive is
    somewhat elaborate (then again, actual reproduction takes many CPU
    hours of rcutorture, so it can't be anything obvious):
    
    					X->cpu = 1
    					rq(1)->curr = X
    
    	CPU0				CPU1				CPU2
    
    					// switch away from X
    					LOCK rq(1)->lock
    					smp_mb__after_spinlock
    					dequeue_task(X)
    					  X->on_rq = 9
    					switch_to(Z)
    					  X->on_cpu = 0
    					UNLOCK rq(1)->lock
    
    									// migrate X to cpu 0
    									LOCK rq(1)->lock
    									dequeue_task(X)
    									set_task_cpu(X, 0)
    									  X->cpu = 0
    									UNLOCK rq(1)->lock
    
    									LOCK rq(0)->lock
    									enqueue_task(X)
    									  X->on_rq = 1
    									UNLOCK rq(0)->lock
    
    	// switch to X
    	LOCK rq(0)->lock
    	smp_mb__after_spinlock
    	switch_to(X)
    	  X->on_cpu = 1
    	UNLOCK rq(0)->lock
    
    	// X goes sleep
    	X->state = TASK_UNINTERRUPTIBLE
    	smp_mb();			// wake X
    					ttwu()
    					  LOCK X->pi_lock
    					  smp_mb__after_spinlock
    
    					  if (p->state)
    
    					  cpu = X->cpu; // =? 1
    
    					  smp_rmb()
    
    	// X calls schedule()
    	LOCK rq(0)->lock
    	smp_mb__after_spinlock
    	dequeue_task(X)
    	  X->on_rq = 0
    
    					  if (p->on_rq)
    
    					  smp_rmb();
    
    					  if (p->on_cpu && ttwu_queue_wakelist(..)) [*]
    
    					  smp_cond_load_acquire(&p->on_cpu, !VAL)
    
    					  cpu = select_task_rq(X, X->wake_cpu, ...)
    					  if (X->cpu != cpu)
    	switch_to(Y)
    	  X->on_cpu = 0
    	UNLOCK rq(0)->lock
    
    However I'm having trouble convincing myself that's actually possible
    on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
    observes ->state != RUNNING, it must also observe ->cpu != 1.
    
    (Most of the previous ttwu() races were found on very large PowerPC)
    
    Nevertheless, this fully explains the observed failure case.
    
    Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
    which is easy since nothing actually uses @cpu before this.
    
    Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    Reported-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Tested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
    b6e13e85
core.c 201 KB