Commit 158fb15f authored by Ingo Molnar's avatar Ingo Molnar Committed by Linus Torvalds

[PATCH] SMP races in the timer code

This fixes two del_timer_sync() races that are still in the timer code. 

The first race was actually triggered in a 2.4 backport of the 2.6 timer
code.  The second race was never triggered - it is mostly theoretical on
a standalone kernel.  (It's more likely in any virtualized or otherwise
preemptable environment.)

Both races happen when self-rearming timers are used.  One mainstream
example is kernel/itimer.c.  The effect of the races is that
del_timer_sync() lets a timer running instead of synchronizing with it,
causing logic bugs (and crashes) in the affected kernel code.  One
typical incarnation of the race is a double add_timer(). 

race #1:

this code in __run_timers() is running on CPU0:

                        list_del(&timer->entry);
                        timer->base = NULL;
			[*]
                        set_running_timer(base, timer);
                        spin_unlock_irq(&base->lock);
			[**]
                        fn(data);
                        spin_lock_irq(&base->lock);

CPU0 gets stuck at the [*] code-point briefly - after the timer->base has
been set to NULL, but before the base->running_timer pointer has been set
up. This is a fundamentally volatile scenario, as there's _zero_ knowledge
in the data structures that this timer is about to be executed!

Now CPU1 comes along and calls del_timer_sync(). It will find nothing -
neither timer->base nor base->running_timer will cause it to synchronize.  
It will return and report that the timer has been deleted - shortly
afterwards CPU1 continues to execute the timer fn, which will cause
crashes.

This particular race is easy to fix by reordering the timer->base
clearing with set_running_timer(), and putting a wmb() between them, but
there's more races:

race #2

The checking of del_timer_sync() for 'pending or running timer' is
fundamentally unrobust. Eg. if CPU0 gets stuck at the [***] point below:

                base = &per_cpu(tvec_bases, i);
                if (base->running_timer == timer) {
                        while (base->running_timer == timer) {
                                cpu_relax();
                                preempt_check_resched();
                        }
			[***]
                        break;
                }
        }
        smp_rmb();
        if (timer_pending(timer))
                goto del_again;


then del_timer_sync() has already decided that this timer is not running
(we just finished loop-waiting for it), but we have not done the
timer_pending() check yet.

If the timer has re-armed itself, and if the timer expires on CPU1 (this
needs a long delay on CPU0 but that's not hard to achieve eg.  in UML or
with kernel preemption enabled), then CPU1 could start to expire the
timer and gets to the [**] point in __run_timers (see above), then CPU1
gets stalled and CPU0 is unstalled, then the timer_pending() check in
del_timer_sync() will not notice the running timer, and del_timer_sync()
returns - while CPU1 is just about to run the timer!

Fixing this second race is hard - it involves a heavy race-check
operation that has to lock all bases, and has to re-check the
base->running_timer value, and timer_pending condition atomically.

This fix also fixes the first race, due to forcing del_timer_sync() to
always observe the timer state atomically, so the [*] code point will
always synchronize with del_timer_sync(). 

The patch is ugly but safe, and it has fixed the crashes in the 2.4
backport.  I tested the patch on 2.6.0-test7 with some heavy itimer use
and it works fine.  Removing self-arming timers safely is the sole
purpose of del_timer_sync(), so there's no way around this overhead i
think.  I believe we should ultimately fix all major del_timer_sync()
users to not use self-arming timers - having del_timer_sync() in the
thread-exit path is now a considerable source of SMP overhead.  But this
is out of the scope of current 2.6 fixes of course, and we have to
support self-arming timers as well.
parent 606e1044
...@@ -315,23 +315,30 @@ EXPORT_SYMBOL(del_timer); ...@@ -315,23 +315,30 @@ EXPORT_SYMBOL(del_timer);
* the timer it also makes sure the handler has finished executing on other * the timer it also makes sure the handler has finished executing on other
* CPUs. * CPUs.
* *
* Synchronization rules: callers must prevent restarting of the timer, * Synchronization rules: callers must prevent restarting of the timer
* otherwise this function is meaningless. It must not be called from * (except restarting the timer from the timer function itself), otherwise
* interrupt contexts. Upon exit the timer is not queued and the handler * this function is meaningless. It must not be called from interrupt
* is not running on any CPU. * contexts. Upon exit the timer is not queued and the handler is not
* running on any CPU.
* *
* The function returns whether it has deactivated a pending timer or not. * The function returns the number of times it has deactivated a pending
* timer.
*/ */
int del_timer_sync(struct timer_list *timer) int del_timer_sync(struct timer_list *timer)
{ {
int i, ret = 0, again;
unsigned long flags;
tvec_base_t *base; tvec_base_t *base;
int i, ret = 0;
check_timer(timer); check_timer(timer);
del_again: del_again:
ret += del_timer(timer); ret += del_timer(timer);
/*
* First do a lighter but racy check, whether the
* timer is running on any other CPU:
*/
for (i = 0; i < NR_CPUS; i++) { for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i)) if (!cpu_online(i))
continue; continue;
...@@ -345,8 +352,33 @@ int del_timer_sync(struct timer_list *timer) ...@@ -345,8 +352,33 @@ int del_timer_sync(struct timer_list *timer)
break; break;
} }
} }
smp_rmb();
/*
* Do a heavy but race-free re-check to make sure both that
* the timer is neither running nor pending:
*/
again = 0;
local_irq_save(flags);
for (i = 0; i < NR_CPUS; i++)
if (cpu_online(i))
spin_lock(&per_cpu(tvec_bases, i).lock);
if (timer_pending(timer)) if (timer_pending(timer))
again = 1;
else
for (i = 0; i < NR_CPUS; i++)
if (cpu_online(i) &&
(per_cpu(tvec_bases, i).running_timer == timer))
again = 1;
for (i = 0; i < NR_CPUS; i++)
if (cpu_online(i))
spin_unlock(&per_cpu(tvec_bases, i).lock);
local_irq_restore(flags);
if (again)
goto del_again; goto del_again;
return ret; return ret;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment