[PATCH] SMP races in the timer code

This fixes two del_timer_sync() races that are still in the timer code. The first race was actually triggered in a 2.4 backport of the 2.6 timer code. The second race was never triggered - it is mostly theoretical on a standalone kernel. (It's more likely in any virtualized or otherwise preemptable environment.) Both races happen when self-rearming timers are used. One mainstream example is kernel/itimer.c. The effect of the races is that del_timer_sync() lets a timer running instead of synchronizing with it, causing logic bugs (and crashes) in the affected kernel code. One typical incarnation of the race is a double add_timer(). race #1: this code in __run_timers() is running on CPU0: list_del(&timer->entry); timer->base = NULL; [*] set_running_timer(base, timer); spin_unlock_irq(&base->lock); [**] fn(data); spin_lock_irq(&base->lock); CPU0 gets stuck at the [*] code-point briefly - after the timer->base has been set to NULL, but before the base->running_timer pointer has been set up. This is a fundamentally volatile scenario, as there's _zero_ knowledge in the data structures that this timer is about to be executed! Now CPU1 comes along and calls del_timer_sync(). It will find nothing - neither timer->base nor base->running_timer will cause it to synchronize. It will return and report that the timer has been deleted - shortly afterwards CPU1 continues to execute the timer fn, which will cause crashes. This particular race is easy to fix by reordering the timer->base clearing with set_running_timer(), and putting a wmb() between them, but there's more races: race #2 The checking of del_timer_sync() for 'pending or running timer' is fundamentally unrobust. Eg. if CPU0 gets stuck at the [***] point below: base = &per_cpu(tvec_bases, i); if (base->running_timer == timer) { while (base->running_timer == timer) { cpu_relax(); preempt_check_resched(); } [***] break; } } smp_rmb(); if (timer_pending(timer)) goto del_again; then del_timer_sync() has already decided that this timer is not running (we just finished loop-waiting for it), but we have not done the timer_pending() check yet. If the timer has re-armed itself, and if the timer expires on CPU1 (this needs a long delay on CPU0 but that's not hard to achieve eg. in UML or with kernel preemption enabled), then CPU1 could start to expire the timer and gets to the [**] point in __run_timers (see above), then CPU1 gets stalled and CPU0 is unstalled, then the timer_pending() check in del_timer_sync() will not notice the running timer, and del_timer_sync() returns - while CPU1 is just about to run the timer! Fixing this second race is hard - it involves a heavy race-check operation that has to lock all bases, and has to re-check the base->running_timer value, and timer_pending condition atomically. This fix also fixes the first race, due to forcing del_timer_sync() to always observe the timer state atomically, so the [*] code point will always synchronize with del_timer_sync(). The patch is ugly but safe, and it has fixed the crashes in the 2.4 backport. I tested the patch on 2.6.0-test7 with some heavy itimer use and it works fine. Removing self-arming timers safely is the sole purpose of del_timer_sync(), so there's no way around this overhead i think. I believe we should ultimately fix all major del_timer_sync() users to not use self-arming timers - having del_timer_sync() in the thread-exit path is now a considerable source of SMP overhead. But this is out of the scope of current 2.6 fixes of course, and we have to support self-arming timers as well.

[PATCH] SMP races in the timer code
This fixes two del_timer_sync() races that are still in the timer code. The first race was actually triggered in a 2.4 backport of the 2.6 timer code. The second race was never triggered - it is mostly theoretical on a standalone kernel. (It's more likely in any virtualized or otherwise preemptable environment.) Both races happen when self-rearming timers are used. One mainstream example is kernel/itimer.c. The effect of the races is that del_timer_sync() lets a timer running instead of synchronizing with it, causing logic bugs (and crashes) in the affected kernel code. One typical incarnation of the race is a double add_timer(). race #1: this code in __run_timers() is running on CPU0: list_del(&timer->entry); timer->base = NULL; [*] set_running_timer(base, timer); spin_unlock_irq(&base->lock); [**] fn(data); spin_lock_irq(&base->lock); CPU0 gets stuck at the [*] code-point briefly - after the timer->base has been set to NULL, but before the base->running_timer pointer has been set up. This is a fundamentally volatile scenario, as there's _zero_ knowledge in the data structures that this timer is about to be executed! Now CPU1 comes along and calls del_timer_sync(). It will find nothing - neither timer->base nor base->running_timer will cause it to synchronize. It will return and report that the timer has been deleted - shortly afterwards CPU1 continues to execute the timer fn, which will cause crashes. This particular race is easy to fix by reordering the timer->base clearing with set_running_timer(), and putting a wmb() between them, but there's more races: race #2 The checking of del_timer_sync() for 'pending or running timer' is fundamentally unrobust. Eg. if CPU0 gets stuck at the [***] point below: base = &per_cpu(tvec_bases, i); if (base->running_timer == timer) { while (base->running_timer == timer) { cpu_relax(); preempt_check_resched(); } [***] break; } } smp_rmb(); if (timer_pending(timer)) goto del_again; then del_timer_sync() has already decided that this timer is not running (we just finished loop-waiting for it), but we have not done the timer_pending() check yet. If the timer has re-armed itself, and if the timer expires on CPU1 (this needs a long delay on CPU0 but that's not hard to achieve eg. in UML or with kernel preemption enabled), then CPU1 could start to expire the timer and gets to the [**] point in __run_timers (see above), then CPU1 gets stalled and CPU0 is unstalled, then the timer_pending() check in del_timer_sync() will not notice the running timer, and del_timer_sync() returns - while CPU1 is just about to run the timer! Fixing this second race is hard - it involves a heavy race-check operation that has to lock all bases, and has to re-check the base->running_timer value, and timer_pending condition atomically. This fix also fixes the first race, due to forcing del_timer_sync() to always observe the timer state atomically, so the [*] code point will always synchronize with del_timer_sync(). The patch is ugly but safe, and it has fixed the crashes in the 2.4 backport. I tested the patch on 2.6.0-test7 with some heavy itimer use and it works fine. Removing self-arming timers safely is the sole purpose of del_timer_sync(), so there's no way around this overhead i think. I believe we should ultimately fix all major del_timer_sync() users to not use self-arming timers - having del_timer_sync() in the thread-exit path is now a considerable source of SMP overhead. But this is out of the scope of current 2.6 fixes of course, and we have to support self-arming timers as well.
158fb15f · Ingo Molnar · Linus Torvalds · 606e1044 · 158fb15f
Commit 158fb15f authored Oct 11, 2003 by Ingo Molnar Committed by Linus Torvalds Oct 11, 2003
Hide whitespace changes
Inline Side-by-side

Showing with 39 additions and 7 deletions

kernel/timer.c kernel/timer.c +39 -7

No files found.
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -315,23 +315,30 @@ EXPORT_SYMBOL(del_timer);
 * the timer it also makes sure the handler has finished executing on other
 * CPUs.
 *
- * Synchronization rules: callers must prevent restarting of the timer,
- * otherwise this function is meaningless. It must not be called from
- * interrupt contexts. Upon exit the timer is not queued and the handler
- * is not running on any CPU.
+ * Synchronization rules: callers must prevent restarting of the timer
+ * (except restarting the timer from the timer function itself), otherwise
+ * this function is meaningless. It must not be called from interrupt
+ * contexts. Upon exit the timer is not queued and the handler is not
+ * running on any CPU.
 *
- * The function returns whether it has deactivated a pending timer or not.
+ * The function returns the number of times it has deactivated a pending
+ * timer.
 */
 int del_timer_sync(struct timer_list *timer)
 {
+	int i, ret = 0, again;
+	unsigned long flags;
 	tvec_base_t *base;
-	int i, ret = 0;

 	check_timer(timer);

 del_again:
 	ret += del_timer(timer);

+	/*
+	 * First do a lighter but racy check, whether the
+	 * timer is running on any other CPU:
+	 */
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
@@ -345,8 +352,33 @@ int del_timer_sync(struct timer_list *timer)
 			break;
 		}
 	}
-	smp_rmb();
+
+	/*
+	 * Do a heavy but race-free re-check to make sure both that
+	 * the timer is neither running nor pending:
+	 */
+	again = 0;
+	local_irq_save(flags);
+
+	for (i = 0; i < NR_CPUS; i++)
+		if (cpu_online(i))
+			spin_lock(&per_cpu(tvec_bases, i).lock);
+
 	if (timer_pending(timer))
+		again = 1;
+	else
+		for (i = 0; i < NR_CPUS; i++)
+			if (cpu_online(i) &&
+				(per_cpu(tvec_bases, i).running_timer == timer))
+					again = 1;
+
+	for (i = 0; i < NR_CPUS; i++)
+		if (cpu_online(i))
+			spin_unlock(&per_cpu(tvec_bases, i).lock);
+
+	local_irq_restore(flags);
+
+	if (again)
 		goto del_again;

 	return ret;