[PATCH] SMP races in the timer code
This fixes two del_timer_sync() races that are still in the timer code. The first race was actually triggered in a 2.4 backport of the 2.6 timer code. The second race was never triggered - it is mostly theoretical on a standalone kernel. (It's more likely in any virtualized or otherwise preemptable environment.) Both races happen when self-rearming timers are used. One mainstream example is kernel/itimer.c. The effect of the races is that del_timer_sync() lets a timer running instead of synchronizing with it, causing logic bugs (and crashes) in the affected kernel code. One typical incarnation of the race is a double add_timer(). race #1: this code in __run_timers() is running on CPU0: list_del(&timer->entry); timer->base = NULL; [*] set_running_timer(base, timer); spin_unlock_irq(&base->lock); [**] fn(data); spin_lock_irq(&base->lock); CPU0 gets stuck at the [*] code-point briefly - after the timer->base has been set to NULL, but before the base->running_timer pointer has been set up. This is a fundamentally volatile scenario, as there's _zero_ knowledge in the data structures that this timer is about to be executed! Now CPU1 comes along and calls del_timer_sync(). It will find nothing - neither timer->base nor base->running_timer will cause it to synchronize. It will return and report that the timer has been deleted - shortly afterwards CPU1 continues to execute the timer fn, which will cause crashes. This particular race is easy to fix by reordering the timer->base clearing with set_running_timer(), and putting a wmb() between them, but there's more races: race #2 The checking of del_timer_sync() for 'pending or running timer' is fundamentally unrobust. Eg. if CPU0 gets stuck at the [***] point below: base = &per_cpu(tvec_bases, i); if (base->running_timer == timer) { while (base->running_timer == timer) { cpu_relax(); preempt_check_resched(); } [***] break; } } smp_rmb(); if (timer_pending(timer)) goto del_again; then del_timer_sync() has already decided that this timer is not running (we just finished loop-waiting for it), but we have not done the timer_pending() check yet. If the timer has re-armed itself, and if the timer expires on CPU1 (this needs a long delay on CPU0 but that's not hard to achieve eg. in UML or with kernel preemption enabled), then CPU1 could start to expire the timer and gets to the [**] point in __run_timers (see above), then CPU1 gets stalled and CPU0 is unstalled, then the timer_pending() check in del_timer_sync() will not notice the running timer, and del_timer_sync() returns - while CPU1 is just about to run the timer! Fixing this second race is hard - it involves a heavy race-check operation that has to lock all bases, and has to re-check the base->running_timer value, and timer_pending condition atomically. This fix also fixes the first race, due to forcing del_timer_sync() to always observe the timer state atomically, so the [*] code point will always synchronize with del_timer_sync(). The patch is ugly but safe, and it has fixed the crashes in the 2.4 backport. I tested the patch on 2.6.0-test7 with some heavy itimer use and it works fine. Removing self-arming timers safely is the sole purpose of del_timer_sync(), so there's no way around this overhead i think. I believe we should ultimately fix all major del_timer_sync() users to not use self-arming timers - having del_timer_sync() in the thread-exit path is now a considerable source of SMP overhead. But this is out of the scope of current 2.6 fixes of course, and we have to support self-arming timers as well.
Showing
Please register or sign in to comment