• Heiko Carstens's avatar
    nohz: Fix printk_needs_cpu() return value on offline cpus · 61ab2544
    Heiko Carstens authored
    This patch fixes a hang observed with 2.6.32 kernels where timers got enqueued
    on offline cpus.
    
    printk_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
    offlined it schedules the idle process which, before killing its own cpu, will
    call tick_nohz_stop_sched_tick(). That function in turn will call
    printk_needs_cpu() in order to check if the local tick can be disabled. On
    offline cpus this function should naturally return 0 since regardless if the
    tick gets disabled or not the cpu will be dead short after. That is besides the
    fact that __cpu_disable() should already have made sure that no interrupts on
    the offlined cpu will be delivered anyway.
    
    In this case it prevents tick_nohz_stop_sched_tick() to call
    select_nohz_load_balancer(). No idea if that really is a problem. However what
    made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
    used within __mod_timer() to select a cpu on which a timer gets enqueued. If
    printk_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
    updated when a cpu gets offlined. It may contain the cpu number of an offline
    cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
    they never expire and cause system hangs.
    
    This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
    get_nohz_timer_target() which doesn't have that problem. However there might be
    other problems because of the too early exit tick_nohz_stop_sched_tick() in
    case a cpu goes offline.
    
    Easiest way to fix this is just to test if the current cpu is offline and call
    printk_tick() directly which clears the condition.
    
    Alternatively I tried a cpu hotplug notifier which would clear the condition,
    however between calling the notifier function and printk_needs_cpu() something
    could have called printk() again and the problem is back again. This seems to
    be the safest fix.
    Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: stable@kernel.org
    LKML-Reference: <20101126120235.406766476@de.ibm.com>
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    61ab2544
printk.c 39.5 KB