• Ashok Raj's avatar
    x86/mce: Ensure offline CPUs don't participate in rendezvous process · d90167a9
    Ashok Raj authored
    Intel's MCA implementation broadcasts MCEs to all CPUs on the
    node. This poses a problem for offlined CPUs which cannot
    participate in the rendezvous process:
    
      Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
      Kernel Offset: disabled
      Rebooting in 100 seconds..
    
    More specifically, Linux does a soft offline of a CPU when
    writing a 0 to /sys/devices/system/cpu/cpuX/online, which
    doesn't prevent the #MC exception from being broadcasted to that
    CPU.
    
    Ensure that offline CPUs don't participate in the MCE rendezvous
    and clear the RIP valid status bit so that a second MCE won't
    cause a shutdown.
    
    Without the patch, mce_start() will increment mce_callin and
    wait for all CPUs. Offlined CPUs should avoid participating in
    the rendezvous process altogether.
    Signed-off-by: default avatarAshok Raj <ashok.raj@intel.com>
    [ Massage commit message. ]
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
    Cc: <stable@vger.kernel.org>
    Cc: H. Peter Anvin <hpa@zytor.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-edac <linux-edac@vger.kernel.org>
    Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.deSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    d90167a9
mce.c 60 KB