• Tony Luck's avatar
    x86/mce: Handle Intel threshold interrupt storms · 1f68ce2a
    Tony Luck authored
    Add an Intel specific hook into machine_check_poll() to keep track of
    per-CPU, per-bank corrected error logs (with a stub for the
    CONFIG_MCE_INTEL=n case).
    
    When a storm is observed the rate of interrupts is reduced by setting
    a large threshold value for this bank in IA32_MCi_CTL2. This bank is
    added to the bitmap of banks for this CPU to poll. The polling rate is
    increased to once per second.
    
    When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove
    the bank from the bitmap for polling, and change the polling rate back
    to the default.
    
    If a CPU with banks in storm mode is taken offline, the new CPU that
    inherits ownership of those banks takes over management of storm(s) in
    the inherited bank(s).
    
    The cmci_discover() function was already very large. These changes
    pushed it well over the top. Refactor with three helper functions to
    bring it back under control.
    Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
    Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
    Link: https://lore.kernel.org/r/20231115195450.12963-4-tony.luck@intel.com
    1f68ce2a
intel.c 13.5 KB