• Borislav Petkov's avatar
    x86/MCE/intel: Cleanup CMCI storm logic · 3f2f0680
    Borislav Petkov authored
    Initially, this started with the yet another report about a race
    condition in the CMCI storm adaptive period length thing. Yes, we have
    to admit, it is fragile and error prone. So let's simplify it.
    
    The simpler logic is: now, after we enter storm mode, we go straight to
    polling with CMCI_STORM_INTERVAL, i.e. once a second. We remain in storm
    mode as long as we see errors being logged while polling.
    
    Theoretically, if we see an uninterrupted error stream, we will remain
    in storm mode indefinitely and keep polling the MSRs.
    
    However, when the storm is actually a burst of errors, once we have
    logged them all, we back out of it after ~5 mins of polling and no more
    errors logged.
    
    If we encounter an error during those 5 minutes, we reset the polling
    interval to 5 mins.
    
    Making machine_check_poll() return a bool and denoting whether it has
    seen an error or not lets us simplify a bunch of code and move the storm
    handling private to mce_intel.c.
    
    Some minor cleanups while at it.
    Reported-by: default avatarCalvin Owens <calvinowens@fb.com>
    Tested-by: default avatarTony Luck <tony.luck@intel.com>
    Link: http://lkml.kernel.org/r/1417746575-23299-1-git-send-email-calvinowens@fb.comSigned-off-by: default avatarBorislav Petkov <bp@suse.de>
    3f2f0680
mce.c 60 KB