• Daniel Axtens's avatar
    powerpc/powernv: Panic on unhandled Machine Check · f2dd80ec
    Daniel Axtens authored
    All unrecovered machine check errors on PowerNV should cause an
    immediate panic. There are 2 reasons that this is the right policy:
    it's not safe to continue, and we're already trying to reboot.
    
    Firstly, if we go through the recovery process and do not successfully
    recover, we can't be sure about the state of the machine, and it is
    not safe to recover and proceed.
    
    Linux knows about the following sources of Machine Check Errors:
    - Uncorrectable Errors (UE)
    - Effective - Real Address Translation (ERAT)
    - Segment Lookaside Buffer (SLB)
    - Translation Lookaside Buffer (TLB)
    - Unknown/Unrecognised
    
    In the SLB, TLB and ERAT cases, we can further categorise these as
    parity errors, multihit errors or unknown/unrecognised.
    
    We can handle SLB errors by flushing and reloading the SLB. We can
    handle TLB and ERAT multihit errors by flushing the TLB. (It appears
    we may not handle TLB and ERAT parity errors: I will investigate
    further and send a followup patch if appropriate.)
    
    This leaves us with uncorrectable errors. Uncorrectable errors are
    usually the result of ECC memory detecting an error that it cannot
    correct, but they also crop up in the context of PCI cards failing
    during DMA writes, and during CAPI error events.
    
    There are several types of UE, and there are 3 places a UE can occur:
    Skiboot, the kernel, and userspace. For Skiboot errors, we have the
    facility to make some recoverable. For userspace, we can simply kill
    (SIGBUS) the affected process. We have no meaningful way to deal with
    UEs in kernel space or in unrecoverable sections of Skiboot.
    
    Currently, these unrecovered UEs fall through to
    machine_check_expection() in traps.c, which calls die(), which OOPSes
    and sends SIGBUS to the process. This sometimes allows us to stumble
    onwards. For example we've seen UEs kill the kernel eehd and
    khugepaged. However, the process killed could have held a lock, or it
    could have been a more important process, etc: we can no longer make
    any assertions about the state of the machine. Similarly if we see a
    UE in skiboot (and again we've seen this happen), we're not in a
    position where we can make any assertions about the state of the
    machine.
    
    Likewise, for unknown or unrecognised errors, we're not able to say
    anything about the state of the machine.
    
    Therefore, if we have an unrecovered MCE, the most appropriate thing
    to do is to panic.
    
    The second reason is that since e784b649 ("powerpc/powernv: Invoke
    opal_cec_reboot2() on unrecoverable machine check errors."), we
    attempt a special OPAL reboot on an unhandled MCE. This is so the
    hardware can record error data for later debugging.
    
    The comments in that commit assert that we are heading down the panic
    path anyway. At the moment this is not always true. With UEs in kernel
    space, for instance, they are marked as recoverable by the hardware,
    so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
    panic() but would simply die() and OOPS. It doesn't make sense to be
    staggering on if we've just tried to reboot: we should panic().
    
    Explicitly panic() on unrecovered MCEs on PowerNV.
    Update the comments appropriately.
    
    This fixes some hangs following EEH events on cxlflash setups.
    Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
    Reviewed-by: default avatarAndrew Donnellan <andrew.donnellan@au1.ibm.com>
    Reviewed-by: default avatarIan Munsie <imunsie@au1.ibm.com>
    Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
    f2dd80ec
opal.c 22.2 KB