• Sriharsha Basavapatna's avatar
    be2net: Support UE recovery in BEx/Skyhawk adapters · 710f3e59
    Sriharsha Basavapatna authored
    This patch supports recovery from UEs caused due to Transient Parity
    Errors (TPE), in BE2, BE3 and Skyhawk adapters. This change avoids
    system reboot when such errors occur. The driver recovers from these
    errors such that the adapter resumes full operational status as prior
    to the UE.
    
    Following is the list of changes in the driver to support this:
    
    o The driver registers its UE recoverable capability with ARM FW at init
    time. This also allows the driver to know if the feature is supported in
    the FW.
    
    o As the UE recovery requires precise time bound processing, the driver
    creates its own error recovery work queue with a single worker thread (per
    module, shared across functions).
    
    o Each function runs an error detection task at an interval of 1 second as
    required by the FW. The error detection logic already exists for BEx/SH,
    but it now runs in the context of a separate worker thread.
    
    o When an error is detected by the task, if it is recoverable, the PF0
    driver instance initiates a soft reset, while other PF driver instances
    wait for the reset to complete and the chip to become ready. Once
    the chip is ready, all driver instances including PF0, resume to
    reinitialize the respective functions.
    
    o The PF0 driver checks for some recovery criteria, to determine if the
    recovery can be initiated. If the criteria is not met, the PF0 driver does
    not initiate a soft reset, it retains the existing behavior to stop
    further processing and requires a reboot to get the chip to operational
    state again.
    
    o To allow each function to share the workq, while also making progress in
    its recovery process, a per-function recovery state machine is used.
    The per-function tasks avoid blocking operations like msleep() while in
    this state machine (until reinit state) and instead reschedule for the
    required delay.
    
    o With these changes, the existing error recovery code for Lancer also
    runs in the context of the new worker thread.
    Signed-off-by: default avatarSriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    710f3e59
be_main.c 159 KB