• Steve Wise's avatar
    RDMA/cxgb4: EEH errors can hang the driver · 2f25e9a5
    Steve Wise authored
    A few more EEH fixes:
    
    c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
    return an error.
    
    The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
    event followed by a ib_register_device() when the device was
    reinitialized.  However, the RDMA core doesn't allow multiple
    iterations of register/deregister by the provider. See
    drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
    the kobject ref is held until the device is deallocated in
    ib_deallocate_device().  Calling deregister adds this kobj reference,
    and then a subsequent register call will generate a WARN_ON() from the
    kobject subsystem because the kobject is being initialized but is
    already initialized with the ref held.
    
    So the provider must deregister and dealloc when resetting for an EEH
    event, then alloc/register to re-initialize.  To do this, we cannot
    use the device ptr as our ULD handle since it will change with each
    reallocation.  This commit adds a ULD context struct which is used as
    the ULD handle, and then contains the device pointer and other state
    needed.
    Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
    Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
    2f25e9a5
iw_cxgb4.h 19.9 KB