• Steffen Maier's avatar
    scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices · 8c9db667
    Steffen Maier authored
    Suppose we have an environment with a number of non-NPIV FCP devices
    (virtual HBAs / FCP devices / zfcp "adapter"s) sharing the same physical
    FCP channel (HBA port) and its I_T nexus. Plus a number of storage target
    ports zoned to such shared channel. Now one target port logs out of the
    fabric causing an RSCN. Zfcp reacts with an ADISC ELS and subsequent port
    recovery depending on the ADISC result. This happens on all such FCP
    devices (in different Linux images) concurrently as they all receive a copy
    of this RSCN. In the following we look at one of those FCP devices.
    
    Requests other than FSF_QTCB_FCP_CMND can be slow until they get a
    response.
    
    Depending on which requests are affected by slow responses, there are
    different recovery outcomes. Here we want to fix failed recoveries on port
    or adapter level by avoiding recovery requests that can be slow.
    
    We need the cached N_Port_ID for the remote port "link" test with ADISC.
    Just before sending the ADISC, we now intentionally forget the old cached
    N_Port_ID. The idea is that on receiving an RSCN for a port, we have to
    assume that any cached information about this port is stale.  This forces a
    fresh new GID_PN [FC-GS] nameserver lookup on any subsequent recovery for
    the same port. Since we typically can still communicate with the nameserver
    efficiently, we now reach steady state quicker: Either the nameserver still
    does not know about the port so we stop recovery, or the nameserver already
    knows the port potentially with a new N_Port_ID and we can successfully and
    quickly perform open port recovery.  For the one case, where ADISC returns
    successfully, we re-initialize port->d_id because that case does not
    involve any port recovery.
    
    This also solves a problem if the storage WWPN quickly logs into the fabric
    again but with a different N_Port_ID. Such as on virtual WWPN takeover
    during target NPIV failover.
    [https://www.redbooks.ibm.com/abstracts/redp5477.html] In that case the
    RSCN from the storage FDISC was ignored by zfcp and we could not
    successfully recover the failover. On some later failback on the storage,
    we could have been lucky if the virtual WWPN got the same old N_Port_ID
    from the SAN switch as we still had cached.  Then the related RSCN
    triggered a successful port reopen recovery.  However, there is no
    guarantee to get the same N_Port_ID on NPIV FDISC.
    
    Even though NPIV-enabled FCP devices are not affected by this problem, this
    code change optimizes recovery time for gone remote ports as a side effect.
    The timely drop of cached N_Port_IDs prevents unnecessary slow open port
    attempts.
    
    While the problem might have been in code before v2.6.32 commit
    799b76d0 ("[SCSI] zfcp: Decouple gid_pn requests from erp") this fix
    depends on the gid_pn_work introduced with that commit, so we mark it as
    culprit to satisfy fix dependencies.
    
    Note: Point-to-point remote port is already handled separately and gets its
    N_Port_ID from the cached peer_d_id. So resetting port->d_id in general
    does not affect PtP.
    
    Link: https://lore.kernel.org/r/20220118165803.3667947-1-maier@linux.ibm.com
    Fixes: 799b76d0 ("[SCSI] zfcp: Decouple gid_pn requests from erp")
    Cc: <stable@vger.kernel.org> #2.6.32+
    Suggested-by: default avatarBenjamin Block <bblock@linux.ibm.com>
    Reviewed-by: default avatarBenjamin Block <bblock@linux.ibm.com>
    Signed-off-by: default avatarSteffen Maier <maier@linux.ibm.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    8c9db667
zfcp_fc.c 30.4 KB