• Stefan Richter's avatar
    firewire: sbp2: fix stall with "Unsolicited response" · a481e97d
    Stefan Richter authored
    Fix I/O stalls with some 4-bay RAID enclosures which are based on
    OXUF936QSE:
      - Onnto dataTale RSM4QO, old firmware (not anymore with current
        firmware),
      - inXtron Hydra Super-S LCM, old as well as current firmware
    when used in RAID-5 mode, perhaps also in other RAID modes.
    
    The stalls happen during heavy or moderate disk traffic in periods that
    are a multiple of 5 minutes, roughly twice per hour.  They are caused
    by the target responding too late to an ORB_Pointer register write:
    The target responds after Split_Timeout, hence firewire-core cancels
    the transaction, and firewire-sbp2 fails the SCSI request.  The SCSI
    core retries the request, that fails again (and again), hence SCSI core
    calls firewire-sbp2's abort handler (and even the Management_Agent
    register write in the abort handler has the transaction timeout
    problem).
    
    During all that, the process which issued the I/O is stalled in I/O
    wait state.
    
    Meanwhile, the target actually acts on the first failed SCSI request:
    It responds to the ORB_Pointer write later (seen in the kernel log as
    "firewire_core: Unsolicited response") and also finishes the SCSI
    request with proper status (seen in the kernel log as "firewire_sbp2:
    status write for unknown orb").
    
    So let's just ignore RCODE_CANCELLED in the transaction callback and
    wait for the target to complete the ORB nevertheless.  This requires
    a small modification is sbp2_cancel_orbs(); it now needs to call
    orb->callback() regardless whether fw_cancel_transaction() found the
    transaction unfinished or finished.
    
    A different solution is to increase Split_Timeout on the local node.
    (Tested: 2000ms timeout; maybe 1000ms or something like that works too.
    200ms is insufficient.  Standard is 100ms.)  However, I rather not do
    this because any software on any node could change the Split_Timeout to
    something unsuitable.  Or such a large Split_Timeout may be undesirable
    for other purposes.
    Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
    a481e97d
sbp2.c 48.1 KB