1. 21 Aug, 2020 36 commits
  2. 18 Aug, 2020 4 commits
    • Can Guo's avatar
      scsi: ufs: Fix a race condition between error handler and runtime PM ops · 5586dd8e
      Can Guo authored
      The current IRQ handler blocks SCSI requests before scheduling eh_work,
      when error handler calls pm_runtime_get_sync, if ufshcd_suspend/resume
      sends a SCSI cmd, most likely the SSU cmd, since SCSI requests are blocked,
      pm_runtime_get_sync() will never return because ufshcd_suspend/resume is
      blocked by the SCSI cmd.
      
       - In queuecommand path, hba->ufshcd_state check and ufshcd_send_command
         should stay under the same spin lock. This is to make sure that no more
         commands leak into doorbell after hba->ufshcd_state is changed.
      
       - Don't block SCSI requests before error handler starts to run, let error
         handler block SCSI requests when it is ready to start error recovery.
      
       - Don't let SCSI layer keep requeuing the SCSI cmds sent from HBA runtime
         PM ops, let them pass or fail them. Let them pass if eh_work is
         scheduled due to non-fatal errors. Fail them if eh_work is scheduled due
         to fatal errors, otherwise the cmds may eventually time out since UFS is
         in bad state, which gets error handler blocked for too long. If we fail
         the SCSI cmds sent from HBA runtime PM ops, HBA runtime PM ops fails
         too, but it does not hurt since error handler can recover HBA runtime PM
         error.
      
      Link: https://lore.kernel.org/r/1596975355-39813-9-git-send-email-cang@codeaurora.orgReviewed-by: default avatarBean Huo <beanhuo@micron.com>
      Reviewed-by: default avatarAsutosh Das <asutoshd@codeaurora.org>
      Signed-off-by: default avatarCan Guo <cang@codeaurora.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      5586dd8e
    • Can Guo's avatar
      scsi: ufs: Move dumps in IRQ handler to error handler · c3be8d1e
      Can Guo authored
      Performing dumps in the IRQ handler causes system stability issues. Move
      dumps to the error handler and only print basic host registers here.
      
      Link: https://lore.kernel.org/r/1596975355-39813-8-git-send-email-cang@codeaurora.orgReviewed-by: default avatarBean Huo <beanhuo@micron.com>
      Reviewed-by: default avatarAsutosh Das <asutoshd@codeaurora.org>
      Signed-off-by: default avatarCan Guo <cang@codeaurora.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      c3be8d1e
    • Can Guo's avatar
      scsi: ufs: Recover HBA runtime PM error in error handler · c72e79c0
      Can Guo authored
      The current error handler can not recover HBA runtime PM error if
      ufshcd_suspend/resume has failed due to UFS errors, e.g. hibern8 enter/exit
      error or SSU cmd error. When this happens, error handler may fail
      performing a full reset and restore because error handler always assumes
      that power, IRQs and clocks are ready after pm_runtime_get_sync returns,
      but actually they are not if ufshcd_resume fails[1].
      
      If ufschd_suspend/resume fails due to UFS errors, runtime PM framework
      saves the error value to dev.power.runtime_error. After that, HBA dev
      runtime suspend/resume would not be invoked anymore unless runtime_error is
      cleared[2].
      
      In case of ufshcd_suspend/resume fails due to UFS errors, for scenario [1],
      error handler cannot assume anything of pm_runtime_get_sync, meaning error
      handler should explicitly turn ON powers, IRQs and clocks again. To get the
      HBA runtime PM work as regard for scenario [2], error handler can clear the
      runtime_error by calling pm_runtime_set_active() if full reset and restore
      succeeds. And, more important, if pm_runtime_set_active() returns no error,
      which means runtime_error has been cleared, we also need to resume those
      scsi devices under HBA in case any of them has failed to be resumed due to
      HBA runtime resume failure. This is to unblock blk_queue_enter in case
      there are bios waiting inside it.
      
      Link: https://lore.kernel.org/r/1596975355-39813-7-git-send-email-cang@codeaurora.orgReviewed-by: default avatarBean Huo <beanhuo@micron.com>
      Reviewed-by: default avatarAsutosh Das <asutoshd@codeaurora.org>
      Signed-off-by: default avatarCan Guo <cang@codeaurora.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      c72e79c0
    • Can Guo's avatar
      scsi: ufs: Fix concurrency of error handler and other error recovery paths · 4db7a236
      Can Guo authored
      Error recovery can be invoked from multiple code paths, including hibern8
      enter/exit (from ufshcd_link_recovery), ufshcd_eh_host_reset_handler() and
      eh_work scheduled from IRQ context. Ultimately, these paths are all trying
      to invoke ufshcd_reset_and_restore() in either a synchronous or
      asynchronous manner. This causes problems:
      
       - If link recovery happens during ungate work, ufshcd_hold() would be
         called recursively. Although commit 53c12d0e ("scsi: ufs: fix error
         recovery after the hibern8 exit failure") fixed a deadlock due to
         recursive calls of ufshcd_hold() by adding a check of eh_in_progress
         into ufshcd_hold, this check allows eh_work to run in parallel while
         link recovery is running.
      
       - Similar concurrency can also happen when error recovery is invoked from
         ufshcd_eh_host_reset_handler and ufshcd_link_recovery.
      
       - Concurrency can even happen between eh_works. eh_work, currently queued
         on system_wq, is allowed to have multiple instances running in parallel,
         but we don't have proper protection for that.
      
      If any of above concurrency scenarios happen, error recovery would fail and
      lead ufs device and host into bad states. To fix the concurrency problem,
      this change queues eh_work on a single threaded workqueue and removes link
      recovery calls from the hibern8 enter/exit path. In addition, make use of
      eh_work in eh_host_reset_handler instead of calling
      ufshcd_reset_and_restore. This unifies the UFS error recovery mechanism.
      
      According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs when
      the link is broken. This essentially applies to any power mode change
      operations (since they all use PACP_PWR cmds in UniPro layer). So, if a
      power mode change operation (including AH8 enter/exit) fails, mark link
      state as UIC_LINK_BROKEN_STATE and schedule the eh_work.  In this case,
      error handler needs to do a full reset and restore to recover the link back
      to active. Before the link state is recovered to active,
      ufshcd_uic_pwr_ctrl simply returns -ENOLINK to avoid more errors.
      
      Link: https://lore.kernel.org/r/1596975355-39813-6-git-send-email-cang@codeaurora.orgReviewed-by: default avatarBean Huo <beanhuo@micron.com>
      Reviewed-by: default avatarAsutosh Das <asutoshd@codeaurora.org>
      Signed-off-by: default avatarCan Guo <cang@codeaurora.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      4db7a236