Commit af5f7eea authored by Oded Gabbay's avatar Oded Gabbay Committed by Greg Kroah-Hartman

habanalabs: soft-reset device if context-switch fails

This patch fix a bug in the driver, where if the TPC or MME remains in
non-IDLE even after all the command submissions are done (due to user bug
or malicious user), then future command submissions will fail in the
context-switch stage and the driver will remain in "stuck" mode.

The fix is to do a soft-reset of the device in case the context-switch
fails, because the device should be IDLE during context-switch. If it is
not IDLE, then something is wrong and we should reset the compute engines.
Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
parent efaa2812
...@@ -622,13 +622,15 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data) ...@@ -622,13 +622,15 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
"Failed to switch to context %d, rejecting CS! %d\n", "Failed to switch to context %d, rejecting CS! %d\n",
ctx->asid, rc); ctx->asid, rc);
/* /*
* If we timedout, we need to soft-reset because * If we timedout, or if the device is not IDLE
* QMAN is probably stuck. However, we can't * while we want to do context-switch (-EBUSY),
* call to reset here directly because of * we need to soft-reset because QMAN is
* deadlock, so need to do it at the very end * probably stuck. However, we can't call to
* of this function * reset here directly because of deadlock, so
* need to do it at the very end of this
* function
*/ */
if (rc == -ETIMEDOUT) if ((rc == -ETIMEDOUT) || (rc == -EBUSY))
need_soft_reset = true; need_soft_reset = true;
mutex_unlock(&hpriv->restore_phase_mutex); mutex_unlock(&hpriv->restore_phase_mutex);
goto out; goto out;
...@@ -706,7 +708,7 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data) ...@@ -706,7 +708,7 @@ int hl_cs_ioctl(struct hl_fpriv *hpriv, void *data)
args->out.seq = cs_seq; args->out.seq = cs_seq;
} }
if ((rc == -ETIMEDOUT) && (need_soft_reset)) if (((rc == -ETIMEDOUT) || (rc == -EBUSY)) && (need_soft_reset))
hl_device_reset(hdev, false, false); hl_device_reset(hdev, false, false);
return rc; return rc;
......
...@@ -3138,7 +3138,7 @@ static int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job) ...@@ -3138,7 +3138,7 @@ static int goya_send_job_on_qman0(struct hl_device *hdev, struct hl_cs_job *job)
if (!hdev->asic_funcs->is_device_idle(hdev)) { if (!hdev->asic_funcs->is_device_idle(hdev)) {
dev_err_ratelimited(hdev->dev, dev_err_ratelimited(hdev->dev,
"Can't send KMD job on QMAN0 if device is not idle\n"); "Can't send KMD job on QMAN0 if device is not idle\n");
return -EFAULT; return -EBUSY;
} }
fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL, fence_ptr = hdev->asic_funcs->dma_pool_zalloc(hdev, 4, GFP_KERNEL,
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment