Commit e4a97d6b authored by Koby Elbaz's avatar Koby Elbaz Committed by Oded Gabbay

accel/habanalabs: set device status 'malfunction' while in rmmod

hl_device_status() returns the status of an acquired device.
If a device is going down (following an rmmod cmd),
it should be marked as an unusable/malfunctioning device, and
hence should not be acquired.
However, since this was not the case so far (i.e., a device going
down would inaccurately return 'in reset' status allowing the user
to acquire the device) it introduced a bug where as part of a reset
flow, the driver could not kill processes that have not run yet, and
since those processes aren't blocked from reacquiring a device,
we get eventually a new flow of a driver attempting to kill all
processes in a list that can't be ever really empty.
Signed-off-by: default avatarKoby Elbaz <kelbaz@habana.ai>
Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
parent e7b2902a
......@@ -315,7 +315,9 @@ enum hl_device_status hl_device_status(struct hl_device *hdev)
{
enum hl_device_status status;
if (hdev->reset_info.in_reset) {
if (hdev->device_fini_pending) {
status = HL_DEVICE_STATUS_MALFUNCTION;
} else if (hdev->reset_info.in_reset) {
if (hdev->reset_info.in_compute_reset)
status = HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE;
else
......@@ -343,9 +345,9 @@ bool hl_device_operational(struct hl_device *hdev,
*status = current_status;
switch (current_status) {
case HL_DEVICE_STATUS_MALFUNCTION:
case HL_DEVICE_STATUS_IN_RESET:
case HL_DEVICE_STATUS_IN_RESET_AFTER_DEVICE_RELEASE:
case HL_DEVICE_STATUS_MALFUNCTION:
case HL_DEVICE_STATUS_NEEDS_RESET:
return false;
case HL_DEVICE_STATUS_OPERATIONAL:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment