From: Koby Elbaz Date: Wed, 11 Jan 2023 13:43:21 +0000 (+0200) Subject: habanalabs: block soft-reset on an unusable device X-Git-Url: http://git.maquefel.me/?a=commitdiff_plain;h=a6685b573c8e0e0ec2149e48038b26d6676274f1;p=linux.git habanalabs: block soft-reset on an unusable device A device with status malfunction indicates that it can't be used. In such a case we do not support certain reset types, e.g., all kinds of soft-resets (compute reset, inference soft-reset), and reset upon device release. A hard-reset is the only way that an unusable device can change its status. All other reset procedures can't put the device in a reset procedure, which might ultimately cause the device to change its status, unintentionally, to become operational again. Such a scenario has recently occurred, when a user requested a hard-reset while another heavy user workload was ongoing (reset request is queued). Since the workload couldn't finish within reset's timeout limits, the reset has failed and set a device status malfunction. Eventually, when the user released the FD, an unsuccessful soft-reset occurred, hence followed by an additional hard-reset that changed the ASICs status back to be operational. Signed-off-by: Koby Elbaz Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 2b6971463f125..9a9c494b08a40 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -1425,8 +1425,8 @@ static void handle_reset_trigger(struct hl_device *hdev, u32 flags) int hl_device_reset(struct hl_device *hdev, u32 flags) { bool hard_reset, from_hard_reset_thread, fw_reset, hard_instead_soft = false, - reset_upon_device_release = false, schedule_hard_reset = false, delay_reset, - from_dev_release, from_watchdog_thread; + reset_upon_device_release = false, schedule_hard_reset = false, + delay_reset, from_dev_release, from_watchdog_thread; u64 idle_mask[HL_BUSY_ENGINES_MASK_EXT_SIZE] = {0}; struct hl_ctx *ctx; int i, rc; @@ -1443,12 +1443,17 @@ int hl_device_reset(struct hl_device *hdev, u32 flags) delay_reset = !!(flags & HL_DRV_RESET_DELAY); from_watchdog_thread = !!(flags & HL_DRV_RESET_FROM_WD_THR); + if (!hard_reset && (hl_device_status(hdev) == HL_DEVICE_STATUS_MALFUNCTION)) { + dev_dbg(hdev->dev, "soft-reset isn't supported on a malfunctioning device\n"); + return 0; + } + if (!hard_reset && !hdev->asic_prop.supports_compute_reset) { hard_instead_soft = true; hard_reset = true; } - if (hdev->reset_upon_device_release && (flags & HL_DRV_RESET_DEV_RELEASE)) { + if (hdev->reset_upon_device_release && from_dev_release) { if (hard_reset) { dev_crit(hdev->dev, "Aborting reset because hard-reset is mutually exclusive with reset-on-device-release\n");