habanalabs: no consecutive err when user context is enabled
authorTal Cohen <talcohen@habana.ai>
Tue, 18 Oct 2022 14:35:06 +0000 (17:35 +0300)
committerOded Gabbay <ogabbay@kernel.org>
Wed, 23 Nov 2022 14:13:44 +0000 (16:13 +0200)
Consecutive error protects a device reset loop from being triggered
due to h/w issues and enters the device into an unavailable state.
When user may cause the error, an unavailable state
will prevent the user from running its workloads.

The commit prevents entering consecutive state when a user context
is enabled.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
drivers/misc/habanalabs/common/device.c

index bcd95992497140dfedacc94d426e2a1eb411843b..61ddcb1ce50891d932f9cde155cff8e906c4a716 100644 (file)
@@ -1320,6 +1320,10 @@ static void handle_reset_trigger(struct hl_device *hdev, u32 flags)
 {
        u32 cur_reset_trigger = HL_RESET_TRIGGER_DEFAULT;
 
+       /* No consecutive mechanism when user context exists */
+       if (hdev->is_compute_ctx_active)
+               return;
+
        /*
         * 'reset cause' is being updated here, because getting here
         * means that it's the 1st time and the last time we're here