accel/habanalabs: set device status 'malfunction' while in rmmod
authorKoby Elbaz <kelbaz@habana.ai>
Mon, 29 May 2023 08:41:04 +0000 (11:41 +0300)
committerOded Gabbay <ogabbay@kernel.org>
Mon, 9 Oct 2023 09:37:18 +0000 (12:37 +0300)
commite4a97d6b62599cc6c4bd37674de378b2a86225c3
tree04f7eb537d451918876013b96b3d7ca390378111
parente7b2902a330eba354024581edead0002521bcedf
accel/habanalabs: set device status 'malfunction' while in rmmod

hl_device_status() returns the status of an acquired device.
If a device is going down (following an rmmod cmd),
it should be marked as an unusable/malfunctioning device, and
hence should not be acquired.
However, since this was not the case so far (i.e., a device going
down would inaccurately return 'in reset' status allowing the user
to acquire the device) it introduced a bug where as part of a reset
flow, the driver could not kill processes that have not run yet, and
since those processes aren't blocked from reacquiring a device,
we get eventually a new flow of a driver attempting to kill all
processes in a list that can't be ever really empty.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
drivers/accel/habanalabs/common/device.c