Commits · 4edb4ffe39c9bdaec50186d0ca583a7ff01143de · Kirill Smelkov / linux

An error occurred fetching the project authors.

28 Feb, 2022 1 commit

habanalabs/gaudi: disable CGM permanently · 4edb4ffe

Oded Gabbay authored 3 years ago

Due to the need of SynapseAI to configure all TPC engines from a single
QMAN, the driver must disable CGM and never allow the user to enable
it. Otherwise, the configuration of the TPC engines will fail.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

4edb4ffe

26 Dec, 2021 8 commits

habanalabs: refactor reset information variables · eb135291

Ofir Bitton authored 3 years ago

Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

eb135291

habanalabs: keep control device alive during hard reset · 707c1252

Dani Liberman authored 3 years ago

Need to allow user retrieve data during reset and afterwards without
the need to reopen the device.
Did it by seperating the user peocesses list into two lists:
1. fpriv_list which contains list of user processes that opened
   the device (currently only one).
2. fpriv_ctrl_list which contains list of user processes that opened
   the control device. This processes in this list shall not be
   killed during reset, only when the device is suddenly removed from
   PCI chain.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

707c1252

habanalabs: remove in_debug check in device open · 7363805b

Oded Gabbay authored 3 years ago

The driver supports only a single user anyway, so there is no point
in checking whether we are in_debug state when a user tries to open
the device, because if we are in_debug, it means a user is already
using the device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

7363805b

habanalabs: remove compute context pointer · 5b90e59d

Oded Gabbay authored 3 years ago

It was an error to save the compute context's pointer in the device
structure, as it allowed its use without proper ref-cnt.

Change the variable to a flag that only indicates whether there is
an active compute context. Code that needs the pointer will now
be forced to use proper internal APIs to get the pointer.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

5b90e59d

habanalabs: Move frequency change thread to goya_late_init · d8eb50f3

Rajaravi Krishna Katta authored 3 years ago

Changing the frequency automatically is only done in Goya. In future
ASICs this is done inside the firmware. Therefore, move the common code
into the Goya specific files.

Main changes as part of the commit are:
    1. The thread for setting frequency is moved from device_late_init
       to goya_late_init
    2. hl_device_set_frequency is removed from hl_device_open as it is
       not relevant for other ASICs and for Goya it is taken care by
       the thread
    3. hl_device_set_frequency is renamed as goya_set_frequency
Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d8eb50f3

habanalabs: add support for fetching historic errors · 3e55b5db

Dani Liberman authored 3 years ago

A new uAPI is added for debug purposes of the user-space to retrieve
errors related data from previous session (before device reset was
performed).

Inforamtion is filled when a razwi or CS timeout happens and can
contain one of the following:

1. Retrieve timestamp of last time the device was opened and razwi or
   CS timeout happened.
2. Retrieve information about last CS timeout.
3. Retrieve information about last razwi error.

This information doesn't contain user data, so no danger of data
leakage between users.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

3e55b5db

habanalabs: make hdev creation code more readable · e617f5f4

Oded Gabbay authored 3 years ago

Divide the code into 3 different parts:
- Copy kernel parameters
- Setting device behaivor per asic
- Fixup of various device parameters according to the device behaivor.

In addition, remove non-relevant code for upstream (simulator support).
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e617f5f4

habanalabs: use variable poll interval for fw loading · f4e7906d

Ohad Sharabi authored 3 years ago

Using a variable poll interval for fw loading allows us to support
much slower environments (emulation) while changing only a single
line in the code, instead of choosing a different interval in each
function that polls.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

f4e7906d

18 Oct, 2021 2 commits

habanalabs: initialize hpriv fields before adding new node · 4a18dde5

Moti Haimovski authored 3 years ago

When adding a new node to the hpriv list, the driver should
initialize its fields before adding the new node.

Otherwise, there may be some small chance of another thread traversing
that list and accessing the new node's fields without them being
initialized.
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

4a18dde5

habanalabs: bypass reset for continuous h/w error event · 10cab81d

Bharat Jauhari authored 3 years ago

There may be a situation where drivers receives continuous fatal H/W
error events from FW immediately post reset cycle.
This may be due to some fault on the silicon itself.
In such case its better to bypass reset cycle so we won't be stuck in
endless loop of resets.

This commit bypasses reset request in case driver received two back to
back FW fatal error before first occurrence of heartbeat event.
Signed-off-by: Bharat Jauhari <bjauhari@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

10cab81d

01 Sep, 2021 2 commits

habanalabs: add support for f/w reset · 8d9aa980

Oded Gabbay authored 3 years ago

When the f/w runs in secured mode, it can reset the ASIC when certain
events occur. In unsecured mode, the driver asks the f/w to reset the
ASIC for those events.

We need to perform the entire reset procedure but without accessing the
ASIC. i.e. without halting the engines and without sending messages
to the f/w.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

8d9aa980

habanalabs: add "in device creation" status · 71731090

Omer Shpigelman authored 3 years ago

On init, the disabled state is cleared right before hw_init and that
causes the device to report on "Operational" state before the device
initialization is finished. Although the char device is not yet exposed
to the user at this stage, the sysfs entries are exposed.

This can cause errors in monitoring applications that use the sysfs
entries.

In order to avoid this, a new state "in device creation" is introduced
to ne reported when the device is not disabled but is still in init
flow.
Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

71731090

29 Aug, 2021 2 commits

habanalabs: add support for encapsulated signals reservation · dadf17ab

farah kassabri authored 3 years ago

The signaling from within encapsulated OP capability is merged into the
existing stream architecture, such that one can trigger multiple
signaling from an encapsulated op, according to the time the event
was done in the graph execution and avoid the need to wait for the
whole encapsulated OP execution to be complete before the stream can
signal.

This commit implements only the reserve/unreserve part.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

dadf17ab

habanalabs: use get_task_pid() to take PID · e79e745b

Oded Gabbay authored 3 years ago

The previous function we used, find_get_pid(), wasn't good in case
the user process was run inside docker.

As a result, we didn't had the PID and we couldn't kill the user
process in case the device got stuck and we needed to reset the
device.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e79e745b

18 Jun, 2021 9 commits

habanalabs: added open_stats info ioctl · e307b302

Yuri Nudelman authored 3 years ago

In a system with multiple ASICs, there is a need to provide monitoring
tools with information on how long a device was opened and how many
times a device was opened.

Therefore, we add a new opcode to the INFO ioctl to provide that
information.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e307b302

habanalabs: Fix an error handling path in 'hl_pci_probe()' · 3002f467

Christophe JAILLET authored 3 years ago

If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it
must be undone by a corresponding 'pci_disable_pcie_error_reporting()'
call, as already done in the remove function.

Fixes: 2e5eda46 ("habanalabs: PCIe Advanced Error Reporting support")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

3002f467

habanalabs: enable stop on error for all QMANs and engines · 358526be

Ofir Bitton authored 4 years ago

If there is an error in the QMAN/engine, there is no point of trying
to continue running the workload. It is better to stop to allow the
user to debug the program.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

358526be

habanalabs: reset device upon FD close if not idle · 84586de4

Ofir Bitton authored 3 years ago

If device is not idle after user closes the FD we must reset device
as next user that will try to open FD will encounter a non-functional
device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

84586de4

habanalabs: prefer ASYNC device probing · 135ade0c

Oded Gabbay authored 3 years ago

There is no dependency when probing multiple devices so indicate to the
kernel that it can probe our devices in ASYNC fashion.

This shortens insmod of the driver from ~2 minutes to 20 seconds on
a system with 8 devices.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

135ade0c

habanalabs: track security status using positive logic · 4cb4508c

Ohad Sharabi authored 3 years ago

Using negative logic (i.e. fw_security_disabled) is confusing.

Modify the flag to use positive logic (fw_security_enabled).
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

4cb4508c

habanalabs: set memory scrubbing to disabled by default · 7fb2a1f5

Oded Gabbay authored 3 years ago

Scrubbing memory after every unmap is very costly in terms of
performance. If a user wants it he can enable it but the default
should prioritize performance.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

7fb2a1f5

habanalabs: check if asic secured with asic type · 190ec497

Ohad Sharabi authored 3 years ago

Fix issue in which the input to the function is_asic_secured was device
PCI_IDS number instead of the asic_type enumeration.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

190ec497

habanalabs/gaudi: send hard reset cause to preboot · 3e0ca9fa

Koby Elbaz authored 3 years ago

LKD should provide hard reset cause to preboot prior to
loading any FW components (in case needed).
Current implementation is based on the new FW 'COMMS' protocol
In cased 'COMMS' is disabled - reset cause won't be sent.
Currently, only 2 reset causes are shared: HEARTBEAT & TDR.

Sending the reset cause will provide the missing watchdog
info that the firmware needs to provide to the BMC.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

3e0ca9fa

08 May, 2021 1 commit

habanalabs: ignore f/w status error · 27a9e35d

Oded Gabbay authored 3 years ago

In case firmware has a bug and erroneously reports a status error
(e.g. device unusable) during boot, allow the user to tell the driver
to continue the boot regardless of the error status.

This will be done via kernel parameter which exposes a mask. The
user that loads the driver can decide exactly which status error to
ignore and which to take into account. The bitmask is according to
defines in hl_boot_if.h
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

27a9e35d

09 Apr, 2021 3 commits

habanalabs/gaudi: derive security status from pci id · e5042a6f

Ofir Bitton authored 3 years ago

As F/ security indication must be available before driver approaches
PCI bus, F/W security should be derived from PCI id rather than be
fetched during boot handshake with F/W.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

e5042a6f

habanalabs: use a single FW loading bringup flag · 6a2f5d70

Ofir Bitton authored 4 years ago

For simplicity, use a single bringup flag indicating which FW
binaries should loaded to device.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

6a2f5d70

habanalabs: change default CS timeout to 30 seconds · 17b59dd3

Oded Gabbay authored 4 years ago

Because our graph contains network operations, we need to account
for delay in the network.

5 seconds timeout per CS is not enough to account for that.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

17b59dd3

28 Dec, 2020 1 commit

habanalabs: register to pci shutdown callback · fcaebc73

Oded Gabbay authored 4 years ago

We need to make sure our device is idle when rebooting a virtual
machine. This is done in the driver level.

The firmware will later handle FLR but we want to be extra safe and
stop the devices until the FLR is handled.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

fcaebc73

30 Nov, 2020 4 commits

habanalabs: move HW dirty check to a proper location · d1ddd905

Ofir Bitton authored 4 years ago

Driver must verify if HW is dirty before trying to fetch preboot
information. Hence, we move this validation to a prior stage of
the boot sequence.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

d1ddd905

habanalabs: add 'needs reset' state in driver · 66a76401

Ofir Bitton authored 4 years ago

The new state indicates that device should be reset in order
to re-gain funcionality.
This unique state can occur if reset_on_lockup is disabled
and an actual lockup has occurred.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

66a76401

habanalabs/gaudi: scrub all memory upon closing FD · 03df136b

farah kassabri authored 4 years ago

In cases of multi-tenants, administrators may want to prevent data
leakage between users running on the same device one after another.

To do that the driver can scrub the internal memory (both SRAM and
DRAM) after a user finish to use the memory.

Because in GAUDI the driver allows only one application to use the
device at a time, it can scrub the memory when user app close FD.

In future devices where we have MMU on the DRAM, we can scrub the DRAM
memory with a finer granularity (page granularity) when the user
allocates the memory.

This feature is not supported in Goya.

To allow users that want to debug their applications, we add a kernel
module parameter to load the driver with this feature disabled.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

03df136b

habanalabs: support multiple types of firmwares · 596553db

Oded Gabbay authored 4 years ago

The driver now loads the firmware in two stages. For debugging purposes
we need to support situations where only the first stage firmware is
loaded.

Therefore, use a bitmask to determine which F/W is loaded
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

596553db

22 Sep, 2020 1 commit

habanalabs: PCIe Advanced Error Reporting support · 2e5eda46

Ofir Bitton authored 4 years ago

driver will now get notified upon any PCI error occurred and
will respond according to the severity of the error.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

2e5eda46

24 Jul, 2020 2 commits

habanalabs: create common folder · 70b2f993

Oded Gabbay authored 4 years ago

For internal needs of our CI we need to move all the common code into a
common folder instead of putting them in the root folder of the driver.

Same applies to the common header files under include/
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>

70b2f993

habanalabs: remove rate limiters from GAUDI · 0b168c8f

Oded Gabbay authored 4 years ago

We no longer need to initialize the rate limiters in GAUDI A1.
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

0b168c8f

10 Jul, 2020 1 commit

habanalabs: set clock gating per engine · e38bfd30

Oded Gabbay authored 4 years ago

For debugging purposes, we need to allow the root user better control of
the clock gating feature of the DMA and compute engines. Therefore, change
the clock gating debugfs interface to be bitmask instead of true/false.
Each bit represents a different engine, according to gaudi_engine_id enum.

See debugfs documentation for more details.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>

e38bfd30

19 May, 2020 3 commits

habanalabs: enable gaudi code in driver · af57cb81

Oded Gabbay authored 4 years ago

Enable the GAUDI ASIC code in the pci probe callback of the driver so the
driver will handle GAUDI ASICs.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

af57cb81

habanalabs: add gaudi asic-dependent code · ac0ae6a9

Oded Gabbay authored 4 years ago

Add the ASIC-dependent code for GAUDI. Supply (almost) all of the function
callbacks that the driver's common code need to initialize, finalize and
submit workloads to the GAUDI ASIC.

It also contains the code to initialize the F/W of the GAUDI ASIC and to
receive events from the F/W.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

ac0ae6a9

habanalabs: support clock gating enable/disable · ca62433f

Oded Gabbay authored 4 years ago

In Gaudi there is a feature of clock gating certain engines.
Therefore, add this property to the device structure.

In addition, due to a limitation of this feature, the driver needs to
dynamically enable or disable this feature during run-time. Therefore, add
ASIC interface functions to enable/disable this function from the common
code.

Moreover, this feature must be turned off when the user wishes to debug the
ASIC by reading/writing registers and/or memory through the driver's
debugfs. Therefore, add an option to enable/disable clock gating via the
debugfs interface.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

ca62433f