An error occurred fetching the project authors.
- 28 Feb, 2022 1 commit
-
-
Oded Gabbay authored
Due to the need of SynapseAI to configure all TPC engines from a single QMAN, the driver must disable CGM and never allow the user to enable it. Otherwise, the configuration of the TPC engines will fail. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 26 Dec, 2021 8 commits
-
-
Ofir Bitton authored
Unify variables related to device reset, which will help us to add some new reset functionality in future patches. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
Need to allow user retrieve data during reset and afterwards without the need to reopen the device. Did it by seperating the user peocesses list into two lists: 1. fpriv_list which contains list of user processes that opened the device (currently only one). 2. fpriv_ctrl_list which contains list of user processes that opened the control device. This processes in this list shall not be killed during reset, only when the device is suddenly removed from PCI chain. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver supports only a single user anyway, so there is no point in checking whether we are in_debug state when a user tries to open the device, because if we are in_debug, it means a user is already using the device. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
It was an error to save the compute context's pointer in the device structure, as it allowed its use without proper ref-cnt. Change the variable to a flag that only indicates whether there is an active compute context. Code that needs the pointer will now be forced to use proper internal APIs to get the pointer. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Rajaravi Krishna Katta authored
Changing the frequency automatically is only done in Goya. In future ASICs this is done inside the firmware. Therefore, move the common code into the Goya specific files. Main changes as part of the commit are: 1. The thread for setting frequency is moved from device_late_init to goya_late_init 2. hl_device_set_frequency is removed from hl_device_open as it is not relevant for other ASICs and for Goya it is taken care by the thread 3. hl_device_set_frequency is renamed as goya_set_frequency Signed-off-by:
Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Dani Liberman authored
A new uAPI is added for debug purposes of the user-space to retrieve errors related data from previous session (before device reset was performed). Inforamtion is filled when a razwi or CS timeout happens and can contain one of the following: 1. Retrieve timestamp of last time the device was opened and razwi or CS timeout happened. 2. Retrieve information about last CS timeout. 3. Retrieve information about last razwi error. This information doesn't contain user data, so no danger of data leakage between users. Signed-off-by:
Dani Liberman <dliberman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Divide the code into 3 different parts: - Copy kernel parameters - Setting device behaivor per asic - Fixup of various device parameters according to the device behaivor. In addition, remove non-relevant code for upstream (simulator support). Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Using a variable poll interval for fw loading allows us to support much slower environments (emulation) while changing only a single line in the code, instead of choosing a different interval in each function that polls. Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 18 Oct, 2021 2 commits
-
-
Moti Haimovski authored
When adding a new node to the hpriv list, the driver should initialize its fields before adding the new node. Otherwise, there may be some small chance of another thread traversing that list and accessing the new node's fields without them being initialized. Signed-off-by:
Moti Haimovski <mhaimovski@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Bharat Jauhari authored
There may be a situation where drivers receives continuous fatal H/W error events from FW immediately post reset cycle. This may be due to some fault on the silicon itself. In such case its better to bypass reset cycle so we won't be stuck in endless loop of resets. This commit bypasses reset request in case driver received two back to back FW fatal error before first occurrence of heartbeat event. Signed-off-by:
Bharat Jauhari <bjauhari@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 01 Sep, 2021 2 commits
-
-
Oded Gabbay authored
When the f/w runs in secured mode, it can reset the ASIC when certain events occur. In unsecured mode, the driver asks the f/w to reset the ASIC for those events. We need to perform the entire reset procedure but without accessing the ASIC. i.e. without halting the engines and without sending messages to the f/w. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Omer Shpigelman authored
On init, the disabled state is cleared right before hw_init and that causes the device to report on "Operational" state before the device initialization is finished. Although the char device is not yet exposed to the user at this stage, the sysfs entries are exposed. This can cause errors in monitoring applications that use the sysfs entries. In order to avoid this, a new state "in device creation" is introduced to ne reported when the device is not disabled but is still in init flow. Signed-off-by:
Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 29 Aug, 2021 2 commits
-
-
farah kassabri authored
The signaling from within encapsulated OP capability is merged into the existing stream architecture, such that one can trigger multiple signaling from an encapsulated op, according to the time the event was done in the graph execution and avoid the need to wait for the whole encapsulated OP execution to be complete before the stream can signal. This commit implements only the reserve/unreserve part. Signed-off-by:
farah kassabri <fkassabri@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The previous function we used, find_get_pid(), wasn't good in case the user process was run inside docker. As a result, we didn't had the PID and we couldn't kill the user process in case the device got stuck and we needed to reset the device. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 18 Jun, 2021 9 commits
-
-
Yuri Nudelman authored
In a system with multiple ASICs, there is a need to provide monitoring tools with information on how long a device was opened and how many times a device was opened. Therefore, we add a new opcode to the INFO ioctl to provide that information. Signed-off-by:
Yuri Nudelman <ynudelman@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Christophe JAILLET authored
If an error occurs after a 'pci_enable_pcie_error_reporting()' call, it must be undone by a corresponding 'pci_disable_pcie_error_reporting()' call, as already done in the remove function. Fixes: 2e5eda46 ("habanalabs: PCIe Advanced Error Reporting support") Signed-off-by:
Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
If there is an error in the QMAN/engine, there is no point of trying to continue running the workload. It is better to stop to allow the user to debug the program. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
If device is not idle after user closes the FD we must reset device as next user that will try to open FD will encounter a non-functional device. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
There is no dependency when probing multiple devices so indicate to the kernel that it can probe our devices in ASYNC fashion. This shortens insmod of the driver from ~2 minutes to 20 seconds on a system with 8 devices. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Using negative logic (i.e. fw_security_disabled) is confusing. Modify the flag to use positive logic (fw_security_enabled). Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Scrubbing memory after every unmap is very costly in terms of performance. If a user wants it he can enable it but the default should prioritize performance. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ohad Sharabi authored
Fix issue in which the input to the function is_asic_secured was device PCI_IDS number instead of the asic_type enumeration. Signed-off-by:
Ohad Sharabi <osharabi@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Koby Elbaz authored
LKD should provide hard reset cause to preboot prior to loading any FW components (in case needed). Current implementation is based on the new FW 'COMMS' protocol In cased 'COMMS' is disabled - reset cause won't be sent. Currently, only 2 reset causes are shared: HEARTBEAT & TDR. Sending the reset cause will provide the missing watchdog info that the firmware needs to provide to the BMC. Signed-off-by:
Koby Elbaz <kelbaz@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 08 May, 2021 1 commit
-
-
Oded Gabbay authored
In case firmware has a bug and erroneously reports a status error (e.g. device unusable) during boot, allow the user to tell the driver to continue the boot regardless of the error status. This will be done via kernel parameter which exposes a mask. The user that loads the driver can decide exactly which status error to ignore and which to take into account. The bitmask is according to defines in hl_boot_if.h Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 09 Apr, 2021 3 commits
-
-
Ofir Bitton authored
As F/ security indication must be available before driver approaches PCI bus, F/W security should be derived from PCI id rather than be fetched during boot handshake with F/W. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
For simplicity, use a single bringup flag indicating which FW binaries should loaded to device. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
Because our graph contains network operations, we need to account for delay in the network. 5 seconds timeout per CS is not enough to account for that. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 28 Dec, 2020 1 commit
-
-
Oded Gabbay authored
We need to make sure our device is idle when rebooting a virtual machine. This is done in the driver level. The firmware will later handle FLR but we want to be extra safe and stop the devices until the FLR is handled. Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 30 Nov, 2020 4 commits
-
-
Ofir Bitton authored
Driver must verify if HW is dirty before trying to fetch preboot information. Hence, we move this validation to a prior stage of the boot sequence. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Ofir Bitton authored
The new state indicates that device should be reset in order to re-gain funcionality. This unique state can occur if reset_on_lockup is disabled and an actual lockup has occurred. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
farah kassabri authored
In cases of multi-tenants, administrators may want to prevent data leakage between users running on the same device one after another. To do that the driver can scrub the internal memory (both SRAM and DRAM) after a user finish to use the memory. Because in GAUDI the driver allows only one application to use the device at a time, it can scrub the memory when user app close FD. In future devices where we have MMU on the DRAM, we can scrub the DRAM memory with a finer granularity (page granularity) when the user allocates the memory. This feature is not supported in Goya. To allow users that want to debug their applications, we add a kernel module parameter to load the driver with this feature disabled. Signed-off-by:
farah kassabri <fkassabri@habana.ai> Reviewed-by:
Oded Gabbay <ogabbay@kernel.org> Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
Oded Gabbay authored
The driver now loads the firmware in two stages. For debugging purposes we need to support situations where only the first stage firmware is loaded. Therefore, use a bitmask to determine which F/W is loaded Signed-off-by:
Oded Gabbay <ogabbay@kernel.org>
-
- 22 Sep, 2020 1 commit
-
-
Ofir Bitton authored
driver will now get notified upon any PCI error occurred and will respond according to the severity of the error. Signed-off-by:
Ofir Bitton <obitton@habana.ai> Reviewed-by:
Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com>
-
- 24 Jul, 2020 2 commits
-
-
Oded Gabbay authored
For internal needs of our CI we need to move all the common code into a common folder instead of putting them in the root folder of the driver. Same applies to the common header files under include/ Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by:
Omer Shpigelman <oshpigelman@habana.ai>
-
Oded Gabbay authored
We no longer need to initialize the rate limiters in GAUDI A1. Reviewed-by:
Omer Shpigelman <oshpigelman@habana.ai> Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com>
-
- 10 Jul, 2020 1 commit
-
-
Oded Gabbay authored
For debugging purposes, we need to allow the root user better control of the clock gating feature of the DMA and compute engines. Therefore, change the clock gating debugfs interface to be bitmask instead of true/false. Each bit represents a different engine, according to gaudi_engine_id enum. See debugfs documentation for more details. Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by:
Omer Shpigelman <oshpigelman@habana.ai>
-
- 19 May, 2020 3 commits
-
-
Oded Gabbay authored
Enable the GAUDI ASIC code in the pci probe callback of the driver so the driver will handle GAUDI ASICs. Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com>
-
Oded Gabbay authored
Add the ASIC-dependent code for GAUDI. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the GAUDI ASIC. It also contains the code to initialize the F/W of the GAUDI ASIC and to receive events from the F/W. Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com>
-
Oded Gabbay authored
In Gaudi there is a feature of clock gating certain engines. Therefore, add this property to the device structure. In addition, due to a limitation of this feature, the driver needs to dynamically enable or disable this feature during run-time. Therefore, add ASIC interface functions to enable/disable this function from the common code. Moreover, this feature must be turned off when the user wishes to debug the ASIC by reading/writing registers and/or memory through the driver's debugfs. Therefore, add an option to enable/disable clock gating via the debugfs interface. Signed-off-by:
Oded Gabbay <oded.gabbay@gmail.com>
-