1. 19 May, 2020 36 commits
  2. 17 May, 2020 4 commits
    • Oded Gabbay's avatar
      habanalabs: handle barriers in DMA QMAN streams · 926ba4cc
      Oded Gabbay authored
      When we have DMA QMAN with multiple streams, we need to know whether the
      command buffer contains at least one DMA packet in order to configure the
      barriers correctly when adding the 2xMSG_PROT at the end of the JOB. If
      there is no DMA packet, then there is no need to put engine barrier. This
      is relevant only for GAUDI as GOYA doesn't have streams so the engine can't
      be busy by another stream.
      Reviewed-by: default avatarTomer Tayar <ttayar@habana.ai>
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      926ba4cc
    • Oded Gabbay's avatar
      habanalabs: retrieve DMA mask indication from firmware · cb056b9f
      Oded Gabbay authored
      Retrieve from the firmware the DMA mask value we need to set according to
      the device's PCI controller configuration. This is needed when working on
      POWER9 machines, as the device's PCI controller is configured in a
      different way in those machines.
      Reviewed-by: default avatarTomer Tayar <ttayar@habana.ai>
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      cb056b9f
    • Oded Gabbay's avatar
      habanalabs: update firmware definitions · c8aee597
      Oded Gabbay authored
      Add comments for the various errors and states of the firmware during boot.
      Add a mapping of a new register that will tell the driver whether the
      firmware executed the request from the driver or if it has encountered an
      error.
      Add a new enum for the possible values of this register.
      Reviewed-by: default avatarOmer Shpigelman <oshpigelman@habana.ai>
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      c8aee597
    • Oded Gabbay's avatar
      habanalabs: increase timeout during reset · 7a65ee04
      Oded Gabbay authored
      When doing training, the DL framework (e.g. tensorflow) performs hundreds
      of thousands of memory allocations and mappings. In case the driver needs
      to perform hard-reset during training, the driver kills the application and
      unmaps all those memory allocations. Unfortunately, because of that large
      amount of mappings, the driver isn't able to do that in the current timeout
      (5 seconds). Therefore, increase the timeout significantly to 30 seconds
      to avoid situation where the driver resets the device with active mappings,
      which sometime can cause a kernel bug.
      
      BTW, it doesn't mean we will spend all the 30 seconds because the reset
      thread checks every one second if the unmap operation is done.
      Reviewed-by: default avatarOmer Shpigelman <oshpigelman@habana.ai>
      Signed-off-by: default avatarOded Gabbay <oded.gabbay@gmail.com>
      7a65ee04