• Jack Zhang's avatar
    drm/amd/amdgpu implement tdr advanced mode · e6c6338f
    Jack Zhang authored
    [Why]
    Previous tdr design treats the first job in job_timeout as the bad job.
    But sometimes a later bad compute job can block a good gfx job and
    cause an unexpected gfx job timeout because gfx and compute ring share
    internal GC HW mutually.
    
    [How]
    This patch implements an advanced tdr mode.It involves an additinal
    synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
    step in order to find the real bad job.
    
    1. At Step0 Resubmit stage, it synchronously submits and pends for the
    first job being signaled. If it gets timeout, we identify it as guilty
    and do hw reset. After that, we would do the normal resubmit step to
    resubmit left jobs.
    
    2. For whole gpu reset(vram lost), do resubmit as the old way.
    
    v2: squash in build fix (Alex)
    Signed-off-by: default avatarJack Zhang <Jack.Zhang1@amd.com>
    Reviewed-by: default avatarAndrey Grodzovsky <andrey.grodzovsky@amd.com>
    Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
    e6c6338f
amdgpu_device.c 143 KB