Commits · 65781c78ad74e4260fbec92c0ecc05738044e177 · Kirill Smelkov / linux

24 May, 2017 40 commits

drm/amdgpu/SRIOV:implement guilty job TDR for(V2) · 65781c78

Monk Liu authored May 11, 2017

1,TDR will kickout guilty job if it hang exceed the threshold
of the given one from kernel paramter "job_hang_limit", that
way a bad command stream will not infinitly cause GPU hang.

by default this threshold is 1 so a job will be kicked out
after it hang.

2,if a job timeout TDR routine will not reset all sched/ring,
instead if will only reset on the givn one which is indicated
by @job of amdgpu_sriov_gpu_reset, that way we don't need to
reset and recover each sched/ring if we already know which job
cause GPU hang.

3,unblock sriov_gpu_reset for AI family.

V2:
1:put kickout guilty job after sched parked.
2:since parking scheduler prior to kickout already occupies a
while, we can do last check on the in question job before
doing hw_reset.

TODO:
1:when a job is considered as guilty, we should mark some flag
in its fence status flag, and let UMD side aware that this
fence signaling is not due to job complete but job hang.

2:if gpu reset cause all video memory lost, we need introduce
a new policy to implement TDR, like drop all jobs not yet
signaled, and all IOCTL on this device will return ERROR
DEVICE_LOST.
this will be implemented later.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

65781c78

drm/amdgpu:don't init entity for KIQ · 75fbed20

Monk Liu authored May 11, 2017

We don't need a scheduler for KIQ.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

75fbed20

drm/amdgpu:only call flr_work under infinite timeout · 0c63e113

Monk Liu authored Apr 26, 2017

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0c63e113

drm/amdgpu:use job* to replace voluntary · 7225f873

Monk Liu authored Apr 26, 2017

that way we can know which job cause hang and
can do per sched reset/recovery instead of all
sched.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7225f873

drm/amdgpu:don't invoke srio-gpu-reset in gpu-reset (v2) · 4fbf87e2

Monk Liu authored May 05, 2017

because we don't want to do sriov-gpu-reset under certain
cases, so just split those two funtion and don't invoke
sr-iov one from bare-metal one.

V2:
remove debugfs_gpu_reset routine on SRIOV case.
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4fbf87e2

drm/amdgpu: id reset count only is updated when used end v2 · bea39672

Chunming Zhou authored May 10, 2017

before that, we have function to check if reset happens by using reset count.
v2: always update reset count after vm flush
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

bea39672

drm/amdgpu: make pipeline sync be in same place v2 · b9bf33d5

Chunming Zhou authored May 11, 2017

v2: directly return for 'if' case.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b9bf33d5

drm/amdgpu: add sched sync for amdgpu job v2 · df83d1eb

Chunming Zhou authored May 09, 2017

this is an improvement for previous patch, the sched_sync is to store fence
that could be skipped as scheduled, when job is executed, we didn't need
pipeline_sync if all fences in sched_sync are signalled, otherwise insert
pipeline_sync still.

v2: handle error when adding fence to sync failed.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com> (v1)
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

df83d1eb

drm/amdgpu: remove unsed amdgpu_gem_handle_lockup (v2) · a022c54e

Christian König authored May 08, 2017

This kind of reset handling was removed a long time ago.

v2: fix warning (Alex)
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a022c54e

drm/amdgpu: print when gpu reset successed · 6643be65

Chunming Zhou authored May 05, 2017

Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Roger.He <Hongbo.He@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

6643be65

drm/amdgpu: fix ring0 failed on pro card · fcf0649f

Chunming Zhou authored May 05, 2017

the root cause is vram content is lost completely after pci reset.
Signed-off-by: Chunming Zhou <David1.Zhou@amd.com>
Reviewed-by: Roger.He <Hongbo.He@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fcf0649f

drm/amdgpu: extend lock range for race condition when gpu reset · 738f64cc

Roger.He authored May 05, 2017

to cover below case:
1. A task gart bind/unbind but not add to adev->gtt_list yet
2. at this time gpu reset, gtt only recover those gtt in adev->gtt_list
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Roger.He <Hongbo.He@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

738f64cc

drm/amdgpu: Fix comments in source code · 455a7bc2

Alex Xie authored May 08, 2017

Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

455a7bc2

drm/amdgpu: fix errors in comments. · ea81a173

Alex Xie authored May 08, 2017

Signed-off-by: Alex Xie <AlexBin.Xie@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ea81a173

drm/amdgpu/gfx9: move define to header file · 67fb56a6

Alex Deucher authored May 04, 2017

rather than defining it locally.
Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

67fb56a6

drm/amd/amdgpu: get rid of else branch · 5b9c58f9

Nikola Pajkovsky authored May 04, 2017

else branch is pointless if it's right at the end of function and use
unlikely() on err path.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nikola Pajkovsky <npajkovsky@suse.cz>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5b9c58f9

drm/amdgpu:cleanup flag not used · 503bb31b

Monk Liu authored May 03, 2017

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

503bb31b

drm/amdgpu:use FRAME_CNTL for new GFX ucode (v2) · 3b4d68e9