An error occurred fetching the project authors.
- 13 Sep, 2019 6 commits
-
-
Tao Zhou authored
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Andrey Grodzovsky authored
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by:
Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by:
Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 23 Aug, 2019 1 commit
-
-
Guchun Chen authored
Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 12 Aug, 2019 3 commits
-
-
Tao Zhou authored
feature mask info is enough for rocm tool, "cat /sys/class/drm/card0/device/ras/features" will get the info like this: feature mask: 0x3ffb Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
call mmhub ras query/inject in amdgpu ras Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
ras sub block index could be passed from shell command Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 06 Aug, 2019 1 commit
-
-
Tao Zhou authored
remove confused ras error type info Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 02 Aug, 2019 3 commits
-
-
Tao Zhou authored
ce can also trigger interrupt, and even both ce and ue error can be found in one ras query, distinguishing between ce and ue in interrupt handler is uncessary. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Suggested-by:
Guchun Chen <guchun.chen@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
correctable error can also trigger interrupt in some ras blocks Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
umc error address query can get ce/ue error address and clear error status Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 31 Jul, 2019 9 commits
-
-
Dennis Li authored
check gfx error count in both ras querry function and ras interrupt handler. gfx ras is still disabled by default due to known stability issue found in gpu reset. Signed-off-by:
Dennis Li <Dennis.Li@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
error injection address is not in gpu address space Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
only ue and ce errors are supported Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
add error data as parameter for ras interrupt cb and process it Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
more than one error address may be recorded in one query Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
create new amdgpu_umc structure to for more umc settings in future and switch to the new structure Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
v1: increase ras ce/ue error count v2: log the number of correctable and uncorrectable errors Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
check umc error count in both ras querry function and ras interrupt handler Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
These are common structures that can be included by IP specific source files Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Dennis Li <dennis.li@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 18 Jul, 2019 6 commits
-
-
Hawking Zhang authored
this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras hardware ability Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras hardware ability Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 20 Jun, 2019 1 commit
-
-
xinhui pan authored
As long as the address is mapped with vram, we can do an error injection. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 13 Jun, 2019 1 commit
-
-
Greg Kroah-Hartman authored
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: xinhui pan <xinhui.pan@amd.com> Cc: Evan Quan <evan.quan@amd.com> Cc: Feifei Xu <Feifei.Xu@amd.com> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 10 Jun, 2019 1 commit
-
-
Sam Ravnborg authored
Delete the unused drmP.h from amdgpu.h. Fix fallout in various files. Signed-off-by:
Sam Ravnborg <sam@ravnborg.org> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20190609220757.10862-5-sam@ravnborg.org
-
- 31 May, 2019 1 commit
-
-
xinhui pan authored
injection need a valid gpu address. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 24 May, 2019 6 commits
-
-
Tom St Denis authored
Acked-by:
Slava Abramov <slava.abramov@amd.com> Signed-off-by:
Tom St Denis <tom.stdenis@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Slava Abramov authored
v1: replace casting to unsigned long with div64_ul Acked-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Slava Abramov <slava.abramov@amd.com> Tested-by:
Slava Abramov <slava.abramov@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
xinhui pan authored
add ras suspend function. rename ras_post_init to amdgpu_ras_resume. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Reviewed-by:
James Zhu <James.Zhu@amd.com> Tested-by:
James Zhu <James.Zhu@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
xinhui pan authored
add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x00000000 : 0x00001000 : R 0x00000001 : 0x00001000 : R 0x00000002 : 0x00001000 : R 0x00000003 : 0x00001000 : R 0x00000004 : 0x00001000 : R 0x00000005 : 0x00001000 : R 0x00000006 : 0x00001000 : R 0x00000007 : 0x00001000 : P 0x00000008 : 0x00001000 : P 0x00000009 : 0x00001000 : P flags can be one of below characters R: reserved. P: pending for reserve. F: failed to reserve for some reasons. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
xinhui pan authored
add another flag to allow IP do a gpu reset after device init. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
xinhui pan authored
Check ras TA error code and return EAGAIN. Issue ras enable/disable cmd without checking currect state. Looks like ras TA will handle current state == target state case. Now driver might need do a reset to satisfy ras TA. Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- 10 Apr, 2019 1 commit
-
-
xinhui pan authored
Many parts of the whole SW stack can program the ras enablement state during the boot. Now we handle that case by adding one function which check the ras flags and choose different code path. Reviewed-by:
Evan Quan <evan.quan@amd.com> Signed-off-by:
xinhui pan <xinhui.pan@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-