Commits · 98ae7f98d44b61479599ec49257ccb8abd92d092 · nexedi / linux

21 Mar, 2019 2 commits

drm/amdgpu: Wait for newly allocated PTs to be idle · 98ae7f98

Felix Kuehling authored Mar 13, 2019

When page table are updated by the CPU, synchronize with the
allocation and initialization of newly allocated page tables.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

98ae7f98

drm/amdgpu: more descriptive message if HMM not enabled · 194f87dd

Philip Yang authored Mar 04, 2019

If using old kernel config file, CONFIG_ZONE_DEVICE is not selected,
so CONFIG_HMM and CONFIG_HMM_MIRROR is not enabled, the current driver
error message "Failed to register MMU notifier" is not clear. Inform
user with more descriptive message on how to fix the missing kernel
config option.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109808Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

194f87dd

19 Mar, 2019 38 commits

drm/amdgpu: support userptr cross VMAs case with HMM · 5aeaccca

Philip Yang authored Mar 04, 2019

userptr may cross two VMAs if the forked child process (not call exec
after fork) malloc buffer, then free it, and then malloc larger size
buf, kerenl will create new VMA adjacent to old VMA which was cloned
from parent process, some pages of userptr are in the first VMA, the
rest pages are in the second VMA.

HMM expects range only have one VMA, loop over all VMAs in the address
range, create multiple ranges to handle this case. See
is_mergeable_anon_vma in mm/mmap.c for details.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5aeaccca

drm/amdkfd: support concurrent userptr update for HMM · 386a68e7

Philip Yang authored Mar 04, 2019

Userptr restore may have concurrent userptr invalidation after
hmm_vma_fault adds the range to the hmm->ranges list, needs call
hmm_vma_range_done to remove the range from hmm->ranges list first,
then reschedule the restore worker. Otherwise hmm_vma_fault will add
same range to the list, this will cause loop in the list because
range->next point to range itself.

Add function untrack_invalid_user_pages to reduce code duplication.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

386a68e7

drm/amdgpu: stop evicting busy PDs/PTs · 1bd4e4ca

Christian König authored Nov 07, 2018

Otherwise we won't be able to cleanly handle page faults.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1bd4e4ca

drm/amdgpu: wait for VM to become idle during flush · 56753e73

Christian König authored Jan 10, 2019

Make sure that not only the entities are flush, but that
we also wait for the HW to finish all processing.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

56753e73

drm/amdgpu: remove non-sense NULL ptr check · 3119e7f4

Christian König authored Jan 10, 2019

It's a bug having a dead pointer in the IDR, silently returning
is the worst we can do.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

3119e7f4

drm/amdgpu: remove chash · 04ed8459

Christian König authored Nov 07, 2018

Remove the chash implementation for now since it isn't used any more.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

04ed8459

drm/amdgpu: use ring/hash for fault handling on GMC9 v3 · c1a8abd9

Christian König authored Nov 07, 2018

Further testing showed that the idea with the chash doesn't work as expected.
Especially we can't predict when we can remove the entries from the hash again.

So replace the chash with a ring buffer/hash mix where entries in the container
age automatically based on their timestamp.

v2: use ring buffer / hash mix
v3: check the timeout to make sure all entries age
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> (v2)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c1a8abd9

drm/amdgpu: limit the number of IVs processed at once · 8c65fe5f

Christian König authored Mar 05, 2019

Only process a maximum of 32 IVs before writing back the RPTR. This improves
hw handling when we get close to an overflow in the ring buffer.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8c65fe5f

drm/amdgpu: enable IH ring 1&2 for Vega20 as well · b51cd19e

Christian König authored Mar 04, 2019

That doesn't seem to have any negative effects.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b51cd19e

drm/amdgpu: enable IH doorbell for ring 1&2 on Vega · 1ae64cec

Christian König authored Feb 27, 2019

The doorbells should already be reserved, just enable them.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1ae64cec

drm/amdgpu: change Vega IH ring 1 config · 0133690e

Christian König authored Feb 27, 2019

Disable overflow and enable full drain. This makes fault handling on ring 1
much more reliable since we don't generate back pressure any more.
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0133690e

drm/amdgpu: Only clear dumb buffers if ring is enabled · 46846ba2

Nicholas Kazlauskas authored Mar 11, 2019

The buffers should be cleared when possible but we also don't want
buffer creation to fail in the rare case where the ring isn't ready
during the call. This could happen during some suspend/resume sequences.

Cc: Christian König <ckoenig.leichtzumerken@gmail.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

46846ba2

drm/amdgpu: Clear VRAM for DRM dumb_create buffers · 95b13468

Nicholas Kazlauskas authored Mar 08, 2019

The dumb_create API isn't intended for high performance rendering
and it's more useful for userspace (ie. IGT) to have them precleared.

The bonus here is that we also won't needlessly leak whatever was
previously in VRAM, but it also probably wasn't sensitive if it was
going through this API.
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

95b13468

drm/amdgpu: fix semicolon.cocci warnings · 289d513b

kbuild test robot authored Mar 06, 2019

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:405:2-3: Unneeded semicolon
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:435:2-3: Unneeded semicolon

 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

CC: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

289d513b

drm/amdgpu: add new ras workflow control flags · 108c6a63

xinhui pan authored Mar 11, 2019

add ras post init function.
Do some initialization after all IP have finished their late init.

Add new member flags which will control the ras work flow.
For now, vbios enable ras for us on boot. That might change in the
future.
So there should be a flag from vbios to tell us if ras is enabled or not
on boot. Looks like there is no such info now.

Other bits of the flags are reserved to control other parts of ras.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

108c6a63

drm/amdgpu: let ras initialization a little noticeable · 5d0f903f

xinhui pan authored Mar 12, 2019

add drm info output if ras initialized successfully.
add ras atomfirmware sanity check.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5d0f903f

drm/amdgpu: Fix lockdep warning more gracely · 163def43

xinhui pan authored Mar 11, 2019

lockdep need a static key.
Previously we set ignore bit to avoid the warning.
Now call sysfs_attr_init to initialize the static key.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-and-Tested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

163def43

drm/amdgpu: Fix ras debugfs data parse · b076296b

xinhui pan authored Mar 11, 2019

Unzero char is accepted by sscanf, so when data is structure but
unexpectedly return error invalid;
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b076296b

drm/amdgpu: add new member hw_supported · 5caf466a

xinhui pan authored Mar 11, 2019

Currently, it is not clear how ras is supported. Both software and
hardware can set the supported. That is confusing.

Fix it by adding new member hw_supported.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5caf466a

drm/amdgpu: Fix warning when lockdep is enabled · 2b9505e3

xinhui pan authored Mar 11, 2019

Set ignore bit to satisfy locpdep.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2b9505e3

drm/amdgpu: Fix NULL pointer when ta is missing · 54eb4ed6

xinhui pan authored Mar 11, 2019

Ta is optional, so check if ta firmware is loaded or not.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

54eb4ed6

drm/amdgpu: fix ras parameter descriptions · 2f3940e9

Evan Quan authored Mar 07, 2019

The descriptions of modinfo wrongly show two parameters
for each feature(see below). This patch can fix this
incorrect outputs.

parm:           amdgpu_ras_enable:Enable RAS features on the GPU (0 = disable, 1 = enable, -1 = auto (default))
parm:           ras_enable:int
parm:           amdgpu_ras_mask:Mask of RAS features to enable (default 0xffffffff), only valid when ras_enable == 1
parm:           ras_mask:uint
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2f3940e9

drm/amdgpu: export both supported and enabled ras features · 1febb00e

xinhui pan authored Mar 07, 2019

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1febb00e

drm/amdgpu: lookup vbios table to check ecc capability · b404ae82

xinhui pan authored Mar 07, 2019

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b404ae82

drm/amdgpu: query sram ecc/ecc availability from atombios · f49ea9f8

Hawking Zhang authored Mar 07, 2019

query sram ecc capability via amdgpu_atomfirmware_ecc_default_enabled
query ecc availability via amdgpu_atomfirmware_sram_ecc_supported
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f49ea9f8

drm/amdgpu: add atomfirmware helper function to query sram ecc caps · 8b6da23f

Hawking Zhang authored Mar 07, 2019

sram ecc capability could be get from firmware_capability field in firmwareinfo table
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8b6da23f

drm/amdgpu: add atomfirmware helper function to query ecc status · 511c4348

Hawking Zhang authored Mar 07, 2019

ecc default status (enabled or disabled) could be get from umc_config field in umc_info table
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

511c4348

drm/amdgpu: update atomfirmware header with ecc related members · ed606ca3

Hawking Zhang authored Mar 07, 2019

add new umc_info structures and new firmware_capability defines
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ed606ca3

drm/amdgpu: handle ras resume · acbbee01

xinhui pan authored Mar 07, 2019

Suspend will put irq, so resume need get irq back.
And in the same time, skip other ras initialization.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

acbbee01

drm/amdkfd: add RAS ECC event support (v3) · 9b54d201

Eric Huang authored Jan 11, 2019

RAS ECC event will combine with GPU reset event, due to
ECC interrupts are caused by uncorrectable error that triggers
GPU reset.

v2: Fix misleading-indentation warning
v3: fix build with CONFIG_HSA_AMD disabled
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9b54d201

drm/amdkfd: add RAS capabilities in topology for Vega20 (v2) · 0dee45a2

Eric Huang authored Jan 11, 2019

It is to collaborate with HSA_CAPABILITY in libhsakmt.

v2: squash in NULL pointer check
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0dee45a2

drm/amdgpu: add human readable debugfs control support (v2) · 96ebb307

xinhui pan authored Mar 01, 2019

Currently, the debugfs control node can't parse bash-like commands.
Now add such support for any tester that uses scripts.

v2: squash in fixes for input validation
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

96ebb307

drm/amdgpu: skip gpu reset when ras error occured · 138352e5

xinhui pan authored Mar 01, 2019

gpu reset is not stable on vega20 A1.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

138352e5

drm/amdgpu: add ioctl query for enabled ras features (v2) · 5cb77114

xinhui pan authored Dec 17, 2018

Add a query for userspace to check which RAS features
are enabled.

v2: squash in warning fix
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5cb77114

drm/amdgpu: Add a new flag to AMDGPU_CTX_OP_QUERY_STATE2 · ae363a21

xinhui pan authored Dec 17, 2018

Add AMDGPU_CTX_QUERY2_FLAGS_RAS_CE/UE which indicate if any error happened
between previous query and this query.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ae363a21

drm/amdgpu: enable ras on gmc9 · 791c4769

xinhui pan authored Jan 23, 2019

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

791c4769

drm/amdgpu: enable ras on gfx9 (v2) · 760a1d55

Feifei Xu authored Dec 07, 2018

Register ecc interrupts and ecc interrupt handler on gfx9.
Add ras support on gfx9

v2: squash in warning fix
Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

760a1d55

drm/amdgpu: enable ras on sdma4 · 8cf12507

xinhui pan authored Nov 28, 2018

register IH, enable ras features on sdma.
create sysfs debugfs file for sdma.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8cf12507