Commits · 4416377ae1fdc41a90b665943152ccd7ff61d3c5 · Kirill Smelkov / linux

23 Aug, 2024 11 commits

drm/amdgpu: add list empty check to avoid null pointer issue · 4416377a

Yang Wang authored Aug 21, 2024

Add list empty check to avoid null pointer issues in some corner cases.
- list_for_each_entry_safe()
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4416377a

drm/amd/display: Make dcn401_dsc_funcs static · 2845f512

Jinjie Ruan authored Aug 21, 2024

The sparse tool complains as follows:

drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dcn401/dcn401_dsc.c:30:24: warning:
	symbol 'dcn401_dsc_funcs' was not declared. Should it be static?

This symbol is not used outside of dcn401_dsc.c, so marks it static.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2845f512

drm/amd/display: Make dcn35_hubp_funcs static · 570867ef

Jinjie Ruan authored Aug 21, 2024

The sparse tool complains as follows:

drivers/gpu/drm/amd/amdgpu/../display/dc/hubp/dcn35/dcn35_hubp.c:191:19: warning:
	symbol 'dcn35_hubp_funcs' was not declared. Should it be static?

This symbol is not used outside of dcn35_hubp.c, so marks it static.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

570867ef

drm/amd/display: Make core_dcn4_ip_caps_base static · 0e405395

Jinjie Ruan authored Aug 21, 2024

The sparse tool complains as follows:

drivers/gpu/drm/amd/amdgpu/../display/dc/dml2/dml21/src/dml2_core/dml2_core_dcn4.c:12:28: warning:
	symbol 'core_dcn4_ip_caps_base' was not declared. Should it be static?

This symbol is not used outside of dcn35_hubp.c, so marks it static.

And do not want to change it, so mark it const.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0e405395

drm/amd/display: Make core_dcn4_g6_temp_read_blackout_table static · 988bfa0b

Jinjie Ruan authored Aug 21, 2024

The sparse tool complains as follows:

drivers/gpu/drm/amd/amdgpu/../display/dc/dml2/dml21/src/dml2_core/dml2_core_dcn4_calcs.c:6853:56: warning:
symbol 'core_dcn4_g6_temp_read_blackout_table' was not declared. Should it be static?

This symbol is not used outside of dml2_core_dcn4_calcs.c, so marks it static.

And not want to change it, so mark it const.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

988bfa0b

drm/amdgpu/gfx12: set UNORD_DISPATCH in compute MQDs · 40318a24

Alex Deucher authored Aug 20, 2024

This needs to be set to 1 to avoid a potential deadlock in
the GC 10.x and newer.  On GC 9.x and older, this needs
to be set to 0. This can lead to hangs in some mixed
graphics and compute workloads.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3575Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

40318a24

drm/amdgpu: Retire query_utcl2_poison_status callback · b05d6476

Hawking Zhang authored Aug 19, 2024

Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b05d6476

drm/amdgpu: Take IOMMU remapping into account for p2p checks · 75f0efbc

Rahul Jain authored Aug 13, 2024

when trying to enable p2p the amdgpu_device_is_peer_accessible()
checks the condition where address_mask overlaps the aper_base
and hence returns 0, due to which the p2p disables for this platform

IOMMU should remap the BAR addresses so the device can access
them. Hence check if peer_adev is remapping DMA

v5: (Felix, Alex)
- fixing comment as per Alex feedback
- refactor code as per Felix

v4: (Alex)
- fix the comment and description

v3:
- remove iommu_remap variable

v2: (Alex)
- Fix as per review comments
- add new function amdgpu_device_check_iommu_remap to check if iommu
  remap
Signed-off-by: Rahul Jain <Rahul.Jain@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

75f0efbc

drm/amd/pm: update message interface for smu v14.0.2/3 · 01bfabc2

Kenneth Feng authored Aug 20, 2024

update message interface for smu v14.0.2/3
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

01bfabc2

drm/amdkfd: Drop poison hanlding from gfx v10 · e28604d8

Hawking Zhang authored Aug 19, 2024

Not supported.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e28604d8

drm/amdkfd: Check int source id for utcl2 poison event · db6341a9

Hawking Zhang authored Aug 20, 2024

Traditional utcl2 fault_status polling does not
work in SRIOV environment. The polling of fault
status register from guest side will be dropped
by hardware.

Driver should switch to check utcl2 interrupt
source id to identify utcl2 poison event. It is
set to 1 when poisoned data interrupts are
signaled.

v2: drop the unused local variable (Tao)
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

db6341a9

21 Aug, 2024 29 commits

drm/amd/gfx11: move the gfx mutex into the caller · 88c511de

Alex Deucher authored Aug 20, 2024

Otherwise we can fail to drop the software mutex when
we fail to take the hardware mutex.

Fixes: 76acba7b ("drm/amdgpu/gfx11: add a mutex for the gfx semaphore")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

88c511de

drm/amd/pm: ensure the fw_info is not null before using it · 186fb12e

Tim Huang authored Aug 07, 2024

This resolves the dereference null return value warning
reported by Coverity.
Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

186fb12e

drm/amd/amdgpu: allow use kiq to do hdp flush under sriov · bf2bc616

Victor Zhao authored Aug 19, 2024

when use cpu to do page table update under sriov runtime, since mmio
access is blocked, kiq has to be used to flush hdp.

change WREG32_NO_KIQ to WREG32 to allow kiq.
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

bf2bc616

drm/amdgpu: fix eGPU hotplug regression · c69b07f7

Alex Deucher authored Aug 19, 2024

The driver needs to wait for the on board firmware
to finish its initialization before probing the card.
Commit 95905698 ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
switched from using msleep() to using usleep_range() which
seems to have caused init failures on some navi1x boards. Switch
back to msleep().

Fixes: 95905698 ("drm/amdgpu: Fix discovery initialization failure during pci rescan")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3559
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3500Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Ma Jun <Jun.Ma2@amd.com>

c69b07f7

drm/amd/display: Promote DC to 3.2.297 · e389eefe

Martin Leung authored Aug 12, 2024

- Various DML 2.1 fixes
- Fix module unload
- Fix construct_phy with MXM connector
- Support UHBR10 link rate on eDP
- Revert updated DCCG wrappers
Reviewed-by: Roman Li <roman.li@amd.com>
Signed-off-by: Martin Leung <Martin.Leung@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e389eefe

drm/amd/display: DML2.1 Reintegration for Various Fixes · d07722e1

Austin Zheng authored Aug 15, 2024

[Why and How]
DML2.1 reintegration for several fixes and updates to the DML
code.
Reviewed-by: Dillon Varone <dillon.varone@amd.com>
Signed-off-by: Austin Zheng <Austin.Zheng@amd.com>
Signed-off-by: Roman Li <roman.li@amd
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d07722e1

drm/amd/display: fix double free issue during amdgpu module unload · 20b5a8f9

Tim Huang authored Aug 15, 2024

Flexible endpoints use DIGs from available inflexible endpoints,
so only the encoders of inflexible links need to be freed.
Otherwise, a double free issue may occur when unloading the
amdgpu module.

[  279.190523] RIP: 0010:__slab_free+0x152/0x2f0
[  279.190577] Call Trace:
[  279.190580]  <TASK>
[  279.190582]  ? show_regs+0x69/0x80
[  279.190590]  ? die+0x3b/0x90
[  279.190595]  ? do_trap+0xc8/0xe0
[  279.190601]  ? do_error_trap+0x73/0xa0
[  279.190605]  ? __slab_free+0x152/0x2f0
[  279.190609]  ? exc_invalid_op+0x56/0x70
[  279.190616]  ? __slab_free+0x152/0x2f0
[  279.190642]  ? asm_exc_invalid_op+0x1f/0x30
[  279.190648]  ? dcn10_link_encoder_destroy+0x19/0x30 [amdgpu]
[  279.191096]  ? __slab_free+0x152/0x2f0
[  279.191102]  ? dcn10_link_encoder_destroy+0x19/0x30 [amdgpu]
[  279.191469]  kfree+0x260/0x2b0
[  279.191474]  dcn10_link_encoder_destroy+0x19/0x30 [amdgpu]
[  279.191821]  link_destroy+0xd7/0x130 [amdgpu]
[  279.192248]  dc_destruct+0x90/0x270 [amdgpu]
[  279.192666]  dc_destroy+0x19/0x40 [amdgpu]
[  279.193020]  amdgpu_dm_fini+0x16e/0x200 [amdgpu]
[  279.193432]  dm_hw_fini+0x26/0x40 [amdgpu]
[  279.193795]  amdgpu_device_fini_hw+0x24c/0x400 [amdgpu]
[  279.194108]  amdgpu_driver_unload_kms+0x4f/0x70 [amdgpu]
[  279.194436]  amdgpu_pci_remove+0x40/0x80 [amdgpu]
[  279.194632]  pci_device_remove+0x3a/0xa0
[  279.194638]  device_remove+0x40/0x70
[  279.194642]  device_release_driver_internal+0x1ad/0x210
[  279.194647]  driver_detach+0x4e/0xa0
[  279.194650]  bus_remove_driver+0x6f/0xf0
[  279.194653]  driver_unregister+0x33/0x60
[  279.194657]  pci_unregister_driver+0x44/0x90
[  279.194662]  amdgpu_exit+0x19/0x1f0 [amdgpu]
[  279.194939]  __do_sys_delete_module.isra.0+0x198/0x2f0
[  279.194946]  __x64_sys_delete_module+0x16/0x20
[  279.194950]  do_syscall_64+0x58/0x120
[  279.194954]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  279.194980]  </TASK>
Reviewed-by: Rodrigo Siqueira <rodrigo.siqueira@amd.com>
Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Roman Li <roman.li@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

20b5a8f9

drm/amd/display: DCN35 set min dispclk to 50Mhz · 23444132

Nicholas Susanto authored Aug 15, 2024

[Why]

Causes hard hangs when resuming after display off on extended/duplicate
modes

[How]

Set the min dispclk to 50Mhz for DCN35
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Nicholas Susanto <Nicholas.Susanto@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

23444132

drm/amd/display: Fix construct_phy with MXM connector · ec9e2e7a

Ilya Bakoulin authored Aug 15, 2024

[Why/How]
The call to construct_phy will fail in cases where connector type is
MXM, and the dc_link won't be properly created/initialized.
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Ilya Bakoulin <Ilya.Bakoulin@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ec9e2e7a

drm/amd/display: Support UHBR10 link rate on eDP · f3271893

Sung Joon Kim authored Aug 15, 2024

[why]
Supporting UHBR10 link rate on eDP leverages
the existing DP2.0 code but need to add some small
adjustments in code.

[how]
Acknowledge the given DPCD caps for UHBR10
link rate support and allow DP2.0 programming
sequence and link training for eDP.
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Sung Joon Kim <Sungjoon.Kim@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

f3271893

drm/amd/display: Hardware cursor changes color when switched to software cursor · 272e6aab

Nevenko Stupar authored Aug 15, 2024

[Why & How]
DCN4 Cursor has separate degamma block and should always
do Cursor degamma for Cursor color modes.
Reviewed-by: Chris Park <chris.park@amd.com>
Signed-off-by: Nevenko Stupar <Nevenko.Stupar@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

272e6aab

drm/amd/display: Allow UHBR Interop With eDP Supported Link Rates Table · 4e9e50b6

Michael Strauss authored Aug 15, 2024

[WHY]
eDP 2.0 is introducing support for UHBR link rates, however current eDP ILR
link optimization does not account for UHBR capabilities.
Either UHBR capabilities will be provided via the same 128b/132b rate DPCD caps
that are currently used on DP2.1, or Table 4-13 may be updated to include UHBR
rates.

[HOW]
Add extra Supported Link Rates table translations for UHBR10/13.5/20.
Update eDP link setting optimization search to be aware of 128b/132b DPCD
rate caps in order to unblock UHBR on panels with Supported Link Rates table.
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Michael Strauss <michael.strauss@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

4e9e50b6

drm/amd/display: Remove redundant check in DCN35 hwseq · 7c9cb6d1

Nicholas Susanto authored Aug 15, 2024

Removing redundant condition.
Reviewed-by: Hansen Dsouza <Hansen.Dsouza@amd.com>
Signed-off-by: Nicholas Susanto <Nicholas.Susanto@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

7c9cb6d1

drm/amd/display: remove an extraneous call for checking dchub clock · 8783a184

Aurabindo Pillai authored Aug 15, 2024

when removing the amdgpu module and reinserting it, a call trace is
triggered:

[  334.230602] RIP: 0010:hubbub2_get_dchub_ref_freq+0xbb/0xe0 [amdgpu]
[  334.230807] Code: 25 28 00 00 00 75 3c 48 8d 65 f0 5b 41 5c 5d 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 c0 45 31 c9 45 31 d2 45 31 db e9 55 a1 ca de <0f> 0b eb c6 0f 0b eb c2 d1 eb 8d 83 c0 63 ff ff 3d 20 4e 00 00 76
[  334.230809] RSP: 0018:ffffbc8b823fb540 EFLAGS: 00010246
[  334.230811] RAX: 0000000000001000 RBX: 00000000000186a0 RCX: 0000000000000000
[  334.230812] RDX: ffffbc8b823fb544 RSI: 0000000000000000 RDI: 0000000000000000
[  334.230813] RBP: ffffbc8b823fb560 R08: 0000000000000000 R09: 0000000000000000
[  334.230814] R10: 0000000000000000 R11: 000000000000000f R12: ffff9e644f1f2bb0
[  334.230815] R13: ffff9e6451361300 R14: 0000000000000000 R15: ffff9e6452c00000
[  334.230816] FS:  00007af7c8519000(0000) GS:ffff9e737dd00000(0000) knlGS:0000000000000000
[  334.230817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  334.230818] CR2: 0000703576b9cbd0 CR3: 00000001095a2000 CR4: 0000000000750ee0
[  334.230819] PKRU: 55555554
[  334.230820] Call Trace:
[  334.230822]  <TASK>
[  334.230824]  ? show_regs+0x6d/0x80
[  334.230828]  ? __warn+0x89/0x160
[  334.230832]  ? hubbub2_get_dchub_ref_freq+0xbb/0xe0 [amdgpu]
[  334.231024]  ? report_bug+0x17e/0x1b0
[  334.231028]  ? handle_bug+0x46/0x90
[  334.231030]  ? exc_invalid_op+0x18/0x80
[  334.231032]  ? asm_exc_invalid_op+0x1b/0x20
[  334.231036]  ? hubbub2_get_dchub_ref_freq+0xbb/0xe0 [amdgpu]
[  334.231217]  dc_create_resource_pool+0xfd/0x320 [amdgpu]
[  334.231408]  dc_create+0x256/0x700 [amdgpu]
[  334.231588]  ? srso_alias_return_thunk+0x5/0x7f
[  334.231590]  ? dmi_matches+0xa0/0x230
[  334.231594]  amdgpu_dm_init+0x28c/0x25f0 [amdgpu]
[  334.231791]  ? prb_read_valid+0x1c/0x30
[  334.231795]  ? __irq_work_queue_local+0x43/0xf0
[  334.231798]  ? srso_alias_return_thunk+0x5/0x7f
[  334.231800]  ? irq_work_queue+0x2f/0x70
[  334.231802]  ? srso_alias_return_thunk+0x5/0x7f
[  334.231803]  ? __wake_up_klogd.part.0+0x40/0x70
[  334.231805]  ? srso_alias_return_thunk+0x5/0x7f
[  334.231807]  ? vprintk_emit+0xd9/0x210
[  334.231809]  ? set_dev_info+0x130/0x1c0
[  334.231812]  ? srso_alias_return_thunk+0x5/0x7f
[  334.231813]  ? dev_printk_emit+0xa1/0xe0
[  334.231819]  dm_hw_init+0x14/0x30 [amdgpu]
[  334.231993]  amdgpu_device_init+0x23c7/0x2fc0 [amdgpu]
[  334.232134]  ? pci_read_config_word+0x25/0x50
[  334.232139]  amdgpu_driver_load_kms+0x1a/0xd0 [amdgpu]
[  334.232284]  amdgpu_pci_probe+0x1f9/0x620 [amdgpu]

On DCN401, get_dchub_ref_freq() hook is called before init_hw() hook.
Hence, it is expected to trigger an assert. Remove the extraneous call
to get_dchub_ref_freq() to suppress the call trace
Reviewed-by: Alvin Lee <alvin.lee2@amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

8783a184

drm/amd/display: Update HPO I/O When Handling Link Retrain Automation Request · 9de60462

Michael Strauss authored Aug 15, 2024

[WHY]
Previous multi-display HPO fix moved where HPO I/O enable/disable is performed.
The codepath now taken to enable/disable HPO I/O is not used for compliance
test automation, meaning that if a compliance box being driven at a DP1 rate
requests retrain at UHBR, HPO I/O will remain off if it was previously off.

[HOW]
Explicitly update HPO I/O after allocating encoders for test request.
Reviewed-by: Charlene Liu <charlene.liu@amd.com>
Reviewed-by: Wenjing Liu <wenjing.liu@amd.com>
Signed-off-by: Michael Strauss <michael.strauss@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9de60462

Revert "drm/amd/display: Update to using new dccg callbacks" · 18ac82c2

Hansen Dsouza authored Aug 15, 2024

[Why]
Revert updated DCCG wrappers due to regression

[How]
This reverts commit 680458d4.
Reviewed-by: Chris Park <chris.park@amd.com>
Signed-off-by: Hansen Dsouza <Hansen.Dsouza@amd.com>
Signed-off-by: Roman Li <roman.li@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

18ac82c2

drm/amdgpu: Validate TA binary size · c0a04e35

Candice Li authored Aug 15, 2024

Add TA binary size validation to avoid OOB write.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c0a04e35

drm/amdkfd: Update BadOpcode Interrupt handling with MES · eb067d65

Mukul Joshi authored Aug 12, 2024

Based on the recommendation of MEC FW, update BadOpcode interrupt
handling by unmapping all queues, removing the queue that got the
interrupt from scheduling and remapping rest of the queues back when
using MES scheduler. This is done to prevent the case where unmapping
of the bad queue can fail thereby causing a GPU reset.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

eb067d65

drm/amdkfd: Update queue unmap after VM fault with MES · 9a16042f

Mukul Joshi authored Jun 03, 2024

MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9a16042f

drm/amdgpu: Implement MES Suspend and Resume APIs for GFX11 · ccf8ef6b

Mukul Joshi authored Jun 03, 2024

Add implementation for MES Suspend and Resume APIs to unmap/map
all queues for GFX11. Support for GFX12 will be added when the
corresponding firmware support is in place.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ccf8ef6b

drm/amdkfd: Enable processes isolation on gfx9 · 87758a0e

Amber Lin authored Apr 29, 2024

When amdgpu enable enforce_isolation, KFD enables single-process mode in
HWS and sets exec_cleaner_shader bit in MAP_PROCESS.
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

87758a0e

drm/amdgpu/gfx_v9_4_3: Apply Isolation Enforcement to GFX & Compute rings · f846250b

Srinivasan Shanmugam authored May 14, 2024

This commit applies isolation enforcement to the GFX and Compute rings
in the gfx_v9_4_3 module.

The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
`amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
called when a ring begins and ends its use, respectively.

`amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
begins its use. This function cancels any scheduled
`enforce_isolation_work` and, if necessary, signals the Kernel Fusion
Driver (KFD) to stop the runqueue.

`amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
its use. This function schedules `enforce_isolation_work` to be run
after a delay.

These functions are part of the Enforce Isolation Handler, which
enforces shader isolation on AMD GPUs to prevent data leakage between
different processes.

The commit also includes a check for the type of the ring. If the type
of the ring is `AMDGPU_RING_TYPE_COMPUTE`, the `xcp_id` of the
`enforce_isolation` structure in the `gfx` structure of the
`amdgpu_device` is set to the `xcp_id` of the ring. This ensures that
the correct `xcp_id` is used when enforcing isolation on compute rings.
The `xcp_id` is an identifier for an XCP partition, and different rings
can be associated with different XCP partitions.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>

f846250b

drm/amdgpu/gfx9: Apply Isolation Enforcement to GFX & Compute rings · b710dbe5

Srinivasan Shanmugam authored Jul 18, 2024

This commit applies isolation enforcement to the GFX and Compute rings
in the gfx_v9_0 module.

The commit sets `amdgpu_gfx_enforce_isolation_ring_begin_use` and
`amdgpu_gfx_enforce_isolation_ring_end_use` as the functions to be
called when a ring begins and ends its use, respectively.

`amdgpu_gfx_enforce_isolation_ring_begin_use` is called when a ring
begins its use. This function cancels any scheduled
`enforce_isolation_work` and, if necessary, signals the Kernel Fusion
Driver (KFD) to stop the runqueue.

`amdgpu_gfx_enforce_isolation_ring_end_use` is called when a ring ends
its use. This function schedules `enforce_isolation_work` to be run
after a delay.

These functions are part of the Enforce Isolation Handler, which
enforces shader isolation on AMD GPUs to prevent data leakage between
different processes.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>

b710dbe5

drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization · afefd6f2

Srinivasan Shanmugam authored Jun 06, 2024

This commit introduces the Enforce Isolation Handler designed to enforce
shader isolation on AMD GPUs, which helps to prevent data leakage
between different processes.

The handler counts the number of emitted fences for each GFX and compute
ring. If there are any fences, it schedules the `enforce_isolation_work`
to be run after a delay of `GFX_SLICE_PERIOD`. If there are no fences,
it signals the Kernel Fusion Driver (KFD) to resume the runqueue.

The function is synchronized using the `enforce_isolation_mutex`.

This commit also introduces a reference count mechanism
(kfd_sch_req_count) to keep track of the number of requests to enable
the KFD scheduler. When a request to enable the KFD scheduler is made,
the reference count is decremented. When the reference count reaches
zero, a delayed work is scheduled to enforce isolation after a delay of
GFX_SLICE_PERIOD.

When a request to disable the KFD scheduler is made, the function first
checks if the reference count is zero. If it is, it cancels the delayed
work for enforcing isolation and checks if the KFD scheduler is active.
If the KFD scheduler is active, it sends a request to stop the KFD
scheduler and sets the KFD scheduler state to inactive. Then, it
increments the reference count.

The function is synchronized using the kfd_sch_mutex to ensure that the
KFD scheduler state and reference count are updated atomically.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>

afefd6f2

drm/amdkfd: APIs to stop/start KFD scheduling · 234eebe1

Amber Lin authored Jul 29, 2024

Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

234eebe1

drm/amdgpu/gfx9: Add cleaner shader support for GFX9.4.4 hardware · b1f49ff9

Srinivasan Shanmugam authored Jul 29, 2024

This commit extends the cleaner shader feature to support GFX9.4.4
hardware.

The cleaner shader feature is used to clear or initialize certain GPU
resources, such as Local Data Share (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). This
operation needs to be performed in isolation, while no other tasks
should be running on the GPU at the same time.

Previously, the cleaner shader feature was implemented for GFX9.4.3
hardware. This commit adds support for GFX9.4.4 hardware by allowing the
cleaner shader to be used with this hardware version.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b1f49ff9

drm/amdgpu/gfx9: Add cleaner shader for GFX9.4.3 · 33528831

Srinivasan Shanmugam authored Jul 29, 2024

This commit adds the cleaner shader microcode for GFX9.4.3 GPUs. The
cleaner shader is a piece of GPU code that is used to clear or
initialize certain GPU resources, such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs).

Clearing these resources is important for ensuring data isolation
between different workloads running on the GPU. Without the cleaner
shader, residual data from a previous workload could potentially be
accessed by a subsequent workload, leading to data leaks and incorrect
computation results.

The cleaner shader microcode is represented as an array of 32-bit words
(`gfx_9_4_3_cleaner_shader_hex`). This array is the binary
representation of the cleaner shader code, which is written in a
low-level GPU instruction set.

When the cleaner shader feature is enabled, the AMDGPU driver loads this
array into a specific location in the GPU memory. The GPU then reads
this memory location to fetch and execute the cleaner shader
instructions.

The cleaner shader is executed automatically by the GPU at the end of
each workload, before the next workload starts. This ensures that all
GPU resources are in a clean state before the start of each workload.

This addition is part of the cleaner shader feature implementation. The
cleaner shader feature helps improve GPU performance and resource
utilization by cleaning up GPU resources after they are used. It also
enhances security and reliability by preventing data leaks between
workloads.

v2: fix copyright date (Alex)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

33528831

drm/amdgpu/gfx9: Implement cleaner shader support for GFX9.4.3 hardware · d4c38154

Srinivasan Shanmugam authored Jul 29, 2024

The patch modifies the gfx_v9_4_3_kiq_set_resources function to write
the cleaner shader's memory controller address to the ring buffer. It
also adds a new function, gfx_v9_4_3_ring_emit_cleaner_shader, which
emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.

This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
gfx_v9_4_3 module. This packet is used to emit the cleaner shader, which
is used to clear GPU memory before it's reused, helping to prevent data
leakage between different processes.

Finally, the patch updates the ring function structures to include the
new gfx_v9_4_3_ring_emit_cleaner_shader function. This allows the
cleaner shader to be emitted as part of the ring's operations.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d4c38154

drm/amdgpu/gfx9: Implement cleaner shader support for GFX9 hardware · c2e70d30

Srinivasan Shanmugam authored Jul 29, 2024

The patch modifies the gfx_v9_0_kiq_set_resources function to write
the cleaner shader's memory controller address to the ring buffer. It
also adds a new function, gfx_v9_0_ring_emit_cleaner_shader, which
emits the PACKET3_RUN_CLEANER_SHADER packet to the ring buffer.

This patch adds support for the PACKET3_RUN_CLEANER_SHADER packet in the
gfx_v9_0 module. This packet is used to emit the cleaner shader, which
is used to clear GPU memory before it's reused, helping to prevent data
leakage between different processes.

Finally, the patch updates the ring function structures to include the
new gfx_v9_0_ring_emit_cleaner_shader function. This allows the
cleaner shader to be emitted as part of the ring's operations.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c2e70d30