drm/xe: fix mem_access for early lrc generation

We spawn some hw queues during device probe to generate the default LRC for every engine type, however the queue destruction step is typically async. Queue destruction needs to do stuff like GuC context deregister which requires GuC CT, which in turn requires an active mem_access ref. The caller during probe is meant to hold the mem_access token, however due to the async destruction it might have already been dropped if we are unlucky. Similar to how we already handle migrate VMs for which there is no mem_access ref, fix this by keeping the callers token alive, releasing it only when destroying the queue. We can treat a NULL vm as indication that we need to grab our own extra ref. Fixes the following splat sometimes seen during load: [ 1682.899930] WARNING: CPU: 1 PID: 8642 at drivers/gpu/drm/xe/xe_device.c:537 xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900209] CPU: 1 PID: 8642 Comm: kworker/u24:97 Tainted: G U W E N 6.6.0-rc3+ #6 [ 1682.900214] Workqueue: submit_wq xe_sched_process_msg_work [xe] [ 1682.900303] RIP: 0010:xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900388] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 1e 6c 03 00 48 85 c0 74 06 5b c3 cc cc cc cc 8b 83 28 23 00 00 85 c0 75 f0 <0f> 0b 5b c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 1682.900390] RSP: 0018:ffffc900021cfb68 EFLAGS: 00010246 [ 1682.900394] RAX: 0000000000000000 RBX: ffff8886a96d8000 RCX: 0000000000000000 [ 1682.900396] RDX: 0000000000000001 RSI: ffff8886a6311a00 RDI: ffff8886a96d8000 [ 1682.900398] RBP: ffffc900021cfcc0 R08: 0000000000000001 R09: 0000000000000000 [ 1682.900400] R10: ffffc900021cfcd0 R11: 0000000000000002 R12: 0000000000000004 [ 1682.900402] R13: 0000000000000000 R14: ffff8886a6311990 R15: ffffc900021cfd74 [ 1682.900405] FS: 0000000000000000(0000) GS:ffff888829880000(0000) knlGS:0000000000000000 [ 1682.900407] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1682.900409] CR2: 000055f70bad3fb0 CR3: 000000025243a004 CR4: 00000000003706e0 [ 1682.900412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1682.900413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1682.900415] Call Trace: [ 1682.900418] <TASK> [ 1682.900420] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900504] ? __warn+0x85/0x170 [ 1682.900510] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900596] ? report_bug+0x171/0x1a0 [ 1682.900604] ? handle_bug+0x3c/0x80 [ 1682.900608] ? exc_invalid_op+0x17/0x70 [ 1682.900612] ? asm_exc_invalid_op+0x1a/0x20 [ 1682.900621] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900706] ? xe_device_assert_mem_access+0x12/0x30 [xe] [ 1682.900790] guc_ct_send_locked+0xb9/0x1550 [xe] [ 1682.900882] ? lock_acquire+0xca/0x2b0 [ 1682.900885] ? guc_ct_send+0x3c/0x1a0 [xe] [ 1682.900977] ? lock_is_held_type+0x9b/0x110 [ 1682.900984] ? __mutex_lock+0xc0/0xb90 [ 1682.900989] ? __pfx___drm_printfn_info+0x10/0x10 [ 1682.900999] guc_ct_send+0x53/0x1a0 [xe] [ 1682.901090] ? __lock_acquire+0xf22/0x21b0 [ 1682.901097] ? process_one_work+0x1a0/0x500 [ 1682.901109] xe_guc_ct_send+0x19/0x50 [xe] [ 1682.901202] set_min_preemption_timeout+0x75/0xa0 [xe] [ 1682.901294] disable_scheduling_deregister+0x55/0x250 [xe] [ 1682.901383] ? xe_sched_process_msg_work+0x76/0xd0 [xe] [ 1682.901467] ? lock_release+0xc9/0x260 [ 1682.901474] xe_sched_process_msg_work+0x82/0xd0 [xe] [ 1682.901559] process_one_work+0x20a/0x500 v2: Add the splat Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

drm/xe: fix mem_access for early lrc generation
We spawn some hw queues during device probe to generate the default LRC for every engine type, however the queue destruction step is typically async. Queue destruction needs to do stuff like GuC context deregister which requires GuC CT, which in turn requires an active mem_access ref. The caller during probe is meant to hold the mem_access token, however due to the async destruction it might have already been dropped if we are unlucky. Similar to how we already handle migrate VMs for which there is no mem_access ref, fix this by keeping the callers token alive, releasing it only when destroying the queue. We can treat a NULL vm as indication that we need to grab our own extra ref. Fixes the following splat sometimes seen during load: [ 1682.899930] WARNING: CPU: 1 PID: 8642 at drivers/gpu/drm/xe/xe_device.c:537 xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900209] CPU: 1 PID: 8642 Comm: kworker/u24:97 Tainted: G U W E N 6.6.0-rc3+ #6 [ 1682.900214] Workqueue: submit_wq xe_sched_process_msg_work [xe] [ 1682.900303] RIP: 0010:xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900388] Code: 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb e8 1e 6c 03 00 48 85 c0 74 06 5b c3 cc cc cc cc 8b 83 28 23 00 00 85 c0 75 f0 <0f> 0b 5b c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 [ 1682.900390] RSP: 0018:ffffc900021cfb68 EFLAGS: 00010246 [ 1682.900394] RAX: 0000000000000000 RBX: ffff8886a96d8000 RCX: 0000000000000000 [ 1682.900396] RDX: 0000000000000001 RSI: ffff8886a6311a00 RDI: ffff8886a96d8000 [ 1682.900398] RBP: ffffc900021cfcc0 R08: 0000000000000001 R09: 0000000000000000 [ 1682.900400] R10: ffffc900021cfcd0 R11: 0000000000000002 R12: 0000000000000004 [ 1682.900402] R13: 0000000000000000 R14: ffff8886a6311990 R15: ffffc900021cfd74 [ 1682.900405] FS: 0000000000000000(0000) GS:ffff888829880000(0000) knlGS:0000000000000000 [ 1682.900407] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1682.900409] CR2: 000055f70bad3fb0 CR3: 000000025243a004 CR4: 00000000003706e0 [ 1682.900412] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1682.900413] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1682.900415] Call Trace: [ 1682.900418] <TASK> [ 1682.900420] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900504] ? __warn+0x85/0x170 [ 1682.900510] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900596] ? report_bug+0x171/0x1a0 [ 1682.900604] ? handle_bug+0x3c/0x80 [ 1682.900608] ? exc_invalid_op+0x17/0x70 [ 1682.900612] ? asm_exc_invalid_op+0x1a/0x20 [ 1682.900621] ? xe_device_assert_mem_access+0x27/0x30 [xe] [ 1682.900706] ? xe_device_assert_mem_access+0x12/0x30 [xe] [ 1682.900790] guc_ct_send_locked+0xb9/0x1550 [xe] [ 1682.900882] ? lock_acquire+0xca/0x2b0 [ 1682.900885] ? guc_ct_send+0x3c/0x1a0 [xe] [ 1682.900977] ? lock_is_held_type+0x9b/0x110 [ 1682.900984] ? __mutex_lock+0xc0/0xb90 [ 1682.900989] ? __pfx___drm_printfn_info+0x10/0x10 [ 1682.900999] guc_ct_send+0x53/0x1a0 [xe] [ 1682.901090] ? __lock_acquire+0xf22/0x21b0 [ 1682.901097] ? process_one_work+0x1a0/0x500 [ 1682.901109] xe_guc_ct_send+0x19/0x50 [xe] [ 1682.901202] set_min_preemption_timeout+0x75/0xa0 [xe] [ 1682.901294] disable_scheduling_deregister+0x55/0x250 [xe] [ 1682.901383] ? xe_sched_process_msg_work+0x76/0xd0 [xe] [ 1682.901467] ? lock_release+0xc9/0x260 [ 1682.901474] xe_sched_process_msg_work+0x82/0xd0 [xe] [ 1682.901559] process_one_work+0x20a/0x500 v2: Add the splat Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
fcf98d68 · Matthew Auld · Rodrigo Vivi · 5152234e · fcf98d68
Commit fcf98d68 authored Nov 27, 2023 by Matthew Auld Committed by Rodrigo Vivi Dec 21, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 7 deletions

drivers/gpu/drm/xe/xe_exec_queue.c drivers/gpu/drm/xe/xe_exec_queue.c +7 -7

No files found.
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -90,12 +90,12 @@ static struct xe_exec_queue *__xe_exec_queue_create(struct xe_device *xe,
 	/*
 	 * Normally the user vm holds an rpm ref to keep the device
 	 * awake, and the context holds a ref for the vm, however for
-	 * some engines we use the kernels migrate vm underneath which
-	 * offers no such rpm ref. Make sure we keep a ref here, so we
-	 * can perform GuC CT actions when needed. Caller is expected to
-	 * have already grabbed the rpm ref outside any sensitive locks.
+	 * some engines we use the kernels migrate vm underneath which offers no
+	 * such rpm ref, or we lack a vm. Make sure we keep a ref here, so we
+	 * can perform GuC CT actions when needed. Caller is expected to have
+	 * already grabbed the rpm ref outside any sensitive locks.
 	 */
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM))
+	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !vm))
 		drm_WARN_ON(&xe->drm, !xe_device_mem_access_get_if_ongoing(xe));

 	return q;
@@ -172,10 +172,10 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)

 	for (i = 0; i < q->width; ++i)
 		xe_lrc_finish(q->lrc + i);
+	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM || !q->vm))
+		xe_device_mem_access_put(gt_to_xe(q->gt));
 	if (q->vm)
 		xe_vm_put(q->vm);
-	if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & EXEC_QUEUE_FLAG_VM))
-		xe_device_mem_access_put(gt_to_xe(q->gt));

 	kfree(q);
 }