1. 05 Mar, 2024 1 commit
    • Janusz Krzysztofik's avatar
      drm/i915/selftest_hangcheck: Check sanity with more patience · 6616e048
      Janusz Krzysztofik authored
      While trying to reproduce some other issues reported by CI for i915
      hangcheck live selftest, I found them hidden behind timeout failures
      reported by igt_hang_sanitycheck -- the very first hangcheck test case
      executed.
      
      Feb 22 19:49:06 DUT1394ACMR kernel: calling  mei_gsc_driver_init+0x0/0xff0 [mei_gsc] @ 121074
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gsc.768 returned 0 after 1475 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gscfi.768 returned 0 after 1441 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: initcall mei_gsc_driver_init+0x0/0xff0 [mei_gsc] returned 0 after 3010 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_GEM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_RUNTIME_PM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915: Performing live selftests with st_random_seed=0x4c26c048 st_timeout=500
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running hangcheck
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] @ 121074
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 1398 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 97 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] returned 0 after 101960 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_pxp_driver_init+0x0/0xff0 [mei_pxp] @ 121094
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 435 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
      Feb 22 19:49:07 DUT1394ACMR kernel: 100ms wait for request failed on rcs0, err=-62
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 158425 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_pxp_driver_init+0x0/0xff0 [mei_pxp] returned 0 after 224159 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: i915/intel_hangcheck_live_selftests: igt_hang_sanitycheck failed with error -5
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: probe of 0000:03:00.0 failed with error -5
      
      Those request waits, once timed out after 100ms, have never been
      confirmed to still persist over another 100ms, always being able to
      complete within the originally requested wait time doubled.
      
      Taking into account potentially significant additional concurrent workload
      generated by new auxiliary drivers that didn't exist before and now are
      loaded in parallel with the i915 module also when loaded in selftest mode,
      relax our expectations on time consumed by the sanity check request before
      it completes.
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240228152500.38267-2-janusz.krzysztofik@linux.intel.com
      6616e048
  2. 04 Mar, 2024 1 commit
  3. 01 Mar, 2024 2 commits
  4. 28 Feb, 2024 1 commit
  5. 20 Feb, 2024 1 commit
  6. 15 Feb, 2024 1 commit
  7. 14 Feb, 2024 1 commit
  8. 13 Feb, 2024 1 commit
  9. 12 Feb, 2024 1 commit
  10. 24 Jan, 2024 2 commits
  11. 18 Jan, 2024 2 commits
  12. 10 Jan, 2024 1 commit
  13. 09 Jan, 2024 3 commits
    • John Harrison's avatar
      drm/i915/guc: Avoid circular locking issue on busyness flush · 0e00a881
      John Harrison authored
      Avoid the following lockdep complaint:
      <4> [298.856498] ======================================================
      <4> [298.856500] WARNING: possible circular locking dependency detected
      <4> [298.856503] 6.7.0-rc5-CI_DRM_14017-g58ac4ffc75b6+ #1 Tainted: G
          N
      <4> [298.856505] ------------------------------------------------------
      <4> [298.856507] kworker/4:1H/190 is trying to acquire lock:
      <4> [298.856509] ffff8881103e9978 (&gt->reset.backoff_srcu){++++}-{0:0}, at:
      _intel_gt_reset_lock+0x35/0x380 [i915]
      <4> [298.856661]
      but task is already holding lock:
      <4> [298.856663] ffffc900013f7e58
      ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at:
      process_scheduled_works+0x264/0x530
      <4> [298.856671]
      which lock already depends on the new lock.
      
      The complaint is not actually valid. The busyness worker thread does
      indeed hold the worker lock and then attempt to acquire the reset lock
      (which may have happened in reverse order elsewhere). However, it does
      so with a trylock that exits if the reset lock is not available
      (specifically to prevent this and other similar deadlocks).
      Unfortunately, lockdep does not understand the trylock semantics (the
      lock is an i915 specific custom implementation for resets).
      
      Not doing a synchronous flush of the worker thread when a reset is in
      progress resolves the lockdep splat by never even attempting to grab
      the lock in this particular scenario.
      
      There are situatons where a synchronous cancel is required, however.
      So, always do the synchronous cancel if not in reset. And add an extra
      synchronous cancel to the end of the reset flow to account for when a
      reset is occurring at driver shutdown and the cancel is no longer
      synchronous but could lead to unallocated memory accesses if the
      worker is not stopped.
      Signed-off-by: default avatarZhanjun Dong <zhanjun.dong@intel.com>
      Signed-off-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
      Cc: Andi Shyti <andi.shyti@linux.intel.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231219195957.212600-1-John.C.Harrison@Intel.com
      0e00a881
    • Alan Previn's avatar
      drm/i915/guc: Close deregister-context race against CT-loss · 2f2cc53b
      Alan Previn authored
      If we are at the end of suspend or very early in resume
      its possible an async fence signal (via rcu_call) is triggered
      to free_engines which could lead us to the execution of
      the context destruction worker (after a prior worker flush).
      
      Thus, when suspending, insert rcu_barriers at the start
      of i915_gem_suspend (part of driver's suspend prepare) and
      again in i915_gem_suspend_late so that all such cases have
      completed and context destruction list isn't missing anything.
      
      In destroyed_worker_func, close the race against CT-loss
      by checking that CT is enabled before calling into
      deregister_destroyed_contexts.
      
      Based on testing, guc_lrc_desc_unpin may still race and fail
      as we traverse the GuC's context-destroy list because the
      CT could be disabled right before calling GuC's CT send function.
      
      We've witnessed this race condition once every ~6000-8000
      suspend-resume cycles while ensuring workloads that render
      something onscreen is continuously started just before
      we suspend (and the workload is small enough to complete
      and trigger the queued engine/context free-up either very
      late in suspend or very early in resume).
      
      In such a case, we need to unroll the entire process because
      guc-lrc-unpin takes a gt wakeref which only gets released in
      the G2H IRQ reply that never comes through in this corner
      case. Without the unroll, the taken wakeref is leaked and will
      cascade into a kernel hang later at the tail end of suspend in
      this function:
      
         intel_wakeref_wait_for_idle(&gt->wakeref)
         (called by) - intel_gt_pm_wait_for_idle
         (called by) - wait_for_suspend
      
      Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
      contexts if guc_lrc_desc_unpin fails due to CT send falure.
      When unrolling, keep the context in the GuC's destroy-list so
      it can get picked up on the next destroy worker invocation
      (if suspend aborted) or get fully purged as part of a GuC
      sanitization (end of suspend) or a reset flow.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Signed-off-by: default avatarAnshuman Gupta <anshuman.gupta@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Acked-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231229215143.581619-1-alan.previn.teres.alexis@intel.com
      2f2cc53b
    • Alan Previn's avatar
      drm/i915/guc: Flush context destruction worker at suspend · 5e83c060
      Alan Previn authored
      When suspending, flush the context-guc-id
      deregistration worker at the final stages of
      intel_gt_suspend_late when we finally call gt_sanitize
      that eventually leads down to __uc_sanitize so that
      the deregistration worker doesn't fire off later as
      we reset the GuC microcontroller.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231228045558.536585-2-alan.previn.teres.alexis@intel.com
      5e83c060
  14. 06 Jan, 2024 1 commit
  15. 05 Jan, 2024 2 commits
  16. 02 Jan, 2024 1 commit
  17. 29 Dec, 2023 4 commits
  18. 22 Dec, 2023 1 commit
  19. 19 Dec, 2023 1 commit
  20. 15 Dec, 2023 9 commits
  21. 14 Dec, 2023 2 commits
  22. 13 Dec, 2023 1 commit