1. 26 Mar, 2024 1 commit
    • Janusz Krzysztofik's avatar
      drm/i915/vma: Fix UAF on destroy against retire race · f3c71b2d
      Janusz Krzysztofik authored
      Object debugging tools were sporadically reporting illegal attempts to
      free a still active i915 VMA object when parking a GT believed to be idle.
      
      [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
      [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0
      ...
      [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
      [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022
      [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
      [161.360592] RIP: 0010:debug_print_object+0x80/0xb0
      ...
      [161.361347] debug_object_free+0xeb/0x110
      [161.361362] i915_active_fini+0x14/0x130 [i915]
      [161.361866] release_references+0xfe/0x1f0 [i915]
      [161.362543] i915_vma_parked+0x1db/0x380 [i915]
      [161.363129] __gt_park+0x121/0x230 [i915]
      [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915]
      
      That has been tracked down to be happening when another thread is
      deactivating the VMA inside __active_retire() helper, after the VMA's
      active counter has been already decremented to 0, but before deactivation
      of the VMA's object is reported to the object debugging tool.
      
      We could prevent from that race by serializing i915_active_fini() with
      __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from
      being used, e.g. from __i915_vma_retire() called at the end of
      __active_retire(), after that VMA has been already freed by a concurrent
      i915_vma_destroy() on return from the i915_active_fini().  Then, we should
      rather fix the issue at the VMA level, not in i915_active.
      
      Since __i915_vma_parked() is called from __gt_park() on last put of the
      GT's wakeref, the issue could be addressed by holding the GT wakeref long
      enough for __active_retire() to complete before that wakeref is released
      and the GT parked.
      
      I believe the issue was introduced by commit d9393973 ("drm/i915:
      Remove the vma refcount") which moved a call to i915_active_fini() from
      a dropped i915_vma_release(), called on last put of the removed VMA kref,
      to i915_vma_parked() processing path called on last put of a GT wakeref.
      However, its visibility to the object debugging tool was suppressed by a
      bug in i915_active that was fixed two weeks later with commit e92eb246
      ("drm/i915/active: Fix missing debug object activation").
      
      A VMA associated with a request doesn't acquire a GT wakeref by itself.
      Instead, it depends on a wakeref held directly by the request's active
      intel_context for a GT associated with its VM, and indirectly on that
      intel_context's engine wakeref if the engine belongs to the same GT as the
      VMA's VM.  Those wakerefs are released asynchronously to VMA deactivation.
      
      Fix the issue by getting a wakeref for the VMA's GT when activating it,
      and putting that wakeref only after the VMA is deactivated.  However,
      exclude global GTT from that processing path, otherwise the GPU never goes
      idle.  Since __i915_vma_retire() may be called from atomic contexts, use
      async variant of wakeref put.  Also, to avoid circular locking dependency,
      take care of acquiring the wakeref before VM mutex when both are needed.
      
      v7: Add inline comments with justifications for:
          - using untracked variants of intel_gt_pm_get/put() (Nirmoy),
          - using async variant of _put(),
          - not getting the wakeref in case of a global GTT,
          - always getting the first wakeref outside vm->mutex.
      v6: Since __i915_vma_active/retire() callbacks are not serialized, storing
          a wakeref tracking handle inside struct i915_vma is not safe, and
          there is no other good place for that.  Use untracked variants of
          intel_gt_pm_get/put_async().
      v5: Replace "tile" with "GT" across commit description (Rodrigo),
        - avoid mentioning multi-GT case in commit description (Rodrigo),
        - explain why we need to take a temporary wakeref unconditionally inside
          i915_vma_pin_ww() (Rodrigo).
      v4: Refresh on top of commit 5e4e06e4 ("drm/i915: Track gt pm
          wakerefs") (Andi),
        - for more easy backporting, split out removal of former insufficient
          workarounds and move them to separate patches (Nirmoy).
        - clean up commit message and description a bit.
      v3: Identify root cause more precisely, and a commit to blame,
        - identify and drop former workarounds,
        - update commit message and description.
      v2: Get the wakeref before VM mutex to avoid circular locking dependency,
        - drop questionable Fixes: tag.
      
      Fixes: d9393973 ("drm/i915: Remove the vma refcount")
      Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Nirmoy Das <nirmoy.das@intel.com>
      Cc: Andi Shyti <andi.shyti@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: stable@vger.kernel.org # v5.19+
      Reviewed-by: default avatarNirmoy Das <nirmoy.das@intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240305143747.335367-6-janusz.krzysztofik@linux.intel.com
      f3c71b2d
  2. 20 Mar, 2024 1 commit
  3. 13 Mar, 2024 1 commit
  4. 11 Mar, 2024 1 commit
  5. 07 Mar, 2024 5 commits
  6. 06 Mar, 2024 1 commit
  7. 05 Mar, 2024 2 commits
    • Janusz Krzysztofik's avatar
      drm/i915/selftests: Fix dependency of some timeouts on HZ · 6ee3f54b
      Janusz Krzysztofik authored
      Third argument of i915_request_wait() accepts a timeout value in jiffies.
      Most users pass either a simple HZ based expression, or a result of
      msecs_to_jiffies(), or MAX_SCHEDULE_TIMEOUT, or a very small number not
      exceeding 4 if applicable as that value.  However, there is one user --
      intel_selftest_wait_for_rq() -- that passes a WAIT_FOR_RESET_TIME symbol,
      defined as a large constant value that most probably represents a desired
      timeout in ms.  While that usage results in the intended value of timeout
      on usual x86_64 kernel configurations, it is not portable across different
      architectures and custom kernel configs.
      
      Rename the symbol to clearly indicate intended units and convert it to
      jiffies before use.
      
      Fixes: 3a4bfa09 ("drm/i915/selftest: Fix workarounds selftest for GuC submission")
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Cc: Rahul Kumar Singh <rahul.kumar.singh@intel.com>
      Cc: John Harrison <John.C.Harrison@Intel.com>
      Cc: Matthew Brost <matthew.brost@intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240222113347.648945-2-janusz.krzysztofik@linux.intel.com
      6ee3f54b
    • Janusz Krzysztofik's avatar
      drm/i915/selftest_hangcheck: Check sanity with more patience · 6616e048
      Janusz Krzysztofik authored
      While trying to reproduce some other issues reported by CI for i915
      hangcheck live selftest, I found them hidden behind timeout failures
      reported by igt_hang_sanitycheck -- the very first hangcheck test case
      executed.
      
      Feb 22 19:49:06 DUT1394ACMR kernel: calling  mei_gsc_driver_init+0x0/0xff0 [mei_gsc] @ 121074
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] Cannot find any crtc or sizes
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gsc.768 returned 0 after 1475 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: probe of i915.mei-gscfi.768 returned 0 after 1441 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: initcall mei_gsc_driver_init+0x0/0xff0 [mei_gsc] returned 0 after 3010 usecs
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_GEM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915 0000:03:00.0: [drm] DRM_I915_DEBUG_RUNTIME_PM enabled
      Feb 22 19:49:06 DUT1394ACMR kernel: i915: Performing live selftests with st_random_seed=0x4c26c048 st_timeout=500
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running hangcheck
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] @ 121074
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: Running intel_hangcheck_live_selftests/igt_hang_sanitycheck
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 1398 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-b638ab7e-94e2-4ea2-a552-d1c54b627f04 returned 0 after 97 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_hdcp_driver_init+0x0/0xff0 [mei_hdcp] returned 0 after 101960 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: calling  mei_pxp_driver_init+0x0/0xff0 [mei_pxp] @ 121094
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 435 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])
      Feb 22 19:49:07 DUT1394ACMR kernel: 100ms wait for request failed on rcs0, err=-62
      Feb 22 19:49:07 DUT1394ACMR kernel: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 returned 0 after 158425 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: initcall mei_pxp_driver_init+0x0/0xff0 [mei_pxp] returned 0 after 224159 usecs
      Feb 22 19:49:07 DUT1394ACMR kernel: i915/intel_hangcheck_live_selftests: igt_hang_sanitycheck failed with error -5
      Feb 22 19:49:07 DUT1394ACMR kernel: i915: probe of 0000:03:00.0 failed with error -5
      
      Those request waits, once timed out after 100ms, have never been
      confirmed to still persist over another 100ms, always being able to
      complete within the originally requested wait time doubled.
      
      Taking into account potentially significant additional concurrent workload
      generated by new auxiliary drivers that didn't exist before and now are
      loaded in parallel with the i915 module also when loaded in selftest mode,
      relax our expectations on time consumed by the sanity check request before
      it completes.
      Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20240228152500.38267-2-janusz.krzysztofik@linux.intel.com
      6616e048
  8. 04 Mar, 2024 1 commit
  9. 01 Mar, 2024 2 commits
  10. 28 Feb, 2024 1 commit
  11. 20 Feb, 2024 1 commit
  12. 15 Feb, 2024 1 commit
  13. 14 Feb, 2024 1 commit
  14. 13 Feb, 2024 1 commit
  15. 12 Feb, 2024 1 commit
  16. 24 Jan, 2024 2 commits
  17. 18 Jan, 2024 2 commits
  18. 10 Jan, 2024 1 commit
  19. 09 Jan, 2024 3 commits
    • John Harrison's avatar
      drm/i915/guc: Avoid circular locking issue on busyness flush · 0e00a881
      John Harrison authored
      Avoid the following lockdep complaint:
      <4> [298.856498] ======================================================
      <4> [298.856500] WARNING: possible circular locking dependency detected
      <4> [298.856503] 6.7.0-rc5-CI_DRM_14017-g58ac4ffc75b6+ #1 Tainted: G
          N
      <4> [298.856505] ------------------------------------------------------
      <4> [298.856507] kworker/4:1H/190 is trying to acquire lock:
      <4> [298.856509] ffff8881103e9978 (&gt->reset.backoff_srcu){++++}-{0:0}, at:
      _intel_gt_reset_lock+0x35/0x380 [i915]
      <4> [298.856661]
      but task is already holding lock:
      <4> [298.856663] ffffc900013f7e58
      ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at:
      process_scheduled_works+0x264/0x530
      <4> [298.856671]
      which lock already depends on the new lock.
      
      The complaint is not actually valid. The busyness worker thread does
      indeed hold the worker lock and then attempt to acquire the reset lock
      (which may have happened in reverse order elsewhere). However, it does
      so with a trylock that exits if the reset lock is not available
      (specifically to prevent this and other similar deadlocks).
      Unfortunately, lockdep does not understand the trylock semantics (the
      lock is an i915 specific custom implementation for resets).
      
      Not doing a synchronous flush of the worker thread when a reset is in
      progress resolves the lockdep splat by never even attempting to grab
      the lock in this particular scenario.
      
      There are situatons where a synchronous cancel is required, however.
      So, always do the synchronous cancel if not in reset. And add an extra
      synchronous cancel to the end of the reset flow to account for when a
      reset is occurring at driver shutdown and the cancel is no longer
      synchronous but could lead to unallocated memory accesses if the
      worker is not stopped.
      Signed-off-by: default avatarZhanjun Dong <zhanjun.dong@intel.com>
      Signed-off-by: default avatarJohn Harrison <John.C.Harrison@Intel.com>
      Cc: Andi Shyti <andi.shyti@linux.intel.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231219195957.212600-1-John.C.Harrison@Intel.com
      0e00a881
    • Alan Previn's avatar
      drm/i915/guc: Close deregister-context race against CT-loss · 2f2cc53b
      Alan Previn authored
      If we are at the end of suspend or very early in resume
      its possible an async fence signal (via rcu_call) is triggered
      to free_engines which could lead us to the execution of
      the context destruction worker (after a prior worker flush).
      
      Thus, when suspending, insert rcu_barriers at the start
      of i915_gem_suspend (part of driver's suspend prepare) and
      again in i915_gem_suspend_late so that all such cases have
      completed and context destruction list isn't missing anything.
      
      In destroyed_worker_func, close the race against CT-loss
      by checking that CT is enabled before calling into
      deregister_destroyed_contexts.
      
      Based on testing, guc_lrc_desc_unpin may still race and fail
      as we traverse the GuC's context-destroy list because the
      CT could be disabled right before calling GuC's CT send function.
      
      We've witnessed this race condition once every ~6000-8000
      suspend-resume cycles while ensuring workloads that render
      something onscreen is continuously started just before
      we suspend (and the workload is small enough to complete
      and trigger the queued engine/context free-up either very
      late in suspend or very early in resume).
      
      In such a case, we need to unroll the entire process because
      guc-lrc-unpin takes a gt wakeref which only gets released in
      the G2H IRQ reply that never comes through in this corner
      case. Without the unroll, the taken wakeref is leaked and will
      cascade into a kernel hang later at the tail end of suspend in
      this function:
      
         intel_wakeref_wait_for_idle(&gt->wakeref)
         (called by) - intel_gt_pm_wait_for_idle
         (called by) - wait_for_suspend
      
      Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
      contexts if guc_lrc_desc_unpin fails due to CT send falure.
      When unrolling, keep the context in the GuC's destroy-list so
      it can get picked up on the next destroy worker invocation
      (if suspend aborted) or get fully purged as part of a GuC
      sanitization (end of suspend) or a reset flow.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Signed-off-by: default avatarAnshuman Gupta <anshuman.gupta@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Acked-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231229215143.581619-1-alan.previn.teres.alexis@intel.com
      2f2cc53b
    • Alan Previn's avatar
      drm/i915/guc: Flush context destruction worker at suspend · 5e83c060
      Alan Previn authored
      When suspending, flush the context-guc-id
      deregistration worker at the final stages of
      intel_gt_suspend_late when we finally call gt_sanitize
      that eventually leads down to __uc_sanitize so that
      the deregistration worker doesn't fire off later as
      we reset the GuC microcontroller.
      Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
      Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
      Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
      Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20231228045558.536585-2-alan.previn.teres.alexis@intel.com
      5e83c060
  20. 06 Jan, 2024 1 commit
  21. 05 Jan, 2024 2 commits
  22. 02 Jan, 2024 1 commit
  23. 29 Dec, 2023 4 commits
  24. 22 Dec, 2023 1 commit
  25. 19 Dec, 2023 1 commit
  26. 15 Dec, 2023 1 commit