• Alan Previn's avatar
    drm/i915/guc: Close deregister-context race against CT-loss · 2f2cc53b
    Alan Previn authored
    If we are at the end of suspend or very early in resume
    its possible an async fence signal (via rcu_call) is triggered
    to free_engines which could lead us to the execution of
    the context destruction worker (after a prior worker flush).
    
    Thus, when suspending, insert rcu_barriers at the start
    of i915_gem_suspend (part of driver's suspend prepare) and
    again in i915_gem_suspend_late so that all such cases have
    completed and context destruction list isn't missing anything.
    
    In destroyed_worker_func, close the race against CT-loss
    by checking that CT is enabled before calling into
    deregister_destroyed_contexts.
    
    Based on testing, guc_lrc_desc_unpin may still race and fail
    as we traverse the GuC's context-destroy list because the
    CT could be disabled right before calling GuC's CT send function.
    
    We've witnessed this race condition once every ~6000-8000
    suspend-resume cycles while ensuring workloads that render
    something onscreen is continuously started just before
    we suspend (and the workload is small enough to complete
    and trigger the queued engine/context free-up either very
    late in suspend or very early in resume).
    
    In such a case, we need to unroll the entire process because
    guc-lrc-unpin takes a gt wakeref which only gets released in
    the G2H IRQ reply that never comes through in this corner
    case. Without the unroll, the taken wakeref is leaked and will
    cascade into a kernel hang later at the tail end of suspend in
    this function:
    
       intel_wakeref_wait_for_idle(&gt->wakeref)
       (called by) - intel_gt_pm_wait_for_idle
       (called by) - wait_for_suspend
    
    Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
    contexts if guc_lrc_desc_unpin fails due to CT send falure.
    When unrolling, keep the context in the GuC's destroy-list so
    it can get picked up on the next destroy worker invocation
    (if suspend aborted) or get fully purged as part of a GuC
    sanitization (end of suspend) or a reset flow.
    Signed-off-by: default avatarAlan Previn <alan.previn.teres.alexis@intel.com>
    Signed-off-by: default avatarAnshuman Gupta <anshuman.gupta@intel.com>
    Tested-by: default avatarMousumi Jana <mousumi.jana@intel.com>
    Acked-by: default avatarDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
    Signed-off-by: default avatarMatt Roper <matthew.d.roper@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20231229215143.581619-1-alan.previn.teres.alexis@intel.com
    2f2cc53b
intel_guc_submission.c 161 KB