Commit a8c51ed2 authored by Chris Wilson's avatar Chris Wilson

drm/i915/gt: Try to more gracefully quiesce the system before resets

If we are doing a normal GPU reset triggered after detecting a long
period of stalled work, we can take our time and allow the engines to
quiesce. Since we've stopped submission to the engine, and if we wait
long enough an innocent context should complete, leaving the engine idle.
So by waiting a short amount of time, we should prevent clobbering other
users when resetting a stuck context.
Suggested-by: default avatarJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
Suggested-by: default avatarJon Bloomfield <jon.bloomfield@intel.com>
Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: default avatarMika Kuoppala <mika.kuoppala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191023133108.21401-1-chris@chris-wilson.co.uk
parent a1ceb93a
...@@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST ...@@ -25,3 +25,14 @@ config DRM_I915_SPIN_REQUEST
May be 0 to disable the initial spin. In practice, we estimate May be 0 to disable the initial spin. In practice, we estimate
the cost of enabling the interrupt (if currently disabled) to be the cost of enabling the interrupt (if currently disabled) to be
a few microseconds. a few microseconds.
config DRM_I915_STOP_TIMEOUT
int "How long to wait for an engine to quiesce gracefully before reset (ms)"
default 100 # milliseconds
help
By stopping submission and sleeping for a short time before resetting
the GPU, we allow the innocent contexts also on the system to quiesce.
It is then less likely for a hanging context to cause collateral
damage as the system is reset in order to recover. The corollary is
that the reset itself may take longer and so be more disruptive to
interactive or low latency workloads.
...@@ -308,6 +308,9 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id) ...@@ -308,6 +308,9 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id)
engine->instance = info->instance; engine->instance = info->instance;
__sprint_engine_name(engine); __sprint_engine_name(engine);
engine->props.stop_timeout_ms =
CONFIG_DRM_I915_STOP_TIMEOUT;
/* /*
* To be overridden by the backend on setup. However to facilitate * To be overridden by the backend on setup. However to facilitate
* cleanup on error during setup, we always provide the destroy vfunc. * cleanup on error during setup, we always provide the destroy vfunc.
...@@ -875,6 +878,21 @@ u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine) ...@@ -875,6 +878,21 @@ u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine)
return bbaddr; return bbaddr;
} }
static unsigned long stop_timeout(const struct intel_engine_cs *engine)
{
if (in_atomic() || irqs_disabled()) /* inside atomic preempt-reset? */
return 0;
/*
* If we are doing a normal GPU reset, we can take our time and allow
* the engine to quiesce. We've stopped submission to the engine, and
* if we wait long enough an innocent context should complete and
* leave the engine idle. So they should not be caught unaware by
* the forthcoming GPU reset (which usually follows the stop_cs)!
*/
return READ_ONCE(engine->props.stop_timeout_ms);
}
int intel_engine_stop_cs(struct intel_engine_cs *engine) int intel_engine_stop_cs(struct intel_engine_cs *engine)
{ {
struct intel_uncore *uncore = engine->uncore; struct intel_uncore *uncore = engine->uncore;
...@@ -892,7 +910,7 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine) ...@@ -892,7 +910,7 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine)
err = 0; err = 0;
if (__intel_wait_for_register_fw(uncore, if (__intel_wait_for_register_fw(uncore,
mode, MODE_IDLE, MODE_IDLE, mode, MODE_IDLE, MODE_IDLE,
1000, 0, 1000, stop_timeout(engine),
NULL)) { NULL)) {
GEM_TRACE("%s: timed out on STOP_RING -> IDLE\n", engine->name); GEM_TRACE("%s: timed out on STOP_RING -> IDLE\n", engine->name);
err = -ETIMEDOUT; err = -ETIMEDOUT;
......
...@@ -542,6 +542,10 @@ struct intel_engine_cs { ...@@ -542,6 +542,10 @@ struct intel_engine_cs {
*/ */
ktime_t total; ktime_t total;
} stats; } stats;
struct {
unsigned long stop_timeout_ms;
} props;
}; };
static inline bool static inline bool
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment