• Chris Wilson's avatar
    drm/i915: Stop the machine whilst capturing the GPU crash dump · 9f267eb8
    Chris Wilson authored
    The error state is purposefully racy as we expect it to be called at any
    time and so have avoided any locking whilst capturing the crash dump.
    However, with multi-engine GPUs and multiple CPUs, those races can
    manifest into OOPSes as we attempt to chase dangling pointers freed on
    other CPUs. Under discussion are lots of ways to slow down normal
    operation in order to protect the post-mortem error capture, but what it
    we take the opposite approach and freeze the machine whilst the error
    capture runs (note the GPU may still running, but as long as we don't
    process any of the results the driver's bookkeeping will be static).
    
    Note that by of itself, this is not a complete fix. It also depends on
    the compiler barriers in list_add/list_del to prevent traversing the
    lists into the void. We also depend that we only require state from
    carefully controlled sources - i.e. all the state we require for
    post-mortem debugging should be reachable from the request itself so
    that we only have to worry about retrieving the request carefully. Once
    we have the request, we know that all pointers from it are intact.
    
    v2: Avoid drm_clflush_pages() inside stop_machine() as it may use
    stop_machine() itself for its wbinvd fallback.
    Signed-off-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
    Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
    Link: http://patchwork.freedesktop.org/patch/msgid/20161012090522.367-3-chris@chris-wilson.co.uk
    9f267eb8
i915_gpu_error.c 41.2 KB