• Janusz Krzysztofik's avatar
    drm/i915/active: Fix misuse of non-idle barriers as fence trackers · e0e6b416
    Janusz Krzysztofik authored
    Users reported oopses on list corruptions when using i915 perf with a
    number of concurrently running graphics applications.  Root cause analysis
    pointed at an issue in barrier processing code -- a race among perf open /
    close replacing active barriers with perf requests on kernel context and
    concurrent barrier preallocate / acquire operations performed during user
    context first pin / last unpin.
    
    When adding a request to a composite tracker, we try to reuse an existing
    fence tracker, already allocated and registered with that composite.  The
    tracker we obtain may already track another fence, may be an idle barrier,
    or an active barrier.
    
    If the tracker we get occurs a non-idle barrier then we try to delete that
    barrier from a list of barrier tasks it belongs to.  However, while doing
    that we don't respect return value from a function that performs the
    barrier deletion.  Should the deletion ever fail, we would end up reusing
    the tracker still registered as a barrier task.  Since the same structure
    field is reused with both fence callback lists and barrier tasks list,
    list corruptions would likely occur.
    
    Barriers are now deleted from a barrier tasks list by temporarily removing
    the list content, traversing that content with skip over the node to be
    deleted, then populating the list back with the modified content.  Should
    that intentionally racy concurrent deletion attempts be not serialized,
    one or more of those may fail because of the list being temporary empty.
    
    Related code that ignores the results of barrier deletion was initially
    introduced in v5.4 by commit d8af05ff ("drm/i915: Allow sharing the
    idle-barrier from other kernel requests").  However, all users of the
    barrier deletion routine were apparently serialized at that time, then the
    issue didn't exhibit itself.  Results of git bisect with help of a newly
    developed igt@gem_barrier_race@remote-request IGT test indicate that list
    corruptions might start to appear after commit 31177017 ("drm/i915/gt:
    Schedule request retirement when timeline idles"), introduced in v5.5.
    
    Respect results of barrier deletion attempts -- mark the barrier as idle
    only if successfully deleted from the list.  Then, before proceeding with
    setting our fence as the one currently tracked, make sure that the tracker
    we've got is not a non-idle barrier.  If that check fails then don't use
    that tracker but go back and try to acquire a new, usable one.
    
    v3: use unlikely() to document what outcome we expect (Andi),
      - fix bad grammar in commit description.
    v2: no code changes,
      - blame commit 31177017 ("drm/i915/gt: Schedule request retirement
        when timeline idles"), v5.5, not commit d8af05ff ("drm/i915: Allow
        sharing the idle-barrier from other kernel requests"), v5.4,
      - reword commit description.
    
    Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6333
    Fixes: 31177017 ("drm/i915/gt: Schedule request retirement when timeline idles")
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@vger.kernel.org # v5.5
    Cc: Andi Shyti <andi.shyti@linux.intel.com>
    Signed-off-by: default avatarJanusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
    Reviewed-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
    Signed-off-by: default avatarAndi Shyti <andi.shyti@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20230302120820.48740-1-janusz.krzysztofik@linux.intel.com
    (cherry picked from commit 50600605)
    Signed-off-by: default avatarJani Nikula <jani.nikula@intel.com>
    e0e6b416
i915_active.c 28.5 KB