• Tvrtko Ursulin's avatar
    drm/i915: Execlists small cleanups and micro-optimisations · c6a2ac71
    Tvrtko Ursulin authored
    Assorted changes in the areas of code cleanup, reduction of
    invariant conditional in the interrupt handler and lock
    contention and MMIO access optimisation.
    
     * Remove needless initialization.
     * Improve cache locality by reorganizing code and/or using
       branch hints to keep unexpected or error conditions out
       of line.
     * Favor busy submit path vs. empty queue.
     * Less branching in hot-paths.
    
    v2:
    
     * Avoid mmio reads when possible. (Chris Wilson)
     * Use natural integer size for csb indices.
     * Remove useless return value from execlists_update_context.
     * Extract 32-bit ppgtt PDPs update so it is out of line and
       shared with two callers.
     * Grab forcewake across all mmio operations to ease the
       load on uncore lock and use chepear mmio ops.
    
    v3:
    
     * Removed some more pointless u8 data types.
     * Removed unused return from execlists_context_queue.
     * Commit message updates.
    
    v4:
     * Unclumsify the unqueue if statement. (Chris Wilson)
     * Hide forcewake from the queuing function. (Chris Wilson)
    
    Version 3 now makes the irq handling code path ~20% smaller on
    48-bit PPGTT hardware, and a little bit less elsewhere. Hot
    paths are mostly in-line now and hammering on the uncore
    spinlock is greatly reduced together with mmio traffic to an
    extent.
    
    Benchmarking with "gem_latency -n 100" (keep submitting
    batches with 100 nop instruction) shows approximately 4% higher
    throughput, 2% less CPU time and 22% smaller latencies. This was
    on a big-core while small-cores could benefit even more.
    
    Most likely reason for the improvements are the MMIO
    optimization and uncore lock traffic reduction.
    
    One odd result is with "gem_latency -n 0" (dispatching empty
    batches) which shows 5% more throughput, 8% less CPU time,
    25% better producer and consumer latencies, but 15% higher
    dispatch latency which is yet unexplained.
    Signed-off-by: default avatarTvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
    Link: http://patchwork.freedesktop.org/patch/msgid/1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com
    c6a2ac71
intel_ringbuffer.h 17.4 KB