• Andi Kleen's avatar
    perf/x86: Use INST_RETIRED.PREC_DIST for cycles: ppp · 72469764
    Andi Kleen authored
    Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as
    base. The basic mechanism of abusing the inverse cmask to get all
    cycles works the same as before.
    
    PREC_DIST is available on Sandy Bridge or later. It had some problems
    on Sandy Bridge, so we only use it on IvyBridge and later. I tested it
    on Broadwell and Skylake.
    
    PREC_DIST has special support for avoiding shadow effects, which can
    give better results compare to UOPS_RETIRED. The drawback is that
    PREC_DIST can only schedule on counter 1, but that is ok for cycle
    sampling, as there is normally no need to do multiple cycle sampling
    runs in parallel. It is still possible to run perf top in parallel, as
    that doesn't use precise mode. Also of course the multiplexing can
    still allow parallel operation.
    
    :pp stays with the previous event.
    
    Example:
    
    Sample a loop with 10 sqrt with old cycles:pp
    
    	  0.14 │10:   sqrtps %xmm1,%xmm0     <--------------
    	  9.13 │      sqrtps %xmm1,%xmm0
    	 11.58 │      sqrtps %xmm1,%xmm0
    	 11.51 │      sqrtps %xmm1,%xmm0
    	  6.27 │      sqrtps %xmm1,%xmm0
    	 10.38 │      sqrtps %xmm1,%xmm0
    	 12.20 │      sqrtps %xmm1,%xmm0
    	 12.74 │      sqrtps %xmm1,%xmm0
    	  5.40 │      sqrtps %xmm1,%xmm0
    	 10.14 │      sqrtps %xmm1,%xmm0
    	 10.51 │    ↑ jmp    10
    
    We expect all 10 sqrt to get roughly the sample number of samples.
    
    But you can see that the instruction directly after the JMP is
    systematically underestimated in the result, due to sampling shadow
    effects.
    
    With the new PREC_DIST based sampling this problem is gone and all
    instructions show up roughly evenly:
    
    	  9.51 │10:   sqrtps %xmm1,%xmm0
    	 11.74 │      sqrtps %xmm1,%xmm0
    	 11.84 │      sqrtps %xmm1,%xmm0
    	  6.05 │      sqrtps %xmm1,%xmm0
    	 10.46 │      sqrtps %xmm1,%xmm0
    	 12.25 │      sqrtps %xmm1,%xmm0
    	 12.18 │      sqrtps %xmm1,%xmm0
    	  5.26 │      sqrtps %xmm1,%xmm0
    	 10.13 │      sqrtps %xmm1,%xmm0
    	 10.43 │      sqrtps %xmm1,%xmm0
    	  0.16 │    ↑ jmp    10
    
    Even with PREC_DIST there is still sampling skid and the result is not
    completely even, but systematic shadow effects are significantly
    reduced.
    
    The improvements are mainly expected to make a difference in high IPC
    code. With low IPC it should be similar.
    Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: hpa@zytor.com
    Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
    72469764
perf_event.c 55.4 KB