• Stephane Eranian's avatar
    perf events: Fix slow and broken cgroup context switch code · a8d757ef
    Stephane Eranian authored
    The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:
    
     $ ./pong
     Both processes pinned to CPU1, running for 10s
     10684.51 ctxsw/s
    
    Now start a cgroup perf stat:
     $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
    
    $ ./pong
     Both processes pinned to CPU1, running for 10s
     6674.61 ctxsw/s
    
    That's a 37% penalty.
    
    Note that pong is not even in the monitored cgroup.
    
    The results shown by perf stat are bogus:
     $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
    
     Performance counter stats for 'sleep 100':
    
     CPU1 <not counted> cycles   test
     CPU1 16,984,189,138 cycles  #    0.000 GHz
    
    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.
    
    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.
    
    With this patch the same test now yields:
     $ ./pong
     Both processes pinned to CPU1, running for 10s
     10775.30 ctxsw/s
    
    Start perf stat with cgroup:
    
     $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
    
    Run pong outside the cgroup:
     $ /pong
     Both processes pinned to CPU1, running for 10s
     10687.80 ctxsw/s
    
    The penalty is now less than 2%.
    
    And the results for perf stat are correct:
    
    $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
    
     Performance counter stats for 'sleep 10':
    
     CPU1 <not counted> cycles test #    0.000 GHz
     CPU1 23,933,981,448 cycles      #    0.000 GHz
    
    Now perf stat reports the correct counts for
    for the non cgroup event.
    
    If we run pong inside the cgroup, then we also get the
    correct counts:
    
    $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
    
     Performance counter stats for 'sleep 10':
    
     CPU1 22,297,726,205 cycles test #    0.000 GHz
     CPU1 23,933,981,448 cycles      #    0.000 GHz
    
          10.001457237 seconds time elapsed
    Signed-off-by: default avatarStephane Eranian <eranian@google.com>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
    a8d757ef
sched.c 223 KB