• Adrian Hunter's avatar
    perf: Add PERF_RECORD_SWITCH to indicate context switches · 45ac1403
    Adrian Hunter authored
    There are already two events for context switches, namely the tracepoint
    sched:sched_switch and the software event context_switches.
    Unfortunately neither are suitable for use by non-privileged users for
    the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
    context switch.
    
    Tracepoints are no good at all for non-privileged users because they
    need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1.
    
    On the other hand, kernel software events need either CAP_SYS_ADMIN or
    /proc/sys/kernel/perf_event_paranoid <= 1.
    
    Now many distributions do default perf_event_paranoid to 1 making
    context_switches a contender, except it has another problem (which is
    also shared with sched:sched_switch) which is that it happens before
    perf schedules events out instead of after perf schedules events in.
    Whereas a privileged user can see all the events anyway, a
    non-privileged user only sees events for their own processes, in other
    words they see when their process was scheduled out not when it was
    scheduled in. That presents two problems to use the event:
    
    1. the information comes too late, so tools have to look ahead in the
       event stream to find out what the current state is
    
    2. if they are unlucky tracing might have stopped before the
       context-switches event is recorded.
    
    This new PERF_RECORD_SWITCH event does not have those problems
    and it also has a couple of other small advantages.
    
    It is easier to use because it is an auxiliary event (like mmap, comm
    and task events) which can be enabled by setting a single bit. It is
    smaller than sched:sched_switch and easier to parse.
    
    To make the event useful for privileged users also, if the
    context is cpu-wide then the event record will be
    PERF_RECORD_SWITCH_CPU_WIDE which is the same as
    PERF_RECORD_SWITCH except it also provides the next or
    previous pid/tid.
    Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
    Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Tested-by: default avatarJiri Olsa <jolsa@redhat.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
    Cc: Pawel Moll <pawel.moll@arm.com>
    Cc: Stephane Eranian <eranian@google.com>
    Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
    45ac1403
core.c 214 KB