1. 12 Oct, 2020 1 commit
    • Jiri Olsa's avatar
      perf/core: Fix race in the perf_mmap_close() function · f91072ed
      Jiri Olsa authored
      There's a possible race in perf_mmap_close() when checking ring buffer's
      mmap_count refcount value. The problem is that the mmap_count check is
      not atomic because we call atomic_dec() and atomic_read() separately.
      
        perf_mmap_close:
        ...
         atomic_dec(&rb->mmap_count);
         ...
         if (atomic_read(&rb->mmap_count))
            goto out_put;
      
         <ring buffer detach>
         free_uid
      
      out_put:
        ring_buffer_put(rb); /* could be last */
      
      The race can happen when we have two (or more) events sharing same ring
      buffer and they go through atomic_dec() and then they both see 0 as refcount
      value later in atomic_read(). Then both will go on and execute code which
      is meant to be run just once.
      
      The code that detaches ring buffer is probably fine to be executed more
      than once, but the problem is in calling free_uid(), which will later on
      demonstrate in related crashes and refcount warnings, like:
      
        refcount_t: addition on 0; use-after-free.
        ...
        RIP: 0010:refcount_warn_saturate+0x6d/0xf
        ...
        Call Trace:
        prepare_creds+0x190/0x1e0
        copy_creds+0x35/0x172
        copy_process+0x471/0x1a80
        _do_fork+0x83/0x3a0
        __do_sys_wait4+0x83/0x90
        __do_sys_clone+0x85/0xa0
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using atomic decrease and check instead of separated calls.
      Tested-by: default avatarMichael Petlan <mpetlan@redhat.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Acked-by: default avatarWade Mealing <wmealing@redhat.com>
      Fixes: 9bb5d40c ("perf: Fix mmap() accounting hole");
      Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
      f91072ed
  2. 06 Oct, 2020 2 commits
    • Peter Zijlstra's avatar
      perf/x86: Fix n_metric for cancelled txn · 3dbde695
      Peter Zijlstra authored
      When a group that has TopDown members is failed to be scheduled, any
      later TopDown groups will not return valid values.
      
      Here is an example.
      
      A background perf that occupies all the GP counters and the fixed
      counter 1.
       $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles,
                       cycles,cycles}:D" -a
      
      A user monitors a TopDown group. It works well, because the fixed
      counter 3 and the PERF_METRICS are available.
       $perf stat -x, --topdown -- ./workload
         retiring,bad speculation,frontend bound,backend bound,
         18.0,16.1,40.4,25.5,
      
      Then the user tries to monitor a group that has TopDown members.
      Because of the cycles event, the group is failed to be scheduled.
       $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound,
                           topdown-fe-bound,topdown-bad-spec,cycles}'
                           -- ./workload
          <not counted>,,slots,0,0.00,,
          <not counted>,,topdown-retiring,0,0.00,,
          <not counted>,,topdown-be-bound,0,0.00,,
          <not counted>,,topdown-fe-bound,0,0.00,,
          <not counted>,,topdown-bad-spec,0,0.00,,
          <not counted>,,cycles,0,0.00,,
      
      The user tries to monitor a TopDown group again. It doesn't work anymore.
       $perf stat -x, --topdown -- ./workload
      
          ,,,,,
      
      In a txn, cancel_txn() is to truncate the event_list for a canceled
      group and update the number of events added in this transaction.
      However, the number of TopDown events added in this transaction is not
      updated. The kernel will probably fail to add new Topdown events.
      
      Fixes: 7b2c05a1 ("perf/x86/intel: Generic support for hardware TopDown metrics")
      Reported-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reported-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.net
      3dbde695
    • Peter Zijlstra's avatar
      perf/x86: Fix n_pair for cancelled txn · 871a93b0
      Peter Zijlstra authored
      Kan reported that n_metric gets corrupted for cancelled transactions;
      a similar issue exists for n_pair for AMD's Large Increment thing.
      
      The problem was confirmed and confirmed fixed by Kim using:
      
        sudo perf stat -e "{cycles,cycles,cycles,cycles}:D" -a sleep 10 &
      
        # should succeed:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
        # should fail:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.all,cycles}:D" -a workload
      
        # previously failed, now succeeds with this patch:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Reported-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarKim Phillips <kim.phillips@amd.com>
      Link: https://lkml.kernel.org/r/20201005082516.GG2628@hirez.programming.kicks-ass.net
      871a93b0
  3. 03 Oct, 2020 2 commits
  4. 29 Sep, 2020 8 commits
  5. 24 Sep, 2020 11 commits
  6. 10 Sep, 2020 10 commits
    • Kim Phillips's avatar
      arch/x86/amd/ibs: Fix re-arming IBS Fetch · 221bfce5
      Kim Phillips authored
      Stephane Eranian found a bug in that IBS' current Fetch counter was not
      being reset when the driver would write the new value to clear it along
      with the enable bit set, and found that adding an MSR write that would
      first disable IBS Fetch would make IBS Fetch reset its current count.
      
      Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12,
      2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when
      IbsFetchEn is changed from 0 to 1."
      
      Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch,
      so the driver properly resets the internal counter to 0 and IBS
      Fetch starts counting again.
      
      A family 15h machine tested does not have this problem, and the extra
      wrmsr is also not needed on Family 19h, so only do the extra wrmsr on
      families 16h through 18h.
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      [peterz: optimized]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      221bfce5
    • Kim Phillips's avatar
      perf/x86/rapl: Add AMD Fam19h RAPL support · a77259bd
      Kim Phillips authored
      Family 19h RAPL support did not change from Family 17h; extend
      the existing Fam17h support to work on Family 19h too.
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-8-kim.phillips@amd.com
      a77259bd
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Support 27-bit extended Op/cycle counter · 8b0bed7d
      Kim Phillips authored
      IBS hardware with the OpCntExt feature gets a 7-bit wider internal
      counter.  Both the maximum and current count bitfields in the
      IBS_OP_CTL register are extended to support reading and writing it.
      
      No changes are necessary to the driver for handling the extra
      contiguous current count bits (IbsOpCurCnt), as the driver already
      passes through 32 bits of that field.  However, the driver has to do
      some extra bit manipulation when converting from a period to the
      non-contiguous (although conveniently aligned) extra bits in the
      IbsOpMaxCnt bitfield.
      
      This decreases IBS Op interrupt overhead when the period is over
      1,048,560 (0xffff0), which would previously activate the driver's
      software counter.  That threshold is now 134,217,712 (0x7fffff0).
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-7-kim.phillips@amd.com
      8b0bed7d
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Fix raw sample data accumulation · 36e1be8a
      Kim Phillips authored
      Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode.
      Don't accumulate them into raw sample user data in that case.
      
      Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR.
      
      Technically, there is an ABI change here with respect to the IBS raw
      sample data format, but I don't see any perf driver version information
      being included in perf.data file headers, but, existing users can detect
      whether the size of the sample record has reduced by 8 bytes to
      determine whether the IBS driver has this fix.
      
      Fixes: 904cb367 ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions")
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
      36e1be8a
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Don't include randomized bits in get_ibs_op_count() · 680d6963
      Kim Phillips authored
      get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits
      to its count regardless of hardware's valid status.
      
      According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54,
      if the counter rolls over, valid status is set, and the lower 7 bits
      of IbsOpCurCnt are randomized by hardware.
      
      Don't include those bits in the driver's event count.
      
      Fixes: 8b1e1363 ("perf/x86-ibs: Fix usage of IBS op current count")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      680d6963
    • Kim Phillips's avatar
      perf/x86/amd: Fix sampling Large Increment per Cycle events · 26e52558
      Kim Phillips authored
      Commit 57388912 ("perf/x86/amd: Add support for Large Increment
      per Cycle Events") mistakenly zeroes the upper 16 bits of the count
      in set_period().  That's fine for counting with perf stat, but not
      sampling with perf record when only Large Increment events are being
      sampled.  To enable sampling, we sign extend the upper 16 bits of the
      merged counter pair as described in the Family 17h PPRs:
      
      "Software wanting to preload a value to a merged counter pair writes the
      high-order 16-bit value to the low-order 16 bits of the odd counter and
      then writes the low-order 48-bit value to the even counter. Reading the
      even counter of the merged counter pair returns the full 64-bit value."
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      26e52558
    • Kim Phillips's avatar
      perf/amd/uncore: Set all slices and threads to restore perf stat -a behaviour · c8fe99d0
      Kim Phillips authored
      Commit 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for
      F17h L3 PMCs") inadvertently changed the uncore driver's behaviour
      wrt perf tool invocations with or without a CPU list, specified with
      -C / --cpu=.
      
      Change the behaviour of the driver to assume the former all-cpu (-a)
      case, which is the more commonly desired default.  This fixes
      '-a -A' invocations without explicit cpu lists (-C) to not count
      L3 events only on behalf of the first thread of the first core
      in the L3 domain.
      
      BEFORE:
      
      Activity performed by the first thread of the last core (CPU#43) in
      CPU#40's L3 domain is not reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                 21,835      l3_request_g1.caching_l3_cache_accesses
      CPU40                 87,066      l3_request_g1.caching_l3_cache_accesses
      CPU44                 17,360      l3_request_g1.caching_l3_cache_accesses
      ...
      
      AFTER:
      
      The L3 domain activity is now reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                354,891      l3_request_g1.caching_l3_cache_accesses
      CPU40              1,780,870      l3_request_g1.caching_l3_cache_accesses
      CPU44                315,062      l3_request_g1.caching_l3_cache_accesses
      ...
      
      Fixes: 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-2-kim.phillips@amd.com
      c8fe99d0
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_out() · 44fae179
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      The same context will iterated in perf_event_context_sched_out() soon.
      Share the cpuctx->task_ctx can avoid the unnecessary iteration of the
      cpuctx list.
      
      The pmu::sched_task() is also required for the optimization case for
      equivalent contexts.
      
      The task_ctx_sched_out() will eventually disable and reenable the PMU
      when schedule out events. Add perf_pmu_disable() and perf_pmu_enable()
      around task_ctx_sched_out() don't break anything.
      
      Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for
      per-CPU context, which is not necessary for the per-task context
      schedule.
      
      No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and
      perf_pmu_sched_task() any more.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
      44fae179
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_in() · 556cccad
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      
      The same context was just iterated in perf_event_context_sched_in(),
      which is invoked right before the pmu::sched_task().
      
      Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid
      the unnecessary iteration of the cpuctx list.
      
      Both pmu::sched_task and perf_event_context_sched_in() have to disable
      PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can
      also save the overhead from the PMU disable and reenable.
      
      The new and old tasks may have equivalent contexts. The current code
      optimize this case by swapping the context, which avoids the scheduling.
      For this case, pmu::sched_task() is still required, e.g., restore the
      LBR content.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
      556cccad
    • Kan Liang's avatar
      perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS · 35d1ce6b
      Kan Liang authored
      A warning as below may be triggered when sampling with large PEBS.
      
      [  410.411250] perf: interrupt took too long (72145 > 71975), lowering
      kernel.perf_event_max_sample_rate to 2000
      [  410.724923] ------------[ cut here ]------------
      [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
      x86_pmu_stop+0x95/0xa0
      [  410.933811]  x86_pmu_del+0x50/0x150
      [  410.937304]  event_sched_out.isra.0+0xbc/0x210
      [  410.941751]  group_sched_out.part.0+0x53/0xd0
      [  410.946111]  ctx_sched_out+0x193/0x270
      [  410.949862]  __perf_event_task_sched_out+0x32c/0x890
      [  410.954827]  ? set_next_entity+0x98/0x2d0
      [  410.958841]  __schedule+0x592/0x9c0
      [  410.962332]  schedule+0x5f/0xd0
      [  410.965477]  exit_to_usermode_loop+0x73/0x120
      [  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
      [  410.974369]  ret_from_intr+0x2a/0x3a
      [  410.977946] RIP: 0033:0x40123c
      [  411.079661] ---[ end trace bc83adaea7bb664a ]---
      
      In the non-overflow context, e.g., context switch, with large PEBS, perf
      may stop an event twice. An example is below.
      
        //max_samples_per_tick is adjusted to 2
        //NMI is triggered
        intel_pmu_handle_irq()
           handle_pmi_common()
             drain_pebs()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     hwc->interrupts = 1
                     return 0
        //A context switch happens right after the NMI.
        //In the same tick, the perf_throttled_seq is not changed.
        perf_event_task_sched_out()
           perf_pmu_sched_task()
             intel_pmu_drain_pebs_buffer()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     ++hwc->interrupts >= max_samples_per_tick
                     return 1
                 x86_pmu_stop();  # First stop
           perf_event_context_sched_out()
             task_ctx_sched_out()
               ctx_sched_out()
                 event_sched_out()
                   x86_pmu_del()
                     x86_pmu_stop();  # Second stop and trigger the warning
      
      Perf should only invoke the perf_event_overflow() in the overflow
      context.
      
      Current drain_pebs() is called from:
      - handle_pmi_common()			-- overflow context
      - intel_pmu_pebs_sched_task()		-- non-overflow context
      - intel_pmu_pebs_disable()		-- non-overflow context
      - intel_pmu_auto_reload_read()		-- possible overflow context
        With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
        invoked in the NMI handler. But, before calling the function, the
        PEBS buffer has already been drained. The __intel_pmu_pebs_event()
        will not be called in the possible overflow context.
      
      To fix the issue, an indicator is required to distinguish between the
      overflow context aka handle_pmi_common() and other cases.
      The dummy regs pointer can be used as the indicator.
      
      In the non-overflow context, perf should treat the last record the same
      as other PEBS records, and doesn't invoke the generic overflow handler.
      
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Reported-by: default avatarLike Xu <like.xu@linux.intel.com>
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarLike Xu <like.xu@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
      35d1ce6b
  7. 18 Aug, 2020 6 commits
    • Kan Liang's avatar
      perf/x86/intel: Support per-thread RDPMC TopDown metrics · 2cb5383b
      Kan Liang authored
      Starts from Ice Lake, the TopDown metrics are directly available as
      fixed counters and do not require generic counters. Also, the TopDown
      metrics can be collected per thread. Extend the RDPMC usage to support
      per-thread TopDown metrics.
      
      The RDPMC index of the PERF_METRICS will be output if RDPMC users ask
      for the RDPMC index of the metrics events.
      
      To support per thread RDPMC TopDown, the metrics and slots counters have
      to be saved/restored during the context switching.
      
      The last_period and period_left are not used in the counting mode. Use
      the fields for saved_metric and saved_slots.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-12-kan.liang@linux.intel.com
      2cb5383b
    • Kan Liang's avatar
      perf/x86/intel: Support TopDown metrics on Ice Lake · 59a854e2
      Kan Liang authored
      Ice Lake supports the hardware TopDown metrics feature, which can free
      up the scarce GP counters.
      
      Update the event constraints for the metrics events. The metric counters
      do not exist, which are mapped to a dummy offset. The sharing between
      multiple users of the same metric without multiplexing is not allowed.
      
      Implement set_topdown_event_period for Ice Lake. The values in
      PERF_METRICS MSR are derived from the fixed counter 3. Both registers
      should start from zero.
      
      Implement update_topdown_event for Ice Lake. The metric is reported by
      multiplying the metric (fraction) with slots. To maintain accurate
      measurements, both registers are cleared for each update. The fixed
      counter 3 should always be cleared before the PERF_METRICS.
      
      Implement td_attr for the new metrics events and the new slots fixed
      counter. Make them visible to the perf user tools.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com
      59a854e2
    • Kan Liang's avatar
      perf/x86: Add a macro for RDPMC offset of fixed counters · 0e2e45e2
      Kan Liang authored
      The RDPMC base offset of fixed counters is hard-code. Use a meaningful
      name to replace the magic number to improve the readability of the code.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-10-kan.liang@linux.intel.com
      0e2e45e2
    • Kan Liang's avatar
      perf/x86/intel: Generic support for hardware TopDown metrics · 7b2c05a1
      Kan Liang authored
      Intro
      =====
      
      The TopDown Microarchitecture Analysis (TMA) Method is a structured
      analysis methodology to identify critical performance bottlenecks in
      out-of-order processors. Current perf has supported the method.
      
      The method works well, but there is one problem. To collect the TopDown
      events, several GP counters have to be used. If a user wants to collect
      other events at the same time, the multiplexing probably be triggered,
      which impacts the accuracy.
      
      To free up the scarce GP counters, the hardware TopDown metrics feature
      is introduced from Ice Lake. The hardware implements an additional
      "metrics" register and a new Fixed Counter 3 that measures pipeline
      "slots". The TopDown events can be calculated from them instead.
      
      Events
      ======
      
      The level 1 TopDown has four metrics. There is no event-code assigned to
      the TopDown metrics. Four metric events are exported as separate perf
      events, which map to the internal "metrics" counter register. Those
      events do not exist in hardware, but can be allocated by the scheduler.
      
      For the event mapping, a special 0x00 event code is used, which is
      reserved for fake events. The metric events start from umask 0x10.
      
      When setting up the metric events, they point to the Fixed Counter 3.
      They have to be specially handled.
      - Add the update_topdown_event() callback to read the additional metrics
        MSR and generate the metrics.
      - Add the set_topdown_event_period() callback to initialize metrics MSR
        and the fixed counter 3.
      - Add a variable n_metric_event to track the number of the accepted
        metrics events. The sharing between multiple users of the same metric
        without multiplexing is not allowed.
      - Only enable/disable the fixed counter 3 when there are no other active
        TopDown events, which avoid the unnecessary writing of the fixed
        control register.
      - Disable the PMU when reading the metrics event. The metrics MSR and
        the fixed counter 3 are read separately. The values may be modified by
        an NMI.
      
      All four metric events don't support sampling. Since they will be
      handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
      introduced to indicate this case.
      
      The slots event can support both sampling and counting.
      For counting, the flag is also applied.
      For sampling, it will be handled normally as other normal events.
      
      Groups
      ======
      
      The slots event is required in a Topdown group.
      To avoid reading the METRICS register multiple times, the metrics and
      slots value can only be updated by slots event in a group.
      All active slots and metrics events will be updated one time.
      Therefore, the slots event must be before any metric events in a Topdown
      group.
      
      NMI
      ======
      
      The METRICS related register may be overflow. The bit 48 of the STATUS
      register will be set. If so, PERF_METRICS and Fixed counter 3 are
      required to be reset. The patch also update all active slots and
      metrics events in the NMI handler.
      
      The update_topdown_event() has to read two registers separately. The
      values may be modified by an NMI. PMU has to be disabled before calling
      the function.
      
      RDPMC
      ======
      
      RDPMC is temporarily disabled. A later patch will enable it.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com
      7b2c05a1
    • Kan Liang's avatar
      perf/core: Add a new PERF_EV_CAP_SIBLING event capability · 9f0c4fa1
      Kan Liang authored
      Current perf assumes that events in a group are independent. Close an
      event doesn't impact the value of the other events in the same group.
      If the closed event is a member, after the event closure, other events
      are still running like a group. If the closed event is a leader, other
      events are running as singleton events.
      
      Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
      part of a group, and when the leader dies they cannot exist
      independently.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-8-kan.liang@linux.intel.com
      9f0c4fa1
    • Kan Liang's avatar
      perf/x86/intel: Use switch in intel_pmu_disable/enable_event · 58da7dbe
      Kan Liang authored
      Currently, the if-else is used in the intel_pmu_disable/enable_event to
      check the type of an event. It works well, but with more and more types
      added later, e.g., perf metrics, compared to the switch statement, the
      if-else may impair the readability of the code.
      
      There is no harm to use the switch statement to replace the if-else
      here. Also, some optimizing compilers may compile a switch statement
      into a jump-table which is more efficient than if-else for a large
      number of cases. The performance gain may not be observed for now,
      because the number of cases is only 5, but the benefits may be observed
      with more and more types added in the future.
      
      Use switch to replace the if-else in the intel_pmu_disable/enable_event.
      
      If the idx is invalid, print a warning.
      
      For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't
      need to check the event->attr.precise_ip. Use return for the case.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200723171117.9918-7-kan.liang@linux.intel.com
      58da7dbe