1. 29 Oct, 2020 4 commits
    • Stephane Eranian's avatar
      perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE · 995f088e
      Stephane Eranian authored
      When studying code layout, it is useful to capture the page size of the
      sampled code address.
      
      Add a new sample type for code page size.
      The new sample type requires collecting the ip. The code page size can
      be calculated from the NMI-safe perf_get_page_size().
      
      For large PEBS, it's very unlikely that the mapping is gone for the
      earlier PEBS records. Enable the feature for the large PEBS. The worst
      case is that page-size '0' is returned.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
      995f088e
    • Kan Liang's avatar
      powerpc/perf: Support PERF_SAMPLE_DATA_PAGE_SIZE · 4cb6a42e
      Kan Liang authored
      The new sample type, PERF_SAMPLE_DATA_PAGE_SIZE, requires the virtual
      address. Update the data->addr if the sample type is set.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-4-kan.liang@linux.intel.com
      4cb6a42e
    • Kan Liang's avatar
      perf/x86/intel: Support PERF_SAMPLE_DATA_PAGE_SIZE · 76a5433f
      Kan Liang authored
      The new sample type, PERF_SAMPLE_DATA_PAGE_SIZE, requires the virtual
      address. Update the data->addr if the sample type is set.
      
      The large PEBS is disabled with the sample type, because perf doesn't
      support munmap tracking yet. The PEBS buffer for large PEBS cannot be
      flushed for each munmap. Wrong page size may be calculated. The large
      PEBS can be enabled later separately when munmap tracking is supported.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-3-kan.liang@linux.intel.com
      76a5433f
    • Kan Liang's avatar
      perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE · 8d97e718
      Kan Liang authored
      Current perf can report both virtual addresses and physical addresses,
      but not the MMU page size. Without the MMU page size information of the
      utilized page, users cannot decide whether to promote/demote large pages
      to optimize memory usage.
      
      Add a new sample type for the data MMU page size.
      
      Current perf already has a facility to collect data virtual addresses.
      A page walker is required to walk the pages tables and calculate the
      MMU page size from a given virtual address.
      
      On some platforms, e.g., X86, the page walker is invoked in an NMI
      handler. So the page walker must be NMI-safe and low overhead. Besides,
      the page walker should work for both user and kernel virtual address.
      The existing generic page walker, e.g., walk_page_range_novma(), is a
      little bit complex and doesn't guarantee the NMI-safe. The follow_page()
      is only for user-virtual address.
      
      Add a new function perf_get_page_size() to walk the page tables and
      calculate the MMU page size. In the function:
      - Interrupts have to be disabled to prevent any teardown of the page
        tables.
      - For user space threads, the current->mm is used for the page walker.
        For kernel threads and the like, the current->mm is NULL. The init_mm
        is used for the page walker. The active_mm is not used here, because
        it can be NULL.
        Quote from Peter Zijlstra,
        "context_switch() can set prev->active_mm to NULL when it transfers it
         to @next. It does this before @current is updated. So an NMI that
         comes in between this active_mm swizzling and updating @current will
         see !active_mm."
      - The MMU page size is calculated from the page table level.
      
      The method should work for all architectures, but it has only been
      verified on X86. Should there be some architectures, which support perf,
      where the method doesn't work, it can be fixed later separately.
      Reporting the wrong page size would not be fatal for the architecture.
      
      Some under discussion features may impact the method in the future.
      Quote from Dave Hansen,
        "There are lots of weird things folks are trying to do with the page
         tables, like Address Space Isolation.  For instance, if you get a
         perf NMI when running userspace, current->mm->pgd is *different* than
         the PGD that was in use when userspace was running. It's close enough
         today, but it might not stay that way."
      If the case happens later, lots of consecutive page walk errors will
      happen. The worst case is that lots of page-size '0' are returned, which
      would not be fatal.
      In the perf tool, a check is implemented to detect this case. Once it
      happens, a kernel patch could be implemented accordingly then.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
      8d97e718
  2. 12 Oct, 2020 1 commit
    • Jiri Olsa's avatar
      perf/core: Fix race in the perf_mmap_close() function · f91072ed
      Jiri Olsa authored
      There's a possible race in perf_mmap_close() when checking ring buffer's
      mmap_count refcount value. The problem is that the mmap_count check is
      not atomic because we call atomic_dec() and atomic_read() separately.
      
        perf_mmap_close:
        ...
         atomic_dec(&rb->mmap_count);
         ...
         if (atomic_read(&rb->mmap_count))
            goto out_put;
      
         <ring buffer detach>
         free_uid
      
      out_put:
        ring_buffer_put(rb); /* could be last */
      
      The race can happen when we have two (or more) events sharing same ring
      buffer and they go through atomic_dec() and then they both see 0 as refcount
      value later in atomic_read(). Then both will go on and execute code which
      is meant to be run just once.
      
      The code that detaches ring buffer is probably fine to be executed more
      than once, but the problem is in calling free_uid(), which will later on
      demonstrate in related crashes and refcount warnings, like:
      
        refcount_t: addition on 0; use-after-free.
        ...
        RIP: 0010:refcount_warn_saturate+0x6d/0xf
        ...
        Call Trace:
        prepare_creds+0x190/0x1e0
        copy_creds+0x35/0x172
        copy_process+0x471/0x1a80
        _do_fork+0x83/0x3a0
        __do_sys_wait4+0x83/0x90
        __do_sys_clone+0x85/0xa0
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using atomic decrease and check instead of separated calls.
      Tested-by: default avatarMichael Petlan <mpetlan@redhat.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Acked-by: default avatarWade Mealing <wmealing@redhat.com>
      Fixes: 9bb5d40c ("perf: Fix mmap() accounting hole");
      Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
      f91072ed
  3. 06 Oct, 2020 2 commits
    • Peter Zijlstra's avatar
      perf/x86: Fix n_metric for cancelled txn · 3dbde695
      Peter Zijlstra authored
      When a group that has TopDown members is failed to be scheduled, any
      later TopDown groups will not return valid values.
      
      Here is an example.
      
      A background perf that occupies all the GP counters and the fixed
      counter 1.
       $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles,
                       cycles,cycles}:D" -a
      
      A user monitors a TopDown group. It works well, because the fixed
      counter 3 and the PERF_METRICS are available.
       $perf stat -x, --topdown -- ./workload
         retiring,bad speculation,frontend bound,backend bound,
         18.0,16.1,40.4,25.5,
      
      Then the user tries to monitor a group that has TopDown members.
      Because of the cycles event, the group is failed to be scheduled.
       $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound,
                           topdown-fe-bound,topdown-bad-spec,cycles}'
                           -- ./workload
          <not counted>,,slots,0,0.00,,
          <not counted>,,topdown-retiring,0,0.00,,
          <not counted>,,topdown-be-bound,0,0.00,,
          <not counted>,,topdown-fe-bound,0,0.00,,
          <not counted>,,topdown-bad-spec,0,0.00,,
          <not counted>,,cycles,0,0.00,,
      
      The user tries to monitor a TopDown group again. It doesn't work anymore.
       $perf stat -x, --topdown -- ./workload
      
          ,,,,,
      
      In a txn, cancel_txn() is to truncate the event_list for a canceled
      group and update the number of events added in this transaction.
      However, the number of TopDown events added in this transaction is not
      updated. The kernel will probably fail to add new Topdown events.
      
      Fixes: 7b2c05a1 ("perf/x86/intel: Generic support for hardware TopDown metrics")
      Reported-by: default avatarAndi Kleen <ak@linux.intel.com>
      Reported-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.net
      3dbde695
    • Peter Zijlstra's avatar
      perf/x86: Fix n_pair for cancelled txn · 871a93b0
      Peter Zijlstra authored
      Kan reported that n_metric gets corrupted for cancelled transactions;
      a similar issue exists for n_pair for AMD's Large Increment thing.
      
      The problem was confirmed and confirmed fixed by Kim using:
      
        sudo perf stat -e "{cycles,cycles,cycles,cycles}:D" -a sleep 10 &
      
        # should succeed:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
        # should fail:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.all,cycles}:D" -a workload
      
        # previously failed, now succeeds with this patch:
        sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Reported-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarKim Phillips <kim.phillips@amd.com>
      Link: https://lkml.kernel.org/r/20201005082516.GG2628@hirez.programming.kicks-ass.net
      871a93b0
  4. 03 Oct, 2020 2 commits
  5. 29 Sep, 2020 8 commits
  6. 24 Sep, 2020 11 commits
  7. 10 Sep, 2020 10 commits
    • Kim Phillips's avatar
      arch/x86/amd/ibs: Fix re-arming IBS Fetch · 221bfce5
      Kim Phillips authored
      Stephane Eranian found a bug in that IBS' current Fetch counter was not
      being reset when the driver would write the new value to clear it along
      with the enable bit set, and found that adding an MSR write that would
      first disable IBS Fetch would make IBS Fetch reset its current count.
      
      Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12,
      2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when
      IbsFetchEn is changed from 0 to 1."
      
      Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch,
      so the driver properly resets the internal counter to 0 and IBS
      Fetch starts counting again.
      
      A family 15h machine tested does not have this problem, and the extra
      wrmsr is also not needed on Family 19h, so only do the extra wrmsr on
      families 16h through 18h.
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      [peterz: optimized]
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      221bfce5
    • Kim Phillips's avatar
      perf/x86/rapl: Add AMD Fam19h RAPL support · a77259bd
      Kim Phillips authored
      Family 19h RAPL support did not change from Family 17h; extend
      the existing Fam17h support to work on Family 19h too.
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-8-kim.phillips@amd.com
      a77259bd
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Support 27-bit extended Op/cycle counter · 8b0bed7d
      Kim Phillips authored
      IBS hardware with the OpCntExt feature gets a 7-bit wider internal
      counter.  Both the maximum and current count bitfields in the
      IBS_OP_CTL register are extended to support reading and writing it.
      
      No changes are necessary to the driver for handling the extra
      contiguous current count bits (IbsOpCurCnt), as the driver already
      passes through 32 bits of that field.  However, the driver has to do
      some extra bit manipulation when converting from a period to the
      non-contiguous (although conveniently aligned) extra bits in the
      IbsOpMaxCnt bitfield.
      
      This decreases IBS Op interrupt overhead when the period is over
      1,048,560 (0xffff0), which would previously activate the driver's
      software counter.  That threshold is now 134,217,712 (0x7fffff0).
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200908214740.18097-7-kim.phillips@amd.com
      8b0bed7d
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Fix raw sample data accumulation · 36e1be8a
      Kim Phillips authored
      Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode.
      Don't accumulate them into raw sample user data in that case.
      
      Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR.
      
      Technically, there is an ABI change here with respect to the IBS raw
      sample data format, but I don't see any perf driver version information
      being included in perf.data file headers, but, existing users can detect
      whether the size of the sample record has reduced by 8 bytes to
      determine whether the IBS driver has this fix.
      
      Fixes: 904cb367 ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions")
      Reported-by: default avatarStephane Eranian <stephane.eranian@google.com>
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
      36e1be8a
    • Kim Phillips's avatar
      perf/x86/amd/ibs: Don't include randomized bits in get_ibs_op_count() · 680d6963
      Kim Phillips authored
      get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits
      to its count regardless of hardware's valid status.
      
      According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54,
      if the counter rolls over, valid status is set, and the lower 7 bits
      of IbsOpCurCnt are randomized by hardware.
      
      Don't include those bits in the driver's event count.
      
      Fixes: 8b1e1363 ("perf/x86-ibs: Fix usage of IBS op current count")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      680d6963
    • Kim Phillips's avatar
      perf/x86/amd: Fix sampling Large Increment per Cycle events · 26e52558
      Kim Phillips authored
      Commit 57388912 ("perf/x86/amd: Add support for Large Increment
      per Cycle Events") mistakenly zeroes the upper 16 bits of the count
      in set_period().  That's fine for counting with perf stat, but not
      sampling with perf record when only Large Increment events are being
      sampled.  To enable sampling, we sign extend the upper 16 bits of the
      merged counter pair as described in the Family 17h PPRs:
      
      "Software wanting to preload a value to a merged counter pair writes the
      high-order 16-bit value to the low-order 16 bits of the odd counter and
      then writes the low-order 48-bit value to the even counter. Reading the
      even counter of the merged counter pair returns the full 64-bit value."
      
      Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
      26e52558
    • Kim Phillips's avatar
      perf/amd/uncore: Set all slices and threads to restore perf stat -a behaviour · c8fe99d0
      Kim Phillips authored
      Commit 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for
      F17h L3 PMCs") inadvertently changed the uncore driver's behaviour
      wrt perf tool invocations with or without a CPU list, specified with
      -C / --cpu=.
      
      Change the behaviour of the driver to assume the former all-cpu (-a)
      case, which is the more commonly desired default.  This fixes
      '-a -A' invocations without explicit cpu lists (-C) to not count
      L3 events only on behalf of the first thread of the first core
      in the L3 domain.
      
      BEFORE:
      
      Activity performed by the first thread of the last core (CPU#43) in
      CPU#40's L3 domain is not reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                 21,835      l3_request_g1.caching_l3_cache_accesses
      CPU40                 87,066      l3_request_g1.caching_l3_cache_accesses
      CPU44                 17,360      l3_request_g1.caching_l3_cache_accesses
      ...
      
      AFTER:
      
      The L3 domain activity is now reported by CPU#40:
      
      sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default
      ...
      CPU36                354,891      l3_request_g1.caching_l3_cache_accesses
      CPU40              1,780,870      l3_request_g1.caching_l3_cache_accesses
      CPU44                315,062      l3_request_g1.caching_l3_cache_accesses
      ...
      
      Fixes: 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs")
      Signed-off-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20200908214740.18097-2-kim.phillips@amd.com
      c8fe99d0
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_out() · 44fae179
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      The same context will iterated in perf_event_context_sched_out() soon.
      Share the cpuctx->task_ctx can avoid the unnecessary iteration of the
      cpuctx list.
      
      The pmu::sched_task() is also required for the optimization case for
      equivalent contexts.
      
      The task_ctx_sched_out() will eventually disable and reenable the PMU
      when schedule out events. Add perf_pmu_disable() and perf_pmu_enable()
      around task_ctx_sched_out() don't break anything.
      
      Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for
      per-CPU context, which is not necessary for the per-task context
      schedule.
      
      No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and
      perf_pmu_sched_task() any more.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
      44fae179
    • Kan Liang's avatar
      perf/core: Pull pmu::sched_task() into perf_event_context_sched_in() · 556cccad
      Kan Liang authored
      The pmu::sched_task() is a context switch callback. It passes the
      cpuctx->task_ctx as a parameter to the lower code. To find the
      cpuctx->task_ctx, the current code iterates a cpuctx list.
      
      The same context was just iterated in perf_event_context_sched_in(),
      which is invoked right before the pmu::sched_task().
      
      Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid
      the unnecessary iteration of the cpuctx list.
      
      Both pmu::sched_task and perf_event_context_sched_in() have to disable
      PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can
      also save the overhead from the PMU disable and reenable.
      
      The new and old tasks may have equivalent contexts. The current code
      optimize this case by swapping the context, which avoids the scheduling.
      For this case, pmu::sched_task() is still required, e.g., restore the
      LBR content.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
      556cccad
    • Kan Liang's avatar
      perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS · 35d1ce6b
      Kan Liang authored
      A warning as below may be triggered when sampling with large PEBS.
      
      [  410.411250] perf: interrupt took too long (72145 > 71975), lowering
      kernel.perf_event_max_sample_rate to 2000
      [  410.724923] ------------[ cut here ]------------
      [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
      x86_pmu_stop+0x95/0xa0
      [  410.933811]  x86_pmu_del+0x50/0x150
      [  410.937304]  event_sched_out.isra.0+0xbc/0x210
      [  410.941751]  group_sched_out.part.0+0x53/0xd0
      [  410.946111]  ctx_sched_out+0x193/0x270
      [  410.949862]  __perf_event_task_sched_out+0x32c/0x890
      [  410.954827]  ? set_next_entity+0x98/0x2d0
      [  410.958841]  __schedule+0x592/0x9c0
      [  410.962332]  schedule+0x5f/0xd0
      [  410.965477]  exit_to_usermode_loop+0x73/0x120
      [  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
      [  410.974369]  ret_from_intr+0x2a/0x3a
      [  410.977946] RIP: 0033:0x40123c
      [  411.079661] ---[ end trace bc83adaea7bb664a ]---
      
      In the non-overflow context, e.g., context switch, with large PEBS, perf
      may stop an event twice. An example is below.
      
        //max_samples_per_tick is adjusted to 2
        //NMI is triggered
        intel_pmu_handle_irq()
           handle_pmi_common()
             drain_pebs()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     hwc->interrupts = 1
                     return 0
        //A context switch happens right after the NMI.
        //In the same tick, the perf_throttled_seq is not changed.
        perf_event_task_sched_out()
           perf_pmu_sched_task()
             intel_pmu_drain_pebs_buffer()
               __intel_pmu_pebs_event()
                 perf_event_overflow()
                   __perf_event_account_interrupt()
                     ++hwc->interrupts >= max_samples_per_tick
                     return 1
                 x86_pmu_stop();  # First stop
           perf_event_context_sched_out()
             task_ctx_sched_out()
               ctx_sched_out()
                 event_sched_out()
                   x86_pmu_del()
                     x86_pmu_stop();  # Second stop and trigger the warning
      
      Perf should only invoke the perf_event_overflow() in the overflow
      context.
      
      Current drain_pebs() is called from:
      - handle_pmi_common()			-- overflow context
      - intel_pmu_pebs_sched_task()		-- non-overflow context
      - intel_pmu_pebs_disable()		-- non-overflow context
      - intel_pmu_auto_reload_read()		-- possible overflow context
        With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be
        invoked in the NMI handler. But, before calling the function, the
        PEBS buffer has already been drained. The __intel_pmu_pebs_event()
        will not be called in the possible overflow context.
      
      To fix the issue, an indicator is required to distinguish between the
      overflow context aka handle_pmi_common() and other cases.
      The dummy regs pointer can be used as the indicator.
      
      In the non-overflow context, perf should treat the last record the same
      as other PEBS records, and doesn't invoke the generic overflow handler.
      
      Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
      Reported-by: default avatarLike Xu <like.xu@linux.intel.com>
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarLike Xu <like.xu@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
      35d1ce6b
  8. 18 Aug, 2020 2 commits