- 12 Oct, 2020 1 commit
-
-
Jiri Olsa authored
There's a possible race in perf_mmap_close() when checking ring buffer's mmap_count refcount value. The problem is that the mmap_count check is not atomic because we call atomic_dec() and atomic_read() separately. perf_mmap_close: ... atomic_dec(&rb->mmap_count); ... if (atomic_read(&rb->mmap_count)) goto out_put; <ring buffer detach> free_uid out_put: ring_buffer_put(rb); /* could be last */ The race can happen when we have two (or more) events sharing same ring buffer and they go through atomic_dec() and then they both see 0 as refcount value later in atomic_read(). Then both will go on and execute code which is meant to be run just once. The code that detaches ring buffer is probably fine to be executed more than once, but the problem is in calling free_uid(), which will later on demonstrate in related crashes and refcount warnings, like: refcount_t: addition on 0; use-after-free. ... RIP: 0010:refcount_warn_saturate+0x6d/0xf ... Call Trace: prepare_creds+0x190/0x1e0 copy_creds+0x35/0x172 copy_process+0x471/0x1a80 _do_fork+0x83/0x3a0 __do_sys_wait4+0x83/0x90 __do_sys_clone+0x85/0xa0 do_syscall_64+0x5b/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using atomic decrease and check instead of separated calls. Tested-by: Michael Petlan <mpetlan@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Wade Mealing <wmealing@redhat.com> Fixes: 9bb5d40c ("perf: Fix mmap() accounting hole"); Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
-
- 06 Oct, 2020 2 commits
-
-
Peter Zijlstra authored
When a group that has TopDown members is failed to be scheduled, any later TopDown groups will not return valid values. Here is an example. A background perf that occupies all the GP counters and the fixed counter 1. $perf stat -e "{cycles,cycles,cycles,cycles,cycles,cycles,cycles, cycles,cycles}:D" -a A user monitors a TopDown group. It works well, because the fixed counter 3 and the PERF_METRICS are available. $perf stat -x, --topdown -- ./workload retiring,bad speculation,frontend bound,backend bound, 18.0,16.1,40.4,25.5, Then the user tries to monitor a group that has TopDown members. Because of the cycles event, the group is failed to be scheduled. $perf stat -x, -e '{slots,topdown-retiring,topdown-be-bound, topdown-fe-bound,topdown-bad-spec,cycles}' -- ./workload <not counted>,,slots,0,0.00,, <not counted>,,topdown-retiring,0,0.00,, <not counted>,,topdown-be-bound,0,0.00,, <not counted>,,topdown-fe-bound,0,0.00,, <not counted>,,topdown-bad-spec,0,0.00,, <not counted>,,cycles,0,0.00,, The user tries to monitor a TopDown group again. It doesn't work anymore. $perf stat -x, --topdown -- ./workload ,,,,, In a txn, cancel_txn() is to truncate the event_list for a canceled group and update the number of events added in this transaction. However, the number of TopDown events added in this transaction is not updated. The kernel will probably fail to add new Topdown events. Fixes: 7b2c05a1 ("perf/x86/intel: Generic support for hardware TopDown metrics") Reported-by: Andi Kleen <ak@linux.intel.com> Reported-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20201005082611.GH2628@hirez.programming.kicks-ass.net
-
Peter Zijlstra authored
Kan reported that n_metric gets corrupted for cancelled transactions; a similar issue exists for n_pair for AMD's Large Increment thing. The problem was confirmed and confirmed fixed by Kim using: sudo perf stat -e "{cycles,cycles,cycles,cycles}:D" -a sleep 10 & # should succeed: sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload # should fail: sudo perf stat -e "{fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.all,cycles}:D" -a workload # previously failed, now succeeds with this patch: sudo perf stat -e "{fp_ret_sse_avx_ops.all}:D" -a workload Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events") Reported-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Kim Phillips <kim.phillips@amd.com> Link: https://lkml.kernel.org/r/20201005082516.GG2628@hirez.programming.kicks-ass.net
-
- 03 Oct, 2020 2 commits
-
-
Colin Ian King authored
An incorrect sizeof is being used, struct attribute ** is not correct, it should be struct attribute *. Note that since ** is the same size as * this is not causing any issues. Improve this fix by using sizeof(*attrs) as this allows us to not even reference the type of the pointer. Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)") Fixes: 51686546 ("x86/events/amd/iommu: Fix sysfs perf attribute groups") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001113900.58889-1-colin.king@canonical.com
-
Kan Liang authored
It might be possible that different CPUs have different CPU metrics on a platform. In this case, writing the GLOBAL_CTRL_EN_PERF_METRICS bit to the GLOBAL_CTRL register of a CPU, which doesn't support the TopDown perf metrics feature, causes MSR access error. Current TopDown perf metrics feature is enumerated using the boot CPU's PERF_CAPABILITIES MSR. The MSR only indicates the boot CPU supports this feature. Check the PERF_CAPABILITIES MSR for each CPU. If any CPU doesn't support the perf metrics feature, disable the feature globally. Fixes: 59a854e2 ("perf/x86/intel: Support TopDown metrics on Ice Lake") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201001211711.25708-1-kan.liang@linux.intel.com
-
- 29 Sep, 2020 8 commits
-
-
Kan Liang authored
An error occues when sampling non-PEBS INST_RETIRED.PREC_DIST(0x01c0) event. perf record -e cpu/event=0xc0,umask=0x01/ -- sleep 1 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu/event=0xc0,umask=0x01/). /bin/dmesg | grep -i perf may provide additional information. The idxmsk64 of the event is set to 0. The event never be successfully scheduled. The event should be limit to the fixed counter 0. Fixes: 60176089 ("perf/x86/intel: Add Icelake support") Reported-by: Yi, Ammy <ammy.yi@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200928134726.13090-1-kan.liang@linux.intel.com
-
Kan Liang authored
The "MiB" result of the IMC free-running bandwidth events, uncore_imc_free_running/read/ and uncore_imc_free_running/write/ are 16 times too small. The "MiB" value equals the raw IMC free-running bandwidth counter value times a "scale" which is inaccurate. The IMC free-running bandwidth events should be incremented per 64B cache line, not DWs (4 bytes). The "scale" should be 6.103515625e-5. Fix the "scale" for both Snow Ridge and Ice Lake. Fixes: 2b3b76b5 ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Fixes: ee49532b ("perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200928133240.12977-1-kan.liang@linux.intel.com
-
Alexander Antonov authored
Introduced early attributes /sys/devices/uncore_iio_<pmu_idx>/die* are initialized by skx_iio_set_mapping(), however, for example, for multiple segment platforms skx_iio_get_topology() returns -EPERM before a list of attributes in skx_iio_mapping_group will have been initialized. As a result the list is being NULL. Thus the warning "sysfs: (bin_)attrs not set by subsystem for group: uncore_iio_*/" appears and uncore_iio pmus are not available in sysfs. Clear IIO attr_update to properly handle the cases when topology information cannot be retrieved. Fixes: bb42b3d3 ("perf/x86/intel/uncore: Expose an Uncore unit to IIO PMON mapping") Reported-by: Kyle Meyer <kyle.meyer@hpe.com> Suggested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexei Budankov <alexey.budankov@linux.intel.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20200928102133.61041-1-alexander.antonov@linux.intel.com
-
Kan Liang authored
The Jasper Lake processor is also a Tremont microarchitecture. From the perspective of perf MSR, there is nothing changed compared with Elkhart Lake. Share the code path with Elkhart Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1601296242-32763-2-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
The Jasper Lake processor is also a Tremont microarchitecture. From the perspective of Intel PMU, there is nothing changed compared with Elkhart Lake. Share the perf code with Elkhart Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1601296242-32763-1-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
An oops is triggered by the fuzzy test. [ 327.853081] unchecked MSR access error: RDMSR from 0x70c at rIP: 0xffffffffc082c820 (uncore_msr_read_counter+0x10/0x50 [intel_uncore]) [ 327.853083] Call Trace: [ 327.853085] <IRQ> [ 327.853089] uncore_pmu_event_start+0x85/0x170 [intel_uncore] [ 327.853093] uncore_pmu_event_add+0x1a4/0x410 [intel_uncore] [ 327.853097] ? event_sched_in.isra.118+0xca/0x240 There are 2 GP counters for each CBOX, but the current code claims 4 counters. Accessing the invalid registers triggers the oops. Fixes: 6e394376 ("perf/x86/intel/uncore: Add Intel Icelake uncore support") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200925134905.8839-3-kan.liang@linux.intel.com
-
Kan Liang authored
There are some updates for the Icelake model specific uncore performance monitors. (The update can be found at 10th generation intel core processors families specification update Revision 004, ICL068) 1) Counter 0 of ARB uncore unit is not available for software use 2) The global 'enable bit' (bit 29) and 'freeze bit' (bit 31) of MSR_UNC_PERF_GLOBAL_CTRL cannot be used to control counter behavior. Needs to use local enable in event select MSR. Accessing the modified bit/registers will be ignored by HW. Users may observe inaccurate results with the current code. The changes of the MSR_UNC_PERF_GLOBAL_CTRL imply that groups cannot be read atomically anymore. Although the error of the result for a group becomes a bit bigger, it still far lower than not using a group. The group support is still kept. Only Remove the *_box() related implementation. Since the counter 0 of ARB uncore unit is not available, update the MSR address for the ARB uncore unit. There is no change for IMC uncore unit, which only include free-running counters. Fixes: 6e394376 ("perf/x86/intel/uncore: Add Intel Icelake uncore support") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200925134905.8839-2-kan.liang@linux.intel.com
-
Kan Liang authored
Previously, the MSR uncore for the Ice Lake and Tiger Lake are identical. The code path is shared. However, with recent update, the global MSR_UNC_PERF_GLOBAL_CTRL register and ARB uncore unit are changed for the Ice Lake. Split the Ice Lake and Tiger Lake MSR uncore support. The changes only impact the MSR ops() and the ARB uncore unit. Other codes can still be shared between the Ice Lake and the Tiger Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200925134905.8839-1-kan.liang@linux.intel.com
-
- 24 Sep, 2020 11 commits
-
-
Kan Liang authored
The Snow Ridge integrated PCIe3 uncore unit can be used to collect performance data, e.g. utilization, between PCIe devices, plugged into the PCIe port, and the components (in M2IOSF) responsible for translating and managing requests to/from the device. The performance data is very useful for analyzing the performance of PCIe devices. The device with the PCIe3 uncore PMON units is owned by the portdrv_pci driver. Create a PCI sub driver for the PCIe3 uncore PMON units. Here are some difference between PCIe3 uncore unit and other uncore pci units. - There may be several Root Ports on a system. But the uncore counters only exist in the Root Port A. A user can configure the channel mask to collect the data from other Root Ports. - The event format of the PCIe3 uncore unit is the same as IIO unit of SKX. - The Control Register of PCIe3 uncore unit is 64 bits. - The offset of each counters is 8, which is the same as M2M unit of SNR. - New MSR addresses for unit control, counter and counter config. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-7-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
Some uncore counters may be located in the configuration space of a PCI device, which already has a bonded driver. Currently, the uncore driver cannot register a PCI uncore PMU for these counters, because, to register a PCI uncore PMU, the uncore driver must be bond to the device. However, one device can only have one bonded driver. Add an uncore PCI sub driver to support such kind of devices. The sub driver doesn't own the device. In initialization, the sub driver searches the device via pci_get_device(), and register the corresponding PMU for the device. In the meantime, the sub driver registers a PCI bus notifier, which is used to notify the sub driver once the device is removed. The sub driver can unregister the PMU accordingly. The sub driver only searches the devices defined in its id table. The id table varies on different platforms, which will be implemented in the following platform-specific patch. Suggested-by: Bjorn Helgaas <helgaas@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-6-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
The PMU unregistration in the uncore PCI sub driver is similar as the normal PMU unregistration for a PCI device. The codes to unregister a PCI PMU can be shared. Factor out uncore_pci_pmu_unregister(), which will be used later. Use uncore_pci_get_dev_die_info() to replace the codes which retrieve the socket and die informaion. The pci_set_drvdata() is not included in uncore_pci_pmu_unregister() as well, because the uncore PCI sub driver will not touch the private driver data pointer of the device. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-5-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
The PMU registration in the uncore PCI sub driver is similar as the normal PMU registration for a PCI device. The codes to register a PCI PMU can be shared. Factor out uncore_pci_pmu_register(), which will be used later. The pci_set_drvdata() is not included in uncore_pci_pmu_register(). The uncore PCI sub driver doesn't own the PCI device. It will not touch the private driver data pointer for the device. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-4-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
When an uncore PCI sub driver gets a remove notification, the corresponding PMU has to be retrieved and unregistered. The codes, which find the corresponding PMU by comparing the pci_device_id table, can be shared. Factor out uncore_pci_find_dev_pmu(), which will be used later. There is no functional change. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-3-git-send-email-kan.liang@linux.intel.com
-
Kan Liang authored
The socket and die information is required to register/unregister a PMU in the uncore PCI sub driver. The codes, which get the socket and die information from a BUS number, can be shared. Factor out uncore_pci_get_dev_die_info(), which will be used later. There is no functional change. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1600094060-82746-2-git-send-email-kan.liang@linux.intel.com
-
Kim Phillips authored
Previously, the uncore driver would say "NB counters detected" on F17h machines, which don't have NorthBridge (NB) counters. They have Data Fabric (DF) counters. Just use the pmu.name to inform users which pmu to use and its associated counter count. F17h dmesg BEFORE: amd_uncore: AMD NB counters detected amd_uncore: AMD LLC counters detected F17h dmesg AFTER: amd_uncore: 4 amd_df counters detected amd_uncore: 6 amd_l3 counters detected Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921144330.6331-5-kim.phillips@amd.com
-
Kim Phillips authored
On Family 19h, the driver checks for a populated 2-bit threadmask in order to establish that the user wants to measure individual slices, individual cores (only one can be measured at a time), and lets the user also directly specify enallcores and/or enallslices if desired. Example F19h invocation to measure L3 accesses (event 4, umask 0xff) by the first thread (id 0 -> mask 0x1) of the first core (id 0) on the first slice (id 0): perf stat -a -e instructions,amd_l3/umask=0xff,event=0x4,coreid=0,threadmask=1,sliceid=0,enallcores=0,enallslices=0/ <workload> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921144330.6331-4-kim.phillips@amd.com
-
Kim Phillips authored
Continue to fully populate either one of threadmask or slicemask if the user doesn't. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921144330.6331-3-kim.phillips@amd.com
-
Kim Phillips authored
Replace AMD_FORMAT_ATTR with the more apropos DEFINE_UNCORE_FORMAT_ATTR stolen from arch/x86/events/intel/uncore.h. This way we can clearly see the bit-variants of each of the attributes that want to have the same name across families. Also unroll AMD_ATTRIBUTE because we are going to separately add new attributes that differ between DF and L3. Also clean up the if-Family 17h-else logic in amd_uncore_init. This is basically a rewrite of commit da6adaea ("perf/x86/amd/uncore: Update sysfs attributes for Family17h processors"). No functional changes. Tested F17h+ /sys/bus/event_source/devices/amd_{l3,df}/format/* content remains unchanged: /sys/bus/event_source/devices/amd_l3/format/event:config:0-7 /sys/bus/event_source/devices/amd_l3/format/umask:config:8-15 /sys/bus/event_source/devices/amd_df/format/event:config:0-7,32-35,59-60 /sys/bus/event_source/devices/amd_df/format/umask:config:8-15 Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200921144330.6331-2-kim.phillips@amd.com
-
Jarkko Sakkinen authored
It is advised to use module_name() macro instead of dereferencing mod->name directly. This makes sense for consistencys sake and also it prevents a hard dependency to CONFIG_MODULES. Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200818050857.117998-1-jarkko.sakkinen@linux.intel.com
-
- 10 Sep, 2020 10 commits
-
-
Kim Phillips authored
Stephane Eranian found a bug in that IBS' current Fetch counter was not being reset when the driver would write the new value to clear it along with the enable bit set, and found that adding an MSR write that would first disable IBS Fetch would make IBS Fetch reset its current count. Indeed, the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54 - Sep 12, 2019 states "The periodic fetch counter is set to IbsFetchCnt [...] when IbsFetchEn is changed from 0 to 1." Explicitly set IbsFetchEn to 0 and then to 1 when re-enabling IBS Fetch, so the driver properly resets the internal counter to 0 and IBS Fetch starts counting again. A family 15h machine tested does not have this problem, and the extra wrmsr is also not needed on Family 19h, so only do the extra wrmsr on families 16h through 18h. Reported-by: Stephane Eranian <stephane.eranian@google.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> [peterz: optimized] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
-
Kim Phillips authored
Family 19h RAPL support did not change from Family 17h; extend the existing Fam17h support to work on Family 19h too. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200908214740.18097-8-kim.phillips@amd.com
-
Kim Phillips authored
IBS hardware with the OpCntExt feature gets a 7-bit wider internal counter. Both the maximum and current count bitfields in the IBS_OP_CTL register are extended to support reading and writing it. No changes are necessary to the driver for handling the extra contiguous current count bits (IbsOpCurCnt), as the driver already passes through 32 bits of that field. However, the driver has to do some extra bit manipulation when converting from a period to the non-contiguous (although conveniently aligned) extra bits in the IbsOpMaxCnt bitfield. This decreases IBS Op interrupt overhead when the period is over 1,048,560 (0xffff0), which would previously activate the driver's software counter. That threshold is now 134,217,712 (0x7fffff0). Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200908214740.18097-7-kim.phillips@amd.com
-
Kim Phillips authored
Neither IbsBrTarget nor OPDATA4 are populated in IBS Fetch mode. Don't accumulate them into raw sample user data in that case. Also, in Fetch mode, add saving the IBS Fetch Control Extended MSR. Technically, there is an ABI change here with respect to the IBS raw sample data format, but I don't see any perf driver version information being included in perf.data file headers, but, existing users can detect whether the size of the sample record has reduced by 8 bytes to determine whether the IBS driver has this fix. Fixes: 904cb367 ("perf/x86/amd/ibs: Update IBS MSRs and feature definitions") Reported-by: Stephane Eranian <stephane.eranian@google.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200908214740.18097-6-kim.phillips@amd.com
-
Kim Phillips authored
get_ibs_op_count() adds hardware's current count (IbsOpCurCnt) bits to its count regardless of hardware's valid status. According to the PPR for AMD Family 17h Model 31h B0 55803 Rev 0.54, if the counter rolls over, valid status is set, and the lower 7 bits of IbsOpCurCnt are randomized by hardware. Don't include those bits in the driver's event count. Fixes: 8b1e1363 ("perf/x86-ibs: Fix usage of IBS op current count") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
-
Kim Phillips authored
Commit 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events") mistakenly zeroes the upper 16 bits of the count in set_period(). That's fine for counting with perf stat, but not sampling with perf record when only Large Increment events are being sampled. To enable sampling, we sign extend the upper 16 bits of the merged counter pair as described in the Family 17h PPRs: "Software wanting to preload a value to a merged counter pair writes the high-order 16-bit value to the low-order 16 bits of the odd counter and then writes the low-order 48-bit value to the even counter. Reading the even counter of the merged counter pair returns the full 64-bit value." Fixes: 57388912 ("perf/x86/amd: Add support for Large Increment per Cycle Events") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
-
Kim Phillips authored
Commit 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs") inadvertently changed the uncore driver's behaviour wrt perf tool invocations with or without a CPU list, specified with -C / --cpu=. Change the behaviour of the driver to assume the former all-cpu (-a) case, which is the more commonly desired default. This fixes '-a -A' invocations without explicit cpu lists (-C) to not count L3 events only on behalf of the first thread of the first core in the L3 domain. BEFORE: Activity performed by the first thread of the last core (CPU#43) in CPU#40's L3 domain is not reported by CPU#40: sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default ... CPU36 21,835 l3_request_g1.caching_l3_cache_accesses CPU40 87,066 l3_request_g1.caching_l3_cache_accesses CPU44 17,360 l3_request_g1.caching_l3_cache_accesses ... AFTER: The L3 domain activity is now reported by CPU#40: sudo perf stat -a -A -e l3_request_g1.caching_l3_cache_accesses taskset -c 43 perf bench mem memcpy -s 32mb -l 100 -f default ... CPU36 354,891 l3_request_g1.caching_l3_cache_accesses CPU40 1,780,870 l3_request_g1.caching_l3_cache_accesses CPU44 315,062 l3_request_g1.caching_l3_cache_accesses ... Fixes: 2f217d58 ("perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200908214740.18097-2-kim.phillips@amd.com
-
Kan Liang authored
The pmu::sched_task() is a context switch callback. It passes the cpuctx->task_ctx as a parameter to the lower code. To find the cpuctx->task_ctx, the current code iterates a cpuctx list. The same context will iterated in perf_event_context_sched_out() soon. Share the cpuctx->task_ctx can avoid the unnecessary iteration of the cpuctx list. The pmu::sched_task() is also required for the optimization case for equivalent contexts. The task_ctx_sched_out() will eventually disable and reenable the PMU when schedule out events. Add perf_pmu_disable() and perf_pmu_enable() around task_ctx_sched_out() don't break anything. Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for per-CPU context, which is not necessary for the per-task context schedule. No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and perf_pmu_sched_task() any more. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com
-
Kan Liang authored
The pmu::sched_task() is a context switch callback. It passes the cpuctx->task_ctx as a parameter to the lower code. To find the cpuctx->task_ctx, the current code iterates a cpuctx list. The same context was just iterated in perf_event_context_sched_in(), which is invoked right before the pmu::sched_task(). Reuse the cpuctx->task_ctx from perf_event_context_sched_in() can avoid the unnecessary iteration of the cpuctx list. Both pmu::sched_task and perf_event_context_sched_in() have to disable PMU. Pull the pmu::sched_task into perf_event_context_sched_in() can also save the overhead from the PMU disable and reenable. The new and old tasks may have equivalent contexts. The current code optimize this case by swapping the context, which avoids the scheduling. For this case, pmu::sched_task() is still required, e.g., restore the LBR content. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200821195754.20159-1-kan.liang@linux.intel.com
-
Kan Liang authored
A warning as below may be triggered when sampling with large PEBS. [ 410.411250] perf: interrupt took too long (72145 > 71975), lowering kernel.perf_event_max_sample_rate to 2000 [ 410.724923] ------------[ cut here ]------------ [ 410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422 x86_pmu_stop+0x95/0xa0 [ 410.933811] x86_pmu_del+0x50/0x150 [ 410.937304] event_sched_out.isra.0+0xbc/0x210 [ 410.941751] group_sched_out.part.0+0x53/0xd0 [ 410.946111] ctx_sched_out+0x193/0x270 [ 410.949862] __perf_event_task_sched_out+0x32c/0x890 [ 410.954827] ? set_next_entity+0x98/0x2d0 [ 410.958841] __schedule+0x592/0x9c0 [ 410.962332] schedule+0x5f/0xd0 [ 410.965477] exit_to_usermode_loop+0x73/0x120 [ 410.969837] prepare_exit_to_usermode+0xcd/0xf0 [ 410.974369] ret_from_intr+0x2a/0x3a [ 410.977946] RIP: 0033:0x40123c [ 411.079661] ---[ end trace bc83adaea7bb664a ]--- In the non-overflow context, e.g., context switch, with large PEBS, perf may stop an event twice. An example is below. //max_samples_per_tick is adjusted to 2 //NMI is triggered intel_pmu_handle_irq() handle_pmi_common() drain_pebs() __intel_pmu_pebs_event() perf_event_overflow() __perf_event_account_interrupt() hwc->interrupts = 1 return 0 //A context switch happens right after the NMI. //In the same tick, the perf_throttled_seq is not changed. perf_event_task_sched_out() perf_pmu_sched_task() intel_pmu_drain_pebs_buffer() __intel_pmu_pebs_event() perf_event_overflow() __perf_event_account_interrupt() ++hwc->interrupts >= max_samples_per_tick return 1 x86_pmu_stop(); # First stop perf_event_context_sched_out() task_ctx_sched_out() ctx_sched_out() event_sched_out() x86_pmu_del() x86_pmu_stop(); # Second stop and trigger the warning Perf should only invoke the perf_event_overflow() in the overflow context. Current drain_pebs() is called from: - handle_pmi_common() -- overflow context - intel_pmu_pebs_sched_task() -- non-overflow context - intel_pmu_pebs_disable() -- non-overflow context - intel_pmu_auto_reload_read() -- possible overflow context With PERF_SAMPLE_READ + PERF_FORMAT_GROUP, the function may be invoked in the NMI handler. But, before calling the function, the PEBS buffer has already been drained. The __intel_pmu_pebs_event() will not be called in the possible overflow context. To fix the issue, an indicator is required to distinguish between the overflow context aka handle_pmi_common() and other cases. The dummy regs pointer can be used as the indicator. In the non-overflow context, perf should treat the last record the same as other PEBS records, and doesn't invoke the generic overflow handler. Fixes: 21509084 ("perf/x86/intel: Handle multiple records in the PEBS buffer") Reported-by: Like Xu <like.xu@linux.intel.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Like Xu <like.xu@linux.intel.com> Link: https://lkml.kernel.org/r/20200902210649.2743-1-kan.liang@linux.intel.com
-
- 18 Aug, 2020 6 commits
-
-
Kan Liang authored
Starts from Ice Lake, the TopDown metrics are directly available as fixed counters and do not require generic counters. Also, the TopDown metrics can be collected per thread. Extend the RDPMC usage to support per-thread TopDown metrics. The RDPMC index of the PERF_METRICS will be output if RDPMC users ask for the RDPMC index of the metrics events. To support per thread RDPMC TopDown, the metrics and slots counters have to be saved/restored during the context switching. The last_period and period_left are not used in the counting mode. Use the fields for saved_metric and saved_slots. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-12-kan.liang@linux.intel.com
-
Kan Liang authored
Ice Lake supports the hardware TopDown metrics feature, which can free up the scarce GP counters. Update the event constraints for the metrics events. The metric counters do not exist, which are mapped to a dummy offset. The sharing between multiple users of the same metric without multiplexing is not allowed. Implement set_topdown_event_period for Ice Lake. The values in PERF_METRICS MSR are derived from the fixed counter 3. Both registers should start from zero. Implement update_topdown_event for Ice Lake. The metric is reported by multiplying the metric (fraction) with slots. To maintain accurate measurements, both registers are cleared for each update. The fixed counter 3 should always be cleared before the PERF_METRICS. Implement td_attr for the new metrics events and the new slots fixed counter. Make them visible to the perf user tools. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com
-
Kan Liang authored
The RDPMC base offset of fixed counters is hard-code. Use a meaningful name to replace the magic number to improve the readability of the code. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-10-kan.liang@linux.intel.com
-
Kan Liang authored
Intro ===== The TopDown Microarchitecture Analysis (TMA) Method is a structured analysis methodology to identify critical performance bottlenecks in out-of-order processors. Current perf has supported the method. The method works well, but there is one problem. To collect the TopDown events, several GP counters have to be used. If a user wants to collect other events at the same time, the multiplexing probably be triggered, which impacts the accuracy. To free up the scarce GP counters, the hardware TopDown metrics feature is introduced from Ice Lake. The hardware implements an additional "metrics" register and a new Fixed Counter 3 that measures pipeline "slots". The TopDown events can be calculated from them instead. Events ====== The level 1 TopDown has four metrics. There is no event-code assigned to the TopDown metrics. Four metric events are exported as separate perf events, which map to the internal "metrics" counter register. Those events do not exist in hardware, but can be allocated by the scheduler. For the event mapping, a special 0x00 event code is used, which is reserved for fake events. The metric events start from umask 0x10. When setting up the metric events, they point to the Fixed Counter 3. They have to be specially handled. - Add the update_topdown_event() callback to read the additional metrics MSR and generate the metrics. - Add the set_topdown_event_period() callback to initialize metrics MSR and the fixed counter 3. - Add a variable n_metric_event to track the number of the accepted metrics events. The sharing between multiple users of the same metric without multiplexing is not allowed. - Only enable/disable the fixed counter 3 when there are no other active TopDown events, which avoid the unnecessary writing of the fixed control register. - Disable the PMU when reading the metrics event. The metrics MSR and the fixed counter 3 are read separately. The values may be modified by an NMI. All four metric events don't support sampling. Since they will be handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is introduced to indicate this case. The slots event can support both sampling and counting. For counting, the flag is also applied. For sampling, it will be handled normally as other normal events. Groups ====== The slots event is required in a Topdown group. To avoid reading the METRICS register multiple times, the metrics and slots value can only be updated by slots event in a group. All active slots and metrics events will be updated one time. Therefore, the slots event must be before any metric events in a Topdown group. NMI ====== The METRICS related register may be overflow. The bit 48 of the STATUS register will be set. If so, PERF_METRICS and Fixed counter 3 are required to be reset. The patch also update all active slots and metrics events in the NMI handler. The update_topdown_event() has to read two registers separately. The values may be modified by an NMI. PMU has to be disabled before calling the function. RDPMC ====== RDPMC is temporarily disabled. A later patch will enable it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com
-
Kan Liang authored
Current perf assumes that events in a group are independent. Close an event doesn't impact the value of the other events in the same group. If the closed event is a member, after the event closure, other events are still running like a group. If the closed event is a leader, other events are running as singleton events. Add PERF_EV_CAP_SIBLING to allow events to indicate they require being part of a group, and when the leader dies they cannot exist independently. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-8-kan.liang@linux.intel.com
-
Kan Liang authored
Currently, the if-else is used in the intel_pmu_disable/enable_event to check the type of an event. It works well, but with more and more types added later, e.g., perf metrics, compared to the switch statement, the if-else may impair the readability of the code. There is no harm to use the switch statement to replace the if-else here. Also, some optimizing compilers may compile a switch statement into a jump-table which is more efficient than if-else for a large number of cases. The performance gain may not be observed for now, because the number of cases is only 5, but the benefits may be observed with more and more types added in the future. Use switch to replace the if-else in the intel_pmu_disable/enable_event. If the idx is invalid, print a warning. For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't need to check the event->attr.precise_ip. Use return for the case. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-7-kan.liang@linux.intel.com
-