1. 22 Feb, 2024 5 commits
    • Ian Rogers's avatar
      perf thread_map: Skip exited threads when scanning /proc · 510e5287
      Ian Rogers authored
      Scanning /proc is inherently racy. Scanning /proc/pid/task within that
      is also racy as the pid can terminate. Rather than failing in
      __thread_map__new_all_cpus, skip pids for such failures.
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Cc: James Clark <james.clark@arm.com>
      Cc: Justin Stitt <justinstitt@google.com>
      Cc: Bill Wendling <morbo@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Yang Jihong <yangjihong1@huawei.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
      Cc: llvm@lists.linux.dev
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240221034155.1500118-2-irogers@google.com
      510e5287
    • Thomas Richter's avatar
      perf list: fix short description for some cache events · b6968f9b
      Thomas Richter authored
      Correct the short description of the following events:
      DCW_REQ, DCW_REQ_CHIP_HIT, DCW_REQ_DRAWER_HIT, DCW_REQ_IV,
      DCW_ON_CHIP, DCW_ON_CHIP_IV, DCW_ON_CHIP_CHIP_HIT,
      DCW_ON_CHIP_DRAWER_HIT, CW_ON_MODULE, DCW_ON_DRAWER,
      DCW_OFF_DRAWER, IDCW_ON_MODULE_IV, IDCW_ON_MODULE_CHIP_HIT,
      IDCW_ON_MODULE_DRAWER_HIT, IDCW_ON_DRAWER_IV, IDCW_ON_DRAWER_CHIP_HIT,
      IDCW_ON_DRAWER_DRAWER_HIT, IDCW_OFF_DRAWER_IV, IDCW_OFF_DRAWER_CHIP_HIT,
      IDCW_OFF_DRAWER_DRAWER_HIT, ICW_REQ, ICW_REQ_IV, CW_REQ_CHIP_HIT,
      ICW_REQ_DRAWER_HIT, ICW_ON_CHIP, ICW_ON_CHIP_IV, ICW_ON_CHIP_CHIP_HIT,
      ICW_ON_CHIP_DRAWER_HIT, ICW_ON_MODULE and ICW_OFF_DRAWER.
      
      The second Cache should be L2-Cache.
      
      Output before (display diff of the first four events)
        # perf list -d
        DCW_REQ
             [Directory Write Level 1 Data Cache from Cache. Unit: cpum_cf]
        DCW_REQ_CHIP_HIT
             [Directory Write Level 1 Data Cache from Cache with Chip HP \
      	       Hit. Unit: cpum_cf]
        DCW_REQ_DRAWER_HIT
             [Directory Write Level 1 Data Cache from Cache with Drawer \
      	       HP Hit. Unit: cpum_cf]
        DCW_REQ_IV
             [Directory Write Level 1 Data Cache from Cache with Intervention. \
      	       Unit: cpum_cf]
      
      Output after:
        # perf list -d
        DCW_REQ
             [Directory Write Level 1 Data Cache from L2-Cache. Unit: cpum_cf]
        DCW_REQ_CHIP_HIT
             [Directory Write Level 1 Data Cache from L2-Cache with Chip HP \
      	       Hit. Unit: cpum_cf]
        DCW_REQ_DRAWER_HIT
             [Directory Write Level 1 Data Cache from L2-Cache with Drawer \
      	       HP Hit. Unit: cpum_cf]
        DCW_REQ_IV
             [Directory Write Level 1 Data Cache from L2-Cache with \
      	       Intervention. Unit: cpum_cf]
      
      Fixes: 7f76b311 ("perf list: Add IBM z16 event description for s390")
      Reported-by: default avatarAndreas Krebbel <krebbel@linux.ibm.com>
      Signed-off-by: default avatarThomas Richter <tmricht@linux.ibm.com>
      Acked-by: default avatarAndreas Krebbel <krebbel@linux.ibm.com>
      Reviewed-by: default avatarIan Rogers <irogers@google.com>
      Cc: gor@linux.ibm.com
      Cc: hca@linux.ibm.com
      Cc: sumanthk@linux.ibm.com
      Cc: svens@linux.ibm.com
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240221091908.1759083-1-tmricht@linux.ibm.com
      b6968f9b
    • Ian Rogers's avatar
      perf stat: Fix metric-only aggregation index · bafd4e75
      Ian Rogers authored
      Aggregation index was being computed using the evsel's cpumap which
      may have a different (typically the same or fewer) entries.
      
      Before:
      ```
      $ perf stat --metric-only -A -M memory_bandwidth_total -a sleep 1
      
       Performance counter stats for 'system wide':
      
             MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total
      CPU0                            12.8                           0.0                          12.9                          12.7                           0.0                          12.6
      CPU1
      
             1.007806367 seconds time elapsed
      ```
      
      After:
      ```
      $ perf stat --metric-only -A -M memory_bandwidth_total -a sleep 1
      
       Performance counter stats for 'system wide':
      
             MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total MB/s  memory_bandwidth_total
      CPU0                            15.4                           0.0                          15.3                          15.0                           0.0                          14.9
      CPU18                            0.0                           0.0                          13.5                           5.2                           0.0                          11.9
      
             1.007858736 seconds time elapsed
      ```
      
      Signed-off-by: Ian Rogers <irogers@google.com>                                  |
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: K Prateek Nayak <kprateek.nayak@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Kaige Ye <ye@kaige.org>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: John Garry <john.g.garry@oracle.com>
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240221070754.4163916-3-irogers@google.com
      bafd4e75
    • Ian Rogers's avatar
      perf metrics: Compute unmerged uncore metrics individually · a59fb796
      Ian Rogers authored
      When merging counts from multiple uncore PMUs the metric is only
      computed for the metric leader. When merging/aggregation is disabled,
      prior to this patch just the leader's metric would be computed. Fix
      this by computing the metric for each PMU.
      
      On a SkylakeX:
      Before:
      ```
      $ perf stat -A -M memory_bandwidth_total -a sleep 1
      
       Performance counter stats for 'system wide':
      
      CPU0               82,217      UNC_M_CAS_COUNT.RD [uncore_imc_0] #      9.2 MB/s  memory_bandwidth_total
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_0] #      0.0 MB/s  memory_bandwidth_total
      CPU0               61,395      UNC_M_CAS_COUNT.WR [uncore_imc_0]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_0]
      CPU0                    0      UNC_M_CAS_COUNT.RD [uncore_imc_1]
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_1]
      CPU0                    0      UNC_M_CAS_COUNT.WR [uncore_imc_1]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_1]
      CPU0               81,570      UNC_M_CAS_COUNT.RD [uncore_imc_2]
      CPU18             113,886      UNC_M_CAS_COUNT.RD [uncore_imc_2]
      CPU0               62,330      UNC_M_CAS_COUNT.WR [uncore_imc_2]
      CPU18              66,942      UNC_M_CAS_COUNT.WR [uncore_imc_2]
      CPU0               75,489      UNC_M_CAS_COUNT.RD [uncore_imc_3]
      CPU18              27,958      UNC_M_CAS_COUNT.RD [uncore_imc_3]
      CPU0               55,864      UNC_M_CAS_COUNT.WR [uncore_imc_3]
      CPU18              38,727      UNC_M_CAS_COUNT.WR [uncore_imc_3]
      CPU0                    0      UNC_M_CAS_COUNT.RD [uncore_imc_4]
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_4]
      CPU0                    0      UNC_M_CAS_COUNT.WR [uncore_imc_4]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_4]
      CPU0               75,423      UNC_M_CAS_COUNT.RD [uncore_imc_5]
      CPU18             104,527      UNC_M_CAS_COUNT.RD [uncore_imc_5]
      CPU0               57,596      UNC_M_CAS_COUNT.WR [uncore_imc_5]
      CPU18              56,777      UNC_M_CAS_COUNT.WR [uncore_imc_5]
      CPU0        1,003,440,851 ns   duration_time
      
             1.003440851 seconds time elapsed
      ```
      
      After:
      ```
      $ perf stat -A -M memory_bandwidth_total -a sleep 1
      
       Performance counter stats for 'system wide':
      
      CPU0               88,968      UNC_M_CAS_COUNT.RD [uncore_imc_0] #      9.5 MB/s  memory_bandwidth_total
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_0] #      0.0 MB/s  memory_bandwidth_total
      CPU0               59,498      UNC_M_CAS_COUNT.WR [uncore_imc_0]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_0]
      CPU0                    0      UNC_M_CAS_COUNT.RD [uncore_imc_1] #      0.0 MB/s  memory_bandwidth_total
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_1] #      0.0 MB/s  memory_bandwidth_total
      CPU0                    0      UNC_M_CAS_COUNT.WR [uncore_imc_1]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_1]
      CPU0               88,635      UNC_M_CAS_COUNT.RD [uncore_imc_2] #      9.5 MB/s  memory_bandwidth_total
      CPU18             117,975      UNC_M_CAS_COUNT.RD [uncore_imc_2] #     11.5 MB/s  memory_bandwidth_total
      CPU0               60,829      UNC_M_CAS_COUNT.WR [uncore_imc_2]
      CPU18              62,105      UNC_M_CAS_COUNT.WR [uncore_imc_2]
      CPU0               82,238      UNC_M_CAS_COUNT.RD [uncore_imc_3] #      8.7 MB/s  memory_bandwidth_total
      CPU18              22,906      UNC_M_CAS_COUNT.RD [uncore_imc_3] #      3.6 MB/s  memory_bandwidth_total
      CPU0               53,959      UNC_M_CAS_COUNT.WR [uncore_imc_3]
      CPU18              32,990      UNC_M_CAS_COUNT.WR [uncore_imc_3]
      CPU0                    0      UNC_M_CAS_COUNT.RD [uncore_imc_4] #      0.0 MB/s  memory_bandwidth_total
      CPU18                   0      UNC_M_CAS_COUNT.RD [uncore_imc_4] #      0.0 MB/s  memory_bandwidth_total
      CPU0                    0      UNC_M_CAS_COUNT.WR [uncore_imc_4]
      CPU18                   0      UNC_M_CAS_COUNT.WR [uncore_imc_4]
      CPU0               83,595      UNC_M_CAS_COUNT.RD [uncore_imc_5] #      8.9 MB/s  memory_bandwidth_total
      CPU18             110,151      UNC_M_CAS_COUNT.RD [uncore_imc_5] #     10.5 MB/s  memory_bandwidth_total
      CPU0               56,540      UNC_M_CAS_COUNT.WR [uncore_imc_5]
      CPU18              53,816      UNC_M_CAS_COUNT.WR [uncore_imc_5]
      CPU0        1,003,353,416 ns   duration_time
      ```
      
      Signed-off-by: Ian Rogers <irogers@google.com>                                  |
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: K Prateek Nayak <kprateek.nayak@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Kaige Ye <ye@kaige.org>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: John Garry <john.g.garry@oracle.com>
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240221070754.4163916-2-irogers@google.com
      a59fb796
    • Ian Rogers's avatar
      perf stat: Pass fewer metric arguments · eee41e6b
      Ian Rogers authored
      Pass metric_expr and evsel rather than specific variables from the
      struct, thereby reducing the number of arguments. This will enable
      later fixes.
      
      To reduce the size of the diff, local variables are added to match the
      previous parameter names. This isn't done in the case of "name" as
      evsel->name is more intention revealing. A whitespace issue is also
      addressed.
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Cc: K Prateek Nayak <kprateek.nayak@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Kaige Ye <ye@kaige.org>
      Cc: Kajol Jain <kjain@linux.ibm.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: John Garry <john.g.garry@oracle.com>
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240221070754.4163916-1-irogers@google.com
      eee41e6b
  2. 21 Feb, 2024 5 commits
  3. 17 Feb, 2024 2 commits
    • Ian Rogers's avatar
      perf list: For metricgroup only list include description · 81377de0
      Ian Rogers authored
      If perf list is invoked with 'metricgroups' include the description
      unless it is invoked with flags to exclude it. Make the description of
      metricgroup dumping dependent on the desc flag in print_state as with
      metrics.
      
      Before:
      ```
      $ perf list metricgroups
      List of pre-defined events (to be used in -e or -M):
      
      Metric Groups:
      
      Backend
      Bad
      BadSpec
      ...
      ```
      
      After:
      ```
      $ perf list metricgroups
      List of pre-defined events (to be used in -e or -M):
      
      Metric Groups:
      
      Backend [Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet]
      Bad [Grouping from Top-down Microarchitecture Analysis Metrics spreadsheet]
      BadSpec
      ...
      ```
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Link: https://lore.kernel.org/r/20240216192044.119897-1-irogers@google.com
      81377de0
    • Namhyung Kim's avatar
      perf tools: Fixup module symbol end address properly · bacefe0c
      Namhyung Kim authored
      I got a strange error on ARM to fail on processing FINISHED_ROUND
      record.  It turned out that it was failing in symbol__alloc_hist()
      because the symbol size is too big.
      
      When a sample is captured on a specific BPF program, it failed.  I've
      added a debug code and found the end address of the symbol is from
      the next module which is placed far way.
      
        ffff800008795778-ffff80000879d6d8: bpf_prog_1bac53b8aac4bc58_netcg_sock    [bpf]
        ffff80000879d6d8-ffff80000ad656b4: bpf_prog_76867454b5944e15_netcg_getsockopt      [bpf]
        ffff80000ad656b4-ffffd69b7af74048: bpf_prog_1d50286d2eb1be85_hn_egress     [bpf]   <---------- here
        ffffd69b7af74048-ffffd69b7af74048: $x.5    [sha3_generic]
        ffffd69b7af74048-ffffd69b7af740b8: crypto_sha3_init        [sha3_generic]
        ffffd69b7af740b8-ffffd69b7af741e0: crypto_sha3_update      [sha3_generic]
      
      The logic in symbols__fixup_end() just uses curr->start to update the
      prev->end.  But in this case, it won't work as it's too different.
      
      I think ARM has a different kernel memory layout for modules and BPF
      than on x86.  Actually there's a logic to handle kernel and module
      boundary.  Let's do the same for symbols between different modules.
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Reviewed-by: default avatarLeo Yan <leo.yan@linux.dev>
      Cc: Will Deacon <will@kernel.org>
      Cc: Mike Leach <mike.leach@linaro.org>
      Cc: John Garry <john.g.garry@oracle.com>
      Link: https://lore.kernel.org/r/20240212233322.1855161-1-namhyung@kernel.org
      bacefe0c
  4. 16 Feb, 2024 28 commits