1. 25 Jun, 2019 4 commits
    • Kyle Meyer's avatar
      perf tools: Increase MAX_NR_CPUS and MAX_CACHES · 9f94c7f9
      Kyle Meyer authored
      Attempting to profile 1024 or more CPUs with perf causes two errors:
      
        perf record -a
        [ perf record: Woken up X times to write data ]
        way too many cpu caches..
        [ perf record: Captured and wrote X MB perf.data (X samples) ]
      
        perf report -C 1024
        Error: failed to set  cpu bitmap
        Requested CPU 1024 too large. Consider raising MAX_NR_CPUS
      
        Increasing MAX_NR_CPUS from 1024 to 2048 and redefining MAX_CACHES as
        MAX_NR_CPUS * 4 returns normal functionality to perf:
      
        perf record -a
        [ perf record: Woken up X times to write data ]
        [ perf record: Captured and wrote X MB perf.data (X samples) ]
      
        perf report -C 1024
        ...
      Signed-off-by: default avatarKyle Meyer <kyle.meyer@hpe.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20190620193630.154025-1-meyerk@stormcage.eag.rdlabs.hpecorp.netSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      9f94c7f9
    • Adrian Hunter's avatar
      perf thread-stack: Eliminate code duplicating thread_stack__pop_ks() · eb5d8544
      Adrian Hunter authored
      Use new function thread_stack__pop_ks() in place of equivalent code.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20190619064429.14940-3-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      eb5d8544
    • Adrian Hunter's avatar
      perf thread-stack: Fix thread stack return from kernel for kernel-only case · 97860b48
      Adrian Hunter authored
      Commit f08046cb ("perf thread-stack: Represent jmps to the start of a
      different symbol") had the side-effect of introducing more stack entries
      before return from kernel space.
      
      When user space is also traced, those entries are popped before entry to
      user space, but when user space is not traced, they get stuck at the
      bottom of the stack, making the stack grow progressively larger.
      
      Fix by detecting a return-from-kernel branch type, and popping kernel
      addresses from the stack then.
      
      Note, the problem and fix affect the exported Call Graph / Tree but not
      the callindent option used by "perf script --call-trace".
      
      Example:
      
        perf-with-kcore record example -e intel_pt//k -- ls
        perf-with-kcore script example --itrace=bep -s ~/libexec/perf-core/scripts/python/export-to-sqlite.py example.db branches calls
        ~/libexec/perf-core/scripts/python/exported-sql-viewer.py example.db
      
        Menu option: Reports -> Context-Sensitive Call Graph
      
        Before: (showing Call Path column only)
      
          Call Path
           perf
          ▼ ls
            ▼ 12111:12111
               setup_new_exec
               __task_pid_nr_ns
               perf_event_pid_type
               perf_event_comm_output
               perf_iterate_ctx
               perf_iterate_sb
               perf_event_comm
               __set_task_comm
               load_elf_binary
               search_binary_handler
               __do_execve_file.isra.41
               __x64_sys_execve
               do_syscall_64
              ▼ entry_SYSCALL_64_after_hwframe
                ▼ swapgs_restore_regs_and_return_to_usermode
                  ▼ native_iret
                     error_entry
                     do_page_fault
                    ▼ error_exit
                      ▼ retint_user
                         prepare_exit_to_usermode
                        ▼ native_iret
                           error_entry
                           do_page_fault
                          ▼ error_exit
                            ▼ retint_user
                               prepare_exit_to_usermode
                              ▼ native_iret
                                 error_entry
                                 do_page_fault
                                ▼ error_exit
                                  ▼ retint_user
                                     prepare_exit_to_usermode
                                     native_iret
      
        After: (showing Call Path column only)
      
          Call Path
           perf
          ▼ ls
            ▼ 12111:12111
               setup_new_exec
               __task_pid_nr_ns
               perf_event_pid_type
               perf_event_comm_output
               perf_iterate_ctx
               perf_iterate_sb
               perf_event_comm
               __set_task_comm
               load_elf_binary
               search_binary_handler
               __do_execve_file.isra.41
               __x64_sys_execve
               do_syscall_64
               entry_SYSCALL_64_after_hwframe
               page_fault
              ▼ entry_SYSCALL_64
                ▼ do_syscall_64
                   __x64_sys_brk
                   __x64_sys_access
                   __x64_sys_openat
                   __x64_sys_newfstat
                   __x64_sys_mmap
                   __x64_sys_close
                   __x64_sys_read
                   __x64_sys_mprotect
                   __x64_sys_arch_prctl
                   __x64_sys_munmap
                   exit_to_usermode_loop
                   __x64_sys_set_tid_address
                   __x64_sys_set_robust_list
                   __x64_sys_rt_sigaction
                   __x64_sys_rt_sigprocmask
                   __x64_sys_prlimit64
                   __x64_sys_statfs
                   __x64_sys_ioctl
                   __x64_sys_getdents64
                   __x64_sys_write
                   __x64_sys_exit_group
      
      Committer notes:
      
      The first arg to the perf-with-kcore needs to be the same for the
      'record' and 'script' lines, otherwise we'll record the perf.data file
      and kcore_dir/ files in one directory ('example') to then try to use it
      from the 'bep' directory, fix the instructions above it so that both use
      'example'.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: f08046cb ("perf thread-stack: Represent jmps to the start of a different symbol")
      Link: http://lkml.kernel.org/r/20190619064429.14940-2-adrian.hunter@intel.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      97860b48
    • Numfor Mbiziwo-Tiapo's avatar
      perf tools: Fix cache.h include directive · 2d7102a0
      Numfor Mbiziwo-Tiapo authored
      Change the include path so that progress.c can find cache.h since it was
      previously searching in the wrong directory.
      
      Committer notes:
      
        $ ls -la tools/perf/ui/../cache.h
        ls: cannot access 'tools/perf/ui/../cache.h': No such file or directory
      
      So it really should include ../../util/cache.h, or plain cache.h, since
      we have -Iutil in INC_FLAGS in tools/perf/Makefile.config
      Signed-off-by: default avatarNumfor Mbiziwo-Tiapo <nums@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>,
      Cc: Luke Mujica <lukemujica@google.com>,
      Cc: Stephane Eranian <eranian@google.com>
      To: Ian Rogers <irogers@google.com>
      Link: https://lkml.kernel.org/n/tip-pud8usyutvd2npg2vpsygncz@git.kernel.orgSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      2d7102a0
  2. 24 Jun, 2019 11 commits
    • Ian Rogers's avatar
      perf/cgroups: Don't rotate events for cgroups unnecessarily · fd7d5517
      Ian Rogers authored
      Currently perf_rotate_context assumes that if the context's nr_events !=
      nr_active a rotation is necessary for perf event multiplexing. With
      cgroups, nr_events is the total count of events for all cgroups and
      nr_active will not include events in a cgroup other than the current
      task's. This makes rotation appear necessary for cgroups when it is not.
      
      Add a perf_event_context flag that is set when rotation is necessary.
      Clear the flag during sched_out and set it when a flexible sched_in
      fails due to resources.
      Signed-off-by: default avatarIan Rogers <irogers@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fd7d5517
    • Jiri Olsa's avatar
      perf/x86/rapl: Get quirk state from new probe framework · 637d97b5
      Jiri Olsa authored
      Getting the apply_quirk bool from new rapl_model_match array.
      
      And because apply_quirk was the last remaining piece of data
      in rapl_cpu_match, replacing it with rapl_model_match as device
      table.
      
      The switch to new perf_msr_probe detection API is done.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-9-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      637d97b5
    • Jiri Olsa's avatar
      perf/x86/rapl: Get attributes from new probe framework · 5fc1bd84
      Jiri Olsa authored
      We no longer need model specific attribute arrays,
      because we get all this detected in rapl_events_attrs.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-8-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5fc1bd84
    • Jiri Olsa's avatar
      perf/x86/rapl: Get MSR values from new probe framework · 122f1c51
      Jiri Olsa authored
      There's no need to have special code for getting
      the bit and MSR value for given event. We can
      now easily get it from rapl_msrs array.
      
      Also getting rid of RAPL_IDX_*, which is no longer
      needed and replacing INTEL_RAPL* with PERF_RAPL*
      enums.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-7-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      122f1c51
    • Jiri Olsa's avatar
      perf/x86/rapl: Get rapl_cntr_mask from new probe framework · cd105aed
      Jiri Olsa authored
      We get rapl_cntr_mask from perf_msr_probe call, as a replacement
      for current intel_rapl_init_fun::cntr_mask value for each model.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-6-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cd105aed
    • Jiri Olsa's avatar
      perf/x86/rapl: Use new MSR detection interface · 5fb5273a
      Jiri Olsa authored
      Using perf_msr_probe function to probe for RAPL MSRs.
      
      Adding new rapl_model_match device table, that
      gathers events info for given model, following
      the MSR and cstate module design.
      
      It will replace the current rapl_cpu_match device
      table and detection code in following patches.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-5-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5fb5273a
    • Jiri Olsa's avatar
      perf/x86/cstate: Use new probe function · 8f2a28c5
      Jiri Olsa authored
      Using perf_msr_probe function to probe for cstate events.
      
      The functionality is the same, with one exception, that
      perf_msr_probe checks for rdmsr to return value != 0 for
      given MSR register.
      
      Using the new attribute groups and adding the events via
      pmu::attr_update.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-4-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f2a28c5
    • Jiri Olsa's avatar
      perf/x86/msr: Use new probe function · dde5e720
      Jiri Olsa authored
      Using perf_msr_probe function to probe for msr events.
      
      The functionality is the same, with one exception, that
      perf_msr_probe checks for rdmsr to return value != 0 for
      given MSR register.
      
      Using the new attribute groups and adding the events via
      pmu::attr_update.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-3-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dde5e720
    • Jiri Olsa's avatar
      perf/x86: Add MSR probe interface · 98253a54
      Jiri Olsa authored
      Adding perf_msr_probe function to provide interface for
      checking up on MSR register and set the related attribute
      group visibility.
      
      User defines following struct for each MSR register:
      
        struct perf_msr {
             u64                       msr;
             struct attribute_group   *grp;
             bool                    (*test)(int idx, void *data);
             bool                      no_check;
        };
      
      Where:
        msr      - is the MSR address
        attrs    - is attribute groups array to add if the check passed
        test     - is test function pointer
        no_check - is bool that bypass the check and adds the
                    attribute without any test
      
      The array of struct perf_msr is passed into:
      
        perf_msr_probe(struct perf_msr *msr, int cnt, bool zero, void *data)
      
      Together with:
        cnt  - which is the number of struct msr array elements
        data - which is user pointer passed to the test function
        zero - allow counters that returns zero on rdmsr
      
      The perf_msr_probe will executed test code, read the MSR and
      check the value is != 0. If all these tests pass, related
      attribute group is kept visible.
      
      Also adding PMU_EVENT_GROUP macro helper to define attribute
      group for single attribute. It will be used in following patches.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan <kan.liang@linux.intel.com>
      Cc: Liang
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: https://lkml.kernel.org/r/20190616140358.27799-2-jolsa@kernel.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      98253a54
    • Ingo Molnar's avatar
    • Ingo Molnar's avatar
      b9271f0c
  3. 23 Jun, 2019 5 commits
    • Fenghua Yu's avatar
      Documentation/ABI: Document umwait control sysfs interfaces · 203dffac
      Fenghua Yu authored
      Since two new sysfs interface files are created for umwait control, add
      an ABI document entry for the files:
      
         /sys/devices/system/cpu/umwait_control/enable_c02
         /sys/devices/system/cpu/umwait_control/max_time
      
      [ tglx: Made the write value instructions readable ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Andy Lutomirski" <luto@kernel.org>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-6-git-send-email-fenghua.yu@intel.com
      203dffac
    • Fenghua Yu's avatar
      x86/umwait: Add sysfs interface to control umwait maximum time · bd9a0c97
      Fenghua Yu authored
      IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
      that processor can stay in C0.1 or C0.2. A zero value means no maximum
      time.
      
      Each instruction sets its own deadline in the instruction's implicit
      input EDX:EAX value. The instruction wakes up if the time-stamp counter
      reaches or exceeds the specified deadline, or the umwait maximum time
      expires, or a store happens in the monitored address range in umwait.
      
      The administrator can write an unsigned 32-bit number to
      /sys/devices/system/cpu/umwait_control/max_time to change the default
      value. Note that a value of zero means there is no limit. The lower two
      bits of the value must be zero.
      
      [ tglx: Simplify the write function. Massage changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Andy Lutomirski" <luto@kernel.org>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-5-git-send-email-fenghua.yu@intel.com
      bd9a0c97
    • Fenghua Yu's avatar
      x86/umwait: Add sysfs interface to control umwait C0.2 state · ff4b353f
      Fenghua Yu authored
      C0.2 state in umwait and tpause instructions can be enabled or disabled
      on a processor through IA32_UMWAIT_CONTROL MSR register.
      
      By default, C0.2 is enabled and the user wait instructions results in
      lower power consumption with slower wakeup time.
      
      But in real time systems which require faster wakeup time although power
      savings could be smaller, the administrator needs to disable C0.2 and all
      umwait invocations from user applications use C0.1.
      
      Create a sysfs interface which allows the administrator to control C0.2
      state during run time.
      
      Andy Lutomirski suggested to turn off local irqs before writing the MSR to
      ensure the cached control value is not changed by a concurrent sysfs write
      from a different CPU via IPI.
      
      [ tglx: Simplified the update logic in the write function and got rid of
        	all the convoluted type casts. Added a shared update function and
      	made the namespace consistent. Moved the sysfs create invocation.
      	Massaged changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Andy Lutomirski" <luto@kernel.org>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-4-git-send-email-fenghua.yu@intel.com
      ff4b353f
    • Fenghua Yu's avatar
      x86/umwait: Initialize umwait control values · bd688c69
      Fenghua Yu authored
      umwait or tpause allows the processor to enter a light-weight
      power/performance optimized state (C0.1 state) or an improved
      power/performance optimized state (C0.2 state) for a period specified by
      the instruction or until the system time limit or until a store to the
      monitored address range in umwait.
      
      IA32_UMWAIT_CONTROL MSR register allows the OS to enable/disable C0.2 on
      the processor and to set the maximum time the processor can reside in C0.1
      or C0.2.
      
      By default C0.2 is enabled so the user wait instructions can enter the
      C0.2 state to save more power with slower wakeup time.
      
      Andy Lutomirski proposed to set the maximum umwait time to 100000 cycles by
      default. A quote from Andy:
      
        "What I want to avoid is the case where it works dramatically differently
         on NO_HZ_FULL systems as compared to everything else. Also, UMWAIT may
         behave a bit differently if the max timeout is hit, and I'd like that
         path to get exercised widely by making it happen even on default
         configs."
      
      A sysfs interface to adjust the time and the C0.2 enablement is provided in
      a follow up change.
      
      [ tglx: Renamed MSR_IA32_UMWAIT_CONTROL_MAX_TIME to
        	MSR_IA32_UMWAIT_CONTROL_TIME_MASK because the constant is used as
        	mask throughout the code.
      	Massaged comments and changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-3-git-send-email-fenghua.yu@intel.com
      bd688c69
    • Fenghua Yu's avatar
      x86/cpufeatures: Enumerate user wait instructions · 6dbbf5ec
      Fenghua Yu authored
      umonitor, umwait, and tpause are a set of user wait instructions.
      
      umonitor arms address monitoring hardware using an address. The
      address range is determined by using CPUID.0x5. A store to
      an address within the specified address range triggers the
      monitoring hardware to wake up the processor waiting in umwait.
      
      umwait instructs the processor to enter an implementation-dependent
      optimized state while monitoring a range of addresses. The optimized
      state may be either a light-weight power/performance optimized state
      (C0.1 state) or an improved power/performance optimized state
      (C0.2 state).
      
      tpause instructs the processor to enter an implementation-dependent
      optimized state C0.1 or C0.2 state and wake up when time-stamp counter
      reaches specified timeout.
      
      The three instructions may be executed at any privilege level.
      
      The instructions provide power saving method while waiting in
      user space. Additionally, they can allow a sibling hyperthread to
      make faster progress while this thread is waiting. One example of an
      application usage of umwait is when waiting for input data from another
      application, such as a user level multi-threaded packet processing
      engine.
      
      Availability of the user wait instructions is indicated by the presence
      of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].
      
      Detailed information on the instructions and CPUID feature WAITPKG flag
      can be found in the latest Intel Architecture Instruction Set Extensions
      and Future Features Programming Reference and Intel 64 and IA-32
      Architectures Software Developer's Manual.
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-2-git-send-email-fenghua.yu@intel.com
      6dbbf5ec
  4. 22 Jun, 2019 20 commits