1. 08 Jul, 2020 23 commits
    • Kan Liang's avatar
      perf/x86/intel/lbr: Support XSAVES for arch LBR read · c085fb87
      Kan Liang authored
      Reading LBR registers in a perf NMI handler for a non-PEBS event
      causes a high overhead because the number of LBR registers is huge.
      To reduce the overhead, the XSAVES instruction should be used to replace
      the LBR registers' reading method.
      
      The XSAVES buffer used for LBR read has to be per-CPU because the NMI
      handler invoked the lbr_read(). The existing task_ctx_data buffer
      cannot be used which is per-task and only be allocated for the LBR call
      stack mode. A new lbr_xsave pointer is introduced in the cpu_hw_events
      as an XSAVES buffer for LBR read.
      
      The XSAVES buffer should be allocated only when LBR is used by a
      non-PEBS event on the CPU because the total size of the lbr_xsave is
      not small (~1.4KB).
      
      The XSAVES buffer is allocated when a non-PEBS event is added, but it
      is lazily released in x86_release_hardware() when perf releases the
      entire PMU hardware resource, because perf may frequently schedule the
      event, e.g. high context switch. The lazy release method reduces the
      overhead of frequently allocate/free the buffer.
      
      If the lbr_xsave fails to be allocated, roll back to normal Arch LBR
      lbr_read().
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-24-git-send-email-kan.liang@linux.intel.com
      c085fb87
    • Kan Liang's avatar
      perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch · ce711ea3
      Kan Liang authored
      In the LBR call stack mode, LBR information is used to reconstruct a
      call stack. To get the complete call stack, perf has to save/restore
      all LBR registers during a context switch. Due to a large number of the
      LBR registers, this process causes a high CPU overhead. To reduce the
      CPU overhead during a context switch, use the XSAVES/XRSTORS
      instructions.
      
      Every XSAVE area must follow a canonical format: the legacy region, an
      XSAVE header and the extended region. Although the LBR information is
      only kept in the extended region, a space for the legacy region and
      XSAVE header is still required. Add a new dedicated structure for LBR
      XSAVES support.
      
      Before enabling XSAVES support, the size of the LBR state has to be
      sanity checked, because:
      - the size of the software structure is calculated from the max number
      of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR.
      The size of the LBR state is enumerated by the CPUID leaf for XSAVE
      support of Arch LBR. If the values from the two CPUID leaves are not
      consistent, it may trigger a buffer overflow. For example, a hypervisor
      may unconsciously set inconsistent values for the two emulated CPUID.
      - unlike other state components, the size of an LBR state depends on the
      max number of LBRs, which may vary from generation to generation.
      
      Expose the function xfeature_size() for the sanity check.
      The LBR XSAVES support will be disabled if the size of the LBR state
      enumerated by CPUID doesn't match with the size of the software
      structure.
      
      The XSAVE instruction requires 64-byte alignment for state buffers. A
      new macro is added to reflect the alignment requirement. A 64-byte
      aligned kmem_cache is created for architecture LBR.
      
      Currently, the structure for each state component is maintained in
      fpu/types.h. The structure for the new LBR state component should be
      maintained in the same place. Move structure lbr_entry to fpu/types.h as
      well for broader sharing.
      
      Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support,
      which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR
      information at the context switch when the call stack mode is enabled.
      Since the XSAVES/XRSTORS instructions will be eventually invoked, the
      dedicated functions is named with '_xsaves'/'_xrstors' postfix.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-23-git-send-email-kan.liang@linux.intel.com
      ce711ea3
    • Kan Liang's avatar
      x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature · 50f408d9
      Kan Liang authored
      The perf subsystem will only need to save/restore the LBR state.
      However, the existing helpers save all supported supervisor states to a
      kernel buffer, which will be unnecessary. Two helpers are introduced to
      only save/restore requested dynamic supervisor states. The supervisor
      features in XFEATURE_MASK_SUPERVISOR_SUPPORTED and
      XFEATURE_MASK_SUPERVISOR_UNSUPPORTED mask cannot be saved/restored using
      these helpers.
      
      The helpers will be used in the following patch.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-22-git-send-email-kan.liang@linux.intel.com
      50f408d9
    • Kan Liang's avatar
      x86/fpu/xstate: Support dynamic supervisor feature for LBR · f0dccc9d
      Kan Liang authored
      Last Branch Records (LBR) registers are used to log taken branches and
      other control flows. In perf with call stack mode, LBR information is
      used to reconstruct a call stack. To get the complete call stack, perf
      has to save/restore all LBR registers during a context switch. Due to
      the large number of the LBR registers, e.g., the current platform has
      96 LBR registers, this process causes a high CPU overhead. To reduce
      the CPU overhead during a context switch, an LBR state component that
      contains all the LBR related registers is introduced in hardware. All
      LBR registers can be saved/restored together using one XSAVES/XRSTORS
      instruction.
      
      However, the kernel should not save/restore the LBR state component at
      each context switch, like other state components, because of the
      following unique features of LBR:
      - The LBR state component only contains valuable information when LBR
        is enabled in the perf subsystem, but for most of the time, LBR is
        disabled.
      - The size of the LBR state component is huge. For the current
        platform, it's 808 bytes.
      If the kernel saves/restores the LBR state at each context switch, for
      most of the time, it is just a waste of space and cycles.
      
      To efficiently support the LBR state component, it is desired to have:
      - only context-switch the LBR when the LBR feature is enabled in perf.
      - only allocate an LBR-specific XSAVE buffer on demand.
        (Besides the LBR state, a legacy region and an XSAVE header have to be
         included in the buffer as well. There is a total of (808+576) byte
         overhead for the LBR-specific XSAVE buffer. The overhead only happens
         when the perf is actively using LBRs. There is still a space-saving,
         on average, when it replaces the constant 808 bytes of overhead for
         every task, all the time on the systems that support architectural
         LBR.)
      - be able to use XSAVES/XRSTORS for accessing LBR at run time.
        However, the IA32_XSS should not be adjusted at run time.
        (The XCR0 | IA32_XSS are used to determine the requested-feature
        bitmap (RFBM) of XSAVES.)
      
      A solution, called dynamic supervisor feature, is introduced to address
      this issue, which
      - does not allocate a buffer in each task->fpu;
      - does not save/restore a state component at each context switch;
      - sets the bit corresponding to the dynamic supervisor feature in
        IA32_XSS at boot time, and avoids setting it at run time.
      - dynamically allocates a specific buffer for a state component
        on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is
        enabled in perf. (Note: The buffer has to include the LBR state
        component, a legacy region and a XSAVE header space.)
        (Implemented in a later patch)
      - saves/restores a state component on demand, e.g. manually invokes
        the XSAVES/XRSTORS instruction to save/restore the LBR state
        to/from the buffer when perf is active and a call stack is required.
        (Implemented in a later patch)
      
      A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic()
      are introduced to indicate the dynamic supervisor feature. For the
      systems which support the Architecture LBR, LBR is the only dynamic
      supervisor feature for now. For the previous systems, there is no
      dynamic supervisor feature available.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
      f0dccc9d
    • Kan Liang's avatar
      x86/fpu: Use proper mask to replace full instruction mask · a063bf24
      Kan Liang authored
      When saving xstate to a kernel/user XSAVE area with the XSAVE family of
      instructions, the current code applies the 'full' instruction mask (-1),
      which tries to XSAVE all possible features. This method relies on
      hardware to trim 'all possible' down to what is enabled in the
      hardware. The code works well for now. However, there will be a
      problem, if some features are enabled in hardware, but are not suitable
      to be saved into all kernel XSAVE buffers, like task->fpu, due to
      performance consideration.
      
      One such example is the Last Branch Records (LBR) state. The LBR state
      only contains valuable information when LBR is explicitly enabled by
      the perf subsystem, and the size of an LBR state is large (808 bytes
      for now). To avoid both CPU overhead and space overhead at each context
      switch, the LBR state should not be saved into task->fpu like other
      state components. It should be saved/restored on demand when LBR is
      enabled in the perf subsystem. Current copy_xregs_to_* will trigger a
      buffer overflow for such cases.
      
      Three sites use the '-1' instruction mask which must be updated.
      
      Two are saving/restoring the xstate to/from a kernel-allocated XSAVE
      buffer and can use 'xfeatures_mask_all', which will save/restore all of
      the features present in a normal task FPU buffer.
      
      The last one saves the register state directly to a user buffer. It
      could
      also use 'xfeatures_mask_all'. Just as it was with the '-1' argument,
      any supervisor states in the mask will be filtered out by the hardware
      and not saved to the buffer.  But, to be more explicit about what is
      expected to be saved, use xfeatures_mask_user() for the instruction
      mask.
      
      KVM includes the header file fpu/internal.h. To avoid 'undefined
      xfeatures_mask_all' compiling issue, move copy_fpregs_to_fpstate() to
      fpu/core.c and export it, because:
      - The xfeatures_mask_all is indirectly used via copy_fpregs_to_fpstate()
        by KVM. The function which is directly used by other modules should be
        exported.
      - The copy_fpregs_to_fpstate() is a function, while xfeatures_mask_all
        is a variable for the "internal" FPU state. It's safer to export a
        function than a variable, which may be implicitly changed by others.
      - The copy_fpregs_to_fpstate() is a big function with many checks. The
        removal of the inline keyword should not impact the performance.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-20-git-send-email-kan.liang@linux.intel.com
      a063bf24
    • Kan Liang's avatar
      perf/x86: Remove task_ctx_size · 5a09928d
      Kan Liang authored
      A new kmem_cache method has replaced the kzalloc() to allocate the PMU
      specific data. The task_ctx_size is not required anymore.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
      5a09928d
    • Kan Liang's avatar
      perf/x86/intel/lbr: Create kmem_cache for the LBR context data · 33cad284
      Kan Liang authored
      A new kmem_cache method is introduced to allocate the PMU specific data
      task_ctx_data, which requires the PMU specific code to create a
      kmem_cache.
      
      Currently, the task_ctx_data is only used by the Intel LBR call stack
      feature, which is introduced since Haswell. The kmem_cache should be
      only created for Haswell and later platforms. There is no alignment
      requirement for the existing platforms.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-18-git-send-email-kan.liang@linux.intel.com
      33cad284
    • Kan Liang's avatar
      perf/core: Use kmem_cache to allocate the PMU specific data · 217c2a63
      Kan Liang authored
      Currently, the PMU specific data task_ctx_data is allocated by the
      function kzalloc() in the perf generic code. When there is no specific
      alignment requirement for the task_ctx_data, the method works well for
      now. However, there will be a problem once a specific alignment
      requirement is introduced in future features, e.g., the Architecture LBR
      XSAVE feature requires 64-byte alignment. If the specific alignment
      requirement is not fulfilled, the XSAVE family of instructions will fail
      to save/restore the xstate to/from the task_ctx_data.
      
      The function kzalloc() itself only guarantees a natural alignment. A
      new method to allocate the task_ctx_data has to be introduced, which
      has to meet the requirements as below:
      - must be a generic method can be used by different architectures,
        because the allocation of the task_ctx_data is implemented in the
        perf generic code;
      - must be an alignment-guarantee method (The alignment requirement is
        not changed after the boot);
      - must be able to allocate/free a buffer (smaller than a page size)
        dynamically;
      - should not cause extra CPU overhead or space overhead.
      
      Several options were considered as below:
      - One option is to allocate a larger buffer for task_ctx_data. E.g.,
          ptr = kmalloc(size + alignment, GFP_KERNEL);
          ptr &= ~(alignment - 1);
        This option causes space overhead.
      - Another option is to allocate the task_ctx_data in the PMU specific
        code. To do so, several function pointers have to be added. As a
        result, both the generic structure and the PMU specific structure
        will become bigger. Besides, extra function calls are added when
        allocating/freeing the buffer. This option will increase both the
        space overhead and CPU overhead.
      - The third option is to use a kmem_cache to allocate a buffer for the
        task_ctx_data. The kmem_cache can be created with a specific alignment
        requirement by the PMU at boot time. A new pointer for kmem_cache has
        to be added in the generic struct pmu, which would be used to
        dynamically allocate a buffer for the task_ctx_data at run time.
        Although the new pointer is added to the struct pmu, the existing
        variable task_ctx_size is not required anymore. The size of the
        generic structure is kept the same.
      
      The third option which meets all the aforementioned requirements is used
      to replace kzalloc() for the PMU specific data allocation. A later patch
      will remove the kzalloc() method and the related variables.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
      217c2a63
    • Kan Liang's avatar
      perf/core: Factor out functions to allocate/free the task_ctx_data · ff9ff926
      Kan Liang authored
      The method to allocate/free the task_ctx_data is going to be changed in
      the following patch. Currently, the task_ctx_data is allocated/freed in
      several different places. To avoid repeatedly modifying the same codes
      in several different places, alloc_task_ctx_data() and
      free_task_ctx_data() are factored out to allocate/free the
      task_ctx_data. The modification only needs to be applied once.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-16-git-send-email-kan.liang@linux.intel.com
      ff9ff926
    • Kan Liang's avatar
      perf/x86/intel/lbr: Support Architectural LBR · 47125db2
      Kan Liang authored
      Last Branch Records (LBR) enables recording of software path history by
      logging taken branches and other control flows within architectural
      registers now. Intel CPUs have had model-specific LBR for quite some
      time, but this evolves them into an architectural feature now.
      
      The main improvements of Architectural LBR implemented includes:
      - Linux kernel can support the LBR features without knowing the model
        number of the current CPU.
      - Architectural LBR capabilities can be enumerated by CPUID. The
        lbr_ctl_map is based on the CPUID Enumeration.
      - The possible LBR depth can be retrieved from CPUID enumeration. The
        max value is written to the new MSR_ARCH_LBR_DEPTH as the number of
        LBR entries.
      - A new IA32_LBR_CTL MSR is introduced to enable and configure LBRs,
        which replaces the IA32_DEBUGCTL[bit 0] and the LBR_SELECT MSR.
      - Each LBR record or entry is still comprised of three MSRs,
        IA32_LBR_x_FROM_IP, IA32_LBR_x_TO_IP and IA32_LBR_x_TO_IP.
        But they become the architectural MSRs.
      - Architectural LBR is stack-like now. Entry 0 is always the youngest
        branch, entry 1 the next youngest... The TOS MSR has been removed.
      
      The way to enable/disable Architectural LBR is similar to the previous
      model-specific LBR. __intel_pmu_lbr_enable/disable() can be reused, but
      some modifications are required, which include:
      - MSR_ARCH_LBR_CTL is used to enable and configure the Architectural
        LBR.
      - When checking the value of the IA32_DEBUGCTL MSR, ignoring the
        DEBUGCTLMSR_LBR (bit 0) for Architectural LBR, which has no meaning
        and always return 0.
      - The FREEZE_LBRS_ON_PMI has to be explicitly set/clear, because
        MSR_IA32_DEBUGCTLMSR is not touched in __intel_pmu_lbr_disable() for
        Architectural LBR.
      - Only MSR_ARCH_LBR_CTL is cleared in __intel_pmu_lbr_disable() for
        Architectural LBR.
      
      Some Architectural LBR dedicated functions are implemented to
      reset/read/save/restore LBR.
      - For reset, writing to the ARCH_LBR_DEPTH MSR clears all Arch LBR
        entries, which is a lot faster and can improve the context switch
        latency.
      - For read, the branch type information can be retrieved from
        the MSR_ARCH_LBR_INFO_*. But it's not fully compatible due to
        OTHER_BRANCH type. The software decoding is still required for the
        OTHER_BRANCH case.
        LBR records are stored in the age order as well. Reuse
        intel_pmu_store_lbr(). Check the CPUID enumeration before accessing
        the corresponding bits in LBR_INFO.
      - For save/restore, applying the fast reset (writing ARCH_LBR_DEPTH).
        Reading 'lbr_from' of entry 0 instead of the TOS MSR to check if the
        LBR registers are reset in the deep C-state. If 'the deep C-state
        reset' bit is not set in CPUID enumeration, ignoring the check.
        XSAVE support for Architectural LBR will be implemented later.
      
      The number of LBR entries cannot be hardcoded anymore, which should be
      retrieved from CPUID enumeration. A new structure
      x86_perf_task_context_arch_lbr is introduced for Architectural LBR.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-15-git-send-email-kan.liang@linux.intel.com
      47125db2
    • Kan Liang's avatar
      perf/x86/intel/lbr: Factor out intel_pmu_store_lbr · 631618a0
      Kan Liang authored
      The way to store the LBR information from a PEBS LBR record can be
      reused in Architecture LBR, because
      - The LBR information is stored like a stack. Entry 0 is always the
        youngest branch.
      - The layout of the LBR INFO MSR is similar.
      
      The LBR information may be retrieved from either the LBR registers
      (non-PEBS event) or a buffer (PEBS event). Extend rdlbr_*() to support
      both methods.
      
      Explicitly check the invalid entry (0s), which can avoid unnecessary MSR
      access if using a non-PEBS event. For a PEBS event, the check should
      slightly improve the performance as well. The invalid entries are cut.
      The intel_pmu_lbr_filter() doesn't need to check and filter them out.
      
      Cannot share the function with current model-specific LBR read, because
      the direction of the LBR growth is opposite.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-14-git-send-email-kan.liang@linux.intel.com
      631618a0
    • Kan Liang's avatar
      perf/x86/intel/lbr: Factor out rdlbr_all() and wrlbr_all() · fda1f99f
      Kan Liang authored
      The previous model-specific LBR and Architecture LBR (legacy way) use a
      similar method to save/restore the LBR information, which directly
      accesses the LBR registers. The codes which read/write a set of LBR
      registers can be shared between them.
      
      Factor out two functions which are used to read/write a set of LBR
      registers.
      
      Add lbr_info into structure x86_pmu, and use it to replace the hardcoded
      LBR INFO MSR, because the LBR INFO MSR address of the previous
      model-specific LBR is different from Architecture LBR. The MSR address
      should be assigned at boot time. For now, only Sky Lake and later
      platforms have the LBR INFO MSR.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-13-git-send-email-kan.liang@linux.intel.com
      fda1f99f
    • Kan Liang's avatar
      perf/x86/intel/lbr: Mark the {rd,wr}lbr_{to,from} wrappers __always_inline · 020d91e5
      Kan Liang authored
      The {rd,wr}lbr_{to,from} wrappers are invoked in hot paths, e.g. context
      switch and NMI handler. They should be always inline to achieve better
      performance. However, the CONFIG_OPTIMIZE_INLINING allows the compiler
      to uninline functions marked 'inline'.
      
      Mark the {rd,wr}lbr_{to,from} wrappers as __always_inline to force
      inline the wrappers.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-12-git-send-email-kan.liang@linux.intel.com
      020d91e5
    • Kan Liang's avatar
      perf/x86/intel/lbr: Unify the stored format of LBR information · 5624986d
      Kan Liang authored
      Current LBR information in the structure x86_perf_task_context is stored
      in a different format from the PEBS LBR record and Architecture LBR,
      which prevents the sharing of the common codes.
      
      Use the format of the PEBS LBR record as a unified format. Use a generic
      name lbr_entry to replace pebs_lbr_entry.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-11-git-send-email-kan.liang@linux.intel.com
      5624986d
    • Kan Liang's avatar
      perf/x86/intel/lbr: Support LBR_CTL · 49d8184f
      Kan Liang authored
      An IA32_LBR_CTL is introduced for Architecture LBR to enable and config
      LBR registers to replace the previous LBR_SELECT.
      
      All the related members in struct cpu_hw_events and struct x86_pmu
      have to be renamed.
      
      Some new macros are added to reflect the layout of LBR_CTL.
      
      The mapping from PERF_SAMPLE_BRANCH_* to the corresponding bits in
      LBR_CTL MSR is saved in lbr_ctl_map now, which is not a const value.
      The value relies on the CPUID enumeration.
      
      For the previous model-specific LBR, most of the bits in LBR_SELECT
      operate in the suppressed mode. For the bits in LBR_CTL, the polarity is
      inverted.
      
      For the previous model-specific LBR format 5 (LBR_FORMAT_INFO), if the
      NO_CYCLES and NO_FLAGS type are set, the flag LBR_NO_INFO will be set to
      avoid the unnecessary LBR_INFO MSR read. Although Architecture LBR also
      has a dedicated LBR_INFO MSR, perf doesn't need to check and set the
      flag LBR_NO_INFO. For Architecture LBR, XSAVES instruction will be used
      as the default way to read the LBR MSRs all together. The overhead which
      the flag tries to avoid doesn't exist anymore. Dropping the flag can
      save the extra check for the flag in the lbr_read() later, and make the
      code cleaner.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-10-git-send-email-kan.liang@linux.intel.com
      49d8184f
    • Kan Liang's avatar
      perf/x86: Expose CPUID enumeration bits for arch LBR · af6cf129
      Kan Liang authored
      The LBR capabilities of Architecture LBR are retrieved from the CPUID
      enumeration once at boot time. The capabilities have to be saved for
      future usage.
      
      Several new fields are added into structure x86_pmu to indicate the
      capabilities. The fields will be used in the following patches.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-9-git-send-email-kan.liang@linux.intel.com
      af6cf129
    • Kan Liang's avatar
      x86/msr-index: Add bunch of MSRs for Arch LBR · d6a162a4
      Kan Liang authored
      Add Arch LBR related MSRs and the new LBR INFO bits in MSR-index.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-8-git-send-email-kan.liang@linux.intel.com
      d6a162a4
    • Kan Liang's avatar
      perf/x86/intel/lbr: Use dynamic data structure for task_ctx · f42be865
      Kan Liang authored
      The type of task_ctx is hardcoded as struct x86_perf_task_context,
      which doesn't apply for Architecture LBR. For example, Architecture LBR
      doesn't have the TOS MSR. The number of LBR entries is variable. A new
      struct will be introduced for Architecture LBR. Perf has to determine
      the type of task_ctx at run time.
      
      The type of task_ctx pointer is changed to 'void *', which will be
      determined at run time.
      
      The generic LBR optimization can be shared between Architecture LBR and
      model-specific LBR. Both need to access the structure for the generic
      LBR optimization. A helper task_context_opt() is introduced to retrieve
      the pointer of the structure at run time.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-7-git-send-email-kan.liang@linux.intel.com
      f42be865
    • Kan Liang's avatar
      perf/x86/intel/lbr: Factor out a new struct for generic optimization · 530bfff6
      Kan Liang authored
      To reduce the overhead of a context switch with LBR enabled, some
      generic optimizations were introduced, e.g. avoiding restore LBR if no
      one else touched them. The generic optimizations can also be used by
      Architecture LBR later. Currently, the fields for the generic
      optimizations are part of structure x86_perf_task_context, which will be
      deprecated by Architecture LBR. A new structure should be introduced
      for the common fields of generic optimization, which can be shared
      between Architecture LBR and model-specific LBR.
      
      Both 'valid_lbrs' and 'tos' are also used by the generic optimizations,
      but they are not moved into the new structure, because Architecture LBR
      is stack-like. The 'valid_lbrs' which records the index of the valid LBR
      is not required anymore. The TOS MSR will be removed.
      
      LBR registers may be cleared in the deep Cstate. If so, the generic
      optimizations should not be applied. Perf has to unconditionally
      restore the LBR registers. A generic function is required to detect the
      reset due to the deep Cstate. lbr_is_reset_in_cstate() is introduced.
      Currently, for the model-specific LBR, the TOS MSR is used to detect the
      reset. There will be another method introduced for Architecture LBR
      later.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-6-git-send-email-kan.liang@linux.intel.com
      530bfff6
    • Kan Liang's avatar
      perf/x86/intel/lbr: Add the function pointers for LBR save and restore · 799571bf
      Kan Liang authored
      The MSRs of Architectural LBR are different from previous model-specific
      LBR. Perf has to implement different functions to save and restore them.
      
      The function pointers for LBR save and restore are introduced. Perf
      should initialize the corresponding functions at boot time.
      
      The generic optimizations, e.g. avoiding restore LBR if no one else
      touched them, still apply for Architectural LBRs. The related codes are
      not moved to model-specific functions.
      
      Current model-specific LBR functions are set as default.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-5-git-send-email-kan.liang@linux.intel.com
      799571bf
    • Kan Liang's avatar
      perf/x86/intel/lbr: Add a function pointer for LBR read · c301b1d8
      Kan Liang authored
      The method to read Architectural LBRs is different from previous
      model-specific LBR. Perf has to implement a different function.
      
      A function pointer for LBR read is introduced. Perf should initialize
      the corresponding function at boot time, and avoid checking lbr_format
      at run time.
      
      The current 64-bit LBR read function is set as default.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-4-git-send-email-kan.liang@linux.intel.com
      c301b1d8
    • Kan Liang's avatar
      perf/x86/intel/lbr: Add a function pointer for LBR reset · 9f354a72
      Kan Liang authored
      The method to reset Architectural LBRs is different from previous
      model-specific LBR. Perf has to implement a different function.
      
      A function pointer is introduced for LBR reset. The enum of
      LBR_FORMAT_* is also moved to perf_event.h. Perf should initialize the
      corresponding functions at boot time, and avoid checking lbr_format at
      run time.
      
      The current 64-bit LBR reset function is set as default.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-3-git-send-email-kan.liang@linux.intel.com
      9f354a72
    • Kan Liang's avatar
      x86/cpufeatures: Add Architectural LBRs feature bit · bd657aa3
      Kan Liang authored
      CPUID.(EAX=07H, ECX=0):EDX[19] indicates whether an Intel CPU supports
      Architectural LBRs.
      
      The "X86_FEATURE_..., word 18" is already mirrored from CPUID
      "0x00000007:0 (EDX)". Add X86_FEATURE_ARCH_LBR under the "word 18"
      section.
      
      The feature will appear as "arch_lbr" in /proc/cpuinfo.
      
      The Architectural Last Branch Records (LBR) feature enables recording
      of software path history by logging taken branches and other control
      flows. The feature will be supported in the perf_events subsystem.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/1593780569-62993-2-git-send-email-kan.liang@linux.intel.com
      bd657aa3
  2. 02 Jul, 2020 6 commits
  3. 28 Jun, 2020 11 commits
    • Linus Torvalds's avatar
      Linux 5.8-rc3 · 9ebcfadb
      Linus Torvalds authored
      9ebcfadb
    • Linus Torvalds's avatar
      Merge tag 'arm-omap-fixes-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · f7db192b
      Linus Torvalds authored
      Pull ARM OMAP fixes from Arnd Bergmann:
       "The OMAP developers are particularly active at hunting down
        regressions, so this is a separate branch with OMAP specific
        fixes for v5.8:
      
        As Tony explains
          "The recent display subsystem (DSS) related platform data changes
           caused display related regressions for suspend and resume. Looks
           like I only tested suspend and resume before dropping the legacy
           platform data, and forgot to test it after dropping it. Turns out
           the main issue was that we no longer have platform code calling
           pm_runtime_suspend for DSS like we did for the legacy platform data
           case, and that fix is still being discussed on the dri-devel list
           and will get merged separately. The DSS related testing exposed a
           pile other other display related issues that also need fixing
           though":
      
         - Fix ti-sysc optional clock handling and reset status checks for
           devices that reset automatically in idle like DSS
      
         - Ignore ti-sysc clockactivity bit unless separately requested to
           avoid unexpected performance issues
      
         - Init ti-sysc framedonetv_irq to true and disable for am4
      
         - Avoid duplicate DSS reset for legacy mode with dts data
      
         - Remove LCD timings for am4 as they cause warnings now that we're
           using generic panels
      
        Other OMAP changes from Tony include:
      
         - Fix omap_prm reset deassert as we still have drivers setting the
           pm_runtime_irq_safe() flag
      
         - Flush posted write for ti-sysc enable and disable
      
         - Fix droid4 spi related errors with spi flags
      
         - Fix am335x USB range and a typo for softreset
      
         - Fix dra7 timer nodes for clocks for IPU and DSP
      
         - Drop duplicate mailboxes after mismerge for dra7
      
         - Prevent pocketgeagle header line signal from accidentally setting
           micro-SD write protection signal by removing the default mux
      
         - Fix NFSroot flakeyness after resume for duover by switching the
           smsc911x gpio interrupt to back to level sensitive
      
         - Fix regression for omap4 clockevent source after recent system
           timer changes
      
         - Yet another ethernet regression fix for the "rgmii" vs "rgmii-rxid"
           phy-mode
      
         - One patch to convert am3/am4 DT files to use the regular sdhci-omap
           driver instead of the old hsmmc driver, this was meant for the
           merge window but got lost in the process"
      
      * tag 'arm-omap-fixes-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (21 commits)
        ARM: dts: am5729: beaglebone-ai: fix rgmii phy-mode
        ARM: dts: Fix omap4 system timer source clocks
        ARM: dts: Fix duovero smsc interrupt for suspend
        ARM: dts: am335x-pocketbeagle: Fix mmc0 Write Protect
        Revert "bus: ti-sysc: Increase max softreset wait"
        ARM: dts: am437x-epos-evm: remove lcd timings
        ARM: dts: am437x-gp-evm: remove lcd timings
        ARM: dts: am437x-sk-evm: remove lcd timings
        ARM: dts: dra7-evm-common: Fix duplicate mailbox nodes
        ARM: dts: dra7: Fix timer nodes properly for timer_sys_ck clocks
        ARM: dts: Fix am33xx.dtsi ti,sysc-mask wrong softreset flag
        ARM: dts: Fix am33xx.dtsi USB ranges length
        bus: ti-sysc: Increase max softreset wait
        ARM: OMAP2+: Fix legacy mode dss_reset
        bus: ti-sysc: Fix uninitialized framedonetv_irq
        bus: ti-sysc: Ignore clockactivity unless specified as a quirk
        bus: ti-sysc: Use optional clocks on for enable and wait for softreset bit
        ARM: dts: omap4-droid4: Fix spi configuration and increase rate
        bus: ti-sysc: Flush posted write on enable and disable
        soc: ti: omap-prm: use atomic iopoll instead of sleeping one
        ...
      f7db192b
    • Linus Torvalds's avatar
      Merge tag 'arm-fixes-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · e44b59cd
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "Here are a couple of bug fixes, mostly for devicetree files
      
        NXP i.MX:
         - Use correct voltage on some i.MX8M board device trees to avoid
           hardware damage
         - Code fixes for a compiler warning and incorrect reference counting,
           both harmless.
         - Fix the i.MX8M SoC driver to correctly identify imx8mp
         - Fix watchdog configuration in imx6ul-kontron device tree.
      
        Broadcom:
         - A small regression fix for the Raspberry-Pi firmware driver
         - A Kconfig change to use the correct timer driver on Northstar
         - A DT fix for the Luxul XWC-2000 machine
         - Two more DT fixes for NSP SoCs
      
        STmicroelectronics STI
         - Revert one broken patch for L2 cache configuration
      
        ARM Versatile Express:
         - Fix a regression by reverting a broken DT cleanup
      
        TEE drivers:
         - MAINTAINERS: change tee mailing list"
      
      * tag 'arm-fixes-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        Revert "ARM: sti: Implement dummy L2 cache's write_sec"
        soc: imx8m: fix build warning
        ARM: imx6: add missing put_device() call in imx6q_suspend_init()
        ARM: imx5: add missing put_device() call in imx_suspend_alloc_ocram()
        soc: imx8m: Correct i.MX8MP UID fuse offset
        ARM: dts: imx6ul-kontron: Change WDOG_ANY signal from push-pull to open-drain
        ARM: dts: imx6ul-kontron: Move watchdog from Kontron i.MX6UL/ULL board to SoM
        arm64: dts: imx8mm-beacon: Fix voltages on LDO1 and LDO2
        arm64: dts: imx8mn-ddr4-evk: correct ldo1/ldo2 voltage range
        arm64: dts: imx8mm-evk: correct ldo1/ldo2 voltage range
        ARM: dts: NSP: Correct FA2 mailbox node
        ARM: bcm2835: Fix integer overflow in rpi_firmware_print_firmware_revision()
        MAINTAINERS: change tee mailing list
        ARM: dts: NSP: Disable PL330 by default, add dma-coherent property
        ARM: bcm: Select ARM_TIMER_SP804 for ARCH_BCM_NSP
        ARM: dts: BCM5301X: Add missing memory "device_type" for Luxul XWC-2000
        arm: dts: vexpress: Move mcc node back into motherboard node
      e44b59cd
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 668f532d
      Linus Torvalds authored
      Pull timer fix from Ingo Molnar:
       "A single DocBook fix"
      
      * tag 'timers-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping: Fix kerneldoc system_device_crosststamp & al
      668f532d
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ae71d4bf
      Linus Torvalds authored
      Pull perf fix from Ingo Molnar:
       "A single Kbuild dependency fix"
      
      * tag 'perf-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/rapl: Fix RAPL config variable bug
      ae71d4bf
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · bc53f67d
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
      
       - Fix build regression on v4.8 and older
      
       - Robustness fix for TPM log parsing code
      
       - kobject refcount fix for the ESRT parsing code
      
       - Two efivarfs fixes to make it behave more like an ordinary file
         system
      
       - Style fixup for zero length arrays
      
       - Fix a regression in path separator handling in the initrd loader
      
       - Fix a missing prototype warning
      
       - Add some kerneldoc headers for newly introduced stub routines
      
       - Allow support for SSDT overrides via EFI variables to be disabled
      
       - Report CPU mode and MMU state upon entry for 32-bit ARM
      
       - Use the correct stack pointer alignment when entering from mixed mode
      
      * tag 'efi-urgent-2020-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi/libstub: arm: Print CPU boot mode and MMU state at boot
        efi/libstub: arm: Omit arch specific config table matching array on arm64
        efi/x86: Setup stack correctly for efi_pe_entry
        efi: Make it possible to disable efivar_ssdt entirely
        efi/libstub: Descriptions for stub helper functions
        efi/libstub: Fix path separator regression
        efi/libstub: Fix missing-prototype warning for skip_spaces()
        efi: Replace zero-length array and use struct_size() helper
        efivarfs: Don't return -EINTR when rate-limiting reads
        efivarfs: Update inode modification time for successful writes
        efi/esrt: Fix reference count leak in esre_create_sysfs_entry.
        efi/tpm: Verify event log header before parsing
        efi/x86: Fix build with gcc 4
      bc53f67d
    • Linus Torvalds's avatar
      Merge tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 91a9a90d
      Linus Torvalds authored
      Pull scheduler fixes from Borislav Petkov:
       "The most anticipated fix in this pull request is probably the horrible
        build fix for the RANDSTRUCT fail that didn't make -rc2. Also included
        is the cleanup that removes those BUILD_BUG_ON()s and replaces it with
        ugly unions.
      
        Also included is the try_to_wake_up() race fix that was first
        triggered by Paul's RCU-torture runs, but was independently hit by
        Dave Chinner's fstest runs as well"
      
      * tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/cfs: change initial value of runnable_avg
        smp, irq_work: Continue smp_call_function*() and irq_work*() integration
        sched/core: s/WF_ON_RQ/WQ_ON_CPU/
        sched/core: Fix ttwu() race
        sched/core: Fix PI boosting between RT and DEADLINE tasks
        sched/deadline: Initialize ->dl_boosted
        sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
        sched/core: Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build fail
      91a9a90d
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 098c7938
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
      
       - AMD Memory bandwidth counter width fix, by Babu Moger.
      
       - Use the proper length type in the 32-bit truncate() syscall variant,
         by Jiri Slaby.
      
       - Reinit IA32_FEAT_CTL during wakeup to fix the case where after
         resume, VMXON would #GP due to VMX not being properly enabled, by
         Sean Christopherson.
      
       - Fix a static checker warning in the resctrl code, by Dan Carpenter.
      
       - Add a CR4 pinning mask for bits which cannot change after boot, by
         Kees Cook.
      
       - Align the start of the loop of __clear_user() to 16 bytes, to improve
         performance on AMD zen1 and zen2 microarchitectures, by Matt Fleming.
      
      * tag 'x86_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/asm/64: Align start of __clear_user() loop to 16-bytes
        x86/cpu: Use pinning mask for CR4 bits needing to be 0
        x86/resctrl: Fix a NULL vs IS_ERR() static checker warning in rdt_cdp_peer_get()
        x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup
        syscalls: Fix offset type of ksys_ftruncate()
        x86/resctrl: Fix memory bandwidth counter width for AMD
      098c7938
    • Linus Torvalds's avatar
      Merge tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c141b30e
      Linus Torvalds authored
      Pull RCU-vs-KCSAN fixes from Borislav Petkov:
       "A single commit that uses "arch_" atomic operations to avoid the
        instrumentation that comes with the non-"arch_" versions.
      
        In preparation for that commit, it also has another commit that makes
        these "arch_" atomic operations available to generic code.
      
        Without these commits, KCSAN uses can see pointless errors"
      
      * tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        rcu: Fixup noinstr warnings
        locking/atomics: Provide the arch_atomic_ interface to generic code
      c141b30e
    • Linus Torvalds's avatar
      Merge tag 'objtool_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7ecb59a5
      Linus Torvalds authored
      Pull objtool fixes from Borislav Petkov:
       "Three fixes from Peter Zijlstra suppressing KCOV instrumentation in
        noinstr sections.
      
        Peter Zijlstra says:
          "Address KCOV vs noinstr. There is no function attribute to
           selectively suppress KCOV instrumentation, instead teach objtool
           to NOP out the calls in noinstr functions"
      
        This cures a bunch of KCOV crashes (as used by syzcaller)"
      
      * tag 'objtool_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Fix noinstr vs KCOV
        objtool: Provide elf_write_{insn,reloc}()
        objtool: Clean up elf_write() condition
      7ecb59a5
    • Linus Torvalds's avatar
      Merge tag 'x86_entry_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a358505d
      Linus Torvalds authored
      Pull x86 entry fixes from Borislav Petkov:
       "This is the x86/entry urgent pile which has accumulated since the
        merge window.
      
        It is not the smallest but considering the almost complete entry core
        rewrite, the amount of fixes to follow is somewhat higher than usual,
        which is to be expected.
      
        Peter Zijlstra says:
         'These patches address a number of instrumentation issues that were
          found after the x86/entry overhaul. When combined with rcu/urgent
          and objtool/urgent, these patches make UBSAN/KASAN/KCSAN happy
          again.
      
          Part of making this all work is bumping the minimum GCC version for
          KASAN builds to gcc-8.3, the reason for this is that the
          __no_sanitize_address function attribute is broken in GCC releases
          before that.
      
          No known GCC version has a working __no_sanitize_undefined, however
          because the only noinstr violation that results from this happens
          when an UB is found, we treat it like WARN. That is, we allow it to
          violate the noinstr rules in order to get the warning out'"
      
      * tag 'x86_entry_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/entry: Fix #UD vs WARN more
        x86/entry: Increase entry_stack size to a full page
        x86/entry: Fixup bad_iret vs noinstr
        objtool: Don't consider vmlinux a C-file
        kasan: Fix required compiler version
        compiler_attributes.h: Support no_sanitize_undefined check with GCC 4
        x86/entry, bug: Comment the instrumentation_begin() usage for WARN()
        x86/entry, ubsan, objtool: Whitelist __ubsan_handle_*()
        x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline()
        compiler_types.h: Add __no_sanitize_{address,undefined} to noinstr
        kasan: Bump required compiler version
        x86, kcsan: Add __no_kcsan to noinstr
        kcsan: Remove __no_kcsan_or_inline
        x86, kcsan: Remove __no_kcsan_or_inline usage
      a358505d