An error occurred fetching the project authors.
  1. 16 Mar, 2021 3 commits
  2. 06 Mar, 2021 1 commit
  3. 26 Feb, 2021 1 commit
  4. 17 Feb, 2021 1 commit
  5. 01 Feb, 2021 1 commit
    • Kan Liang's avatar
      perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT · 2a6c6b7d
      Kan Liang authored
      Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
      cost of an action represented by the sample. This allows the profiler
      to scale the samples to be more informative to the programmer. It could
      also help to locate a hotspot, e.g., when profiling by memory latencies,
      the expensive load appear higher up in the histograms. But current
      PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
      could be a problem, if users want two or more factors to contribute to
      the weight. For example, Golden Cove core PMU can provide both the
      instruction latency and the cache Latency information as factors for the
      memory profiling.
      
      For current X86 platforms, although meminfo::latency is defined as a
      u64, only the lower 32 bits include the valid data in practice (No
      memory access could last than 4G cycles). The higher 32 bits can be used
      to store new factors.
      
      Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
      sample weight structure. It shares the same space as the
      PERF_SAMPLE_WEIGHT sample type.
      
      Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
      PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
      they cannot apply both sample types simultaneously.
      
      Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
      - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
        sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
        sample type. PowerPC can re-struct the weight field similarly later.
      - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
        sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
        The following patches will apply the new factors for the
        PERF_SAMPLE_WEIGHT_STRUCT sample type.
      
      The field in the union perf_sample_weight should be shared among
      different architectures. A generic name is required, but it's hard to
      abstract a name that applies to all architectures. For example, on X86,
      the fields are to store all kinds of latency. While on PowerPC, it
      stores MMCRA[TECX/TECM], which should not be latency. So a general name
      prefix 'var$NUM' is used here.
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
      2a6c6b7d
  6. 15 Jan, 2021 1 commit
  7. 10 Dec, 2020 1 commit
    • Eric W. Biederman's avatar
      exec: Transform exec_update_mutex into a rw_semaphore · f7cfd871
      Eric W. Biederman authored
      Recently syzbot reported[0] that there is a deadlock amongst the users
      of exec_update_mutex.  The problematic lock ordering found by lockdep
      was:
      
         perf_event_open  (exec_update_mutex -> ovl_i_mutex)
         chown            (ovl_i_mutex       -> sb_writes)
         sendfile         (sb_writes         -> p->lock)
           by reading from a proc file and writing to overlayfs
         proc_pid_syscall (p->lock           -> exec_update_mutex)
      
      While looking at possible solutions it occured to me that all of the
      users and possible users involved only wanted to state of the given
      process to remain the same.  They are all readers.  The only writer is
      exec.
      
      There is no reason for readers to block on each other.  So fix
      this deadlock by transforming exec_update_mutex into a rw_semaphore
      named exec_update_lock that only exec takes for writing.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Sargun Dhillon <sargun@sargun.me>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
      [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      f7cfd871
  8. 09 Dec, 2020 1 commit
    • peterz@infradead.org's avatar
      perf: Break deadlock involving exec_update_mutex · 78af4dc9
      peterz@infradead.org authored
      Syzbot reported a lock inversion involving perf. The sore point being
      perf holding exec_update_mutex() for a very long time, specifically
      across a whole bunch of filesystem ops in pmu::event_init() (uprobes)
      and anon_inode_getfile().
      
      This then inverts against procfs code trying to take
      exec_update_mutex.
      
      Move the permission checks later, such that we need to hold the mutex
      over less code.
      
      Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      78af4dc9
  9. 03 Dec, 2020 1 commit
  10. 09 Nov, 2020 6 commits
  11. 07 Nov, 2020 1 commit
    • kiyin(尹亮)'s avatar
      perf/core: Fix a memory leak in perf_event_parse_addr_filter() · 7bdb157c
      kiyin(尹亮) authored
      As shown through runtime testing, the "filename" allocation is not
      always freed in perf_event_parse_addr_filter().
      
      There are three possible ways that this could happen:
      
       - It could be allocated twice on subsequent iterations through the loop,
       - or leaked on the success path,
       - or on the failure path.
      
      Clean up the code flow to make it obvious that 'filename' is always
      freed in the reallocation path and in the two return paths as well.
      
      We rely on the fact that kfree(NULL) is NOP and filename is initialized
      with NULL.
      
      This fixes the leak. No other side effects expected.
      
      [ Dan Carpenter: cleaned up the code flow & added a changelog. ]
      [ Ingo Molnar: updated the changelog some more. ]
      
      Fixes: 375637bc ("perf/core: Introduce address range filtering")
      Signed-off-by: default avatar"kiyin(尹亮)" <kiyin@tencent.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu>
      Cc: Anthony Liguori <aliguori@amazon.com>
      --
       kernel/events/core.c | 12 +++++-------
       1 file changed, 5 insertions(+), 7 deletions(-)
      7bdb157c
  12. 29 Oct, 2020 3 commits
    • Peter Zijlstra's avatar
      perf,mm: Handle non-page-table-aligned hugetlbfs · 51b646b2
      Peter Zijlstra authored
      A limited nunmber of architectures support hugetlbfs sizes that do not
      align with the page-tables (ARM64, Power, Sparc64). Add support for
      this to the generic perf_get_page_size() implementation, and also
      allow an architecture to override this implementation.
      
      This latter is only needed when it uses non-page-table aligned huge
      pages in its kernel map.
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      51b646b2
    • Stephane Eranian's avatar
      perf/core: Add support for PERF_SAMPLE_CODE_PAGE_SIZE · 995f088e
      Stephane Eranian authored
      When studying code layout, it is useful to capture the page size of the
      sampled code address.
      
      Add a new sample type for code page size.
      The new sample type requires collecting the ip. The code page size can
      be calculated from the NMI-safe perf_get_page_size().
      
      For large PEBS, it's very unlikely that the mapping is gone for the
      earlier PEBS records. Enable the feature for the large PEBS. The worst
      case is that page-size '0' is returned.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-5-kan.liang@linux.intel.com
      995f088e
    • Kan Liang's avatar
      perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE · 8d97e718
      Kan Liang authored
      Current perf can report both virtual addresses and physical addresses,
      but not the MMU page size. Without the MMU page size information of the
      utilized page, users cannot decide whether to promote/demote large pages
      to optimize memory usage.
      
      Add a new sample type for the data MMU page size.
      
      Current perf already has a facility to collect data virtual addresses.
      A page walker is required to walk the pages tables and calculate the
      MMU page size from a given virtual address.
      
      On some platforms, e.g., X86, the page walker is invoked in an NMI
      handler. So the page walker must be NMI-safe and low overhead. Besides,
      the page walker should work for both user and kernel virtual address.
      The existing generic page walker, e.g., walk_page_range_novma(), is a
      little bit complex and doesn't guarantee the NMI-safe. The follow_page()
      is only for user-virtual address.
      
      Add a new function perf_get_page_size() to walk the page tables and
      calculate the MMU page size. In the function:
      - Interrupts have to be disabled to prevent any teardown of the page
        tables.
      - For user space threads, the current->mm is used for the page walker.
        For kernel threads and the like, the current->mm is NULL. The init_mm
        is used for the page walker. The active_mm is not used here, because
        it can be NULL.
        Quote from Peter Zijlstra,
        "context_switch() can set prev->active_mm to NULL when it transfers it
         to @next. It does this before @current is updated. So an NMI that
         comes in between this active_mm swizzling and updating @current will
         see !active_mm."
      - The MMU page size is calculated from the page table level.
      
      The method should work for all architectures, but it has only been
      verified on X86. Should there be some architectures, which support perf,
      where the method doesn't work, it can be fixed later separately.
      Reporting the wrong page size would not be fatal for the architecture.
      
      Some under discussion features may impact the method in the future.
      Quote from Dave Hansen,
        "There are lots of weird things folks are trying to do with the page
         tables, like Address Space Isolation.  For instance, if you get a
         perf NMI when running userspace, current->mm->pgd is *different* than
         the PGD that was in use when userspace was running. It's close enough
         today, but it might not stay that way."
      If the case happens later, lots of consecutive page walk errors will
      happen. The worst case is that lots of page-size '0' are returned, which
      would not be fatal.
      In the perf tool, a check is implemented to detect this case. Once it
      happens, a kernel patch could be implemented accordingly then.
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20201001135749.2804-2-kan.liang@linux.intel.com
      8d97e718
  13. 12 Oct, 2020 1 commit
    • Jiri Olsa's avatar
      perf/core: Fix race in the perf_mmap_close() function · f91072ed
      Jiri Olsa authored
      There's a possible race in perf_mmap_close() when checking ring buffer's
      mmap_count refcount value. The problem is that the mmap_count check is
      not atomic because we call atomic_dec() and atomic_read() separately.
      
        perf_mmap_close:
        ...
         atomic_dec(&rb->mmap_count);
         ...
         if (atomic_read(&rb->mmap_count))
            goto out_put;
      
         <ring buffer detach>
         free_uid
      
      out_put:
        ring_buffer_put(rb); /* could be last */
      
      The race can happen when we have two (or more) events sharing same ring
      buffer and they go through atomic_dec() and then they both see 0 as refcount
      value later in atomic_read(). Then both will go on and execute code which
      is meant to be run just once.
      
      The code that detaches ring buffer is probably fine to be executed more
      than once, but the problem is in calling free_uid(), which will later on
      demonstrate in related crashes and refcount warnings, like:
      
        refcount_t: addition on 0; use-after-free.
        ...
        RIP: 0010:refcount_warn_saturate+0x6d/0xf
        ...
        Call Trace:
        prepare_creds+0x190/0x1e0
        copy_creds+0x35/0x172
        copy_process+0x471/0x1a80
        _do_fork+0x83/0x3a0
        __do_sys_wait4+0x83/0x90
        __do_sys_clone+0x85/0xa0
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using atomic decrease and check instead of separated calls.
      Tested-by: default avatarMichael Petlan <mpetlan@redhat.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Acked-by: default avatarWade Mealing <wmealing@redhat.com>
      Fixes: 9bb5d40c ("perf: Fix mmap() accounting hole");
      Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
      f91072ed
  14. 09 Oct, 2020 1 commit
  15. 10 Sep, 2020 2 commits
  16. 23 Aug, 2020 1 commit
  17. 18 Aug, 2020 1 commit
  18. 12 Aug, 2020 1 commit
  19. 06 Aug, 2020 1 commit
  20. 26 Jul, 2020 1 commit
  21. 16 Jul, 2020 1 commit
  22. 08 Jul, 2020 3 commits
    • Kan Liang's avatar
      perf/x86: Remove task_ctx_size · 5a09928d
      Kan Liang authored
      A new kmem_cache method has replaced the kzalloc() to allocate the PMU
      specific data. The task_ctx_size is not required anymore.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
      5a09928d
    • Kan Liang's avatar
      perf/core: Use kmem_cache to allocate the PMU specific data · 217c2a63
      Kan Liang authored
      Currently, the PMU specific data task_ctx_data is allocated by the
      function kzalloc() in the perf generic code. When there is no specific
      alignment requirement for the task_ctx_data, the method works well for
      now. However, there will be a problem once a specific alignment
      requirement is introduced in future features, e.g., the Architecture LBR
      XSAVE feature requires 64-byte alignment. If the specific alignment
      requirement is not fulfilled, the XSAVE family of instructions will fail
      to save/restore the xstate to/from the task_ctx_data.
      
      The function kzalloc() itself only guarantees a natural alignment. A
      new method to allocate the task_ctx_data has to be introduced, which
      has to meet the requirements as below:
      - must be a generic method can be used by different architectures,
        because the allocation of the task_ctx_data is implemented in the
        perf generic code;
      - must be an alignment-guarantee method (The alignment requirement is
        not changed after the boot);
      - must be able to allocate/free a buffer (smaller than a page size)
        dynamically;
      - should not cause extra CPU overhead or space overhead.
      
      Several options were considered as below:
      - One option is to allocate a larger buffer for task_ctx_data. E.g.,
          ptr = kmalloc(size + alignment, GFP_KERNEL);
          ptr &= ~(alignment - 1);
        This option causes space overhead.
      - Another option is to allocate the task_ctx_data in the PMU specific
        code. To do so, several function pointers have to be added. As a
        result, both the generic structure and the PMU specific structure
        will become bigger. Besides, extra function calls are added when
        allocating/freeing the buffer. This option will increase both the
        space overhead and CPU overhead.
      - The third option is to use a kmem_cache to allocate a buffer for the
        task_ctx_data. The kmem_cache can be created with a specific alignment
        requirement by the PMU at boot time. A new pointer for kmem_cache has
        to be added in the generic struct pmu, which would be used to
        dynamically allocate a buffer for the task_ctx_data at run time.
        Although the new pointer is added to the struct pmu, the existing
        variable task_ctx_size is not required anymore. The size of the
        generic structure is kept the same.
      
      The third option which meets all the aforementioned requirements is used
      to replace kzalloc() for the PMU specific data allocation. A later patch
      will remove the kzalloc() method and the related variables.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
      217c2a63
    • Kan Liang's avatar
      perf/core: Factor out functions to allocate/free the task_ctx_data · ff9ff926
      Kan Liang authored
      The method to allocate/free the task_ctx_data is going to be changed in
      the following patch. Currently, the task_ctx_data is allocated/freed in
      several different places. To avoid repeatedly modifying the same codes
      in several different places, alloc_task_ctx_data() and
      free_task_ctx_data() are factored out to allocate/free the
      task_ctx_data. The modification only needs to be applied once.
      Signed-off-by: default avatarKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1593780569-62993-16-git-send-email-kan.liang@linux.intel.com
      ff9ff926
  23. 15 Jun, 2020 1 commit
    • Adrian Hunter's avatar
      perf: Add perf text poke event · e17d43b9
      Adrian Hunter authored
      Record (single instruction) changes to the kernel text (i.e.
      self-modifying code) in order to support tracers like Intel PT and
      ARM CoreSight.
      
      A copy of the running kernel code is needed as a reference point (e.g.
      from /proc/kcore). The text poke event records the old bytes and the
      new bytes so that the event can be processed forwards or backwards.
      
      The basic problem is recording the modified instruction in an
      unambiguous manner given SMP instruction cache (in)coherence. That is,
      when modifying an instruction concurrently any solution with one or
      multiple timestamps is not sufficient:
      
      	CPU0				CPU1
       0
       1	write insn A
       2					execute insn A
       3	sync-I$
       4
      
      Due to I$, CPU1 might execute either the old or new A. No matter where
      we record tracepoints on CPU0, one simply cannot tell what CPU1 will
      have observed, except that at 0 it must be the old one and at 4 it
      must be the new one.
      
      To solve this, take inspiration from x86 text poking, which has to
      solve this exact problem due to variable length instruction encoding
      and I-fetch windows.
      
       1) overwrite the instruction with a breakpoint and sync I$
      
      This guarantees that that code flow will never hit the target
      instruction anymore, on any CPU (or rather, it will cause an
      exception).
      
       2) issue the TEXT_POKE event
      
       3) overwrite the breakpoint with the new instruction and sync I$
      
      Now we know that any execution after the TEXT_POKE event will either
      observe the breakpoint (and hit the exception) or the new instruction.
      
      So by guarding the TEXT_POKE event with an exception on either side;
      we can now tell, without doubt, which instruction another CPU will
      have observed.
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com
      e17d43b9
  24. 09 Jun, 2020 2 commits
  25. 08 Jun, 2020 1 commit
    • Souptick Joarder's avatar
      mm/gup.c: convert to use get_user_{page|pages}_fast_only() · dadbb612
      Souptick Joarder authored
      API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
      align with pin_user_pages_fast_only().
      
      As part of this we will get rid of write parameter.  Instead caller will
      pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
      existing functionality of the API.
      
      All the callers are changed to pass FOLL_WRITE.
      
      Also introduce get_user_page_fast_only(), and use it in a few places
      that hard-code nr_pages to 1.
      
      Updated the documentation of the API.
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Suchanek <msuchanek@suse.de>
      Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dadbb612
  26. 07 May, 2020 1 commit
  27. 30 Apr, 2020 1 commit