1. 17 Feb, 2012 2 commits
    • Srikar Dronamraju's avatar
      uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints · 2b144498
      Srikar Dronamraju authored
      Add uprobes support to the core kernel, with x86 support.
      
      This commit adds the kernel facilities, the actual uprobes
      user-space ABI and perf probe support comes in later commits.
      
      General design:
      
      Uprobes are maintained in an rb-tree indexed by inode and offset
      (the offset here is from the start of the mapping). For a unique
      (inode, offset) tuple, there can be at most one uprobe in the
      rb-tree.
      
      Since the (inode, offset) tuple identifies a unique uprobe, more
      than one user may be interested in the same uprobe. This provides
      the ability to connect multiple 'consumers' to the same uprobe.
      
      Each consumer defines a handler and a filter (optional). The
      'handler' is run every time the uprobe is hit, if it matches the
      'filter' criteria.
      
      The first consumer of a uprobe causes the breakpoint to be
      inserted at the specified address and subsequent consumers are
      appended to this list.  On subsequent probes, the consumer gets
      appended to the existing list of consumers. The breakpoint is
      removed when the last consumer unregisters. For all other
      unregisterations, the consumer is removed from the list of
      consumers.
      
      Given a inode, we get a list of the mms that have mapped the
      inode. Do the actual registration if mm maps the page where a
      probe needs to be inserted/removed.
      
      We use a temporary list to walk through the vmas that map the
      inode.
      
      - The number of maps that map the inode, is not known before we
        walk the rmap and keeps changing.
      - extending vm_area_struct wasn't recommended, it's a
        size-critical data structure.
      - There can be more than one maps of the inode in the same mm.
      
      We add callbacks to the mmap methods to keep an eye on text vmas
      that are of interest to uprobes.  When a vma of interest is mapped,
      we insert the breakpoint at the right address.
      
      Uprobe works by replacing the instruction at the address defined
      by (inode, offset) with the arch specific breakpoint
      instruction. We save a copy of the original instruction at the
      uprobed address.
      
      This is needed for:
      
       a. executing the instruction out-of-line (xol).
       b. instruction analysis for any subsequent fixups.
       c. restoring the instruction back when the uprobe is unregistered.
      
      We insert or delete a breakpoint instruction, and this
      breakpoint instruction is assumed to be the smallest instruction
      available on the platform. For fixed size instruction platforms
      this is trivially true, for variable size instruction platforms
      the breakpoint instruction is typically the smallest (often a
      single byte).
      
      Writing the instruction is done by COWing the page and changing
      the instruction during the copy, this even though most platforms
      allow atomic writes of the breakpoint instruction. This also
      mirrors the behaviour of a ptrace() memory write to a PRIVATE
      file map.
      
      The core worker is derived from KSM's replace_page() logic.
      
      In essence, similar to KSM:
      
       a. allocate a new page and copy over contents of the page that
          has the uprobed vaddr
       b. modify the copy and insert the breakpoint at the required
          address
       c. switch the original page with the copy containing the
          breakpoint
       d. flush page tables.
      
      replace_page() is being replicated here because of some minor
      changes in the type of pages and also because Hugh Dickins had
      plans to improve replace_page() for KSM specific work.
      
      Instruction analysis on x86 is based on instruction decoder and
      determines if an instruction can be probed and determines the
      necessary fixups after singlestep.  Instruction analysis is done
      at probe insertion time so that we avoid having to repeat the
      same analysis every time a probe is hit.
      
      A lot of code here is due to the improvement/suggestions/inputs
      from Peter Zijlstra.
      
      Changelog:
      
      (v10):
       - Add code to clear REX.B prefix as suggested by Denys Vlasenko
         and Masami Hiramatsu.
      
      (v9):
       - Use insn_offset_modrm as suggested by Masami Hiramatsu.
      
      (v7):
      
       Handle comments from Peter Zijlstra:
      
       - Dont take reference to inode. (expect inode to uprobe_register to be sane).
       - Use PTR_ERR to set the return value.
       - No need to take reference to inode.
       - use PTR_ERR to return error value.
       - register and uprobe_unregister share code.
      
      (v5):
      
       - Modified del_consumer as per comments from Peter.
       - Drop reference to inode before dropping reference to uprobe.
       - Use i_size_read(inode) instead of inode->i_size.
       - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
       - Includes errno.h as recommended by Stephen Rothwell to fix a build issue
         on sparc defconfig
       - Remove restrictions while unregistering.
       - Earlier code leaked inode references under some conditions while
         registering/unregistering.
       - Continue the vma-rmap walk even if the intermediate vma doesnt
         meet the requirements.
       - Validate the vma found by find_vma before inserting/removing the
         breakpoint
       - Call del_consumer under mutex_lock.
       - Use hash locks.
       - Handle mremap.
       - Introduce find_least_offset_node() instead of close match logic in
         find_uprobe
       - Uprobes no more depends on MM_OWNER; No reference to task_structs
         while inserting/removing a probe.
       - Uses read_mapping_page instead of grab_cache_page so that the pages
         have valid content.
       - pass NULL to get_user_pages for the task parameter.
       - call SetPageUptodate on the new page allocated in write_opcode.
       - fix leaking a reference to the new page under certain conditions.
       - Include Instruction Decoder if Uprobes gets defined.
       - Remove const attributes for instruction prefix arrays.
       - Uses mm_context to know if the application is 32 bit.
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Also-written-by: default avatarJim Keniston <jkenisto@us.ibm.com>
      Reviewed-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linux-mm <linux-mm@kvack.org>
      Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
      [ Made various small edits to the commit log ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2b144498
    • Ingo Molnar's avatar
      Merge tag 'perf-core-for-mingo' of... · d1e169da
      Ingo Molnar authored
      Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
      
      Includes smaller fixes and improvements plus the exclude_{host,guest} feature
      test and fallback to handle older kernels.
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d1e169da
  2. 14 Feb, 2012 15 commits
  3. 13 Feb, 2012 1 commit
  4. 11 Feb, 2012 4 commits
  5. 09 Feb, 2012 2 commits
    • David Ahern's avatar
      perf record: No build id option fails · d3665498
      David Ahern authored
      A recent refactoring of perf-record introduced the following:
      
      perf record -a -B
      Couldn't generating buildids. Use --no-buildid to profile anyway.
      sleep: Terminated
      
      I believe the triple negative was meant to be only a double negative.
      :-) While I'm there, fixed the grammar on the error message.
      
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1328567272-13190-1-git-send-email-dsahern@gmail.comSigned-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      d3665498
    • Stephane Eranian's avatar
      perf tools: fix endianness detection in perf.data · 73323f54
      Stephane Eranian authored
      The current version of perf detects whether or not the perf.data file is
      written in a different endianness using the attr_size field in the
      header of the file. This field represents sizeof(struct perf_event_attr)
      as known to perf record. If the sizes do not match, then perf tries the
      byte-swapped version. If they match, then the tool assumes a different
      endianness.
      
      The issue with the approach is that it assumes the size of
      perf_event_attr always has to match between perf record and perf report.
      However, the kernel perf_event ABI is extensible.  New fields can be
      added to struct perf_event_attr. Consequently, it is not possible to use
      attr_size to detect endianness.
      
      This patch takes another approach by using the magic number written at
      the beginning of the perf.data file to detect endianness. The magic
      number is an eight-byte signature.  It's primary purpose is to identify
      (signature) a perf.data file. But it could also be used to encode the
      endianness.
      
      The patch introduces a new value for this signature. The key difference
      is that the signature is written differently in the file depending on
      the endianness. Thus, by comparing the signature from the file with the
      tool's own signature it is possible to detect endianness. The new
      signature is "PERFILE2".
      
      Backward compatiblity with existing perf.data file is ensured.
      Tested-by: default avatarDavid Ahern <dsahern@gmail.com>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Lin Ming <ming.m.lin@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roberto Agostino Vitillo <ravitillo@lbl.gov>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Vince Weaver <vweaver1@eecs.utk.edu>
      Link: http://lkml.kernel.org/r/1328187288-24395-15-git-send-email-eranian@google.comSigned-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      73323f54
  6. 07 Feb, 2012 3 commits
  7. 06 Feb, 2012 8 commits
  8. 03 Feb, 2012 1 commit
    • Stephane Eranian's avatar
      perf: Remove deprecated WARN_ON_ONCE() · 84f2b9b2
      Stephane Eranian authored
      With the new throttling/unthrottling code introduced with
      commit:
      
        e050e3f0 ("perf: Fix broken interrupt rate throttling")
      
      we occasionally hit two WARN_ON_ONCE() checks in:
      
        - intel_pmu_pebs_enable()
        - intel_pmu_lbr_enable()
        - x86_pmu_start()
      
      The assertions are no longer problematic. There is a valid
      path where they can trigger but it is harmless.
      
      The assertion can be triggered with:
      
        $ perf record -e instructions:pp ....
      
      Leading to paths:
      
        intel_pmu_pebs_enable
        intel_pmu_enable_event
        x86_perf_event_set_period
        x86_pmu_start
        perf_adjust_freq_unthr_context
        perf_event_task_tick
        scheduler_tick
      
      And:
      
        intel_pmu_lbr_enable
        intel_pmu_enable_event
        x86_perf_event_set_period
        x86_pmu_start
        perf_adjust_freq_unthr_context.
        perf_event_task_tick
        scheduler_tick
      
      cpuc->enabled is always on because when we get to
      perf_adjust_freq_unthr_context() the PMU is not totally
      disabled. Furthermore when we need to adjust a period,
      we only stop the event we need to change and not the
      entire PMU. Thus, when we re-enable, cpuc->enabled is
      already set. Note that when we stop the event, both
      pebs and lbr are stopped if necessary (and possible).
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Cc: peterz@infradead.org
      Link: http://lkml.kernel.org/r/20120202110401.GA30911@quadSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      84f2b9b2
  9. 02 Feb, 2012 4 commits