1. 21 Jul, 2011 1 commit
  2. 15 Jul, 2011 1 commit
    • Steven Rostedt's avatar
      ftrace: Fix regression where ftrace breaks when modules are loaded · f7bc8b61
      Steven Rostedt authored
      Enabling function tracer to trace all functions, then load a module and
      then disable function tracing will cause ftrace to fail.
      
      This can also happen by enabling function tracing on the command line:
      
        ftrace=function
      
      and during boot up, modules are loaded, then you disable function tracing
      with 'echo nop > current_tracer' you will trigger a bug in ftrace that
      will shut itself down.
      
      The reason is, the new ftrace code keeps ref counts of all ftrace_ops that
      are registered for tracing. When one or more ftrace_ops are registered,
      all the records that represent the functions that the ftrace_ops will
      trace have a ref count incremented. If this ref count is not zero,
      when the code modification runs, that function will be enabled for tracing.
      If the ref count is zero, that function will be disabled from tracing.
      
      To make sure the accounting was working, FTRACE_WARN_ON()s were added
      to updating of the ref counts.
      
      If the ref count hits its max (> 2^30 ftrace_ops added), or if
      the ref count goes below zero, a FTRACE_WARN_ON() is triggered which
      disables all modification of code.
      
      Since it is common for ftrace_ops to trace all functions in the kernel,
      instead of creating > 20,000 hash items for the ftrace_ops, the hash
      count is just set to zero, and it represents that the ftrace_ops is
      to trace all functions. This is where the issues arrise.
      
      If you enable function tracing to trace all functions, and then add
      a module, the modules function records do not get the ref count updated.
      When the function tracer is disabled, all function records ref counts
      are subtracted. Since the modules never had their ref counts incremented,
      they go below zero and the FTRACE_WARN_ON() is triggered.
      
      The solution to this is rather simple. When modules are loaded, and
      their functions are added to the the ftrace pool, look to see if any
      ftrace_ops are registered that trace all functions. And for those,
      update the ref count for the module function records.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      f7bc8b61
  3. 11 Jul, 2011 1 commit
  4. 07 Jul, 2011 4 commits
    • Steven Rostedt's avatar
      tracing, x86/irq: Do not trace arch_local_{*,irq_*}() functions · e08fbb78
      Steven Rostedt authored
      I triggered a triple fault with gcc 4.5.1 because it did not
      honor the inline annotation to arch_local_save_flags() function
      and that function was added to the pool of functions traced by
      the function tracer.
      
      When preempt_schedule() called arch_local_save_flags() (called
      by irqs_disabled()), it was traced, but the first thing the
      function tracer does is disable preemption. When it enables
      preemption, the NEED_RESCHED flag will not have been cleared and
      the preemption check will trigger the call to preempt_schedule()
      again.
      
      Although the dynamic function tracer crashed immediately, the
      static version of the function tracer (CONFIG_DYNAMIC_FTRACE is
      not set) actually was able to show where the problem was.
      
       swapper-1       3.N.. 103885us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103886us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103886us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103887us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103887us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103888us : arch_local_save_flags <-preempt_schedule
       swapper-1       3.N.. 103888us : arch_local_save_flags <-preempt_schedule
      
      It went on for a while before it triple faulted with a corrupted
      stack.
      
      The arch_local_save_flags and arch_local_irq_* functions should
      not be traced. Even though they are marked as inline, gcc may
      still make them a function and enable tracing of them.
      
      The simple solution is to just mark them as notrace. I had to
      add the <linux/types.h> for this file to include the notrace
      tag.
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20110702033852.733414762@goodmis.orgSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>
      e08fbb78
    • Steven Rostedt's avatar
      ftrace: Fix regression of :mod:module function enabling · 43dd61c9
      Steven Rostedt authored
      The new code that allows different utilities to pick and choose
      what functions they trace broke the :mod: hook that allows users
      to trace only functions of a particular module.
      
      The reason is that the :mod: hook bypasses the hash that is setup
      to allow individual users to trace their own functions and uses
      the global hash directly. But if the global hash has not been
      set up, it will cause a bug:
      
      echo '*:mod:radeon' > /sys/kernel/debug/set_ftrace_filter
      
      produces:
      
       [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
       [drm:radeon_crtc_page_flip] *ERROR* failed to reserve new rbo buffer before flip
       BUG: unable to handle kernel paging request at ffffffff8160ec90
       IP: [<ffffffff810d9136>] add_hash_entry+0x66/0xd0
       PGD 1a05067 PUD 1a09063 PMD 80000000016001e1
       Oops: 0003 [#1] SMP Jul  7 04:02:28 phyllis kernel: [55303.858604] CPU 1
       Modules linked in: cryptd aes_x86_64 aes_generic binfmt_misc rfcomm bnep ip6table_filter hid radeon r8169 ahci libahci mii ttm drm_kms_helper drm video i2c_algo_bit intel_agp intel_gtt
      
       Pid: 10344, comm: bash Tainted: G        WC  3.0.0-rc5 #1 Dell Inc. Inspiron N5010/0YXXJJ
       RIP: 0010:[<ffffffff810d9136>]  [<ffffffff810d9136>] add_hash_entry+0x66/0xd0
       RSP: 0018:ffff88003a96bda8  EFLAGS: 00010246
       RAX: ffff8801301735c0 RBX: ffffffff8160ec80 RCX: 0000000000306ee0
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880137c92940
       RBP: ffff88003a96bdb8 R08: ffff880137c95680 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff81c9df78
       R13: ffff8801153d1000 R14: 0000000000000000 R15: 0000000000000000
       FS: 00007f329c18a700(0000) GS:ffff880137c80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffff8160ec90 CR3: 000000003002b000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process bash (pid: 10344, threadinfo ffff88003a96a000, task ffff88012fcfc470)
       Stack:
        0000000000000fd0 00000000000000fc ffff88003a96be38 ffffffff810d92f5
        ffff88011c4c4e00 ffff880000000000 000000000b69f4d0 ffffffff8160ec80
        ffff8800300e6f06 0000000081130295 0000000000000282 ffff8800300e6f00
       Call Trace:
        [<ffffffff810d92f5>] match_records+0x155/0x1b0
        [<ffffffff810d940c>] ftrace_mod_callback+0xbc/0x100
        [<ffffffff810dafdf>] ftrace_regex_write+0x16f/0x210
        [<ffffffff810db09f>] ftrace_filter_write+0xf/0x20
        [<ffffffff81166e48>] vfs_write+0xc8/0x190
        [<ffffffff81167001>] sys_write+0x51/0x90
        [<ffffffff815c7e02>] system_call_fastpath+0x16/0x1b
       Code: 48 8b 33 31 d2 48 85 f6 75 33 49 89 d4 4c 03 63 08 49 8b 14 24 48 85 d2 48 89 10 74 04 48 89 42 08 49 89 04 24 4c 89 60 08 31 d2
       RIP [<ffffffff810d9136>] add_hash_entry+0x66/0xd0
        RSP <ffff88003a96bda8>
       CR2: ffffffff8160ec90
       ---[ end trace a5d031828efdd88e ]---
      Reported-by: default avatarBrian Marete <marete@toshnix.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      43dd61c9
    • Steven Rostedt's avatar
      tracing: Have "enable" file use refcounts like the "filter" file · 40ee4dff
      Steven Rostedt authored
      The "enable" file for the event system can be removed when a module
      is unloaded and the event system only has events from that module.
      As the event system nr_events count goes to zero, it may be freed
      if its ref_count is also set to zero.
      
      Like the "filter" file, the "enable" file may be opened by a task and
      referenced later, after a module has been unloaded and the events for
      that event system have been removed.
      
      Although the "filter" file referenced the event system structure,
      the "enable" file only references a pointer to the event system
      name. Since the name is freed when the event system is removed,
      it is possible that an access to the "enable" file may reference
      a freed pointer.
      
      Update the "enable" file to use the subsystem_open() routine that
      the "filter" file uses, to keep a reference to the event system
      structure while the "enable" file is opened.
      
      Cc: <stable@kernel.org>
      Reported-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      40ee4dff
    • Steven Rostedt's avatar
      tracing: Fix bug when reading system filters on module removal · e9dbfae5
      Steven Rostedt authored
      The event system is freed when its nr_events is set to zero. This happens
      when a module created an event system and then later the module is
      removed. Modules may share systems, so the system is allocated when
      it is created and freed when the modules are unloaded and all the
      events under the system are removed (nr_events set to zero).
      
      The problem arises when a task opened the "filter" file for the
      system. If the module is unloaded and it removed the last event for
      that system, the system structure is freed. If the task that opened
      the filter file accesses the "filter" file after the system has
      been freed, the system will access an invalid pointer.
      
      By adding a ref_count, and using it to keep track of what
      is using the event system, we can free it after all users
      are finished with the event system.
      
      Cc: <stable@kernel.org>
      Reported-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      e9dbfae5
  5. 05 Jul, 2011 2 commits
  6. 04 Jul, 2011 1 commit
  7. 03 Jul, 2011 1 commit
  8. 02 Jul, 2011 6 commits
    • Frederic Weisbecker's avatar
      x86: Don't use frame pointer to save old stack on irq entry · a2bbe750
      Frederic Weisbecker authored
      rbp is used in SAVE_ARGS_IRQ to save the old stack pointer
      in order to restore it later in ret_from_intr.
      
      It is convenient because we save its value in the irq regs
      and it's easily restored using the leave instruction.
      
      However this is a kind of abuse of the frame pointer which
      role is to help unwinding the kernel by chaining frames
      together, each node following the return address to the
      previous frame.
      
      But although we are breaking the frame by changing the stack
      pointer, there is no preceding return address before the new
      frame. Hence using the frame pointer to link the two stacks
      breaks the stack unwinders that find a random value instead of
      a return address here.
      
      There is no workaround that can work in every case. We are using
      the fixup_bp_irq_link() function to dereference that abused frame
      pointer in the case of non nesting interrupt (which means stack
      changed).
      But that doesn't fix the case of interrupts that don't change the
      stack (but we still have the unconditional frame link), which is
      the case of hardirq interrupting softirq. We have no way to detect
      this transition so the frame irq link is considered as a real frame
      pointer and the return address is dereferenced but it is still a
      spurious one.
      
      There are two possible results of this: either the spurious return
      address, a random stack value, luckily belongs to the kernel text
      and then the unwinding can continue and we just have a weird entry
      in the stack trace. Or it doesn't belong to the kernel text and
      unwinding stops there.
      
      This is the reason why stacktraces (including perf callchains) on
      irqs that interrupted softirqs don't work very well.
      
      To solve this, we don't save the old stack pointer on rbp anymore
      but we save it to a scratch register that we push on the new
      stack and that we pop back later on irq return.
      
      This preserves the whole frame chain without spurious return addresses
      in the middle and drops the need for the horrid fixup_bp_irq_link()
      workaround.
      
      And finally irqs that interrupt softirq are sanely unwinded.
      
      Before:
      
          99.81%         perf  [kernel.kallsyms]  [k] perf_pending_event
                         |
                         --- perf_pending_event
                             irq_work_run
                             smp_irq_work_interrupt
                             irq_work_interrupt
                            |
                            |--41.60%-- __read
                            |          |
                            |          |--99.90%-- create_worker
                            |          |          bench_sched_messaging
                            |          |          cmd_bench
                            |          |          run_builtin
                            |          |          main
                            |          |          __libc_start_main
                            |           --0.10%-- [...]
      
      After:
      
           1.64%  swapper  [kernel.kallsyms]  [k] perf_pending_event
                  |
                  --- perf_pending_event
                      irq_work_run
                      smp_irq_work_interrupt
                      irq_work_interrupt
                     |
                     |--95.00%-- arch_irq_work_raise
                     |          irq_work_queue
                     |          __perf_event_overflow
                     |          perf_swevent_overflow
                     |          perf_swevent_event
                     |          perf_tp_event
                     |          perf_trace_softirq
                     |          __do_softirq
                     |          call_softirq
                     |          do_softirq
                     |          irq_exit
                     |          |
                     |          |--73.68%-- smp_apic_timer_interrupt
                     |          |          apic_timer_interrupt
                     |          |          |
                     |          |          |--96.43%-- amd_e400_idle
                     |          |          |          cpu_idle
                     |          |          |          start_secondary
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      a2bbe750
    • Frederic Weisbecker's avatar
      x86: Remove useless unwinder backlink from irq regs saving · 48ffee7d
      Frederic Weisbecker authored
      The unwinder backlink in interrupt entry is very useless.
      It's actually not part of the stack frame chain and thus is
      never used.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      48ffee7d
    • Frederic Weisbecker's avatar
      x86,64: Separate arg1 from rbp handling in SAVE_REGS_IRQ · 3b99a3ef
      Frederic Weisbecker authored
      Just for clarity in the code. Have a first block that handles
      the frame pointer and a separate one that handles pt_regs
      pointer and its use.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      3b99a3ef
    • Frederic Weisbecker's avatar
      x86,64: Simplify save_regs() · 1871853f
      Frederic Weisbecker authored
      The save_regs function that saves the regs on low level
      irq entry is complicated because of the fact it changes
      its stack in the middle and also because it manipulates
      data allocated in the caller frame and accesses there
      are directly calculated from callee rsp value with the
      return address in the middle of the way.
      
      This complicates the static stack offsets calculation and
      require more dynamic ones. It also needs a save/restore
      of the function's return address.
      
      To simplify and optimize this, turn save_regs() into a
      macro.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      1871853f
    • Frederic Weisbecker's avatar
      x86: Fetch stack from regs when possible in dump_trace() · 47ce11a2
      Frederic Weisbecker authored
      When regs are passed to dump_stack(), we fetch the frame
      pointer from the regs but the stack pointer is taken from
      the current frame.
      
      Thus the frame and stack pointers may not come from the same
      context. For example this can result in the unwinder to
      think the context is in irq, due to the current value of
      the stack, but the frame pointer coming from the regs points
      to a frame from another place. It then tries to fix up
      the irq link but ends up dereferencing a random frame
      pointer that doesn't belong to the irq stack:
      
      [ 9131.706906] ------------[ cut here ]------------
      [ 9131.707003] WARNING: at arch/x86/kernel/dumpstack_64.c:129 dump_trace+0x2aa/0x330()
      [ 9131.707003] Hardware name: AMD690VM-FMH
      [ 9131.707003] Perf: bad frame pointer = 0000000000000005 in callchain
      [ 9131.707003] Modules linked in:
      [ 9131.707003] Pid: 1050, comm: perf Not tainted 3.0.0-rc3+ #181
      [ 9131.707003] Call Trace:
      [ 9131.707003]  <IRQ>  [<ffffffff8104bd4a>] warn_slowpath_common+0x7a/0xb0
      [ 9131.707003]  [<ffffffff8104be21>] warn_slowpath_fmt+0x41/0x50
      [ 9131.707003]  [<ffffffff8178b873>] ? bad_to_user+0x6d/0x10be
      [ 9131.707003]  [<ffffffff8100c2da>] dump_trace+0x2aa/0x330
      [ 9131.707003]  [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
      [ 9131.707003]  [<ffffffff8101b164>] perf_callchain_kernel+0x54/0x70
      [ 9131.707003]  [<ffffffff810d391f>] perf_prepare_sample+0x19f/0x2a0
      [ 9131.707003]  [<ffffffff810d546c>] __perf_event_overflow+0x16c/0x290
      [ 9131.707003]  [<ffffffff810d5430>] ? __perf_event_overflow+0x130/0x290
      [ 9131.707003]  [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
      [ 9131.707003]  [<ffffffff8100fbb9>] ? sched_clock+0x9/0x10
      [ 9131.707003]  [<ffffffff810752e5>] ? T.375+0x15/0x90
      [ 9131.707003]  [<ffffffff81084da4>] ? trace_hardirqs_on_caller+0x64/0x180
      [ 9131.707003]  [<ffffffff810817bd>] ? trace_hardirqs_off+0xd/0x10
      [ 9131.707003]  [<ffffffff810d5764>] perf_event_overflow+0x14/0x20
      [ 9131.707003]  [<ffffffff810d588c>] perf_swevent_hrtimer+0x11c/0x130
      [ 9131.707003]  [<ffffffff817821a1>] ? error_exit+0x51/0xb0
      [ 9131.707003]  [<ffffffff81072e93>] __run_hrtimer+0x83/0x1e0
      [ 9131.707003]  [<ffffffff810d5770>] ? perf_event_overflow+0x20/0x20
      [ 9131.707003]  [<ffffffff81073256>] hrtimer_interrupt+0x106/0x250
      [ 9131.707003]  [<ffffffff812a3bfd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
      [ 9131.707003]  [<ffffffff81024833>] smp_apic_timer_interrupt+0x53/0x90
      [ 9131.707003]  [<ffffffff81789053>] apic_timer_interrupt+0x13/0x20
      [ 9131.707003]  <EOI>  [<ffffffff817821a1>] ? error_exit+0x51/0xb0
      [ 9131.707003]  [<ffffffff8178219c>] ? error_exit+0x4c/0xb0
      [ 9131.707003] ---[ end trace b2560d4876709347 ]---
      
      Fix this by simply taking the stack pointer from regs->sp
      when regs are provided.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      47ce11a2
    • Frederic Weisbecker's avatar
      x86: Save stack pointer in perf live regs savings · 9e46294d
      Frederic Weisbecker authored
      In order to prepare for fetching the stack pointer from the
      regs when possible in dump_trace() instead of taking the
      local one, save the current stack pointer in perf live regs saving.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      9e46294d
  9. 01 Jul, 2011 22 commits
  10. 29 Jun, 2011 1 commit
    • Frederic Weisbecker's avatar
      perf tools: Only display parent field if explictly sorted · cb1955b8
      Frederic Weisbecker authored
      We don't need to display the parent field if the parent
      sorting machinery is only used for parent filtering
      (as in "-p foo").
      
      However if parent filtering is used in combination with
      explicit parent sorting ( -s parent), we want to
      display it.
      
      Result with:
      
        perf report -p kernel_thread -s parent
      
      Before:
      
       # Overhead  Parent symbol
       # ........  .............
       #
           0.07%
                  |
                  --- ioread8
                      ata_sff_check_status
                      ata_sff_tf_load
                      ata_sff_qc_issue
                      ata_bmdma_qc_issue
                      ata_qc_issue
                      ata_scsi_translate
                      ata_scsi_queuecmd
                      scsi_dispatch_cmd
                      scsi_request_fn
                      __blk_run_queue
                      __make_request
                      generic_make_request
                      submit_bio
                      submit_bh
                      journal_submit_commit_record
                      jbd2_journal_commit_transaction
                      kjournald2
                      kthread
                      kernel_thread_helpe
      
      After:
      
       # Overhead  Parent symbol
       # ........  .............
       #
           0.07%  kernel_thread_helper
                  |
                  --- ioread8
                      ata_sff_check_status
                      ata_sff_tf_load
                      ata_sff_qc_issue
                      ata_bmdma_qc_issue
                      ata_qc_issue
                      ata_scsi_translate
                      ata_scsi_queuecmd
                      scsi_dispatch_cmd
                      scsi_request_fn
                      __blk_run_queue
                      __make_request
                      generic_make_request
                      submit_bio
                      submit_bh
                      journal_submit_commit_record
                      jbd2_journal_commit_transaction
                      kjournald2
                      kthread
                      kernel_thread_helper
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Sam Liao <phyomh@gmail.com>
      cb1955b8