1. 21 Dec, 2023 4 commits
  2. 20 Dec, 2023 5 commits
  3. 19 Dec, 2023 11 commits
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Check if absolute timestamp goes backwards · f50345b4
      Steven Rostedt (Google) authored
      The check_buffer() which checks the timestamps of the ring buffer
      sub-buffer page, when enabled, only checks if the adding of deltas of the
      events from the last absolute timestamp or the timestamp of the sub-buffer
      page adds up to the current event.
      
      What it does not check is if the absolute timestamp causes the time of the
      events to go backwards, as that can cause issues elsewhere.
      
      Test for the timestamp going backwards too.
      
      This also fixes a slight issue where if the warning triggers at boot up
      (because of the resetting of the tsc), it will disable all further checks,
      even those that are after boot Have it continue checking if the warning
      was ignored during boot up.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      f50345b4
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Add interrupt information to dump of data sub-buffer · d40dbb61
      Steven Rostedt (Google) authored
      When the ring buffer timestamp verifier triggers, it dumps the content of
      the sub-buffer. But currently it only dumps the timestamps and the offset
      of the data as well as the deltas. It would be even more informative if
      the event data also showed the interrupt context level it was in.
      
      That is, if each event showed that the event was written in normal,
      softirq, irq or NMI context. Then a better idea about how the events may
      have been interrupted from each other.
      
      As the payload of the ring buffer is really a black box of the ring
      buffer, just assume that if the payload is larger than a trace entry, that
      it is a trace entry. As trace entries have the interrupt context
      information saved in a flags field, look at that location and report the
      output of the flags.
      
      If the payload is not a trace entry, there's no way to really know, and
      the information will be garbage. But that's OK, because this is for
      debugging only (this output is not used in production as the buffer check
      that calls it causes a huge overhead to the tracing). This information,
      when available, is crucial for debugging timestamp issues. If it's
      garbage, it will also be pretty obvious that its garbage too.
      
      As this output usually happens in kselftests of the tracing code, the user
      will know what the payload is at the time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Suggested-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d40dbb61
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Remove 32bit timestamp logic · c84897c0
      Steven Rostedt (Google) authored
      Each event has a 27 bit timestamp delta that is used to hold the delta
      from the last event. If the time between events is greater than 2^27, then
      a timestamp is added that holds a 59 bit absolute timestamp.
      
      Until a389d86f ("ring-buffer: Have nested events still record running
      time stamp"), if an interrupt interrupted an event in progress, all the
      events delta would be zero to not deal with the races that need to be
      handled. The commit a389d86f changed that to handle the races giving
      all events, even those that preempt other events, still have an accurate
      timestamp.
      
      To handle those races requires performing 64-bit cmpxchg on the
      timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
      very slow. To try to deal with this the timestamp logic was broken into
      two and then three 32-bit cmpxchgs, with the thought that two (or three)
      32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
      architectures.
      
      Part of the problem with this is that I didn't have any 32-bit
      architectures to test on. After hitting several subtle bugs in this code,
      an effort was made to try and see if three 32-bit cmpxchgs are indeed
      faster than a single 64-bit. After a few people brushed off the dust of
      their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
      was faster than a single 64-bit, it was in the order of 50% at best, not
      300%.
      
      After some more refactoring of the code, all 4 64-bit cmpxchg were removed:
      
       https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
       https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
       https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
       https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/
      
      With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be
      removed.
      
      The 32-bit and 64-bit logic is now exactly the same.
      
      Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      c84897c0
    • Steven Rostedt (Google)'s avatar
      tracing/selftests: Add test to test the trace_marker · 3bf70092
      Steven Rostedt (Google) authored
      Add a test that writes longs strings, some over the size of the sub buffer
      and make sure that the entire content is there.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213111527.6a4c9b58@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarShuah Khan <skhan@linuxfoundation.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      3bf70092
    • Steven Rostedt (Google)'s avatar
      tracing: Increase size of trace_marker_raw to max ring buffer entry · 76ca20c7
      Steven Rostedt (Google) authored
      There's no reason to give an arbitrary limit to the size of a raw trace
      marker. Just let it be as big as the size that is allowed by the ring
      buffer itself.
      
      And there's also no reason to artificially break up the write to
      TRACE_BUF_SIZE, as that's not even used.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213104218.2efc70c1@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      76ca20c7
    • Steven Rostedt (Google)'s avatar
      tracing: Have trace_marker break up by lines by size of trace_seq · 9482341d
      Steven Rostedt (Google) authored
      If a trace_marker write is bigger than what trace_seq can hold, then it
      will print "LINE TOO BIG" message and not what was written.
      
      Instead, check if the write is bigger than the trace_seq and break it
      up by that size.
      
      Ideally, we could make the trace_seq dynamic that could hold this. But
      that's for another time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212190422.1eaf224f@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      9482341d
    • Steven Rostedt (Google)'s avatar
      trace_seq: Increase the buffer size to almost two pages · 40fc60e3
      Steven Rostedt (Google) authored
      Now that trace_marker can hold more than 1KB string, and can write as much
      as the ring buffer can hold, the trace_seq is not big enough to hold
      writes:
      
       ~# a="1234567890"
       ~# cnt=4080
       ~# s=""
       ~# while [ $cnt -gt 10 ]; do
       ~#	s="${s}${a}"
       ~#	cnt=$((cnt-10))
       ~# done
       ~# echo $s > trace_marker
       ~# cat trace
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 2/2   #P:8
       #
       #                                _-----=> irqs-off/BH-disabled
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              ||| / _-=> migrate-disable
       #                              |||| /     delay
       #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
       #              | |         |   |||||     |         |
                  <...>-860     [002] .....   105.543465: tracing_mark_write[LINE TOO BIG]
                  <...>-860     [002] .....   105.543496: tracing_mark_write: 789012345678901234567890
      
      By increasing the trace_seq buffer to almost two pages, it can now print
      out the first line.
      
      This also subtracts the rest of the trace_seq fields from the buffer, so
      that the entire trace_seq is now PAGE_SIZE aligned.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231209175220.19867af4@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      40fc60e3
    • Steven Rostedt (Google)'s avatar
      tracing: Allow for max buffer data size trace_marker writes · 8ec90be7
      Steven Rostedt (Google) authored
      Allow a trace write to be as big as the ring buffer tracing data will
      allow. Currently, it only allows writes of 1KB in size, but there's no
      reason that it cannot allow what the ring buffer can hold.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      8ec90be7
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Add offset of events in dump on mismatch · 0b9036ef
      Steven Rostedt (Google) authored
      On bugs that have the ring buffer timestamp get out of sync, the config
      CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is
      detected it causes a dump of the bad sub buffer.
      
      It shows each event and their timestamp as well as the delta in the event.
      But it's also good to see the offset into the subbuffer for that event to
      know if how close to the end it is.
      
      Also print where the last event actually ended compared to where it was
      expected to end.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0b9036ef
    • Steven Rostedt (Google)'s avatar
      tracing: Allow creating instances with specified system events · d2356997
      Steven Rostedt (Google) authored
      A trace instance may only need to enable specific events. As the eventfs
      directory of an instance currently creates all events which adds overhead,
      allow internal instances to be created with just the events in systems
      that they care about. This currently only deals with systems and not
      individual events, but this should bring down the overhead of creating
      instances for specific use cases quite bit.
      
      The trace_array_get_by_name() now has another parameter "systems". This
      parameter is a const string pointer of a comma/space separated list of
      event systems that should be created by the trace_array. (Note if the
      trace_array already exists, this parameter is ignored).
      
      The list of systems is saved and if a module is loaded, its events will
      not be added unless the system for those events also match the systems
      string.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213093701.03fddec0@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: Arun Easi   <aeasi@marvell.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Tested-by: default avatarDmytro Maluka <dmaluka@chromium.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d2356997
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix slowpath of interrupted event · b803d7c6
      Steven Rostedt (Google) authored
      To synchronize the timestamps with the ring buffer reservation, there are
      two timestamps that are saved in the buffer meta data.
      
      1. before_stamp
      2. write_stamp
      
      When the two are equal, the write_stamp is considered valid, as in, it may
      be used to calculate the delta of the next event as the write_stamp is the
      timestamp of the previous reserved event on the buffer.
      
      This is done by the following:
      
       /*A*/	w = current position on the ring buffer
      	before = before_stamp
      	after = write_stamp
      	ts = read current timestamp
      
      	if (before != after) {
      		write_stamp is not valid, force adding an absolute
      		timestamp.
      	}
      
       /*B*/	before_stamp = ts
      
       /*C*/	write = local_add_return(event length, position on ring buffer)
      
      	if (w == write - event length) {
      		/* Nothing interrupted between A and C */
       /*E*/		write_stamp = ts;
      		delta = ts - after
      		/*
      		 * If nothing interrupted again,
      		 * before_stamp == write_stamp and write_stamp
      		 * can be used to calculate the delta for
      		 * events that come in after this one.
      		 */
      	} else {
      
      		/*
      		 * The slow path!
      		 * Was interrupted between A and C.
      		 */
      
      This is the place that there's a bug. We currently have:
      
      		after = write_stamp
      		ts = read current timestamp
      
       /*F*/		if (write == current position on the ring buffer &&
      		    after < ts && cmpxchg(write_stamp, after, ts)) {
      
      			delta = ts - after;
      
      		} else {
      			delta = 0;
      		}
      
      The assumption is that if the current position on the ring buffer hasn't
      moved between C and F, then it also was not interrupted, and that the last
      event written has a timestamp that matches the write_stamp. That is the
      write_stamp is valid.
      
      But this may not be the case:
      
      If a task context event was interrupted by softirq between B and C.
      
      And the softirq wrote an event that got interrupted by a hard irq between
      C and E.
      
      and the hard irq wrote an event (does not need to be interrupted)
      
      We have:
      
       /*B*/ before_stamp = ts of normal context
      
         ---> interrupted by softirq
      
      	/*B*/ before_stamp = ts of softirq context
      
      	  ---> interrupted by hardirq
      
      		/*B*/ before_stamp = ts of hard irq context
      		/*E*/ write_stamp = ts of hard irq context
      
      		/* matches and write_stamp valid */
      	  <----
      
      	/*E*/ write_stamp = ts of softirq context
      
      	/* No longer matches before_stamp, write_stamp is not valid! */
      
         <---
      
       w != write - length, go to slow path
      
      // Right now the order of events in the ring buffer is:
      //
      // |-- softirq event --|-- hard irq event --|-- normal context event --|
      //
      
       after = write_stamp (this is the ts of softirq)
       ts = read current timestamp
      
       if (write == current position on the ring buffer [true] &&
           after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
      
      	delta = ts - after  [Wrong!]
      
      The delta is to be between the hard irq event and the normal context
      event, but the above logic made the delta between the softirq event and
      the normal context event, where the hard irq event is between the two. This
      will shift all the remaining event timestamps on the sub-buffer
      incorrectly.
      
      The write_stamp is only valid if it matches the before_stamp. The cmpxchg
      does nothing to help this.
      
      Instead, the following logic can be done to fix this:
      
      	before = before_stamp
      	ts = read current timestamp
      	before_stamp = ts
      
      	after = write_stamp
      
      	if (write == current position on the ring buffer &&
      	    after == before && after < ts) {
      
      		delta = ts - after
      
      	} else {
      		delta = 0;
      	}
      
      The above will only use the write_stamp if it still matches before_stamp
      and was tested to not have changed since C.
      
      As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
      
      This means the 32-bit rb_time_t workaround can finally be removed. But
      that's for a later time.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Fixes: dd939425 ("ring-buffer: Do not try to put back write_stamp")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b803d7c6
  4. 15 Dec, 2023 6 commits
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI · 71229230
      Steven Rostedt (Google) authored
      As the ring buffer recording requires cmpxchg() to work, if the
      architecture does not support cmpxchg in NMI, then do not do any recording
      within an NMI.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      71229230
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Have rb_time_cmpxchg() set the msb counter too · 0aa0e528
      Steven Rostedt (Google) authored
      The rb_time_cmpxchg() on 32-bit architectures requires setting three
      32-bit words to represent the 64-bit timestamp, with some salt for
      synchronization. Those are: msb, top, and bottom
      
      The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
      and the msb that was written was stale.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: f03f2abc ("ring-buffer: Have 32 bit time stamps use all 64 bits")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      0aa0e528
    • Mathieu Desnoyers's avatar
      ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() · dec89008
      Mathieu Desnoyers authored
      The following race can cause rb_time_read() to observe a corrupted time
      stamp:
      
      rb_time_cmpxchg()
      [...]
              if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
                      return false;
              if (!rb_time_read_cmpxchg(&t->top, top, top2))
                      return false;
      <interrupted before updating bottom>
      __rb_time_read()
      [...]
              do {
                      c = local_read(&t->cnt);
                      top = local_read(&t->top);
                      bottom = local_read(&t->bottom);
                      msb = local_read(&t->msb);
              } while (c != local_read(&t->cnt));
      
              *cnt = rb_time_cnt(top);
      
              /* If top and msb counts don't match, this interrupted a write */
              if (*cnt != rb_time_cnt(msb))
                      return false;
                ^ this check fails to catch that "bottom" is still not updated.
      
      So the old "bottom" value is returned, which is wrong.
      
      Fix this by checking that all three of msb, top, and bottom 2-bit cnt
      values match.
      
      The reason to favor checking all three fields over requiring a specific
      update order for both rb_time_set() and rb_time_cmpxchg() is because
      checking all three fields is more robust to handle partial failures of
      rb_time_cmpxchg() when interrupted by nested rb_time_set().
      
      Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com
      
      Fixes: f458a145 ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      dec89008
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs · fff88fa0
      Steven Rostedt (Google) authored
      Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
      architectures. That is:
      
       static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
       {
      	unsigned long cnt, top, bottom, msb;
      	unsigned long cnt2, top2, bottom2, msb2;
      	u64 val;
      
      	/* The cmpxchg always fails if it interrupted an update */
      	 if (!__rb_time_read(t, &val, &cnt2))
      		 return false;
      
      	 if (val != expect)
      		 return false;
      
      <<<< interrupted here!
      
      	 cnt = local_read(&t->cnt);
      
      The problem is that the synchronization counter in the rb_time_t is read
      *after* the value of the timestamp is read. That means if an interrupt
      were to come in between the value being read and the counter being read,
      it can change the value and the counter and the interrupted process would
      be clueless about it!
      
      The counter needs to be read first and then the value. That way it is easy
      to tell if the value is stale or not. If the counter hasn't been updated,
      then the value is still good.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
      Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Fixes: 10464b4a ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
      Reported-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      fff88fa0
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() · 083e9f65
      Steven Rostedt (Google) authored
      When filtering is enabled, a temporary buffer is created to place the
      content of the trace event output so that the filter logic can decide
      from the trace event output if the trace event should be filtered out or
      not. If it is to be filtered out, the content in the temporary buffer is
      simply discarded, otherwise it is written into the trace buffer.
      
      But if an interrupt were to come in while a previous event was using that
      temporary buffer, the event written by the interrupt would actually go
      into the ring buffer itself to prevent corrupting the data on the
      temporary buffer. If the event is to be filtered out, the event in the
      ring buffer is discarded, or if it fails to discard because another event
      were to have already come in, it is turned into padding.
      
      The update to the write_stamp in the rb_try_to_discard() happens after a
      fix was made to force the next event after the discard to use an absolute
      timestamp by setting the before_stamp to zero so it does not match the
      write_stamp (which causes an event to use the absolute timestamp).
      
      But there's an effort in rb_try_to_discard() to put back the write_stamp
      to what it was before the event was added. But this is useless and
      wasteful because nothing is going to be using that write_stamp for
      calculations as it still will not match the before_stamp.
      
      Remove this useless update, and in doing so, we remove another
      cmpxchg64()!
      
      Also update the comments to reflect this change as well as remove some
      extra white space in another comment.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Vincent Donnefort   <vdonnefort@google.com>
      Fixes: b2dd7975 ("ring-buffer: Force absolute timestamp on discard of event")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      083e9f65
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Do not try to put back write_stamp · dd939425
      Steven Rostedt (Google) authored
      If an update to an event is interrupted by another event between the time
      the initial event allocated its buffer and where it wrote to the
      write_stamp, the code try to reset the write stamp back to the what it had
      just overwritten. It knows that it was overwritten via checking the
      before_stamp, and if it didn't match what it wrote to the before_stamp
      before it allocated its space, it knows it was overwritten.
      
      To put back the write_stamp, it uses the before_stamp it read. The problem
      here is that by writing the before_stamp to the write_stamp it makes the
      two equal again, which means that the write_stamp can be considered valid
      as the last timestamp written to the ring buffer. But this is not
      necessarily true. The event that interrupted the event could have been
      interrupted in a way that it was interrupted as well, and can end up
      leaving with an invalid write_stamp. But if this happens and returns to
      this context that uses the before_stamp to update the write_stamp again,
      it can possibly incorrectly make it valid, causing later events to have in
      correct time stamps.
      
      As it is OK to leave this function with an invalid write_stamp (one that
      doesn't match the before_stamp), there's no reason to try to make it valid
      again in this case. If this race happens, then just leave with the invalid
      write_stamp and the next event to come along will just add a absolute
      timestamp and validate everything again.
      
      Bonus points: This gets rid of another cmpxchg64!
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Vincent Donnefort <vdonnefort@google.com>
      Fixes: a389d86f ("ring-buffer: Have nested events still record running time stamp")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      dd939425
  5. 14 Dec, 2023 1 commit
    • Zheng Yejian's avatar
      tracing: Fix uaf issue when open the hist or hist_debug file · 1cc111b9
      Zheng Yejian authored
      KASAN report following issue. The root cause is when opening 'hist'
      file of an instance and accessing 'trace_event_file' in hist_show(),
      but 'trace_event_file' has been freed due to the instance being removed.
      'hist_debug' file has the same problem. To fix it, call
      tracing_{open,release}_file_tr() in file_operations callback to have
      the ref count and avoid 'trace_event_file' being freed.
      
        BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
        Read of size 8 at addr ffff242541e336b8 by task head/190
      
        CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849 #133
        Hardware name: linux,dummy-virt (DT)
        Call trace:
         dump_backtrace+0x98/0xf8
         show_stack+0x1c/0x30
         dump_stack_lvl+0x44/0x58
         print_report+0xf0/0x5a0
         kasan_report+0x80/0xc0
         __asan_report_load8_noabort+0x1c/0x28
         hist_show+0x11e0/0x1278
         seq_read_iter+0x344/0xd78
         seq_read+0x128/0x1c0
         vfs_read+0x198/0x6c8
         ksys_read+0xf4/0x1e0
         __arm64_sys_read+0x70/0xa8
         invoke_syscall+0x70/0x260
         el0_svc_common.constprop.0+0xb0/0x280
         do_el0_svc+0x44/0x60
         el0_svc+0x34/0x68
         el0t_64_sync_handler+0xb8/0xc0
         el0t_64_sync+0x168/0x170
      
        Allocated by task 188:
         kasan_save_stack+0x28/0x50
         kasan_set_track+0x28/0x38
         kasan_save_alloc_info+0x20/0x30
         __kasan_slab_alloc+0x6c/0x80
         kmem_cache_alloc+0x15c/0x4a8
         trace_create_new_event+0x84/0x348
         __trace_add_new_event+0x18/0x88
         event_trace_add_tracer+0xc4/0x1a0
         trace_array_create_dir+0x6c/0x100
         trace_array_create+0x2e8/0x568
         instance_mkdir+0x48/0x80
         tracefs_syscall_mkdir+0x90/0xe8
         vfs_mkdir+0x3c4/0x610
         do_mkdirat+0x144/0x200
         __arm64_sys_mkdirat+0x8c/0xc0
         invoke_syscall+0x70/0x260
         el0_svc_common.constprop.0+0xb0/0x280
         do_el0_svc+0x44/0x60
         el0_svc+0x34/0x68
         el0t_64_sync_handler+0xb8/0xc0
         el0t_64_sync+0x168/0x170
      
        Freed by task 191:
         kasan_save_stack+0x28/0x50
         kasan_set_track+0x28/0x38
         kasan_save_free_info+0x34/0x58
         __kasan_slab_free+0xe4/0x158
         kmem_cache_free+0x19c/0x508
         event_file_put+0xa0/0x120
         remove_event_file_dir+0x180/0x320
         event_trace_del_tracer+0xb0/0x180
         __remove_instance+0x224/0x508
         instance_rmdir+0x44/0x78
         tracefs_syscall_rmdir+0xbc/0x140
         vfs_rmdir+0x1cc/0x4c8
         do_rmdir+0x220/0x2b8
         __arm64_sys_unlinkat+0xc0/0x100
         invoke_syscall+0x70/0x260
         el0_svc_common.constprop.0+0xb0/0x280
         do_el0_svc+0x44/0x60
         el0_svc+0x34/0x68
         el0t_64_sync_handler+0xb8/0xc0
         el0t_64_sync+0x168/0x170
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.comSuggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarZheng Yejian <zhengyejian1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1cc111b9
  6. 13 Dec, 2023 8 commits
    • Steven Rostedt (Google)'s avatar
      tracing: Add size check when printing trace_marker output · 60be76ee
      Steven Rostedt (Google) authored
      If for some reason the trace_marker write does not have a nul byte for the
      string, it will overflow the print:
      
        trace_seq_printf(s, ": %s", field->buf);
      
      The field->buf could be missing the nul byte. To prevent overflow, add the
      max size that the buf can be by using the event size and the field
      location.
      
        int max = iter->ent_size - offsetof(struct print_entry, buf);
      
        trace_seq_printf(s, ": %*.s", max, field->buf);
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      60be76ee
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Have saved event hold the entire event · b0495258
      Steven Rostedt (Google) authored
      For the ring buffer iterator (non-consuming read), the event needs to be
      copied into the iterator buffer to make sure that a writer does not
      overwrite it while the user is reading it. If a write happens during the
      copy, the buffer is simply discarded.
      
      But the temp buffer itself was not big enough. The allocation of the
      buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
      be passed into the ring buffer and saved. But the temp buffer needs to
      hold the meta data as well. That would be BUF_PAGE_SIZE and not
      BUF_MAX_DATA_SIZE.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 785888c5 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b0495258
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Do not update before stamp when switching sub-buffers · 9e45e39d
      Steven Rostedt (Google) authored
      The ring buffer timestamps are synchronized by two timestamp placeholders.
      One is the "before_stamp" and the other is the "write_stamp" (sometimes
      referred to as the "after stamp" but only in the comments. These two
      stamps are key to knowing how to handle nested events coming in with a
      lockless system.
      
      When moving across sub-buffers, the before stamp is updated but the write
      stamp is not. There's an effort to put back the before stamp to something
      that seems logical in case there's nested events. But as the current event
      is about to cross sub-buffers, and so will any new nested event that happens,
      updating the before stamp is useless, and could even introduce new race
      conditions.
      
      The first event on a sub-buffer simply uses the sub-buffer's timestamp
      and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
      used in the algorithm in this case. There's no reason to try to fix the
      before_stamp when this happens.
      
      As a bonus, it removes a cmpxchg() when crossing sub-buffers!
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: a389d86f ("ring-buffer: Have nested events still record running time stamp")
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      9e45e39d
    • Steven Rostedt (Google)'s avatar
      tracing: Update snapshot buffer on resize if it is allocated · d06aff1c
      Steven Rostedt (Google) authored
      The snapshot buffer is to mimic the main buffer so that when a snapshot is
      needed, the snapshot and main buffer are swapped. When the snapshot buffer
      is allocated, it is set to the minimal size that the ring buffer may be at
      and still functional. When it is allocated it becomes the same size as the
      main ring buffer, and when the main ring buffer changes in size, it should
      do.
      
      Currently, the resize only updates the snapshot buffer if it's used by the
      current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
      anytime it is allocated.
      
      When changing the size of the main buffer, instead of looking to see if
      the current tracer is utilizing the snapshot buffer, just check if it is
      allocated to know if it should be updated or not.
      
      Also fix typo in comment just above the code change.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: ad909e21 ("tracing: Add internal tracing_snapshot() functions")
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      d06aff1c
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix memory leak of free page · 17d80175
      Steven Rostedt (Google) authored
      Reading the ring buffer does a swap of a sub-buffer within the ring buffer
      with a empty sub-buffer. This allows the reader to have full access to the
      content of the sub-buffer that was swapped out without having to worry
      about contention with the writer.
      
      The readers call ring_buffer_alloc_read_page() to allocate a page that
      will be used to swap with the ring buffer. When the code is finished with
      the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
      the page, it stores it as a spare. Then next call to
      ring_buffer_alloc_read_page() will return this spare instead of calling
      into the memory management system to allocate a new page.
      
      Unfortunately, on freeing of the ring buffer, this spare page is not
      freed, and causes a memory leak.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      17d80175
    • Beau Belgrave's avatar
      eventfs: Fix events beyond NAME_MAX blocking tasks · 5eaf7f05
      Beau Belgrave authored
      Eventfs uses simple_lookup(), however, it will fail if the name of the
      entry is beyond NAME_MAX length. When this error is encountered, eventfs
      still tries to create dentries instead of skipping the dentry creation.
      When the dentry is attempted to be created in this state d_wait_lookup()
      will loop forever, waiting for the lookup to be removed.
      
      Fix eventfs to return the error in simple_lookup() back to the caller
      instead of continuing to try to create the dentry.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231210213534.497-1-beaub@linux.microsoft.com
      
      Fixes: 63940449 ("eventfs: Implement eventfs lookup, read, open functions")
      Link: https://lore.kernel.org/linux-trace-kernel/20231208183601.GA46-beaub@linux.microsoft.com/Signed-off-by: default avatarBeau Belgrave <beaub@linux.microsoft.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      5eaf7f05
    • Steven Rostedt (Google)'s avatar
      tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing · b55b0a0d
      Steven Rostedt (Google) authored
      If a large event was added to the ring buffer that is larger than what the
      trace_seq can handle, it just drops the output:
      
       ~# cat /sys/kernel/tracing/trace
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 2/2   #P:8
       #
       #                                _-----=> irqs-off/BH-disabled
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              ||| / _-=> migrate-disable
       #                              |||| /     delay
       #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
       #              | |         |   |||||     |         |
                  <...>-859     [001] .....   141.118951: tracing_mark_write           <...>-859     [001] .....   141.148201: tracing_mark_write: 78901234
      
      Instead, catch this case and add some context:
      
       ~# cat /sys/kernel/tracing/trace
       # tracer: nop
       #
       # entries-in-buffer/entries-written: 2/2   #P:8
       #
       #                                _-----=> irqs-off/BH-disabled
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              ||| / _-=> migrate-disable
       #                              |||| /     delay
       #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
       #              | |         |   |||||     |         |
                  <...>-852     [001] .....   121.550551: tracing_mark_write[LINE TOO BIG]
                  <...>-852     [001] .....   121.550581: tracing_mark_write: 78901234
      
      This now emulates the same output as trace_pipe.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home
      
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b55b0a0d
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Fix writing to the buffer with max_data_size · b3ae7b67
      Steven Rostedt (Google) authored
      The maximum ring buffer data size is the maximum size of data that can be
      recorded on the ring buffer. Events must be smaller than the sub buffer
      data size minus any meta data. This size is checked before trying to
      allocate from the ring buffer because the allocation assumes that the size
      will fit on the sub buffer.
      
      The maximum size was calculated as the size of a sub buffer page (which is
      currently PAGE_SIZE minus the sub buffer header) minus the size of the
      meta data of an individual event. But it missed the possible adding of a
      time stamp for events that are added long enough apart that the event meta
      data can't hold the time delta.
      
      When an event is added that is greater than the current BUF_MAX_DATA_SIZE
      minus the size of a time stamp, but still less than or equal to
      BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
      for a page that can hold the event. Luckily, there's a check for this loop
      and after 1000 iterations and a warning is emitted and the ring buffer is
      disabled. But this should never happen.
      
      This can happen when a large event is added first, or after a long period
      where an absolute timestamp is prefixed to the event, increasing its size
      by 8 bytes. This passes the check and then goes into the algorithm that
      causes the infinite loop.
      
      For events that are the first event on the sub-buffer, it does not need to
      add a timestamp, because the sub-buffer itself contains an absolute
      timestamp, and adding one is redundant.
      
      The fix is to check if the event is to be the first event on the
      sub-buffer, and if it is, then do not add a timestamp.
      
      This also fixes 32 bit adding a timestamp when a read of before_stamp or
      write_stamp is interrupted. There's still no need to add that timestamp if
      the event is going to be the first event on the sub buffer.
      
      Also, if the buffer has "time_stamp_abs" set, then also check if the
      length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.
      
      Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/
      Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home
      Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: a4543a2f ("ring-buffer: Get timestamp after event is allocated")
      Fixes: 58fbc3c6 ("ring-buffer: Consolidate add_timestamp to remove some branches")
      Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC)
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b3ae7b67
  7. 06 Dec, 2023 2 commits
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Test last update in 32bit version of __rb_time_read() · f458a145
      Steven Rostedt (Google) authored
      Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
      timestamp used by the ring buffer does some interesting tricks to be able
      to still have an atomic 64 bit number. It originally just used 60 bits and
      broke it up into two 32 bit words where the extra 2 bits were used for
      synchronization. But this was not enough for all use cases, and all 64
      bits were required.
      
      The 32bit version of the ring buffer timestamp was then broken up into 3
      32bit words using the same counter trick. But one update was not done. The
      check to see if the read operation was done without interruption only
      checked the first two words and not last one (like it had before this
      update). Fix it by making sure all three updates happen without
      interruption by comparing the initial counter with the last updated
      counter.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: f03f2abc ("ring-buffer: Have 32 bit time stamps use all 64 bits")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      f458a145
    • Steven Rostedt (Google)'s avatar
      ring-buffer: Force absolute timestamp on discard of event · b2dd7975
      Steven Rostedt (Google) authored
      There's a race where if an event is discarded from the ring buffer and an
      interrupt were to happen at that time and insert an event, the time stamp
      is still used from the discarded event as an offset. This can screw up the
      timings.
      
      If the event is going to be discarded, set the "before_stamp" to zero.
      When a new event comes in, it compares the "before_stamp" with the
      "write_stamp" and if they are not equal, it will insert an absolute
      timestamp. This will prevent the timings from getting out of sync due to
      the discarded event.
      
      Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Fixes: 6f6be606 ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      b2dd7975
  8. 05 Dec, 2023 3 commits
    • Petr Pavlu's avatar
      tracing: Fix a possible race when disabling buffered events · c0591b1c
      Petr Pavlu authored
      Function trace_buffered_event_disable() is responsible for freeing pages
      backing buffered events and this process can run concurrently with
      trace_event_buffer_lock_reserve().
      
      The following race is currently possible:
      
      * Function trace_buffered_event_disable() is called on CPU 0. It
        increments trace_buffered_event_cnt on each CPU and waits via
        synchronize_rcu() for each user of trace_buffered_event to complete.
      
      * After synchronize_rcu() is finished, function
        trace_buffered_event_disable() has the exclusive access to
        trace_buffered_event. All counters trace_buffered_event_cnt are at 1
        and all pointers trace_buffered_event are still valid.
      
      * At this point, on a different CPU 1, the execution reaches
        trace_event_buffer_lock_reserve(). The function calls
        preempt_disable_notrace() and only now enters an RCU read-side
        critical section. The function proceeds and reads a still valid
        pointer from trace_buffered_event[CPU1] into the local variable
        "entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1]
        which happens later.
      
      * Function trace_buffered_event_disable() continues. It frees
        trace_buffered_event[CPU1] and decrements
        trace_buffered_event_cnt[CPU1] back to 0.
      
      * Function trace_event_buffer_lock_reserve() continues. It reads and
        increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
        believe that it can use the "entry" that it already obtained but the
        pointer is now invalid and any access results in a use-after-free.
      
      Fix the problem by making a second synchronize_rcu() call after all
      trace_buffered_event values are set to NULL. This waits on all potential
      users in trace_event_buffer_lock_reserve() that still read a previous
      pointer from trace_buffered_event.
      
      Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
      Link: https://lkml.kernel.org/r/20231205161736.19663-4-petr.pavlu@suse.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0fc1b09f ("tracing: Use temp buffer when filtering events")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      c0591b1c
    • Petr Pavlu's avatar
      tracing: Fix a warning when allocating buffered events fails · 34209fe8
      Petr Pavlu authored
      Function trace_buffered_event_disable() produces an unexpected warning
      when the previous call to trace_buffered_event_enable() fails to
      allocate pages for buffered events.
      
      The situation can occur as follows:
      
      * The counter trace_buffered_event_ref is at 0.
      
      * The soft mode gets enabled for some event and
        trace_buffered_event_enable() is called. The function increments
        trace_buffered_event_ref to 1 and starts allocating event pages.
      
      * The allocation fails for some page and trace_buffered_event_disable()
        is called for cleanup.
      
      * Function trace_buffered_event_disable() decrements
        trace_buffered_event_ref back to 0, recognizes that it was the last
        use of buffered events and frees all allocated pages.
      
      * The control goes back to trace_buffered_event_enable() which returns.
        The caller of trace_buffered_event_enable() has no information that
        the function actually failed.
      
      * Some time later, the soft mode is disabled for the same event.
        Function trace_buffered_event_disable() is called. It warns on
        "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
      
      Buffered events are just an optimization and can handle failures. Make
      trace_buffered_event_enable() exit on the first failure and left any
      cleanup later to when trace_buffered_event_disable() is called.
      
      Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
      Link: https://lkml.kernel.org/r/20231205161736.19663-3-petr.pavlu@suse.com
      
      Fixes: 0fc1b09f ("tracing: Use temp buffer when filtering events")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      34209fe8
    • Petr Pavlu's avatar
      tracing: Fix incomplete locking when disabling buffered events · 7fed14f7
      Petr Pavlu authored
      The following warning appears when using buffered events:
      
      [  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
      [...]
      [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
      [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
      [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
      [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
      [  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
      [  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
      [  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
      [  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
      [  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
      [  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
      [  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
      [  203.781862] Call Trace:
      [  203.781870]  <TASK>
      [  203.851949]  trace_event_buffer_commit+0x1ea/0x250
      [  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
      [  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
      [  203.851990]  do_syscall_64+0x3a/0xe0
      [  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
      [  203.852090] RIP: 0033:0x7f4cd870fa77
      [  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
      [  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
      [  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
      [  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
      [  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
      [  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
      [  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
      [  204.049256]  </TASK>
      
      For instance, it can be triggered by running these two commands in
      parallel:
      
       $ while true; do
          echo hist:key=id.syscall:val=hitcount > \
            /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
        done
       $ stress-ng --sysinfo $(nproc)
      
      The warning indicates that the current ring_buffer_per_cpu is not in the
      committing state. It happens because the active ring_buffer_event
      doesn't actually come from the ring_buffer_per_cpu but is allocated from
      trace_buffered_event.
      
      The bug is in function trace_buffered_event_disable() where the
      following normally happens:
      
      * The code invokes disable_trace_buffered_event() via
        smp_call_function_many() and follows it by synchronize_rcu(). This
        increments the per-CPU variable trace_buffered_event_cnt on each
        target CPU and grants trace_buffered_event_disable() the exclusive
        access to the per-CPU variable trace_buffered_event.
      
      * Maintenance is performed on trace_buffered_event, all per-CPU event
        buffers get freed.
      
      * The code invokes enable_trace_buffered_event() via
        smp_call_function_many(). This decrements trace_buffered_event_cnt and
        releases the access to trace_buffered_event.
      
      A problem is that smp_call_function_many() runs a given function on all
      target CPUs except on the current one. The following can then occur:
      
      * Task X executing trace_buffered_event_disable() runs on CPU 0.
      
      * The control reaches synchronize_rcu() and the task gets rescheduled on
        another CPU 1.
      
      * The RCU synchronization finishes. At this point,
        trace_buffered_event_disable() has the exclusive access to all
        trace_buffered_event variables except trace_buffered_event[CPU0]
        because trace_buffered_event_cnt[CPU0] is never incremented and if the
        buffer is currently unused, remains set to 0.
      
      * A different task Y is scheduled on CPU 0 and hits a trace event. The
        code in trace_event_buffer_lock_reserve() sees that
        trace_buffered_event_cnt[CPU0] is set to 0 and decides the use the
        buffer provided by trace_buffered_event[CPU0].
      
      * Task X continues its execution in trace_buffered_event_disable(). The
        code incorrectly frees the event buffer pointed by
        trace_buffered_event[CPU0] and resets the variable to NULL.
      
      * Task Y writes event data to the now freed buffer and later detects the
        created inconsistency.
      
      The issue is observable since commit dea49978 ("tracing: Fix warning
      in trace_buffered_event_disable()") which moved the call of
      trace_buffered_event_disable() in __ftrace_event_enable_disable()
      earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
      The underlying problem in trace_buffered_event_disable() is however
      present since the original implementation in commit 0fc1b09f
      ("tracing: Use temp buffer when filtering events").
      
      Fix the problem by replacing the two smp_call_function_many() calls with
      on_each_cpu_mask() which invokes a given callback on all CPUs.
      
      Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
      Link: https://lkml.kernel.org/r/20231205161736.19663-2-petr.pavlu@suse.com
      
      Cc: stable@vger.kernel.org
      Fixes: 0fc1b09f ("tracing: Use temp buffer when filtering events")
      Fixes: dea49978 ("tracing: Fix warning in trace_buffered_event_disable()")
      Signed-off-by: default avatarPetr Pavlu <petr.pavlu@suse.com>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      7fed14f7