Commit 10dbf582 authored by Stan Hu's avatar Stan Hu

Merge branch 'stackprof' into 'master'

Trigger stackprof by sending a SIGUSR2 signal

See merge request gitlab-org/gitlab!35993
parents 7315fd5e 64c39f21
---
title: Trigger stackprof by sending a SIGUSR2 signal
merge_request: 35993
author:
type: performance
# frozen_string_literal: true
# trigger stackprof by sending a SIGUSR2 signal
#
# default settings:
# * collect raw samples
# * sample at 100hz (every 10k microseconds)
# * timeout profile after 30 seconds
# * write to $TMPDIR/stackprof.$PID.$RAND.profile
if Gitlab::Utils.to_boolean(ENV['STACKPROF_ENABLED'].to_s)
Gitlab::Cluster::LifecycleEvents.on_worker_start do
require 'stackprof'
require 'tmpdir'
Gitlab::AppJsonLogger.info "stackprof: listening on SIGUSR2 signal"
# create a pipe in order to propagate signal out of the signal handler
# see also: https://cr.yp.to/docs/selfpipe.html
read, write = IO.pipe
# create a separate thread that polls for signals on the pipe.
#
# this way we do not execute in signal handler context, which
# lifts restrictions and also serializes the calls in a thread-safe
# manner.
#
# it's very similar to a goroutine and channel design.
#
# another nice benefit of this method is that we can timeout the
# IO.select call, allowing the profile to automatically stop after
# a given interval (by default 30 seconds), avoiding unbounded memory
# growth from a profile that was started and never stopped.
t = Thread.new do
timeout_s = ENV['STACKPROF_TIMEOUT_S']&.to_i || 30
current_timeout_s = nil
loop do
got_value = IO.select([read], nil, nil, current_timeout_s)
read.getbyte if got_value
if StackProf.running?
stackprof_file_prefix = ENV['STACKPROF_FILE_PREFIX'] || Dir.tmpdir
stackprof_out_file = "#{stackprof_file_prefix}/stackprof.#{Process.pid}.#{SecureRandom.hex(6)}.profile"
Gitlab::AppJsonLogger.info(
event: "stackprof",
message: "stopping profile",
output_filename: stackprof_out_file,
pid: Process.pid,
timeout_s: timeout_s,
timed_out: got_value.nil?
)
StackProf.stop
StackProf.results(stackprof_out_file)
current_timeout_s = nil
else
Gitlab::AppJsonLogger.info(
event: "stackprof",
message: "starting profile",
pid: Process.pid
)
StackProf.start(
raw: Gitlab::Utils.to_boolean(ENV['STACKPROF_RAW'] || 'true'),
interval: ENV['STACKPROF_INTERVAL_US']&.to_i || 10_000
)
current_timeout_s = timeout_s
end
end
end
t.abort_on_exception = true
# in the case of puma, this will override the existing SIGUSR2 signal handler
# that can be used to trigger a restart.
#
# puma cluster has two types of restarts:
# * SIGUSR1: phased restart
# * SIGUSR2: restart
#
# phased restart is not supported in our configuration, because we use
# preload_app. this means we will always perform a normal restart.
# additionally, phased restart is not supported when sending a SIGUSR2
# directly to a puma worker (as opposed to the master process).
#
# the result is that the behaviour of SIGUSR1 and SIGUSR2 is identical in
# our configuration, and we can always use a SIGUSR1 to perform a restart.
#
# thus, it is acceptable for us to re-appropriate the SIGUSR2 signal, and
# override the puma behaviour.
#
# see also:
# * https://github.com/puma/puma/blob/master/docs/signals.md#puma-signals
# * https://github.com/phusion/unicorn/blob/master/SIGNALS
# * https://github.com/mperham/sidekiq/wiki/Signals
Signal.trap('SIGUSR2') do
write.write('.')
end
end
end
...@@ -36,7 +36,6 @@ graphs/dashboards. ...@@ -36,7 +36,6 @@ graphs/dashboards.
GitLab provides built-in tools to help improve performance and availability: GitLab provides built-in tools to help improve performance and availability:
- [Profiling](profiling.md). - [Profiling](profiling.md).
- [Sherlock](profiling.md#sherlock).
- [Distributed Tracing](distributed_tracing.md) - [Distributed Tracing](distributed_tracing.md)
- [GitLab Performance Monitoring](../administration/monitoring/performance/index.md). - [GitLab Performance Monitoring](../administration/monitoring/performance/index.md).
- [Request Profiling](../administration/monitoring/performance/request_profiling.md). - [Request Profiling](../administration/monitoring/performance/request_profiling.md).
...@@ -108,16 +107,24 @@ In short: ...@@ -108,16 +107,24 @@ In short:
## Profiling ## Profiling
By collecting snapshots of process state at regular intervals, profiling allows By collecting snapshots of process state at regular intervals, profiling allows
you to see where time is spent in a process. The [StackProf](https://github.com/tmm1/stackprof) you to see where time is spent in a process. The
gem is included in GitLab's development environment, allowing you to investigate [Stackprof](https://github.com/tmm1/stackprof) gem is included in GitLab,
the behavior of suspect code in detail. allowing you to profile which code is running on CPU in detail.
It's important to note that profiling an application *alters its performance*, It's important to note that profiling an application *alters its performance*.
and will generally be done *in an unrepresentative environment*. In particular, Different profiling strategies have different overheads. Stackprof is a sampling
a method is not necessarily troublesome just because it's executed many times, profiler. It will sample stack traces from running threads at a configurable
or takes a long time to execute. Profiles are tools you can use to better frequency (e.g. 100hz, that is 100 stacks per second). This type of profiling
understand what is happening in an application - using that information wisely has quite a low (albeit non-zero) overhead and is generally considered to be
is up to you! safe for production.
### Development
A profiler can be a very useful tool during development, even if it does run *in
an unrepresentative environment*. In particular, a method is not necessarily
troublesome just because it's executed many times, or takes a long time to
execute. Profiles are tools you can use to better understand what is happening
in an application - using that information wisely is up to you!
Keeping that in mind, to create a profile, identify (or create) a spec that Keeping that in mind, to create a profile, identify (or create) a spec that
exercises the troublesome code path, then run it using the `bin/rspec-stackprof` exercises the troublesome code path, then run it using the `bin/rspec-stackprof`
...@@ -211,11 +218,56 @@ application code, these profiles can be used to investigate slow tests as well. ...@@ -211,11 +218,56 @@ application code, these profiles can be used to investigate slow tests as well.
However, for smaller runs (like this example), this means that the cost of However, for smaller runs (like this example), this means that the cost of
setting up the test suite will tend to dominate. setting up the test suite will tend to dominate.
It's also possible to modify the application code in-place to output profiles ### Production
whenever a particular code path is triggered without going through the test
suite first. See the Stackprof can also be used to profile production workloads.
[StackProf documentation](https://github.com/tmm1/stackprof/blob/master/README.md)
for details. In order to enable production profiling for Ruby processes, you can set the `STACKPROF_ENABLED` environment variable to `true`.
The following configuration options can be configured:
- `STACKPROF_ENABLED`: Enables stackprof signal handler on SIGUSR2 signal.
Defaults to `false`.
- `STACKPROF_INTERVAL_US`: Sampling interval in microseconds. Defaults to
`10000` μs (100hz).
- `STACKPROF_FILE_PREFIX`: File path prefix where profiles are stored. Defaults
to `$TMPDIR` (often corresponds to `/tmp`).
- `STACKPROF_TIMEOUT_S`: Profiling timeout in seconds. Profiling will
automatically stop after this time has elapsed. Defaults to `30`.
- `STACKPROF_RAW`: Whether to collect raw samples or only aggregates. Raw
samples are needed to generate flamegraphs, but they do have a higher memory
and disk overhead. Defaults to `true`.
Once enabled, profiling can be triggered by sending a `SIGUSR2` signal to the
Ruby process. The process will begin sampling stacks. Profiling can be stopped
by sending another `SIGUSR2`. Alternatively, it will automatically stop after
the timeout.
Once profiling stops, the profile is written out to disk at
`$STACKPROF_FILE_PREFIX/stackprof.$PID.$RAND.profile`. It can then be inspected
further via the `stackprof` command line tool, as described in the previous
section.
Currently supported profiling targets are:
- Puma worker
- Sidekiq
NOTE: **Note:** The Puma master process is not supported. Neither is Unicorn.
Sending SIGUSR2 to either of those will trigger restarts. In the case of Puma,
take care to only send the signal to Puma workers.
This can be done via `pkill -USR2 puma:`. The `:` disambiguates between `puma
4.3.3.gitlab.2 ...` (the master process) from `puma: cluster worker 0: ...` (the
worker processes), selecting the latter.
Production profiles can be especially noisy. It can be helpful to visualize them
as a [flamegraph](https://github.com/brendangregg/FlameGraph). This can be done
via:
```shell
bundle exec stackprof --stackcollapse /tmp/stackprof.55769.c6c3906452.profile | flamegraph.pl > flamegraph.svg
```
## RSpec profiling ## RSpec profiling
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment