Commit f99d6755 authored by Mike Jang's avatar Mike Jang

Merge branch 'mwaw/add_postgres_distributed_hll_documentation' into 'master'

Update usage ping guide

See merge request gitlab-org/gitlab!50270
parents 8fefb07a b93684e3
......@@ -132,8 +132,8 @@ general guidelines around how to collect those, due to the individual nature of
There are several types of counters which are all found in `usage_data.rb`:
- **Ordinary Batch Counters:** Simple count of a given ActiveRecord_Relation
- **Distinct Batch Counters:** Distinct count of a given ActiveRecord_Relation on given column
- **Sum Batch Counters:** Sum the values of a given ActiveRecord_Relation on given column
- **Distinct Batch Counters:** Distinct count of a given ActiveRecord_Relation in a given column
- **Sum Batch Counters:** Sum the values of a given ActiveRecord_Relation in a given column
- **Alternative Counters:** Used for settings and configurations
- **Redis Counters:** Used for in-memory counts.
......@@ -153,7 +153,15 @@ For GitLab.com, there are extremely large tables with 15 second query timeouts,
| `merge_request_diff_files` | 1082 |
| `events` | 514 |
There are two batch counting methods provided, `Ordinary Batch Counters` and `Distinct Batch Counters`. Batch counting requires indexes on columns to calculate max, min, and range queries. In some cases, a specialized index may need to be added on the columns involved in a counter.
We have several batch counting methods available:
- `Ordinary Batch Counters`
- `Distinct Batch Counters`
- `Sum Batch Counters`
- `Estimated Batch Counters`
Batch counting requires indexes on columns to calculate max, min, and range queries. In some cases,
you may need to add a specialized index on the columns involved in a counter.
### Ordinary Batch Counters
......@@ -248,6 +256,76 @@ sum(Issue.group(:state_id), :weight))
# returns => {1=>3542, 2=>6820}
```
### Estimated Batch Counters
> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/48233) in GitLab 13.7.
Estimated batch counter functionality handles `ActiveRecord::StatementInvalid` errors
when used through the provided `estimate_batch_distinct_count` method.
Errors return a value of `-1`.
WARNING:
This functionality estimates a distinct count of a specific ActiveRecord_Relation in a given column,
which uses the [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) algorithm.
As the HyperLogLog algorithm is probabilistic, the **results always includes error**.
The highest encountered error rate is 4.9%.
When correctly used, the `estimate_batch_distinct_count` method enables efficient counting over
columns that contain non-unique values, which can not be assured by other counters.
Method: [`estimate_batch_distinct_count(relation, column = nil, batch_size: nil, start: nil, finish: nil)`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/utils/usage_data.rb#L63)
The method includes the following arguments:
- `relation`: The ActiveRecord_Relation to perform the count.
- `column`: The column to perform the distinct count. The default is the primary key.
- `batch_size`: The default is 10,000, from `Gitlab::Database::PostgresHll::BatchDistinctCounter::DEFAULT_BATCH_SIZE`.
- `start`: The custom start of the batch count, to avoid complex minimum calculations.
- `finish`: The custom end of the batch count in order to avoid complex maximum calculations.
The method includes the following prerequisites:
1. The supplied `relation` must include the primary key defined as the numeric column.
For example: `id bigint NOT NULL`.
1. The `estimate_batch_distinct_count` can handle a joined relation. To utilize its ability to
count non-unique columns, the joined relation **must NOT** have a one-to-many relationship,
such as `has_many :boards`.
1. Both `start` and `finish` arguments should always represent primary key relationship values,
even if the estimated count refers to another column, for example:
```ruby
estimate_batch_distinct_count(::Note, :author_id, start: ::Note.minimum(:id), finish: ::Note.maximum(:id))
```
Examples:
1. Simple execution of estimated batch counter, with only relation provided, returned value will represent estimated
number of unique values in `id` column (which is the primary key) of `Project` relation:
```ruby
estimate_batch_distinct_count(::Project)
```
1. Execution of estimated batch counter, where provided relation has applied additional filter (`.where(time_period)`), number of unique values is going to be estimated in custom column (`:author_id`), and parameters: `start` and `finish` together apply boundaries that defines range of provided relation that is going to be analyzed
```ruby
estimate_batch_distinct_count(::Note.with_suggestions.where(time_period), :author_id, start: ::Note.minimum(:id), finish: ::Note.maximum(:id))
```
1. Execution of estimated batch counter with joined relation (`joins(:cluster)`), for a custom column (`'clusters.user_id'`):
```ruby
estimate_batch_distinct_count(::Clusters::Applications::CertManager.where(time_period).available.joins(:cluster), 'clusters.user_id')
```
When instrumenting metric with usage of estimated batch counter please add `_estimated` suffix to its name, for example:
```ruby
"counts": {
"ci_builds_estimated": estimate_batch_distinct_count(Ci::Build),
...
```
### Redis Counters
Handles `::Redis::CommandError` and `Gitlab::UsageDataCounters::BaseCounter::UnknownEvent`
......@@ -309,6 +387,10 @@ Examples of implementation:
#### Redis HLL Counters
WARNING:
HyperLogLog (HLL) is a probabilistic algorithm and its **results always includes some small error**. According to [Redis documentation](https://redis.io/commands/pfcount), data from
used HLL implementation is "approximated with a standard error of 0.81%".
With `Gitlab::UsageDataCounters::HLLRedisCounter` we have available data structures used to count unique values.
Implemented using Redis methods [PFADD](https://redis.io/commands/pfadd) and [PFCOUNT](https://redis.io/commands/pfcount).
......@@ -783,8 +865,6 @@ appear to be associated to any of the services running, since they all appear to
## Aggregated metrics
> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/45979) in GitLab 13.6.
> - It's [deployed behind a feature flag](../../user/feature_flags.md), disabled by default.
> - It's enabled on GitLab.com.
WARNING:
This feature is intended solely for internal GitLab use.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment