Commit a70428b3 authored by Grzegorz Bizon's avatar Grzegorz Bizon

Merge branch 'docs/gb/architecture/update-ci-scaling' into 'master'

Update CI/CD Scaling blueprint with the current state

See merge request gitlab-org/gitlab!76111
parents c1a020ce fa3af0c1
...@@ -5,7 +5,7 @@ comments: false ...@@ -5,7 +5,7 @@ comments: false
description: 'Improve scalability of GitLab CI/CD' description: 'Improve scalability of GitLab CI/CD'
--- ---
# Next CI/CD scale target: 20M builds per day by 2024 # CI/CD Scaling
## Summary ## Summary
...@@ -20,13 +20,8 @@ store all the builds in PostgreSQL in `ci_builds` table, and because we are ...@@ -20,13 +20,8 @@ store all the builds in PostgreSQL in `ci_builds` table, and because we are
creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU), creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU),
we are reaching database limits that are slowing our development velocity down. we are reaching database limits that are slowing our development velocity down.
On February 1st, 2021, a billionth CI/CD job was created and the number of On February 1st, 2021, GitLab.com surpased 1 billion CI/CD builds created and the number of
builds is growing exponentially. We will run out of the available primary keys builds continues to grow exponentially.
for builds before December 2021 unless we improve the database model used to
store CI/CD data.
We expect to see 20M builds created daily on GitLab.com in the first half of
2024.
![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png) ![CI builds cumulative with forecast](ci_builds_cumulative_forecast.png)
...@@ -60,8 +55,8 @@ that have the same problem. ...@@ -60,8 +55,8 @@ that have the same problem.
Primary keys problem will be tackled by our Database Team. Primary keys problem will be tackled by our Database Team.
Status: As of October 2021 the primary keys in CI tables have been migrated to **Status**: As of October 2021 the primary keys in CI tables have been migrated
big integers. to big integers.
### The table is too large ### The table is too large
...@@ -84,6 +79,14 @@ seem fine in the development environment may not work on GitLab.com. The ...@@ -84,6 +79,14 @@ seem fine in the development environment may not work on GitLab.com. The
difference in the dataset size between the environments makes it difficult to difference in the dataset size between the environments makes it difficult to
predict the performance of even the most simple queries. predict the performance of even the most simple queries.
Team members and the wider community members are struggling to contribute the
Verify area, because we restricted the possibility of extending `ci_builds`
even further. Our static analysis tools prevent adding more columns to this
table. Adding new queries is unpredictable because of the size of the dataset
and the amount of queries executed using the table. This significantly hinders
the development velocity and contributes to incidents on the production
environment.
We also expect a significant, exponential growth in the upcoming years. We also expect a significant, exponential growth in the upcoming years.
One of the forecasts done using [Facebook's One of the forecasts done using [Facebook's
...@@ -94,6 +97,10 @@ sustain in upcoming years. ...@@ -94,6 +97,10 @@ sustain in upcoming years.
![CI builds daily forecast](ci_builds_daily_forecast.png) ![CI builds daily forecast](ci_builds_daily_forecast.png)
**Status**: As of October 2021 we reduced the growth rate of `ci_builds` table
by writing build options and variables to `ci_builds_metadata` table. We plan
to ship futher improvements that will be described in a separate blueprint.
### Queuing mechanisms are using the large table ### Queuing mechanisms are using the large table
Because of how large the table is, mechanisms that we use to build queues of Because of how large the table is, mechanisms that we use to build queues of
...@@ -114,8 +121,8 @@ table that will accelerate SQL queries used to build ...@@ -114,8 +121,8 @@ table that will accelerate SQL queries used to build
queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) and we want to queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) and we want to
explore them. explore them.
Status: the new architecture [has been implemented on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908). **Status**: As of October 2021 the new architecture [has been implemented on
GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908).
The following epic tracks making it generally available: [Make the new pending The following epic tracks making it generally available: [Make the new pending
builds architecture generally available]( builds architecture generally available](
https://gitlab.com/groups/gitlab-org/-/epics/6954). https://gitlab.com/groups/gitlab-org/-/epics/6954).
...@@ -136,17 +143,8 @@ columns, tables, partitions or database shards. ...@@ -136,17 +143,8 @@ columns, tables, partitions or database shards.
Effort to improve background migrations will be owned by our Database Team. Effort to improve background migrations will be owned by our Database Team.
Status: In progress. **Status**: In progress. We plan to ship further improvements that will be
described in a separate architectural blueprint.
### Development velocity is negatively affected
Team members and the wider community members are struggling to contribute the
Verify area, because we restricted the possibility of extending `ci_builds`
even further. Our static analysis tools prevent adding more columns to this
table. Adding new queries is unpredictable because of the size of the dataset
and the amount of queries executed using the table. This significantly hinders
the development velocity and contributes to incidents on the production
environment.
## Proposal ## Proposal
...@@ -157,32 +155,34 @@ First, we want to focus on things that are urgently needed right now. We need ...@@ -157,32 +155,34 @@ First, we want to focus on things that are urgently needed right now. We need
to fix primary keys overflow risk and unblock other teams that are working on to fix primary keys overflow risk and unblock other teams that are working on
database partitioning and sharding. database partitioning and sharding.
We want to improve situation around bottlenecks that are known already, like We want to improve known bottlenecks, like
queuing mechanisms using the large table and things that are holding other builds queuing mechanisms that is using the large table, and other things that
teams back. are holding other teams back.
Extending CI/CD metrics is important to get a better sense of how the system Extending CI/CD metrics is important to get a better sense of how the system
performs and to what growth should we expect. This will make it easier for us performs and to what growth should we expect. This will make it easier for us
to identify bottlenecks and perform more advanced capacity planning. to identify bottlenecks and perform more advanced capacity planning.
As we work on first iterations we expect our Database Sharding team and Next step is to better understand how we can leverage strong time-decay
Database Scalability Working Group to make progress on patterns we will be able characteristic of CI/CD data. This might help us to partition CI/CD dataset to
to use to partition the large CI/CD dataset. We consider the strong time-decay reduce the size of CI/CD database tables.
effect, related to the diminishing importance of pipelines with time, as an
opportunity we might want to seize.
## Iterations ## Iterations
Work required to achieve our next CI/CD scaling target is tracked in the Work required to achieve our next CI/CD scaling target is tracked in the
[GitLab CI/CD 20M builds per day scaling [CI/CD Scaling](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
target](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
1. ✓ Migrate primary keys to big integers on GitLab.com.
1. ✓ Implement the new architecture of builds queuing on GitLab.com.
1. Make the new builds queuing architecture generally available.
1. Partition CI/CD data using time-decay pattern.
## Status ## Status
|-------------|--------------| |-------------|--------------|
| Created at | 21.01.2021 | | Created at | 21.01.2021 |
| Approved at | 26.04.2021 | | Approved at | 26.04.2021 |
| Updated at | 28.10.2021 | | Updated at | 06.12.2021 |
Status: In progress. Status: In progress.
...@@ -215,6 +215,7 @@ Domain experts: ...@@ -215,6 +215,7 @@ Domain experts:
| Area | Who | Area | Who
|------------------------------|------------------------| |------------------------------|------------------------|
| Domain Expert / Verify | Fabio Pitino | | Domain Expert / Verify | Fabio Pitino |
| Domain Expert / Verify | Marius Bobin |
| Domain Expert / Database | Jose Finotto | | Domain Expert / Database | Jose Finotto |
| Domain Expert / PostgreSQL | Nikolay Samokhvalov | | Domain Expert / PostgreSQL | Nikolay Samokhvalov |
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment