Describe three parallel workstreams for CI/CD time-decay

This commit also assigns and AEC and a domain expert.

Describe three parallel workstreams for CI/CD time-decay
This commit also assigns and AEC and a domain expert.
12491bd1 · Grzegorz Bizon · aea0661a · 12491bd1
Commit 12491bd1 authored Dec 03, 2021 by Grzegorz Bizon
Hide whitespace changes
Inline Side-by-side

Showing with 108 additions and 14 deletions

doc/architecture/blueprints/ci_data_decay/index.md doc/architecture/blueprints/ci_data_decay/index.md +108 -14

No files found.
--- a/doc/architecture/blueprints/ci_data_decay/index.md
+++ b/doc/architecture/blueprints/ci_data_decay/index.md
@@ -48,34 +48,125 @@ the upcoming years.
 CI/CD data is subject to
 [time-decay](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/time-decay.html)
 because, usually, pipelines that are a few months old are not frequently
-accessed or relevant anymore. Restricting access to processing pipelines that
+accessed or are even not relevant anymore. Restricting access to processing
-are longer than a few months might help us to Describe our approach to
+pipelines that are longer than a few months might help us to move this data to
-partitioning using time-decay pattern.
+a different storage, that is more performance and cost effective.
-It is already possible to prevent processing of builds [that have been
+It is already possible to prevent processing builds [that have been
 archived](/ee/user/admin_area/settings/continuous_integration.html#archive-jobs).
 When a build gets archived it will not be possible to retry it, but we do not
-remove data from the database.
+move data from the database.
 In order to improve performance and make it easier to scale CI/CD data storage
-we might want to:
+we might want to follow these three tracks described below.
-1. Make it possible to move data out of PostgreSQL to a different data store
+### Move rarely accessed data
-   when a build, or a pipeline, gets archived using a retention policy.
-1. Make it possible to partition certain database tables, storing CI/CD data,
-   using the same retention policy as in the point above.
-### Data retention
+Once a build (or a pipeline) gets archived, it is no longer possible to resume
+pipeline processing in such pipeline. It means that all the metadata, we store
+in PostgreSQL, that is needed to efficiently and reliably process builds can be
+safely moved to a different data store.
-### Partitioning
+Currently, storing pipeline processing data is expensive as this kind of CI/CD
+data represents a significant portion of data stored in CI/CD tables. Once we
+restrict access to processing archived pipelines, we can move this metadata to
+a different place - preferably object storage - and make it accessible on
+demand, when it is really needed again (for example for compliance purposes).
-### Archived data access
+Epic: [Link]
+### Partition rarely accessed data
+After we move CI/CD metadata to a different store, the problem of having
+billions of rows describing pipelines, build and artifacts, remains. We still
+need to keep reference to the metadata we store in object storage and we still
+do need to be able to retrieve this information reliably in bulk (or search
+through it).
+It means that by moving data to object storage we might not be able to reduce
+the number of rows in CI/CD tables. Moving data to object storage should help
+with reducing the data size, but not the quantity of entries describing this
+data. Because of this limitation, we still want to partition CI/CD data to
+reduce the impact on the database (indices size, auto-vacuum time and
+frequency).
+Partitioning rarely accessed data should also follow the policy defined for
+builds archival, to make it consistent and reliable.
+Epic: [Link]
+### Partition frequently used queuing tables
+While working on the [CI/CD Scale](https://docs.gitlab.com/ee/architecture/blueprints/ci_scale/)
+architecture, we have introduced a [new architecture for queuing CI/CD builds](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908)
+for execution.
+This allowed us to significant improve performance, but we still do consider
+the new solution as an intermediate mechanism, needed before we start working
+on the next iteration, that should improve the architecture of builds queuing
+even more (it might require moving off the PostgreSQL fully or partially).
+In the meantime we want to ship another iteration, an intermediate step towards
+more flexible and reliable solution. We want to partition the new queuing
+tables, to reduce the impact on the database, to improve reliability and
+database health.
+Partitioning of CI/CD queuing tables does not need to follow the policy defined
+for builds archival. Instead we should leverage a long-standing policy saying
+that builds created more 24 hours ago need to be removed from the queue. This
+business rule is present in the product since the inception of GitLab CI.
+Epic: [Prepare queuing tables for list-style partitioning](https://gitlab.com/gitlab-org/gitlab/-/issues/347027).
+## Caveats
+All the three tracks we will use to implement CI/CD time decay pattern are
+associated with some challenges. Most important ones are documented below.
+### Removing data
+While it might be tempting to simply remove old or archived data from our
+database, this should be avoided. We should not permanently remove user data
+unless a consent is given to do so. We can, however, move data to a different
+data store, like object storage.
+Archived data can still be needed sometimes (for example for compliance
+reasons). We want to be able to retrieve this data if needed, as long as
+permanent removal has not been requested or approved by a user.
+### Accessing data
+Implementing CI/CD data time-decay through partitioning might be challenging
+when we still want to make it possible for users to access data stored across
+many partitions.
+In order to do that we will need to make sure that when archived data needs be
+accessed, users provide a time range in which the data has been created. In
+order to make it efficient it might be necessary to restrict access to querying
+data residing in more than two partitions at once. We can do that by supporting
+time ranges spanning the duration that equals to the builds archival policy.
+#### Merge request pipelines
+Once we partition CI/CD data, especially CI builds, we need to find an
+efficient mechanism to present pipeline statuses in merge requests.
+How to exactly do that is an implementation detail that we will need to figure
+out as the work progresses. We do have many tools to achieve that - data
+denormalization, routing reads to proper partitions based on data stored with a
+merge request.
 ## Iterations
+All three tacks can be worked on in parallel:
+1. [Move archived CI/CD data to object storage](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68228)
+2. [Partition CI/CD tables using CI/CD data retention policy](LINK)
+3. [Partition CI/CD queuing tables using list partitioning](https://gitlab.com/gitlab-org/gitlab/-/issues/347027)
 ## Status
-In progress.
+Request For Comments.
 ## Who
@@ -88,6 +179,7 @@ Proposal:
 | Author                       | Grzegorz Bizon          |
 | Engineering Leader           | Cheryl Li               |
 | Product Manager              | Jackie Porter           |
+| Architecture Evolution Coach | Kamil Trzciński         |
 DRIs:
@@ -101,5 +193,7 @@ Domain experts:
 | Area                         | Who
 |------------------------------|------------------------|
+| Continuous Integration       | Marius Bobin           |
+| PostgreSQL Database          | Andreas Brandl         |
 <!-- vale gitlab.Spelling = YES -->