Commit d5d153d9 authored by Amy Qualls's avatar Amy Qualls Committed by Craig Norris

Convert future tense to present tense

More cleaning up of future tense in docs, in favor of present tense.
parent 0ba15986
This diff is collapsed.
......@@ -11,7 +11,7 @@ GitLab creates a new Project with an associated Git repository that is a
copy of the original project at the time of the fork. If a large project
gets forked often, this can lead to a quick increase in Git repository
storage disk use. To counteract this problem, we are adding Git object
deduplication for forks to GitLab. In this document, we will describe how
deduplication for forks to GitLab. In this document, we describe how
GitLab implements Git object deduplication.
## Pool repositories
......@@ -27,9 +27,9 @@ If we want repository A to borrow from repository B, we first write a
path that resolves to `B.git/objects` in the special file
`A.git/objects/info/alternates`. This establishes the alternates link.
Next, we must perform a Git repack in A. After the repack, any objects
that are duplicated between A and B will get deleted from A. Repository
that are duplicated between A and B are deleted from A. Repository
A is now no longer self-contained, but it still has its own refs and
configuration. Objects in A that are not in B will remain in A. For this
configuration. Objects in A that are not in B remain in A. For this
to work, it is of course critical that **no objects ever get deleted from
B** because A might need them.
......@@ -49,7 +49,7 @@ repositories** which are hidden from the user. We then use Git
alternates to let a collection of project repositories borrow from a
single pool repository. We call such a collection of project
repositories a pool. Pools form star-shaped networks of repositories
that borrow from a single pool, which will resemble (but not be
that borrow from a single pool, which resemble (but not be
identical to) the fork networks that get formed when users fork
projects.
......@@ -72,9 +72,9 @@ across a collection of GitLab project repositories at the Git level:
The effectiveness of Git object deduplication in GitLab depends on the
amount of overlap between the pool repository and each of its
participants. Each time garbage collection runs on the source project,
Git objects from the source project will get migrated to the pool
Git objects from the source project are migrated to the pool
repository. One by one, as garbage collection runs, other member
projects will benefit from the new objects that got added to the pool.
projects benefit from the new objects that got added to the pool.
## SQL model
......@@ -123,19 +123,19 @@ are as follows:
the fork parent and the fork child project become members of the new
pool.
- Once project A has become the source project of a pool, all future
eligible forks of A will become pool members.
eligible forks of A become pool members.
- If the fork source is itself a fork, the resulting repository will
neither join the repository nor will a new pool repository be
neither join the repository nor is a new pool repository
seeded.
eg:
Such as:
Suppose fork A is part of a pool repository, any forks created off
of fork A *will not* be a part of the pool repository that fork A is
of fork A *are not* a part of the pool repository that fork A is
a part of.
Suppose B is a fork of A, and A does not belong to an object pool.
Now C gets created as a fork of B. C will not be part of a pool
Now C gets created as a fork of B. C is not part of a pool
repository.
> TODO should forks of forks be deduplicated?
......@@ -146,11 +146,11 @@ are as follows:
- If a normal Project participating in a pool gets moved to another
Gitaly storage shard, its "belongs to PoolRepository" relation will
be broken. Because of the way moving repositories between shard is
implemented, we will automatically get a fresh self-contained copy
implemented, we get a fresh self-contained copy
of the project's repository on the new storage shard.
- If the source project of a pool gets moved to another Gitaly storage
shard or is deleted the "source project" relation is not broken.
However, as of GitLab 12.0 a pool will not fetch from a source
However, as of GitLab 12.0 a pool does not fetch from a source
unless the source is on the same Gitaly shard.
## Consistency between the SQL pool relation and Gitaly
......@@ -163,7 +163,7 @@ repository and a pool.
### Pool existence
If GitLab thinks a pool repository exists (i.e. it exists according to
SQL), but it does not on the Gitaly server, then it will be created on
SQL), but it does not on the Gitaly server, then it is created on
the fly by Gitaly.
### Pool relation existence
......@@ -173,27 +173,27 @@ There are three different things that can go wrong here.
#### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects
In this case, we miss out on disk space savings but all RPC's on A
itself will function fine. The next time garbage collection runs on A,
itself function fine. The next time garbage collection runs on A,
the alternates connection gets established in Gitaly. This is done by
`Projects::GitDeduplicationService` in GitLab Rails.
#### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2
In this case `Projects::GitDeduplicationService` will throw an exception.
In this case `Projects::GitDeduplicationService` throws an exception.
#### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P
In this case `Projects::GitDeduplicationService` will try to
In this case `Projects::GitDeduplicationService` tries to
"re-duplicate" the repository A using the DisconnectGitAlternates RPC.
## Git object deduplication and GitLab Geo
When a pool repository record is created in SQL on a Geo primary, this
will eventually trigger an event on the Geo secondary. The Geo secondary
will then create the pool repository in Gitaly. This leads to an
eventually triggers an event on the Geo secondary. The Geo secondary
then creates the pool repository in Gitaly. This leads to an
"eventually consistent" situation because as each pool participant gets
synchronized, Geo will eventually trigger garbage collection in Gitaly on
the secondary, at which stage Git objects will get deduplicated.
synchronized, Geo eventually triggers garbage collection in Gitaly on
the secondary, at which stage Git objects are deduplicated.
> TODO How do we handle the edge case where at the time the Geo
> secondary tries to create the pool repository, the source project does
......
......@@ -46,8 +46,8 @@ and running.
Can the queries used potentially take down any critical services and result in
engineers being woken up in the night? Can a malicious user abuse the code to
take down a GitLab instance? Will my changes simply make loading a certain page
slower? Will execution time grow exponentially given enough load or data in the
take down a GitLab instance? Do my changes simply make loading a certain page
slower? Does execution time grow exponentially given enough load or data in the
database?
These are all questions one should ask themselves before submitting a merge
......@@ -67,14 +67,14 @@ in turn can request a performance specialist to review the changes.
## Think outside of the box
Everyone has their own perception how the new feature is going to be used.
Everyone has their own perception of how to use the new feature.
Always consider how users might be using the feature instead. Usually,
users test our features in a very unconventional way,
like by brute forcing or abusing edge conditions that we have.
## Data set
The data set that will be processed by the merge request should be known
The data set the merge request processes should be known
and documented. The feature should clearly document what the expected
data set is for this feature to process, and what problems it might cause.
......@@ -86,8 +86,8 @@ from the repository and perform search for the set of files.
As an author you should in context of that problem consider
the following:
1. What repositories are going to be supported?
1. How long it will take for big repositories like Linux kernel?
1. What repositories are planned to be supported?
1. How long it do big repositories like Linux kernel take?
1. Is there something that we can do differently to not process such a
big data set?
1. Should we build some fail-safe mechanism to contain
......@@ -96,28 +96,28 @@ the following:
## Query plans and database structure
The query plan can tell us if we will need additional
The query plan can tell us if we need additional
indexes, or expensive filtering (such as using sequential scans).
Each query plan should be run against substantial size of data set.
For example, if you look for issues with specific conditions,
you should consider validating a query against
a small number (a few hundred) and a big number (100_000) of issues.
See how the query will behave if the result will be a few
See how the query behaves if the result is a few
and a few thousand.
This is needed as we have users using GitLab for very big projects and
in a very unconventional way. Even if it seems that it's unlikely
that such a big data set will be used, it's still plausible that one
of our customers will encounter a problem with the feature.
that such a big data set is used, it's still plausible that one
of our customers could encounter a problem with the feature.
Understanding ahead of time how it's going to behave at scale, even if we accept it,
is the desired outcome. We should always have a plan or understanding of what it will take
Understanding ahead of time how it behaves at scale, even if we accept it,
is the desired outcome. We should always have a plan or understanding of what is needed
to optimize the feature for higher usage patterns.
Every database structure should be optimized and sometimes even over-described
in preparation for easy extension. The hardest part after some point is
data migration. Migrating millions of rows will always be troublesome and
data migration. Migrating millions of rows is always troublesome and
can have a negative impact on the application.
To better understand how to get help with the query plan reviews
......@@ -130,7 +130,7 @@ queries unless absolutely necessary.
The total number of queries executed by the code modified or added by a merge request
must not increase unless absolutely necessary. When building features it's
entirely possible you will need some extra queries, but you should try to keep
entirely possible you need some extra queries, but you should try to keep
this at a minimum.
As an example, say you introduce a feature that updates a number of database
......@@ -144,7 +144,7 @@ objects_to_update.each do |object|
end
```
This will end up running one query for every object to update. This code can
This means running one query for every object to update. This code can
easily overload a database given enough rows to update or many instances of this
code running in parallel. This particular problem is known as the
["N+1 query problem"](https://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations). You can write a test with [QueryRecorder](query_recorder.md) to detect this and prevent regressions.
......@@ -162,10 +162,10 @@ query. This in turn makes it much harder for this code to overload a database.
**Summary:** a merge request **should not** execute duplicated cached queries.
Rails provides an [SQL Query Cache](cached_queries.md#cached-queries-guidelines),
used to cache the results of database queries for the duration of the request.
Rails provides an [SQL Query Cache](cached_queries.md#cached-queries-guidelines),
used to cache the results of database queries for the duration of the request.
See [why cached queries are considered bad](cached_queries.md#why-cached-queries-are-considered-bad) and
See [why cached queries are considered bad](cached_queries.md#why-cached-queries-are-considered-bad) and
[how to detect them](cached_queries.md#how-to-detect-cached-queries).
The code introduced by a merge request, should not execute multiple duplicated cached queries.
......@@ -189,8 +189,8 @@ build.project == pipeline_project
# => true
```
When we call `build.project`, it will not hit the database, it will use the cached result, but it will re-instantiate
same pipeline project object. It turns out that associated objects do not point to the same in-memory object.
When we call `build.project`, it doesn't hit the database, it uses the cached result, but it re-instantiates
the same pipeline project object. It turns out that associated objects do not point to the same in-memory object.
If we try to serialize each build:
......@@ -200,7 +200,7 @@ pipeline.builds.each do |build|
end
```
It will re-instantiate project object for each build, instead of using the same in-memory object.
It re-instantiates project object for each build, instead of using the same in-memory object.
In this particular case the workaround is fairly easy:
......@@ -212,7 +212,7 @@ end
```
We can assign `pipeline.project` to each `build.project`, since we know it should point to the same project.
This will allow us that each build point to the same in-memory project,
This allows us that each build point to the same in-memory project,
avoiding the cached SQL query and re-instantiation of the project object for each build.
## Executing Queries in Loops
......@@ -323,7 +323,7 @@ Certain UI elements may not always be needed. For example, when hovering over a
diff line there's a small icon displayed that can be used to create a new
comment. Instead of always rendering these kind of elements they should only be
rendered when actually needed. This ensures we don't spend time generating
Haml/HTML when it's not going to be used.
Haml/HTML when it's not used.
## Instrumenting New Code
......@@ -411,8 +411,8 @@ The quota should be optimised to a level that we consider the feature to
be performant and usable for the user, but **not limiting**.
**We want the features to be fully usable for the users.**
**However, we want to ensure that the feature will continue to perform well if used at its limit**
**and it won't cause availability issues.**
**However, we want to ensure that the feature continues to perform well if used at its limit**
**and it doesn't cause availability issues.**
Consider that it's always better to start with some kind of limitation,
instead of later introducing a breaking change that would result in some
......@@ -433,11 +433,11 @@ The intent of quotas could be different:
Examples:
1. Pipeline Schedules: It's very unlikely that user will want to create
1. Pipeline Schedules: It's very unlikely that user wants to create
more than 50 schedules.
In such cases it's rather expected that this is either misuse
or abuse of the feature. Lack of the upper limit can result
in service degradation as the system will try to process all schedules
in service degradation as the system tries to process all schedules
assigned the project.
1. GitLab CI/CD includes: We started with the limit of maximum of 50 nested includes.
......@@ -477,7 +477,7 @@ We can consider the following types of storages:
for most of our implementations. Even though this allows the above limit to be significantly larger,
it does not really mean that you can use more. The shared temporary storage is shared by
all nodes. Thus, the job that uses significant amount of that space or performs a lot
of operations will create a contention on execution of all other jobs and request
of operations creates a contention on execution of all other jobs and request
across the whole application, this can easily impact stability of the whole GitLab.
Be respectful of that.
......
......@@ -467,7 +467,7 @@ of the `gitlab-org/gitlab-foss` project. These jobs are only created in the foll
The `* as-if-foss` jobs are run in addition to the regular EE-context jobs. They have the `FOSS_ONLY='1'` variable
set and get their EE-specific folders removed before the tests start running.
The intent is to ensure that a change won't introduce a failure once the `gitlab-org/gitlab` project will be synced to
The intent is to ensure that a change doesn't introduce a failure after the `gitlab-org/gitlab` project is synced to
the `gitlab-org/gitlab-foss` project.
## Performance
......@@ -502,7 +502,7 @@ request, be sure to start the `dont-interrupt-me` job before pushing.
- `update-assets-compile-production-cache`, defined in [`.gitlab/ci/frontend.gitlab-ci.yml`](https://gitlab.com/gitlab-org/gitlab/blob/master/.gitlab/ci/frontend.gitlab-ci.yml).
- `update-assets-compile-test-cache`, defined in [`.gitlab/ci/frontend.gitlab-ci.yml`](https://gitlab.com/gitlab-org/gitlab/blob/master/.gitlab/ci/frontend.gitlab-ci.yml).
- `update-yarn-cache`, defined in [`.gitlab/ci/frontend.gitlab-ci.yml`](https://gitlab.com/gitlab-org/gitlab/blob/master/.gitlab/ci/frontend.gitlab-ci.yml).
1. These jobs will run in merge requests whose title include `UPDATE CACHE`.
1. These jobs run in merge requests whose title include `UPDATE CACHE`.
### Pre-clone step
......@@ -546,8 +546,7 @@ on a scheduled pipeline, it does the following:
1. Saves the data as a `.tar.gz`.
1. Uploads it into the Google Cloud Storage bucket.
When a CI job runs with this configuration, you'll see something like
this:
When a CI job runs with this configuration, the output looks something like this:
```shell
$ eval "$CI_PRE_CLONE_SCRIPT"
......@@ -568,7 +567,7 @@ GitLab Team Member, find credentials in the
[GitLab shared 1Password account](https://about.gitlab.com/handbook/security/#1password-for-teams).
Note that this bucket should be located in the same continent as the
runner, or [network egress charges will apply](https://cloud.google.com/storage/pricing).
runner, or [you can incur network egress charges](https://cloud.google.com/storage/pricing).
## CI configuration internals
......@@ -662,7 +661,7 @@ and included in `rules` definitions via [YAML anchors](../ci/yaml/README.md#anch
| `if-not-canonical-namespace` | Matches if the project isn't in the canonical (`gitlab-org/`) or security (`gitlab-org/security`) namespace. | Use to create a job for forks (by using `when: on_success\|manual`), or **not** create a job for forks (by using `when: never`). |
| `if-not-ee` | Matches if the project isn't EE (i.e. project name isn't `gitlab` or `gitlab-ee`). | Use to create a job only in the FOSS project (by using `when: on_success|manual`), or **not** create a job if the project is EE (by using `when: never`). |
| `if-not-foss` | Matches if the project isn't FOSS (i.e. project name isn't `gitlab-foss`, `gitlab-ce`, or `gitlabhq`). | Use to create a job only in the EE project (by using `when: on_success|manual`), or **not** create a job if the project is FOSS (by using `when: never`). |
| `if-default-refs` | Matches if the pipeline is for `master`, `/^[\d-]+-stable(-ee)?$/` (stable branches), `/^\d+-\d+-auto-deploy-\d+$/` (auto-deploy branches), `/^security\//` (security branches), merge requests, and tags. | Note that jobs won't be created for branches with this default configuration. |
| `if-default-refs` | Matches if the pipeline is for `master`, `/^[\d-]+-stable(-ee)?$/` (stable branches), `/^\d+-\d+-auto-deploy-\d+$/` (auto-deploy branches), `/^security\//` (security branches), merge requests, and tags. | Note that jobs aren't created for branches with this default configuration. |
| `if-master-refs` | Matches if the current branch is `master`. | |
| `if-master-or-tag` | Matches if the pipeline is for the `master` branch or for a tag. | |
| `if-merge-request` | Matches if the pipeline is for a merge request. | |
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment