Commit b11f7da6 authored by Marcin Sedlak-Jakubowski's avatar Marcin Sedlak-Jakubowski

Merge branch 'axil-hashed-storage-dedup' into 'master'

Refactor the hashed storage docs

See merge request gitlab-org/gitlab!31331
parents 3480f08d 2ed4f048
# Repository storage Rake tasks **(CORE ONLY)** # Repository storage Rake tasks **(CORE ONLY)**
This is a collection of Rake tasks you can use to help you list and migrate This is a collection of Rake tasks to help you list and migrate
existing projects and attachments associated with it from Legacy storage to existing projects and their attachments to the new
the new Hashed storage type. [hashed storage](../repository_storage_types.md) that GitLab
uses to organize the Git data.
You can read more about the storage types [here](../repository_storage_types.md). ## List projects and attachments
## Migrate existing projects to hashed storage The following Rake tasks will list the projects and attachments that are
available on legacy and hashed storage.
Before migrating your existing projects, you should ### On legacy storage
[enable hashed storage](../repository_storage_types.md#how-to-migrate-to-hashed-storage) for the new projects as well.
This task will schedule all your existing projects and attachments associated with it to be migrated to the To have a summary and then a list of projects and their attachments using legacy storage:
**Hashed** storage type:
**Omnibus Installation** - **Omnibus installation**
```shell ```shell
sudo gitlab-rake gitlab:storage:migrate_to_hashed # Projects
``` sudo gitlab-rake gitlab:storage:legacy_projects
sudo gitlab-rake gitlab:storage:list_legacy_projects
**Source Installation** # Attachments
sudo gitlab-rake gitlab:storage:legacy_attachments
sudo gitlab-rake gitlab:storage:list_legacy_attachments
```
```shell - **Source installation**
sudo -u git -H bundle exec rake gitlab:storage:migrate_to_hashed RAILS_ENV=production
```
They both also accept a range as environment variable: ```shell
# Projects
sudo -u git -H bundle exec rake gitlab:storage:legacy_projects RAILS_ENV=production
sudo -u git -H bundle exec rake gitlab:storage:list_legacy_projects RAILS_ENV=production
```shell # Attachments
# to migrate any non migrated project from ID 20 to 50. sudo -u git -H bundle exec rake gitlab:storage:legacy_attachments RAILS_ENV=production
export ID_FROM=20 sudo -u git -H bundle exec rake gitlab:storage:list_legacy_attachments RAILS_ENV=production
export ID_TO=50 ```
```
You can monitor the progress in the **{admin}** **Admin Area > Monitoring > Background Jobs** page. ### On hashed storage
There is a specific queue you can watch to see how long it will take to finish:
`hashed_storage:hashed_storage_project_migrate`.
After it reaches zero, you can confirm every project has been migrated by running the commands bellow. To have a summary and then a list of projects and their attachments using hashed storage:
If you find it necessary, you can run this migration script again to schedule missing projects.
Any error or warning will be logged in Sidekiq's log file. - **Omnibus installation**
NOTE: **Note:** ```shell
If [Geo](../geo/replication/index.md) is enabled, each project that is successfully migrated # Projects
generates an event to replicate the changes on any **secondary** nodes. sudo gitlab-rake gitlab:storage:hashed_projects
sudo gitlab-rake gitlab:storage:list_hashed_projects
You only need the `gitlab:storage:migrate_to_hashed` Rake task to migrate your repositories, but we have additional # Attachments
commands below that helps you inspect projects and attachments in both legacy and hashed storage. sudo gitlab-rake gitlab:storage:hashed_attachments
sudo gitlab-rake gitlab:storage:list_hashed_attachments
```
## Rollback from hashed storage to legacy storage - **Source installation**
If you need to rollback the storage migration for any reason, you can follow the steps described here. ```shell
# Projects
sudo -u git -H bundle exec rake gitlab:storage:hashed_projects RAILS_ENV=production
sudo -u git -H bundle exec rake gitlab:storage:list_hashed_projects RAILS_ENV=production
NOTE: **Note:** # Attachments
Hashed storage will be required in future version of GitLab. sudo -u git -H bundle exec rake gitlab:storage:hashed_attachments RAILS_ENV=production
sudo -u git -H bundle exec rake gitlab:storage:list_hashed_attachments RAILS_ENV=production
```
To prevent new projects from being created in the Hashed storage, ## Migrate to hashed storage
you need to undo the [enable hashed storage](../repository_storage_types.md#how-to-migrate-to-hashed-storage) changes.
This task will schedule all your existing projects and associated attachments to be rolled back to the NOTE: **Note:**
Legacy storage type. In GitLab 13.0, [hashed storage](../repository_storage_types.md#hashed-storage)
is enabled by default and the legacy storage is deprecated.
Support for legacy storage will be removed in GitLab 14.0. If you're on GitLab
13.0 and later, switching new projects to legacy storage is not possible.
The option to choose between hashed and legacy storage in the admin area has
been disabled.
For Omnibus installations, run the following: This task will schedule all your existing projects and attachments associated
with it to be migrated to the **Hashed** storage type:
```shell - **Omnibus installation**
sudo gitlab-rake gitlab:storage:rollback_to_legacy
```
For source installations, run the following: ```shell
sudo gitlab-rake gitlab:storage:migrate_to_hashed
```
```shell - **Source installation**
sudo -u git -H bundle exec rake gitlab:storage:rollback_to_legacy RAILS_ENV=production
```
Both commands accept a range as environment variable: ```shell
sudo -u git -H bundle exec rake gitlab:storage:migrate_to_hashed RAILS_ENV=production
```
If you have any existing integration, you may want to do a small rollout first,
to validate. You can do so by specifying an ID range with the operation by using
the environment variables `ID_FROM` and `ID_TO`. For example, to limit the rollout
to project IDs 50 to 100 in an Omnibus GitLab installation:
```shell ```shell
# to rollback any migrated project from ID 20 to 50. sudo gitlab-rake gitlab:storage:migrate_to_hashed ID_FROM=50 ID_TO=100
export ID_FROM=20
export ID_TO=50
``` ```
You can monitor the progress in the **{admin}** **Admin Area > Monitoring > Background Jobs** page. You can monitor the progress in the **{admin}** **Admin Area > Monitoring > Background Jobs** page.
On the **Queues** tab, you can watch the `hashed_storage:hashed_storage_project_rollback` queue to see how long the process will take to finish. There is a specific queue you can watch to see how long it will take to finish:
`hashed_storage:hashed_storage_project_migrate`.
After it reaches zero, you can confirm every project has been rolled back by running the commands bellow. After it reaches zero, you can confirm every project has been migrated by running the commands bellow.
If some projects weren't rolled back, you can run this rollback script again to schedule further rollbacks. If you find it necessary, you can run this migration script again to schedule missing projects.
Any error or warning will be logged in Sidekiq's log file. Any error or warning will be logged in Sidekiq's log file.
## List projects NOTE: **Note:**
If [Geo](../geo/replication/index.md) is enabled, each project that is successfully migrated
The following are Rake tasks for listing projects. generates an event to replicate the changes on any **secondary** nodes.
### List projects on legacy storage
To have a simple summary of projects using legacy storage:
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:legacy_projects
```
**Source Installation**
```shell
sudo -u git -H bundle exec rake gitlab:storage:legacy_projects RAILS_ENV=production
```
To list projects using legacy storage:
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:list_legacy_projects
```
**Source Installation**
```shell
sudo -u git -H bundle exec rake gitlab:storage:list_legacy_projects RAILS_ENV=production
```
### List projects on hashed storage
To have a simple summary of projects using hashed storage:
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:hashed_projects
```
**Source Installation**
```shell
sudo -u git -H bundle exec rake gitlab:storage:hashed_projects RAILS_ENV=production
```
To list projects using hashed storage:
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:list_hashed_projects
```
**Source Installation**
```shell
sudo -u git -H bundle exec rake gitlab:storage:list_hashed_projects RAILS_ENV=production
```
## List attachments
The following are Rake tasks for listing attachments.
### List attachments on legacy storage
To have a simple summary of project attachments using legacy storage:
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:legacy_attachments
```
**Source Installation**
```shell
sudo -u git -H bundle exec rake gitlab:storage:legacy_attachments RAILS_ENV=production
```
To list project attachments using legacy storage: You only need the `gitlab:storage:migrate_to_hashed` Rake task to migrate your repositories, but we have additional
commands below that helps you inspect projects and attachments in both legacy and hashed storage.
**Omnibus Installation** ## Rollback from hashed storage to legacy storage
```shell NOTE: **Deprecated:**
sudo gitlab-rake gitlab:storage:list_legacy_attachments In GitLab 13.0, [hashed storage](../repository_storage_types.md#hashed-storage)
``` is enabled by default and the legacy storage is deprecated.
Support for legacy storage will be removed in GitLab 14.0. If you're on GitLab
13.0 and later, switching new projects to legacy storage is not possible.
The option to choose between hashed and legacy storage in the admin area has
been disabled.
**Source Installation** This task will schedule all your existing projects and associated attachments to be rolled back to the
legacy storage type.
```shell - **Omnibus installation**
sudo -u git -H bundle exec rake gitlab:storage:list_legacy_attachments RAILS_ENV=production
```
### List attachments on hashed storage ```shell
sudo gitlab-rake gitlab:storage:rollback_to_legacy
```
To have a simple summary of project attachments using hashed storage: - **Source installation**
**Omnibus Installation** ```shell
sudo -u git -H bundle exec rake gitlab:storage:rollback_to_legacy RAILS_ENV=production
```
```shell If you have any existing integration, you may want to do a small rollback first,
sudo gitlab-rake gitlab:storage:hashed_attachments to validate. You can do so by specifying an ID range with the operation by using
``` the environment variables `ID_FROM` and `ID_TO`. For example, to limit the rollout
to project IDs 50 to 100 in an Omnibus GitLab installation:
**Source Installation**
```shell ```shell
sudo -u git -H bundle exec rake gitlab:storage:hashed_attachments RAILS_ENV=production sudo gitlab-rake gitlab:storage:rollback_to_legacy ID_FROM=50 ID_TO=100
``` ```
To list project attachments using hashed storage: You can monitor the progress in the **{admin}** **Admin Area > Monitoring > Background Jobs** page.
On the **Queues** tab, you can watch the `hashed_storage:hashed_storage_project_rollback` queue to see how long the process will take to finish.
**Omnibus Installation**
```shell
sudo gitlab-rake gitlab:storage:list_hashed_attachments
```
**Source Installation** After it reaches zero, you can confirm every project has been rolled back by running the commands bellow.
If some projects weren't rolled back, you can run this rollback script again to schedule further rollbacks.
Any error or warning will be logged in Sidekiq's log file.
```shell If you have a Geo setup, the rollback will not be reflected automatically
sudo -u git -H bundle exec rake gitlab:storage:list_hashed_attachments RAILS_ENV=production on the **secondary** node. You may need to wait for a backfill operation to kick-in and remove
``` the remaining repositories from the special `@hashed/` folder manually.
# Repository Storage Types # Repository storage types **(CORE ONLY)**
> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/issues/28283) in GitLab 10.0. > - [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/issues/28283) in GitLab 10.0.
> - Hashed storage became the default for new installations in GitLab 12.0
Two different storage layouts can be used > - Hashed storage is enabled by default for new and renamed projects in GitLab 13.0.
to store the repositories on disk and their characteristics.
GitLab can be configured to use one or multiple repository storage paths/shard GitLab can be configured to use one or multiple repository storage paths/shard
locations that can be: locations that can be:
...@@ -20,40 +19,17 @@ The `default` repository shard that is available in any installations ...@@ -20,40 +19,17 @@ The `default` repository shard that is available in any installations
that haven't customized it, points to the local folder: `/var/opt/gitlab/git-data`. that haven't customized it, points to the local folder: `/var/opt/gitlab/git-data`.
Anything discussed below is expected to be part of that folder. Anything discussed below is expected to be part of that folder.
## Legacy Storage ## Hashed storage
Legacy Storage is the storage behavior prior to version 10.0. For historical
reasons, GitLab replicated the same mapping structure from the projects URLs:
- Project's repository: `#{namespace}/#{project_name}.git`
- Project's wiki: `#{namespace}/#{project_name}.wiki.git`
This structure made it simple to migrate from existing solutions to GitLab and
easy for Administrators to find where the repository is stored.
On the other hand this has some drawbacks:
Storage location will concentrate huge amount of top-level namespaces. The
impact can be reduced by the introduction of
[multiple storage paths](repository_storage_paths.md).
Because backups are a snapshot of the same URL mapping, if you try to recover a
very old backup, you need to verify whether any project has taken the place of
an old removed or renamed project sharing the same URL. This means that
`mygroup/myproject` from your backup may not be the same original project that
is at that same URL today.
Any change in the URL will need to be reflected on disk (when groups / users or
projects are renamed). This can add a lot of load in big installations,
especially if using any type of network based filesystem.
## Hashed Storage
CAUTION: **Important:** NOTE: **Note:**
Geo requires Hashed Storage since 12.0. If you haven't migrated yet, In GitLab 13.0, hashed storage is enabled by default and the legacy storage is
check the [migration instructions](#how-to-migrate-to-hashed-storage) ASAP. deprecated. Support for legacy storage will be removed in GitLab 14.0.
If you haven't migrated yet, check the
[migration instructions](raketasks/storage.md#migrate-to-hashed-storage).
The option to choose between hashed and legacy storage in the admin area has
been disabled.
Hashed Storage is the new storage behavior we rolled out with 10.0. Instead Hashed storage is the storage behavior we rolled out with 10.0. Instead
of coupling project URL and the folder structure where the repository will be of coupling project URL and the folder structure where the repository will be
stored on disk, we are coupling a hash, based on the project's ID. This makes stored on disk, we are coupling a hash, based on the project's ID. This makes
the folder structure immutable, and therefore eliminates any requirement to the folder structure immutable, and therefore eliminates any requirement to
...@@ -134,6 +110,11 @@ The output includes the project ID and the project name: ...@@ -134,6 +110,11 @@ The output includes the project ID and the project name:
> [Introduced](https://gitlab.com/gitlab-org/gitaly/issues/1606) in GitLab 12.1. > [Introduced](https://gitlab.com/gitlab-org/gitaly/issues/1606) in GitLab 12.1.
DANGER: **Danger:**
Do not run `git prune` or `git gc` in pool repositories! This can
cause data loss in "real" repositories that depend on the pool in
question.
Forks of public projects are deduplicated by creating a third repository, the Forks of public projects are deduplicated by creating a third repository, the
object pool, containing the objects from the source project. Using object pool, containing the objects from the source project. Using
`objects/info/alternates`, the source project and forks use the object pool for `objects/info/alternates`, the source project and forks use the object pool for
...@@ -145,71 +126,15 @@ when housekeeping is run on the source project. ...@@ -145,71 +126,15 @@ when housekeeping is run on the source project.
"@pools/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git" "@pools/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"
``` ```
DANGER: **Danger:** ### Hashed storage coverage migration
Do not run `git prune` or `git gc` in pool repositories! This can
cause data loss in "real" repositories that depend on the pool in
question.
### How to migrate to Hashed Storage
To start a migration, enable Hashed Storage for new projects:
1. Go to **Admin > Settings > Repository** and expand the **Repository Storage** section.
1. Select the **Use hashed storage paths for newly created and renamed projects** checkbox.
Check if the change breaks any existing integration you may have that
either runs on the same machine as your repositories are located, or may login to that machine
to access data (for example, a remote backup solution).
To schedule a complete rollout, see the
[Rake task documentation for storage migration](raketasks/storage.md#migrate-existing-projects-to-hashed-storage) for instructions.
If you do have any existing integration, you may want to do a small rollout first,
to validate. You can do so by specifying a range with the operation.
This is an example of how to limit the rollout to Project IDs 50 to 100, running in
an Omnibus GitLab installation:
```shell Files stored in an S3 compatible endpoint will not have the downsides
sudo gitlab-rake gitlab:storage:migrate_to_hashed ID_FROM=50 ID_TO=100
```
Check the [documentation](raketasks/storage.md#migrate-existing-projects-to-hashed-storage) for additional information and instructions for
source-based installation.
#### Rollback
Similar to the migration, to disable Hashed Storage for new
projects:
1. Go to **Admin > Settings > Repository** and expand the **Repository Storage** section.
1. Uncheck the **Use hashed storage paths for newly created and renamed projects** checkbox.
To schedule a complete rollback, see the
[Rake task documentation for storage rollback](raketasks/storage.md#rollback-from-hashed-storage-to-legacy-storage) for instructions.
The rollback task also supports specifying a range of Project IDs. Here is an example
of limiting the rollout to Project IDs 50 to 100, in an Omnibus GitLab installation:
```shell
sudo gitlab-rake gitlab:storage:rollback_to_legacy ID_FROM=50 ID_TO=100
```
If you have a Geo setup, please note that the rollback will not be reflected automatically
on the **secondary** node. You may need to wait for a backfill operation to kick-in and remove
the remaining repositories from the special `@hashed/` folder manually.
### Hashed Storage coverage
We are incrementally moving every storable object in GitLab to the Hashed
Storage pattern. You can check the current coverage status below (and also see
the [issue](https://gitlab.com/gitlab-com/infrastructure/issues/2821)).
Note that things stored in an S3 compatible endpoint will not have the downsides
mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`, mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`,
which is true for CI Cache and LFS Objects. which is true for CI Cache and LFS Objects.
| Storable Object | Legacy Storage | Hashed Storage | S3 Compatible | GitLab Version | In the table below, you can find the coverage of the migration to the hashed storage.
| Storable Object | Legacy storage | Hashed storage | S3 Compatible | GitLab Version |
| --------------- | -------------- | -------------- | ------------- | -------------- | | --------------- | -------------- | -------------- | ------------- | -------------- |
| Repository | Yes | Yes | - | 10.0 | | Repository | Yes | Yes | - | 10.0 |
| Attachments | Yes | Yes | - | 10.2 | | Attachments | Yes | Yes | - | 10.2 |
...@@ -222,18 +147,16 @@ which is true for CI Cache and LFS Objects. ...@@ -222,18 +147,16 @@ which is true for CI Cache and LFS Objects.
| LFS Objects | Yes | Similar | Yes | 10.0 / 10.7 | | LFS Objects | Yes | Similar | Yes | 10.0 / 10.7 |
| Repository pools| No | Yes | - | 11.6 | | Repository pools| No | Yes | - | 11.6 |
#### Implementation Details #### Avatars
##### Avatars
Each file is stored in a folder with its `id` from the database. The filename is always `avatar.png` for user avatars. Each file is stored in a folder with its `id` from the database. The filename is always `avatar.png` for user avatars.
When avatar is replaced, `Upload` model is destroyed and a new one takes place with different `id`. When avatar is replaced, `Upload` model is destroyed and a new one takes place with different `id`.
##### CI Artifacts #### CI artifacts
CI Artifacts are S3 compatible since **9.4** (GitLab Premium), and available in GitLab Core since **10.6**. CI Artifacts are S3 compatible since **9.4** (GitLab Premium), and available in GitLab Core since **10.6**.
##### LFS Objects #### LFS objects
[LFS Objects in GitLab](../topics/git/lfs/index.md) implement a similar [LFS Objects in GitLab](../topics/git/lfs/index.md) implement a similar
storage pattern using 2 chars, 2 level folders, following Git's own implementation: storage pattern using 2 chars, 2 level folders, following Git's own implementation:
...@@ -246,3 +169,39 @@ storage pattern using 2 chars, 2 level folders, following Git's own implementati ...@@ -246,3 +169,39 @@ storage pattern using 2 chars, 2 level folders, following Git's own implementati
``` ```
LFS objects are also [S3 compatible](lfs/index.md#storing-lfs-objects-in-remote-object-storage). LFS objects are also [S3 compatible](lfs/index.md#storing-lfs-objects-in-remote-object-storage).
## Legacy storage
NOTE: **Deprecated:**
In GitLab 13.0, hashed storage is enabled by default and the legacy storage is
deprecated. If you haven't migrated yet, check the
[migration instructions](raketasks/storage.md#migrate-to-hashed-storage).
Support for legacy storage will be removed in GitLab 14.0. If you're on GitLab
13.0 and later, switching new projects to legacy storage is not possible.
The option to choose between hashed and legacy storage in the admin area has
been disabled.
Legacy storage is the storage behavior prior to version 10.0. For historical
reasons, GitLab replicated the same mapping structure from the projects URLs:
- Project's repository: `#{namespace}/#{project_name}.git`
- Project's wiki: `#{namespace}/#{project_name}.wiki.git`
This structure made it simple to migrate from existing solutions to GitLab and
easy for Administrators to find where the repository is stored.
On the other hand this has some drawbacks:
Storage location will concentrate huge amount of top-level namespaces. The
impact can be reduced by the introduction of
[multiple storage paths](repository_storage_paths.md).
Because backups are a snapshot of the same URL mapping, if you try to recover a
very old backup, you need to verify whether any project has taken the place of
an old removed or renamed project sharing the same URL. This means that
`mygroup/myproject` from your backup may not be the same original project that
is at that same URL today.
Any change in the URL will need to be reflected on disk (when groups / users or
projects are renamed). This can add a lot of load in big installations,
especially if using any type of network based filesystem.
...@@ -44,7 +44,7 @@ Access the default page for admin area settings by navigating to ...@@ -44,7 +44,7 @@ Access the default page for admin area settings by navigating to
| Option | Description | | Option | Description |
| ------ | ----------- | | ------ | ----------- |
| [Repository mirror](visibility_and_access_controls.md#allow-mirrors-to-be-set-up-for-projects) | Configure repository mirroring. | | [Repository mirror](visibility_and_access_controls.md#allow-mirrors-to-be-set-up-for-projects) | Configure repository mirroring. |
| [Repository storage](../../../administration/repository_storage_types.md#how-to-migrate-to-hashed-storage) | Configure storage path settings. | | [Repository storage](../../../administration/repository_storage_types.md) | Configure storage path settings. |
| Repository maintenance | ([Repository checks](../../../administration/repository_checks.md) and [Housekeeping](../../../administration/housekeeping.md)). Configure automatic Git checks and housekeeping on repositories. | | Repository maintenance | ([Repository checks](../../../administration/repository_checks.md) and [Housekeeping](../../../administration/housekeeping.md)). Configure automatic Git checks and housekeeping on repositories. |
| [Repository static objects](../../../administration/static_objects_external_storage.md) | Serve repository static objects (for example, archives, blobs, ...) from an external storage (for example, a CDN). | | [Repository static objects](../../../administration/static_objects_external_storage.md) | Serve repository static objects (for example, archives, blobs, ...) from an external storage (for example, a CDN). |
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment