Commit 526da962 authored by James Ramsay's avatar James Ramsay

Rewrite repo size reduction docs

Purging data from a Git repo can be useful for different reasons,
including making cloning faster, or reducing repo size on the GitLab
server. The first is easy, the second is hard.

Documentation has been rewritten to use `git filter-repo` which is the
leading repo rewriting tool for this application, and is recommended by
the Git project.

Documentation now also provides workflow for pruning all kinds of refs,
including refs/keep-around/* which was not documented. This meant
purging files was incomplete from disk use perspective on the GitLab
server.

References to BFG have been removed.
parent cdcdeccc
...@@ -97,7 +97,7 @@ Read the documentation on how to [migrate an existing Git repo with Git LFS](mig ...@@ -97,7 +97,7 @@ Read the documentation on how to [migrate an existing Git repo with Git LFS](mig
To remove objects from LFS: To remove objects from LFS:
1. Use [BFG-Cleaner](../../../user/project/repository/reducing_the_repo_size_using_git.md#using-the-bfg-repo-cleaner) or [filter-branch](../../../user/project/repository/reducing_the_repo_size_using_git.md#using-git-filter-branch) to remove the objects from the repository. 1. Use [`git filter-repo`](../../../user/project/repository/reducing_the_repo_size_using_git.md) to remove the objects from the repository.
1. Delete the relevant LFS lines for the objects you have removed from your `.gitattributes` file and commit those changes. 1. Delete the relevant LFS lines for the objects you have removed from your `.gitattributes` file and commit those changes.
## File Locking ## File Locking
......
...@@ -7,147 +7,239 @@ type: howto ...@@ -7,147 +7,239 @@ type: howto
# Reducing the repository size using Git # Reducing the repository size using Git
A GitLab Enterprise Edition administrator can set a [repository size limit](../../admin_area/settings/account_and_limit_settings.md) When large files are added to a Git repository this makes fetching the
which will prevent you from exceeding it. repository slower, because everyone will need to download the file. These files
can also take up a large amount of storage space on the server over time.
When a project has reached its size limit, you will not be able to push to it,
create a new merge request, or merge existing ones. You will still be able to Rewriting a repository can remove unwanted history to make the repository
create new issues, and clone the project though. Uploading LFS objects will smaller. [`git filter-repo`](https://github.com/newren/git-filter-repo) is a
also be denied. tool for quickly rewriting Git repository history, and is recommended over [`git
filter-branch`](https://git-scm.com/docs/git-filter-branch) and
If you exceed the repository size limit, your first thought might be to remove
some data, make a new commit and push back to the repository. Perhaps you can
move some blobs to LFS, or remove some old dependency updates from history.
Unfortunately, it's not so easy and that workflow won't work. Deleting files in
a commit doesn't actually reduce the size of the repo since the earlier commits
and blobs are still around. What you need to do is rewrite history with Git's
[`filter-branch` option](https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History#The-Nuclear-Option:-filter-branch),
or an open source community-maintained tool like the
[BFG](https://rtyley.github.io/bfg-repo-cleaner/). [BFG](https://rtyley.github.io/bfg-repo-cleaner/).
Note that even with that method, until `git gc` runs on the GitLab side, the DANGER: **Danger:**
"removed" commits and blobs will still be around. You also need to be able to Rewriting repository history is a destructive operation. Make sure to backup
push the rewritten history to GitLab, which may be impossible if you've already your repository before you begin. The best way is to [export the
exceeded the maximum size limit. project](../settings/import_export.html#exporting-a-project-and-its-data).
In order to lift these restrictions, the administrator of the GitLab instance ## Purging files from your repository history
needs to increase the limit on the particular project that exceeded it, so it's
always better to spot that you're approaching the limit and act proactively to
stay underneath it. If you hit the limit, and your admin can't - or won't -
temporarily increase it for you, your only option is to prune all the unneeded
stuff locally, and then create a new project on GitLab and start using that
instead.
If you can continue to use the original project, we recommend [using To make cloning your project faster, rewrite branches and tags to remove
BFG](#using-the-bfg-repo-cleaner), a tool that's built and unwanted files.
maintained by the open source community. It's faster and simpler than
`git filter-branch`, and GitLab can use its account of what has changed to clean
up its own internal state, maximizing the space saved.
CAUTION: **Caution:** 1. [Install `git
Make sure to first make a copy of your repository since rewriting history will filter-repo`](https://github.com/newren/git-filter-repo/blob/master/INSTALL.md)
purge the files and information you are about to delete. Also make sure to using a supported package manager, or from source.
inform any collaborators to not use `pull` after your changes, but use `rebase`.
CAUTION: **Caution:** 1. Clone a fresh copy of the repository using `--bare`.
This process is not suitable for removing sensitive data like password or keys
from your repository. Information about commits, including file content, is
cached in the database, and will remain visible even after they have been
removed from the repository.
## Using the BFG Repo-Cleaner
> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/19376) in GitLab 11.6.
1. [Install BFG](https://rtyley.github.io/bfg-repo-cleaner/) from its open source community repository.
1. Navigate to your repository:
```shell ```shell
cd my_repository/ git clone --bare https://example.gitlab.com/my/project.git
``` ```
1. Change to the branch you want to remove the big file from: 1. Using `git filter-repo`, purge any files from the history of your repository.
To purge all large files, the `--strip-blobs-bigger-than` option can be used:
```shell ```shell
git checkout master git filter-repo --strip-blobs-bigger-than 10M
``` ```
1. Create a commit removing the large file from the branch, if it still exists: To purge specific large files by path, the `--path` and `--invert-paths`
options can be combined.
```shell ```shell
git rm path/to/big_file.mpg git filter-repo --path path/to/big/file.m4v --invert-paths
git commit -m 'Remove unneeded large file'
``` ```
1. Rewrite history: See the [`git filter-repo`
documentation](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#EXAMPLES)
for more examples, and the complete documentation.
1. Force push your changes to overwrite all branches on GitLab.
```shell ```shell
bfg --delete-files path/to/big_file.mpg git push origin --force --all
``` ```
An object map file will be written to `object-id-map.old-new.txt`. Keep it [Protected Branches](../protected_branches.md) will cause this to fail. To
around - you'll need it for the final step! proceed you will need to remove branch protection, push, and then
reconfigure protected branches.
1. Force-push the changes to GitLab: 1. To remove large files from tagged releases, force push your changes to all
tags on GitLab.
```shell ```shell
git push --force-with-lease origin master git push origin --force --tags
``` ```
If this step fails, someone has changed the `master` branch while you were [Protected Tags](../protected_tags.md) will cause this to
rewriting history. You could restore the branch and re-run BFG to preserve fail. To proceed you will need to remove tag protection, push, and then
their changes, or use `git push --force` to overwrite their changes. reconfigure protected tags.
1. Navigate to **Project > Settings > Repository > Repository Cleanup**: ## Purging files from GitLab storage
![Repository settings cleanup form](img/repository_cleanup.png) To reduce the size of your repository in GitLab you will need to remove GitLab
internal refs that reference commits contain large files. Before completing
these steps, first [purged files from your repository
history](#purging-files-from-your-repository-history).
Upload the `object-id-map.old-new.txt` file and press **Start cleanup**. As well as branches and tags, which are a type of Git ref, GitLab automatically
This will remove any internal Git references to the old commits, and run creates other refs. These refs prevent dead links to commits, or missing diffs
`git gc` against the repository. You will receive an email once it has when viewing merge requests. [Repository cleanup](#repository-cleanup) can be
completed. used to remove these from GitLab.
NOTE: **Note:** The internal refs for merge requests (`refs/merge-requests/*`),
This process will remove some copies of the rewritten commits from GitLab's [pipelines](../../../ci/pipelines/index.md#troubleshooting-fatal-reference-is-not-a-tree)
cache and database, but there are still numerous gaps in coverage - at present, (`refs/pipelines/*`), and environments (`refs/environments/*`) are not
some of the copies may persist indefinitely. [Clearing the instance cache](../../../administration/raketasks/maintenance.md#clear-redis-cache) advertised, which means they are not included when fetching, which makes
may help to remove some of them, but it should not be depended on for security fetching faster. The hidden refs to prevent commits with discussion from being
purposes! deleted (`refs/keep-around/*`) cannot be fetched at all. These refs can,
however, be accessed from the Git bundle inside the project export.
## Using `git filter-branch` 1. [Install `git
filter-repo`](https://github.com/newren/git-filter-repo/blob/master/INSTALL.md)
using a supported package manager, or from source.
1. Navigate to your repository: 1. Generate a fresh [export the
project](../settings/import_export.html#exporting-a-project-and-its-data) and
download to your computer.
1. Decompress the backup using `tar`
```shell ```shell
cd my_repository/ tar xzf project-backup.tar.gz
``` ```
1. Change to the branch you want to remove the big file from: This will contain a `project.bundle` file, which was created by [`git
bundle`](https://git-scm.com/docs/git-bundle)
1. Clone a fresh copy of the repository from the bundle.
```shell ```shell
git checkout master git clone --bare --mirror /path/to/project.bundle
``` ```
1. Use `filter-branch` to remove the big file: 1. Using `git filter-repo`, purge any files from the history of your repository.
Because we are trying to remove internal refs, we will rely on the
`commit-map` produced by each run to tell us which internal refs to remove.
NOTE:**Note:**
`git filter-repo` creates a new `commit-map` file every run, and overwrite the
`commit-map` from the previous run. You will need this file from **every**
run. Do the next step every time you run `git filter-repo`.
To purge all large files, the `--strip-blobs-bigger-than` option can be used:
```shell ```shell
git filter-branch --force --tree-filter 'rm -f path/to/big_file.mpg' HEAD git filter-repo --strip-blobs-bigger-than 10M
``` ```
1. Instruct Git to purge the unwanted data: To purge specific large files by path, the `--path` and `--invert-paths`
options can be combined.
```shell ```shell
git reflog expire --expire=now --all && git gc --prune=now --aggressive git filter-repo --path path/to/big/file.m4v --invert-paths
``` ```
1. Lastly, force push to the repository: See the [`git filter-repo`
documentation](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#EXAMPLES)
for more examples, and the complete documentation.
1. After running `git filter-repo`, the header and unchanged commits need to be
removed from the `commit-map` before uploading to GitLab.
```shell ```shell
git push --force origin master tail -n +2 filter-repo/commit-map | grep -E -v '^(\w+) \1$' >> commit-map.txt
``` ```
Your repository should now be below the size limit. This command can be run after each run of `git filter-repo` to append the
output of the run to `commit-map.txt`
1. Navigate to **Project > Settings > Repository > Repository Cleanup**.
Upload the `commit-map.txt` file and press **Start cleanup**. This will
remove any internal Git references to the old commits, and run `git gc`
against the repository. You will receive an email once it has completed.
## Repository cleanup
> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/19376) in GitLab 11.6.
Repository cleanup allows you to upload a text file of objects and GitLab will remove
internal Git references to these objects.
To clean up a repository:
1. Go to the project for the repository.
1. Navigate to **{settings}** **Settings > Repository**.
1. Upload a list of objects.
1. Click **Start cleanup**.
This will remove any internal Git references to old commits, and run `git gc`
against the repository. You will receive an email once it has completed.
These tools produce suitable output for purging history on the server:
- [`git filter-repo`](https://github.com/newren/git-filter-repo): use the
`commit-map` file.
- [BFG](https://rtyley.github.io/bfg-repo-cleaner/): use the
`object-id-map.old-new.txt` file.
NOTE: **Note:**
Housekeeping prunes loose objects older than 2 weeks. This means objects added
in the last 2 weeks will not be removed immediately. If you have access to the
Gitaly server, you may run `git gc --prune=now` to prune all loose object
immediately.
NOTE: **Note:**
This process will remove some copies of the rewritten commits from GitLab's
cache and database, but there are still numerous gaps in coverage - at present,
some of the copies may persist indefinitely. [Clearing the instance
cache](../../../administration/raketasks/maintenance.md#clear-redis-cache) may
help to remove some of them, but it should not be depended on for security
purposes!
## Exceeding storage limit
A GitLab Enterprise Edition administrator can set a [repository size
limit](../../admin_area/settings/account_and_limit_settings.md) which will
prevent you from exceeding it.
When a project has reached its size limit, you will not be able to push to it,
create a new merge request, or merge existing ones. You will still be able to
create new issues, and clone the project though. Uploading LFS objects will
also be denied.
If you exceed the repository size limit, your first thought might be to remove
some data, make a new commit and push back to the repository. Perhaps you can
move some blobs to LFS, or remove some old dependency updates from history.
Unfortunately, it's not so easy and that workflow won't work. Deleting files in
a commit doesn't actually reduce the size of the repo since the earlier commits
and blobs are still around. What you need to do is rewrite history with Git's
[`filter-branch` option](https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History#The-Nuclear-Option:-filter-branch),
or an open source community-maintained tool like the
[`git filter-repo`](https://github.com/newren/git-filter-repo).
Note that even with that method, until `git gc` runs on the GitLab side, the
"removed" commits and blobs will still be around. You also need to be able to
push the rewritten history to GitLab, which may be impossible if you've already
exceeded the maximum size limit.
In order to lift these restrictions, the administrator of the GitLab instance
needs to increase the limit on the particular project that exceeded it, so it's
always better to spot that you're approaching the limit and act proactively to
stay underneath it. If you hit the limit, and your admin can't - or won't -
temporarily increase it for you, your only option is to prune all the unneeded
stuff locally, and then create a new project on GitLab and start using that
instead.
CAUTION: **Caution:**
This process is not suitable for removing sensitive data like password or keys
from your repository. Information about commits, including file content, is
cached in the database, and will remain visible even after they have been
removed from the repository.
<!-- ## Troubleshooting <!-- ## Troubleshooting
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment