Merge branch 'improve-partial-clone-docs' into 'master'

Update partial clone docs See merge request gitlab-org/gitlab!28884

Merge branch 'improve-partial-clone-docs' into 'master'
Update partial clone docs See merge request gitlab-org/gitlab!28884
5e87fec3 · Achilleas Pipinellis · a4a33415 · 3e385ed6 · 5e87fec3
Commit 5e87fec3 authored Apr 09, 2020 by Achilleas Pipinellis
Hide whitespace changes
Inline Side-by-side

Showing with 91 additions and 58 deletions

doc/topics/git/partial_clone.md doc/topics/git/partial_clone.md +91 -58

No files found.
--- a/doc/topics/git/partial_clone.md
+++ b/doc/topics/git/partial_clone.md
-# Partial Clone for Large Repositories
+# Partial Clone
-CAUTION: **Alpha:**
+As Git repositories grow in size, they can become cumbersome to work with
-Partial Clone is an experimental feature, and will significantly increase
+because of the large amount of history that must be downloaded, and the large
-Gitaly resource utilization when performing a partial clone, and decrease
+amount of disk space they require.
-performance of subsequent fetch operations.
-As Git repositories become very large, usability decreases as performance
-decreases. One major challenge is cloning the repository, because Git will
-download the entire repository including every commit and every version of
-every object. This can be slow to transfer, and require large amounts of disk
-space.
-Historically, performing a **shallow clone**
-([`--depth`](https://www.git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt))
-has been the only way to reduce the amount of data transferred when cloning
-a Git repository. This does not, however, allow filtering by sub-tree which is
-important for monolithic repositories containing many projects, or by object
-size preventing unnecessary large objects being downloaded.
 [Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt)
 is a performance optimization that "allows Git to function without having a
 complete copy of the repository. The goal of this work is to allow Git better
 handle extremely large repositories."
-Specifically, using partial clone, it should be possible for Git to natively
+## Filter by file size
-support:
- large objects, instead of using [Git LFS](https://git-lfs.github.com/)
- enormous repositories
-Briefly, partial clone works by:
- excluding objects from being transferred when cloning or fetching a
+> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
-  repository using a new `--filter` flag
- downloading missing objects on demand
-Follow [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773) for roadmap and updates.
+Storing large binary files in Git is normally discouraged, because every large
+file added will be downloaded by everyone who clones or fetches changes
+thereafter. This is slow, if not a complete obstruction when working from a slow
+or unreliable internet connection.
-## Enabling partial clone
+Using partial clone with a file size filter solves this problem, by excluding
+troublesome large files from clones and fetches. When Git encounters a missing
+file, it will be downloaded on demand.
-> [Introduced](https://gitlab.com/gitlab-org/gitaly/issues/1553) in GitLab 12.4.
+When cloning a repository, use the `--filter=blob:limit=<size>` argument. For example,
+to clone the repository excluding files larger than 1 megabyte:
-To enable partial clone, use the [feature flags API](../../api/features.md).
-For example:
 ```shell
-curl --data "value=true" --header "PRIVATE-TOKEN: <your_access_token>" https://gitlab.example.com/api/v4/features/gitaly_upload_pack_filter
+git clone --filter=blob:limit=1m git@gitlab.com:gitlab-com/www-gitlab-com.git
 ```
-Alternatively, flip the switch and enable the feature flag:
+This would produce the following output:
-```ruby
+```plaintext
-Feature.enable(:gitaly_upload_pack_filter)
+Cloning into 'www-gitlab-com'...
+remote: Enumerating objects: 832467, done.
+remote: Counting objects: 100% (832467/832467), done.
+remote: Compressing objects: 100% (207226/207226), done.
+remote: Total 832467 (delta 585563), reused 826624 (delta 580099), pack-reused 0
+Receiving objects: 100% (832467/832467), 2.34 GiB | 5.05 MiB/s, done.
+Resolving deltas: 100% (585563/585563), done.
+remote: Enumerating objects: 146, done.
+remote: Counting objects: 100% (146/146), done.
+remote: Compressing objects: 100% (138/138), done.
+remote: Total 146 (delta 8), reused 144 (delta 8), pack-reused 0
+Receiving objects: 100% (146/146), 471.45 MiB | 4.60 MiB/s, done.
+Resolving deltas: 100% (8/8), done.
+Updating files: 100% (13008/13008), done.
+Filtering content: 100% (3/3), 131.24 MiB | 4.65 MiB/s, done.
 ```
-## Excluding objects by size
+The output will be longer because Git will first clone the repository excluding
+files larger than 1 megabyte, and second download any missing large files needed
-Partial Clone allows large objects to be stored directly in the Git repository,
+to checkout the `master` branch.
-and be excluded from clones as desired by the user. This eliminates the error
-prone process of deciding which objects should be stored in LFS or not. Using
+When changing branches, Git may need to download more missing files.
-partial clone, all files – large or small – may be treated the same.
+## Filter by object type
+> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2553) in GitLab 12.10.
+For enormous repositories with millions of files, and long history, it may be
+helpful to exclude all files and use in combination with `sparse-checkout` to
+reduce the size of your working copy.
+```plaintext
+# Clone the repo excluding all files
+$ git clone --filter=blob:none --sparse git@gitlab.com:gitlab-com/www-gitlab-com/git
+Cloning into 'www-gitlab-com'...
+remote: Enumerating objects: 678296, done.
+remote: Counting objects: 100% (678296/678296), done.
+remote: Compressing objects: 100% (165915/165915), done.
+remote: Total 678296 (delta 472342), reused 673292 (delta 467476), pack-reused 0
+Receiving objects: 100% (678296/678296), 81.06 MiB | 5.74 MiB/s, done.
+Resolving deltas: 100% (472342/472342), done.
+remote: Enumerating objects: 28, done.
+remote: Counting objects: 100% (28/28), done.
+remote: Compressing objects: 100% (25/25), done.
+remote: Total 28 (delta 0), reused 12 (delta 0), pack-reused 0
+Receiving objects: 100% (28/28), 140.29 KiB | 341.00 KiB/s, done.
+Updating files: 100% (28/28), done.
+$ cd www-gitlab-com
+$ git sparse-checkout init --cone
+$ git sparse-checkout add data
+remote: Enumerating objects: 301, done.
+remote: Counting objects: 100% (301/301), done.
+remote: Compressing objects: 100% (292/292), done.
+remote: Total 301 (delta 16), reused 102 (delta 9), pack-reused 0
+Receiving objects: 100% (301/301), 1.15 MiB | 608.00 KiB/s, done.
+Resolving deltas: 100% (16/16), done.
+Updating files: 100% (302/302), done.
+```
-With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
+For more details, see the Git documentation for
-enabled on the Git server:
+[`sparse-checkout`](https://git-scm.com/docs/git-sparse-checkout).
-```shell
+## Filter by file path
-# clone the repo, excluding blobs larger than 1 megabyte
-git clone --filter=blob:limit=1m <url>
-# in the checkout step of the clone, and any subsequent operations
+CAUTION: **Experimental:**
-# any blobs that are needed will be downloaded on demand
+Partial Clone using `sparse` filters is experimental, slow, and will
-git checkout feature-branch
+significantly increase Gitaly resource utilization when cloning and fetching.
-```
-## Excluding objects by path
+Deeper integration between Partial Clone and Sparse Checkout is being explored
+through the `--filter=sparse:oid=<blob-ish>` filter spec, but this is highly
+experimental. This mode of filtering uses a format similar to a `.gitignore`
+file to specify which files should be included when cloning and fetching.
-Partial Clone allows clones to be filtered by path using a format similar to a
+For more details, see the Git documentation for
-`.gitignore` file stored inside the repository.
+[`rev-list-options`](https://gitlab.com/gitlab-org/git/-/blob/9fadedd637b312089337d73c3ed8447e9f0aa775/Documentation/rev-list-options.txt#L735-780).
 With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
 enabled on the Git server: