Use object quarantine directory to enumerate new LFS pointers
When accepting pushes, we will check whether pushes contain any new LFS pointers and, if so, verify that we've got the corresponding LFS object for each of the poniters in order to ensure consistency. Determining new LFS pointers is expensive though: we need to perform a complete graph walk in order to determine which blobs are new and which aren't. The integrity check's runtime thus scales with repository size and is frequently seen to take multiple seconds or even time out after 30 seconds. Results are that the push seems to be hanging for quite some time doing nothing, or that the push is refused altogether in the case of a timeout. We can do better though: instead of doing a graph walk, we can just inspect all pushed objects directly by enumerating all pushed objects. This is quite trivial to do: when git-receive-pack(1) receives a push, all pushed objects will be put into a quarantine directory which is then made available to git hooks via the GIT_OBJECT_DIRECTORY variable, where the real repository is stored in GIT_ALTERNATIVE_OBJECT_DIRECTORIES. Instead of doing the graph walk, we can just use git-cat-file(1) with the `--batch-all-objects` flag and the alternate object directories unset. The result is a direct enumeration of all pushed objects, which scales linearly with push size and not with repository size. Doing some benchmarks for gitlab-org/gitlab showed that these computations are around 100-200x faster than doing the graph walk, reducing the time from around 200ms to 2-4ms. This functionality has recently been implemented in Gitaly via the new `ListAllLFSPointers()` RPC: given a repository, it will simply list all reachable or unreachable objects. We can now use above semantics when a pre-receive hook environment is active and strip the repository's alternative object directory, which will as a result only list newly pushed objects.
Showing
Please register or sign in to comment