Commit e07a1c70 authored by Patrick Steinhardt's avatar Patrick Steinhardt

Use object quarantine directory to enumerate new LFS pointers

When accepting pushes, we will check whether pushes contain any new LFS
pointers and, if so, verify that we've got the corresponding LFS object
for each of the poniters in order to ensure consistency. Determining new
LFS pointers is expensive though: we need to perform a complete graph
walk in order to determine which blobs are new and which aren't. The
integrity check's runtime thus scales with repository size and is
frequently seen to take multiple seconds or even time out after 30
seconds. Results are that the push seems to be hanging for quite some
time doing nothing, or that the push is refused altogether in the case
of a timeout.

We can do better though: instead of doing a graph walk, we can just
inspect all pushed objects directly by enumerating all pushed objects.
This is quite trivial to do: when git-receive-pack(1) receives a push,
all pushed objects will be put into a quarantine directory which is then
made available to git hooks via the GIT_OBJECT_DIRECTORY variable, where
the real repository is stored in GIT_ALTERNATIVE_OBJECT_DIRECTORIES.
Instead of doing the graph walk, we can just use git-cat-file(1) with
the `--batch-all-objects` flag and the alternate object directories
unset. The result is a direct enumeration of all pushed objects, which
scales linearly with push size and not with repository size. Doing some
benchmarks for gitlab-org/gitlab showed that these computations are
around 100-200x faster than doing the graph walk, reducing the time from
around 200ms to 2-4ms.

This functionality has recently been implemented in Gitaly via the new
`ListAllLFSPointers()` RPC: given a repository, it will simply list all
reachable or unreachable objects. We can now use above semantics when a
pre-receive hook environment is active and strip the repository's
alternative object directory, which will as a result only list newly
pushed objects.
parent d9f880c7
---
title: Use object quarantine directory to enumerate new LFS pointers
merge_request: 58634
author:
type: performance
---
name: lfs_integrity_inspect_quarantined_objects
introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/58634
rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/327440
milestone: '13.11'
type: development
group: group::gitaly
default_enabled: false
...@@ -78,17 +78,7 @@ module Gitlab ...@@ -78,17 +78,7 @@ module Gitlab
end end
def get_new_lfs_pointers(revision, limit, not_in, dynamic_timeout = nil) def get_new_lfs_pointers(revision, limit, not_in, dynamic_timeout = nil)
request = Gitaly::GetNewLFSPointersRequest.new( request, rpc = create_new_lfs_pointers_request(revision, limit, not_in)
repository: @gitaly_repo,
revision: encode_binary(revision),
limit: limit || 0
)
if not_in.nil? || not_in == :all
request.not_in_all = true
else
request.not_in_refs += not_in
end
timeout = timeout =
if dynamic_timeout if dynamic_timeout
...@@ -100,7 +90,7 @@ module Gitlab ...@@ -100,7 +90,7 @@ module Gitlab
response = GitalyClient.call( response = GitalyClient.call(
@gitaly_repo.storage_name, @gitaly_repo.storage_name,
:blob_service, :blob_service,
:get_new_lfs_pointers, rpc,
request, request,
timeout: timeout timeout: timeout
) )
...@@ -118,6 +108,39 @@ module Gitlab ...@@ -118,6 +108,39 @@ module Gitlab
private private
def create_new_lfs_pointers_request(revision, limit, not_in)
# If the check happens for a change which is using a quarantine
# environment for incoming objects, then we can avoid doing the
# necessary graph walk to detect only new LFS pointers and instead scan
# through all quarantined objects.
git_env = ::Gitlab::Git::HookEnv.all(@gitaly_repo.gl_repository)
if Feature.enabled?(:lfs_integrity_inspect_quarantined_objects, @project, default_enabled: :yaml) && git_env['GIT_OBJECT_DIRECTORY_RELATIVE'].present?
repository = @gitaly_repo.dup
repository.git_alternate_object_directories = Google::Protobuf::RepeatedField.new(:string)
request = Gitaly::ListAllLFSPointersRequest.new(
repository: repository,
limit: limit || 0
)
[request, :list_all_lfs_pointers]
else
request = Gitaly::GetNewLFSPointersRequest.new(
repository: @gitaly_repo,
revision: encode_binary(revision),
limit: limit || 0
)
if not_in.nil? || not_in == :all
request.not_in_all = true
else
request.not_in_refs += not_in
end
[request, :get_new_lfs_pointers]
end
end
def consume_blob_response(response) def consume_blob_response(response)
data = [] data = []
blob = nil blob = nil
......
...@@ -43,6 +43,33 @@ RSpec.describe Gitlab::GitalyClient::BlobService do ...@@ -43,6 +43,33 @@ RSpec.describe Gitlab::GitalyClient::BlobService do
subject subject
end end
end end
context 'with hook environment' do
let(:git_env) do
{
'GIT_OBJECT_DIRECTORY_RELATIVE' => '.git/objects',
'GIT_ALTERNATE_OBJECT_DIRECTORIES_RELATIVE' => ['/dir/one', '/dir/two']
}
end
let(:expected_params) do
expected_repository = repository.gitaly_repository
expected_repository.git_alternate_object_directories = Google::Protobuf::RepeatedField.new(:string)
{ limit: limit, repository: expected_repository }
end
it 'sends a list_all_lfs_pointers message' do
allow(Gitlab::Git::HookEnv).to receive(:all).with(repository.gl_repository).and_return(git_env)
expect_any_instance_of(Gitaly::BlobService::Stub)
.to receive(:list_all_lfs_pointers)
.with(gitaly_request_with_params(expected_params), kind_of(Hash))
.and_return([])
subject
end
end
end end
describe '#get_all_lfs_pointers' do describe '#get_all_lfs_pointers' do
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment