• Kirill Smelkov's avatar
    pull: Switch from porcelain `git fetch` to plumbing `git fetch-pack` + friends · 899103bf
    Kirill Smelkov authored
    On lab.nexedi.com `git-backup pull` became slow, and most of the slowness
    was tracked down to the following:
    
    `git fetch`, before fetching data from remote repository, first checks
    whether it already locally has all the objects remote advertises. This
    boils down to running
    
    	echo $remote_tips | git rev-list --quiet --objects --stdin --not --all
    
    and checking whether it succeeds or not:
    
    	https://git.kernel.org/pub/scm/git/git.git/commit/?h=4191c35671
    	https://git.kernel.org/pub/scm/git/git.git/tree/builtin/fetch.c?h=v2.18.0-rc1-1-g6f333ff2fb#n925
    	https://git.kernel.org/pub/scm/git/git.git/tree/connected.c?h=v2.18.0-rc1-1-g6f333ff2fb#n8
    
    The "--not --all" in the query means that objects should be not
    reachable from all locally existing refs and is implemented by linearly
    scanning from tip of those existing refs and marking objects reachable
    from there as "do not print".
    
    In case of git-backup, where we have mostly master which is super commit
    merging from whole histories of all projects and from backup history,
    linearly scanning from such a tip goes through lots of commits. Up to
    the point where fetching a small, outdated repository, which was already
    pulled into backup and did not changed since long, takes more than 30
    seconds with almost 100% of that time being spent in quickfetch() only.
    
    The solution will be to optimize checking whether we already have all the
    remote objects and to not repeat whole backup-repo scanning for every
    pulled repository. This will be done via first querying through `git
    ls-remote` what tips remote repository has, then checking on
    git-backup specific index which tips we already have and then fetching
    only the rest. This way we are essentially moving most of quickfetch
    phase of git into git-backup.
    
    Since we'll be tailing to git to fetch only some of the remote refs, we
    will either have to amend ourselves the refs `git fetch` creates after
    fetching, or to not rely on `git fetch` creating any refs at all. Since
    we already have a long standing issue that many many refs that are
    coming live after `git fetch` slow down further git fetches
    
    https://lab.nexedi.com/kirr/git-backup/blob/0ab7bbb6/git-backup.go#L551
    
    the longer term plan will be not to create unneeded references.
    Since 2 forks could have references covering the same commits, we would
    either have to compare references created after git-fetch and deduplicate
    them or manage references creation ourselves.
    
    It is also generally better to split `git fetch` into steps at plumbing
    layer, because after doing so, we can have the chance to optimize or
    tweak any of the steps at our side with knowing full git-backup context
    and indices.
    
    This commit only switches from using `git fetch` to its plumbing
    counterpart `git fetch-pack` + friends + manually creating fetched refs
    the way `git fetch` used to do exactly. There should be neither
    functionality changed nor any speedup.
    
    Further commits will start to take advantage of the switch and optimize
    `git-backup pull`.
    899103bf
git-backup_test.go 12.3 KB