Commits · 3aedc246272c30b362978cb83e845e1ba037ef74 · Alain Takoudjou / git-backup

13 Dec, 2016 1 commit

Move error-handling routines & co to lab.nexedi.com/kirr/go123 · 3aedc246

Kirill Smelkov authored Dec 13, 2016

error.go is completely being moved to that shared place for handy Go
utilities into several subpackages:

lab.nexedi.com/kirr/go123/exc -- exception-style error handling for Go
lab.nexedi.com/kirr/go123/myname -- easy way to determine current function's name and package
lab.nexedi.com/kirr/go123/xerr -- addons for error-handling
lab.nexedi.com/kirr/go123/xruntime -- addons to standard package runtime

3aedc246

03 Nov, 2016 1 commit

Don't be fooled by strings.Split(..., "\n") result always having empty "" last element · 3ba6cf73

Kirill Smelkov authored Nov 03, 2016

By definition of strings.Split(..., sep) it "slices s into all substrings
separated by sep and returns a slice of the substrings between those
separators". That means that

    string.Split("hello\nworld\n", "\n") -> ["hello", "world", ""])     # NOTE the last ""

when parsing file by lines, it is handy though to do not get last empty
"" after last "\n". #6 shows how we missed to do that filtering-out for
case of empty backup.refs file and errored-out because of that.

To fix let's introduce a helper - splitlines(), which does the job of
filtering-out last empty entry after last separator. By using this
helper everywhere we can hopefully avoid problems while pulling only
empty repositories (#6 case), and also similar ones.

Fixes #6
/reported-by @iv

3ba6cf73

01 Aug, 2016 3 commits

pull: Don't let a lot of empty directories stay under refs/backup/... work prefix after end of pull · 7535343c

Kirill Smelkov authored Aug 01, 2016

Continuing 62374038 (pull: Turns unused refs are removed not 100% and a
lot of empty directories are accumulated) we just make sure to remove
them in the end of pull.

But NOTE: there could be O(n^2) behaviour still hidden, so it makes
sense to eventually revisit it and cleanup empty dirs earlier.

For now we just care not to degrade future pull performance. The
appropriate time for revisiting could be when reworking pull to do
fetches in parallel.

Updates: https://lab.nexedi.com/lab.nexedi.com/lab.nexedi.com/issues/4

7535343c

restore: Extract packs in multiple workers · ff2f0b67

Kirill Smelkov authored Aug 01, 2016

This way it allows us to leverage multiple CPUs on a system for pack
extractions, which are computation-heavy operations.

The way to do is more-or-less classical:

    - main worker prepares requests for pack extraction jobs

    - there are multiple pack-extraction workers, which read requests
      from jobs queue and perform them

    - at the end we wait for everything to stop, collect errors and
      optionally signalling the whole thing to cancel if we see an error
      coming. (it is only a signal and we still have to wait for
      everything to stop)

The default number of workers is N(CPU) on the system - because we spawn
separate `git pack-objects ...` for every request.

We also now explicitly limit N(CPU) each `git pack-objects ...` can use
to 1. This way control how many resources to use is in git-backup hand
and also git packs better this way (when only using 1 thread) because
when deltifying all objects are considered to each other, not only all
objects inside 1 thread's object poll, and even when pack.threads is not
1, first "objects counting" phase of pack is serial - wasting all but 1
core.

On lab.nexedi.com we already use pack.threads=1 by default in global
gitconfig, but the above change is for code to be universal.

Time to restore nexedi/ from lab.nexedi.com backup:

2CPU laptop:

    before (pack.threads=1)     10m11s
    before (pack.threads=NCPU)   9m13s
    after  -j1                  10m11s
    after                        6m17s

8CPU system (with other load present, noisy) :

    before (pack.threads=1)     ~5m
    after                       ~1m30s

ff2f0b67

raisef: Fix it wrt erraddcallingcontext() · 6c2abbbf

Kirill Smelkov authored Aug 01, 2016

like in 302aaaea (raiseif: Fix it wrt erraddcallingcontext()) now fix
raisef, which I originally overlooked.

6c2abbbf

31 Jul, 2016 3 commits

xcommit_tree: Teach it to create commit without spawning `git commit-tree ...` · 3a7b390c

Kirill Smelkov authored Jul 31, 2016

Because spawning separate process per 1 commit is slow.

Libgit2 does not allow to create commits only knowing tree & parentv
sha1s, but we can create commit objects by hand pretty easily - their format is

    tree <sha1>
    parent <parent1-sha1>
    parent <parent2-sha1>
    ...
    author user <email> date +offset
    committer user <email> date +offset
    LF
    message

Time for pulling-in kirr/slapos.git

before: 2.5s
after:  0.9s

NOTE AuthorInfo is changed to inherit from git.Signature (same fields
    and semantic)

NOTE Since libgit2 default ident can fail, and does not look beyond
    user.name and user.email we do backup identity detection
    (user/hostname) - in similar way Git does - ourselves.

3a7b390c

Move xcommit_tree() & friends to gitobjects.go · cc450765

Kirill Smelkov authored Jul 31, 2016

We are going to rework this function, but before adding changes let's
move it to more appropriate place. Since xcommit_tree() creates commit
object from tree and parents and is pretty standard git function - the
appropriate place is gitobjects.

NOTE we cannot just replace xcommit_tree() with g.CreateCommit() as the
latter works with already loaded tree and parent objects, but we
want to be able to make commits only knowing tree and parents sha1.

cc450765

Verify tag/tree/blob encoding is consistent and always the same · 5aac4734

Kirill Smelkov authored Jul 31, 2016

In upcoming patch we are going to switch xcommit_tree() to our own
implementation, and since this can potentially change how commits are
represented, for backward compatibility reason we need to make sure
objects encoded as commits stay the same.

So for all kind of objects (they are present in testdata/ repositories)
add checks that:

    - encode/decode is idempotent
    - encoding and decoding produces exactly expected sha1

One nice side effect of this is that we can now remove runtime
consistency check from tail of decoding. That check was there from the
beginning - from 6f237f22 (git-backup: Initial draft) mainly present
because there was no testsuite at that time. That check place is however
even not completely right - in case we somehow wrongly pulled an object
it has to be detected at pull time, not restore time. So that check was
checking only 1/2 of implementation - and not the main one - that
decoding does not mess up.

Since now we have proper testsuite and add encode/decode tests in this
patch, we can remove that partial runtime check. And even if decoding
messes something up, despite having it testsuited, it will be 100%
caught by restore process, because for an extracted repository, if
there is no some object which needs to be present in it, pack generation
for that repository will fail. So we can be safe with the removal.

Time for restoring kirr/slapos.git from lab.nexedi.com backup

before: 5.5s
after:  3.5s

( so much because there are ~ 500 tags in slapos.git and currently tag
  encoding is done with spawning separate subprocess per tag )

5aac4734

30 Jul, 2016 1 commit

pull: Add blobs to index in batch · dbf86b19

Kirill Smelkov authored Jul 30, 2016

Do not waste resources adding every file converted to blob with spawning
`git update-index ...` per file - we can queue the info and add all
entries to index in one go.

Time to pull files part for lab.nexedi.com

before: ~110s
after:    ~3s

dbf86b19

29 Jul, 2016 6 commits

obj_recreate_from_commit: Re-create tag without spawning hash-object · c33dc392
Kirill Smelkov authored Jul 30, 2016
```
Time for restoring kirr/slapos.git from lab.nexedi.com backup

before: 7.4s
after:  5.6s
```
c33dc392

Switch xload_tag() too work without spawning Git subprocess · 5b1cdca3

Kirill Smelkov authored Jul 30, 2016

We can reuse ReadObject() like for blob_to_file().

We cannot drop xload_tag() in favor of Repository.LookupTag() because
upon tag loading we need to have not only parsed tag, but also its raw
content for encoding in another object.

Time for restoring kirr/slapos.git from lab.nexedi.com backup

before: 8.9s
after:  7.4s

( it goes down because on restore restored tags are reencoded again to
  verify restoration was ok. Pulling time should go down appropriately
  as well )

5b1cdca3

Switch file_to_blob() and blob_to_file() to work without spawning Git subprocesses · fbd72c02

Kirill Smelkov authored Jul 29, 2016

Substituting `git cat-file` to Odb.Read() and `git hash-object -w` to
Odb.Write().

Timing for restoring only files from lab.nexedi.com backup:

before: ~95s
after:   ~8s

Timings for making backup in file part should have similar effect.

fbd72c02

Drop xload_commit() in favor of git2go's Repository.LookupCommit() · 87283e4b

Kirill Smelkov authored Jul 29, 2016

This saves us one `git cat-file` call per recreated tag.

Time for restoring kirr/slapos.git from lab.nexedi.com backup

before: 10.3s
after:   8.9s

87283e4b

Hook in git2go (cgo bindings to libgit2) · 624393db

Kirill Smelkov authored Jul 29, 2016

Currently for every file -> blob, and blob -> file we invoke git
subprocess (cat-file or hash-object). We also invoke git subprocess for
every tag read/write and the same for commits and this 1-subprocess per
1 object has very high overhead.

The ways to avoid such overhead could be:

1) for every kind of operation spawn git service process, like e.g.
   `git cat-file --batch` for reading files, and only do request/reply
   per object with it.

2) use some go library to work with git repository ourselves.

"1" can work but:

    - at present there is no counterpart of `cat-file --batch` for
      e.g. `hash-object` - i.e. we cannot write objects without quirks
      or patching git.

    - even if we add support for hashing via request/reply, as all
      requests are processed sequentially on git side by e.g. `git
      cat-file --batch`, we won't be able to leverage parallelism.

    - request/reply has also latency attached.

For "2" we have roughly the following choices:

    - use cgo bindings to libgit2   (git2go)

    - use some pure-go git library

Pure-go approach has pros that it by design avoids problems related to
tricky CGo pointer C <-> Go passing rules. The fact that this was sorted
out by go team itself only during 1.6 cycle

    https://github.com/golang/go/issues/12416

tells a lot. The net is full of examples where those were hard to get,
and git2go in particular has a story of e.g. heap corruption (the bug
was on golang itself side and fixed only for 1.5)

    https://github.com/libgit2/git2go/issues/223
    https://groups.google.com/forum/#!topic/golang-nuts/Vi1HD-54BTA/discussion

However there is no good (to my knowledge) pure-go git library, and the
family of forks around github.com/speedata/gogit either:

    - works 3x slower compared to git2go

      ( or the same 3x in serial mode compared to e.g. `git cat-file --batch`
        as in serial mode git subservice and git2go has roughly similar performance )

    - or does not work at all (e.g. barfing out on REF_DELTA pack
      entries, etc)

So because of 3x slowdown, pure-go way is currently a no-runner.

Since one person from golang team cared to update git2go to properly
follow the CGo rules

    https://github.com/libgit2/git2go/pull/282

we can be relatively confident about git2go bindings quality and try to
use it.

This commit only hooks git2go into the build, subcommands and to Sha1
for to/from Oid conversion. We'll be switching places to git2go
incrementally in upcoming patches.

NOTE for now we need git2go from next branch for

    https://github.com/libgit2/git2go/commit/cf7553e7

The plan is to eventually switch to

    gopkg.in/libgit2/git2go.v25

once it is out.

624393db

Rename git() -> ggit() · fdaa4a19

Kirill Smelkov authored Jul 29, 2016

We are going to use git2go (see next patch) for which canonical import
path is git (import "github.com/libgit2/git2go" results in package name
being autotruncated to just "git") so free up the "git" name for that
package.

Reason is: git() - as function - is used not often, while the package
will be used often.

Regarding naming: not sure it is good choice but ggit() is something
like xgit(), only g is for "GitError".

fdaa4a19

27 Jul, 2016 1 commit

NOTES.restore: Clarify heuristic to limit search · ad6c6853

Kirill Smelkov authored Jul 27, 2016

We can do similar to what git does for blobs - searching in a window of
repositories sorted by repo basename.

ad6c6853

25 Jul, 2016 1 commit

error/mypkgname: Fix for a package living under dotted prefix · 36da74e6

Kirill Smelkov authored Jul 25, 2016

In 28986e0e (Rewrite in Go) I've added mypkgname() with comment that go
escapes all '.' in function name with %2e. That turned out to be not
true: Go escapes only dots in last component after last slash, e.g.

    lab.nexedi.com/kirr/git-backup/package%2ename.Function
    lab.nexedi.com/kirr/git-backup/pkg2.qqq/name%2ezzz.Function

Correct mypkgname() accordingly.

Noted while trying to run git-backup in a GOPATH root, not as
standalone.

36da74e6

07 Jul, 2016 2 commits

raiseif: Fix it wrt erraddcallingcontext() · 302aaaea

Kirill Smelkov authored Jul 07, 2016

erraddcallingcontext() already tries not to go beyond raise, but since
raiseif wes calling raise, it was omitting raiseif but not raise itself.
So an error could be like this

    cmd_restore: raiseif: mkdir ../R/1: file exists

while it should be

    cmd_restore: mkdir ../R/1: file exists

Fix it.

302aaaea

My notes on algorithmical restore optimization · 7fcb8c67
Kirill Smelkov authored Apr 18, 2016
```
when/if we ever get to need them.
```
7fcb8c67

06 Jul, 2016 2 commits

obj_represent_as_commit is always called with obj_type non-empty · b8bd89a3

Kirill Smelkov authored Jul 06, 2016

It was a default leftover to autodetect object type if obj_type=None,
from the beginning - from bbee44ce (Start of git-backup.git) - because
even there obj_represent_as_commit() is always called with obj_type
explicitly passed in.

So remove the leftover.

b8bd89a3

Rewrite in Go · 28986e0e

Kirill Smelkov authored Jul 06, 2016

This is more-or-less 1-to-1 port of git-backup to Go. There are things
we handle a bit differently:

- there is a separate type for Sha1
- conversion of repo paths to git references is now more robust wrt
  avoiding not-allowed in git constructs like ".." or ".lock"

  https://git.kernel.org/cgit/git/git.git/tree/refs.c?h=v2.9.0-37-g6d523a3#n34

The rewrite happened because we need to optimize restore, and for e.g.
parallelizing part it should be convenient to use goroutines and channels.

I'm not very comfortable with how error handling is done, because
contrary to what canonical Go way seems to be, in a lot of places it still
looks to me exceptions are better idea compared to just error codes,
though in many places just error codes are better and makes more sense.
Probably there will be less exceptions over time once the code starts to
be collaborating set of goroutines with communications done via
channels.

Still a lot of python habits on my side.

And as a bonus we now have end-to-end pull/restore tests...

28986e0e

20 Jun, 2016 2 commits

ref_to_repo: Fix format error on raise · a6cfe210
Kirill Smelkov authored Jun 20, 2016
```
Bug present since the beginning: 6f237f22 (git-backup: Initial draft).
```
a6cfe210

pull: Turns unused refs are removed not 100% and a lot of empty directories are accumulated · 62374038

Kirill Smelkov authored Jun 20, 2016

Even though we delete all temporary refs after pull, git leaves empty
directories in the place where the refs were - for example if there was
a ref

    dir/ref

and we delete ref `ref`, empty dir/ is still leaved there.

That increasingly hurts next pull performance a lot - before pulling git
wants to scan all local refs, and while doing so it descends into all
directories under refs/.

As after several pulls we can have many such empty directories under
refs/backup/, this scanning can take quite some time: e.g. for
lab.nexedi.com normal pull currently takes ~3 minutes, but after doing
pull ~60 times, it can become as bad as ~10 minutes for one pull. And
all that slowness goes away after cleaning refs/backup/ manually.

/cc https://lab.nexedi.com/lab.nexedi.com/lab.nexedi.com/issues/4

62374038

13 Jun, 2016 1 commit

readme: Fix relative URLs in README · ac60eb17

Kirill Smelkov authored Jun 13, 2016

Same story as in e.g.

    wendelin.core@b0b2c52e

( in short: GitLab now prepends namespace/repo/blob/ref/ prefix by
  itself )

ac60eb17

02 May, 2016 1 commit
- restore: x_heads_sha1 is the same as repo_sha1_heads · 94a54c8f
  Kirill Smelkov authored May 02, 2016
```
No need to compute that twice. My mistake from original 6f237f22
(git-backup: Initial draft).
```
  94a54c8f
13 Apr, 2016 2 commits
- Fix link to README · 21bed0c6
  Kirill Smelkov authored Apr 13, 2016
```
README.txt was renamed -> README.rst in a695bdbe (readme: .txt -> .rst)
```
  21bed0c6
- contrib/gitlab-backup: Intro · fa5226c9
  Kirill Smelkov authored Apr 13, 2016
```
Add a short introduction which outlines what gitlab-backup program does.
```
  fa5226c9
29 Feb, 2016 1 commit

gitlab-backup/restore: Gitlab >= 8.5 now wants uploads to be 0700 · 6a2852cf

Kirill Smelkov authored Feb 29, 2016

Following up on 48062989 (gitlab-backup/restore: Gitlab wants uploads/
to be 0750 and dirs inside uploads/ to be 0755):

Starting from 8.5:

    https://gitlab.com/gitlab-org/gitlab-ce/commit/4f946f03

GitLab wants uploads/ to be 0700 and dirs inside uploads/ to be 0700
too.

6a2852cf

28 Feb, 2016 1 commit

gitlab-backup/restore: Don't allow ln ambiguity (which can lead to failures) · 7279754d

Kirill Smelkov authored Feb 28, 2016

ln has several syntaxes. man ln 1 ln:

   SYNOPSIS
          ln [OPTION]... [-T] TARGET LINK_NAME   (1st form)
          ln [OPTION]... TARGET                  (2nd form)
          ln [OPTION]... TARGET... DIRECTORY     (3rd form)
          ln [OPTION]... -t DIRECTORY TARGET...  (4th form)

so without -T or -t what is target and what is link name is ambiguous and
ln tries to guess. Now imagine:

    ln -sf /path/to/new/hook    $H

and let us consider that $H is already a symlink, pointing to some place
which _exists_, but current user do not have access to. Then ln will
complain:

    ln: accessing `$H': Permission denied

and abort.

Fix it by specifying ln form we use explicitly with -T.

7279754d

10 Feb, 2016 1 commit

Make sure git will recognize *.git as repositories, even empty ones, after restore · b770b689

Kirill Smelkov authored Feb 10, 2016

On restore we were initializing refs/ and objects/ for repositories
obtained from backuped refs set, but this approach does not cover empty
repositories - e.g. repositories without any ref at all.

A frequent case for this is *.wiki.git in gitlab, and if we restore only
files for such repo, without empty refs/ and objects/ it would look like
restored ok, but any git-related operation on such repo will fail.

Fix it via making sure to create refs/ and objects/ the first time we
see a *.git while restoring files.

/cc @kazuhiko

b770b689

09 Feb, 2016 9 commits

gitlab-backup: Cosmetics · 02c80d58

Kirill Smelkov authored Feb 09, 2016

Add comments about what each function does, and add appropriate echo
which were missing in several pull & restore places.

02c80d58

gitlab-backup/restore: Review restoration commands + add way to actually run them on user request · 14ce9ff3

Kirill Smelkov authored Feb 09, 2016

- don't start/stop services - we assume appropriate services start/stop
  will be done bu invoker, and tell people to do so via dumping proper
  comments. (Rationale: services are start/stopped differently on
  different systems, e.g. in omnibus and in slapos)

- mv in repositories atomically with just 1 mv + fix case when there was
  no repositories/ previously at all.

- adjust `gitlab-rake gitlab:backup:restore` with force=yes, so it does
  not interactively ask about whether ok to restore ssh keys - just do it.

- add `-go` option to actually run gitlab restoration in addition to
  preparing backup files.

/cc @kazuhiko

14ce9ff3

gitlab-backup/restore: Allow restoration on higher GitLab version, if user requests so · a8ba07d5

Kirill Smelkov authored Feb 09, 2016

Currently GitLab backup restoration works on exactly the same GitLab
version, as the one with which the backup was made:

    https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/manager.rb#L132

However in many cases restoring backup on a newer GitLab version is
desirable - e.g. when moving GitLab instance to upgraded software.
GitLab answer - that we should first prepare exactly the same GitLab
version on moved instance, restore backup, then upgrade GitLab itself
_inplace_, is not satisfactory in e.g. slapos case - as upgrading can
take a long time, and in-place software changes can render GitLab
instance non-working.

What we better prefer to do is to fully prepare new GitLab software
version, and then knowing software is ready, restore backup in a quick
manner.

The following analysis says we should be 99% ok to do so:

1. git-backup cares backward compatibility for format of repositories backup.
2. db dump is backward compatible, because Rails, when seeing old db
   schema, will run migrations.
3. the rest is relatively minor - e.g. uploads, which is just files in
   tar, and format for such things changes seldomly.

because of 3, strictly speaking, it is not 100% correct to restore
backup from older gitlab version to newer one (since gitlab does not
provide a promise of backward compatibility on e.g. uploads/ backup
format) , but in practice it is 99% correct and is usually handy.

/cc @kazuhiko

a8ba07d5

gitlab-backup/restore: Gitlab wants uploads/ to be 0750 and dirs inside uploads/ to be 0755 · 48062989

Kirill Smelkov authored Feb 09, 2016

As with repositories (see patch "gitlab-backup/restore: GitLab wants
repositories to be drwxrws---") Gitlab wants proper permissions for
uploads/ - else the following check fail

    Uploads directory setup correctly? ... no
      Try fixing it:
      sudo chmod 0750 .../var/gitlab/uploads
      For more information see:
      doc/install/installation.md in section "GitLab"
      Please fix the error above and rerun the checks.

    Uploads directory setup correctly? ... no
      Try fixing it:
      sudo chown -R slapuser14 .../var/gitlab/uploads
      sudo find .../var/gitlab/uploads -type f -exec chmod 0644 {} \;
      sudo find .../var/gitlab/uploads -type d -not -path .../var/gitlab/uploads -exec chmod 0755 {} \;

and files are not served back from uploads - e.g. there is no uploaded icons shown.

/cc @kazuhiko

48062989

gitlab-backup/restore: Adjust hooks links to point to current gitlab-shell location · a3e3e5ad

Kirill Smelkov authored Feb 09, 2016

By design Gitlab currently symlinks *.git/hooks to hooks in gitlab-shell
working tree. As when restoring backup on different machine gitlab-shell
worktree can be located in another place, all hooks needs to be adjusted
upon restoration.

Btw, Gitlab itself does the same:

    https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/repository.rb#L103
    https://gitlab.com/gitlab-org/gitlab-ce/commit/1d03fa2e

/cc @kazuhiko

a3e3e5ad

gitlab-backup/restore: GitLab wants repositories to be drwxrws--- · c8ac2f3a

Kirill Smelkov authored Feb 09, 2016

As git-backup does not currently preserve file persmissions fully, we
need to adjust them on restore. For repositories after restore the
following gitlab check currently fails:

    Repo base access is drwxrws---? ... no

Fix it.

/cc @kazuhiko

c8ac2f3a

gitlab-backup: Split each table to parts <= 16M in size · d31febed

Kirill Smelkov authored Feb 09, 2016

As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
currently DB dump is not git friendly, because for each table dump is
just one (potentially large) file and grows over time. In Gitlab there
is one big table which dominates ~95% of whole dump size.

So to avoid overloading git with large blobs, let's split each table to
parts <= 16M in size, so this way we do not store very large blobs in
git, with which it is inefficient.

The fact that table data is sorted (see previous patch) helps the
splitting result to be more-or-less stable - as we split not purely by
byte size, but by lines, and max size 16M is only approximate, if a row
is changed in a part, it will be splitted the same way on the next
backup run.

This works not so good, when row entries are large itself (e.g. for big
patches which change a lot of files with big diff). For such cases
splitting can be improved with splitting by edges found similar to e.g.
bup[1] - by finding nodes of a rolling checksum, but for now we are
staying with more simple way of doing the split.

This reduce load on git packing (for e.g. repack or when doing fetch and
push) a lot.

[1] https://github.com/bup/bup

/cc @kazuhiko

d31febed

gitlab-backup: Sort each DB table data · 5534e682

Kirill Smelkov authored Feb 09, 2016

As was outlined in previous patch, DB dump is currently not git/rsync
friendly because order of rows in PostgreSQL dump constantly changes:

pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering -
  http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
  http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
- in fact it dumps data as stored raw in DB pages, and every record update changes row order.

On the other hand, Rails by default adds integer `id` first column to
every table as convention -
  http://edgeguides.rubyonrails.org/active_record_basics.html
and GitLab does not override this. So we can sort tables on id and this
way make data order stable.

And even if there is no id column we can sort - as COPY does not
guarantee ordering, we can change the order of rows in _whatever_ way and
the dump will still be correct.

This change helps git a lot to find good object deltas in less time, and
it should also help rsync to find less delta between backup dumps.

NOTE no changes are needed on restore side at all - the dump stays valid
    - sorted or not, and restores to semantically the same DB, even if
    internal rows ordering is different.

/cc @kazuhiko

5534e682

gitlab-backup: Dump DB ourselves · 6fa6df4b

Kirill Smelkov authored Feb 08, 2016

The reason to do this is that we want to have more control over DB dump
process. Current problems which lead to this decision are:

1. DB dump is one large file which size grows over time. This is not
friendly to git;

2. DB dump is currently not git/rsync friendly - when PostgreSQL
does a dump, it just copes internal pages for data to output.
And internal ordering changes every time a row is updated.

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order

both 1 and 2 currently put our backup tool to their knees. We'll be
handling those issues in the following patches.

For now we perform the dump manually and switch from dumping in
plain-text SQL to dumping in PostgreSQL native "directory" format, where
there is small table of contents with schema (toc.dat) and output of
`COPY <table> TO stdout` for each table in separate file.

http://www.postgresql.org/docs/9.5/static/app-pgdump.html

On restore we restore plain-text SQL with pg_restore and give this
plain-text SQL back to gitlab, so it thinks it restores it the usual way.

NOTE: backward compatibility is preserved - restore part, if it sees
backup made by older version of gitlab-backup, which dumps
database.sql in plain text - restores it correctly.

NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL).
Adding support for other databases is possible, but requires custom
handler for every DB (or just a fallback to usual plaintext maybe).

NOTE3: even as we split DB into separate tables, this does not currently
help problem #1, as in GitLab it is mostly just one table which
occupies the whole space.

/cc @kazuhiko

6fa6df4b

08 Feb, 2016 1 commit

gitlab-backup: Make $tmpd absolute · 5cdfd51e

Kirill Smelkov authored Feb 08, 2016

For now having $tmpd worked ok, but in the next patch, we are going to
pass this directory to a command, which, when run, automatically changes
its working directory as a first step, so passing $tmpd as relative
pathname won't work for it.

So switch $tmpd to be an absolute path.

/cc @kazuhiko

5cdfd51e