1. 31 Jul, 2016 1 commit
    • Kirill Smelkov's avatar
      Verify tag/tree/blob encoding is consistent and always the same · 5aac4734
      Kirill Smelkov authored
      In upcoming patch we are going to switch xcommit_tree() to our own
      implementation, and since this can potentially change how commits are
      represented, for backward compatibility reason we need to make sure
      objects encoded as commits stay the same.
      
      So for all kind of objects (they are present in testdata/ repositories)
      add checks that:
      
          - encode/decode is idempotent
          - encoding and decoding produces exactly expected sha1
      
      One nice side effect of this is that we can now remove runtime
      consistency check from tail of decoding. That check was there from the
      beginning - from 6f237f22 (git-backup: Initial draft) mainly present
      because there was no testsuite at that time. That check place is however
      even not completely right - in case we somehow wrongly pulled an object
      it has to be detected at pull time, not restore time. So that check was
      checking only 1/2 of implementation - and not the main one - that
      decoding does not mess up.
      
      Since now we have proper testsuite and add encode/decode tests in this
      patch, we can remove that partial runtime check. And even if decoding
      messes something up, despite having it testsuited, it will be 100%
      caught by restore process, because for an extracted repository, if
      there is no some object which needs to be present in it, pack generation
      for that repository will fail. So we can be safe with the removal.
      
      Time for restoring kirr/slapos.git from lab.nexedi.com backup
      
      before: 5.5s
      after:  3.5s
      
      ( so much because there are ~ 500 tags in slapos.git and currently tag
        encoding is done with spawning separate subprocess per tag )
      5aac4734
  2. 30 Jul, 2016 1 commit
    • Kirill Smelkov's avatar
      pull: Add blobs to index in batch · dbf86b19
      Kirill Smelkov authored
      Do not waste resources adding every file converted to blob with spawning
      `git update-index ...` per file - we can queue the info and add all
      entries to index in one go.
      
      Time to pull files part for lab.nexedi.com
      
      before: ~110s
      after:    ~3s
      dbf86b19
  3. 29 Jul, 2016 6 commits
    • Kirill Smelkov's avatar
      obj_recreate_from_commit: Re-create tag without spawning hash-object · c33dc392
      Kirill Smelkov authored
      Time for restoring kirr/slapos.git from lab.nexedi.com backup
      
      before: 7.4s
      after:  5.6s
      c33dc392
    • Kirill Smelkov's avatar
      Switch xload_tag() too work without spawning Git subprocess · 5b1cdca3
      Kirill Smelkov authored
      We can reuse ReadObject() like for blob_to_file().
      
      We cannot drop xload_tag() in favor of Repository.LookupTag() because
      upon tag loading we need to have not only parsed tag, but also its raw
      content for encoding in another object.
      
      Time for restoring kirr/slapos.git from lab.nexedi.com backup
      
      before: 8.9s
      after:  7.4s
      
      ( it goes down because on restore restored tags are reencoded again to
        verify restoration was ok. Pulling time should go down appropriately
        as well )
      5b1cdca3
    • Kirill Smelkov's avatar
      Switch file_to_blob() and blob_to_file() to work without spawning Git subprocesses · fbd72c02
      Kirill Smelkov authored
      Substituting `git cat-file` to Odb.Read() and `git hash-object -w` to
      Odb.Write().
      
      Timing for restoring only files from lab.nexedi.com backup:
      
      before: ~95s
      after:   ~8s
      
      Timings for making backup in file part should have similar effect.
      fbd72c02
    • Kirill Smelkov's avatar
      Drop xload_commit() in favor of git2go's Repository.LookupCommit() · 87283e4b
      Kirill Smelkov authored
      This saves us one `git cat-file` call per recreated tag.
      
      Time for restoring kirr/slapos.git from lab.nexedi.com backup
      
      before: 10.3s
      after:   8.9s
      87283e4b
    • Kirill Smelkov's avatar
      Hook in git2go (cgo bindings to libgit2) · 624393db
      Kirill Smelkov authored
      Currently for every file -> blob, and blob -> file we invoke git
      subprocess (cat-file or hash-object). We also invoke git subprocess for
      every tag read/write and the same for commits and this 1-subprocess per
      1 object has very high overhead.
      
      The ways to avoid such overhead could be:
      
      1) for every kind of operation spawn git service process, like e.g.
         `git cat-file --batch` for reading files, and only do request/reply
         per object with it.
      
      2) use some go library to work with git repository ourselves.
      
      "1" can work but:
      
          - at present there is no counterpart of `cat-file --batch` for
            e.g. `hash-object` - i.e. we cannot write objects without quirks
            or patching git.
      
          - even if we add support for hashing via request/reply, as all
            requests are processed sequentially on git side by e.g. `git
            cat-file --batch`, we won't be able to leverage parallelism.
      
          - request/reply has also latency attached.
      
      For "2" we have roughly the following choices:
      
          - use cgo bindings to libgit2   (git2go)
      
          - use some pure-go git library
      
      Pure-go approach has pros that it by design avoids problems related to
      tricky CGo pointer C <-> Go passing rules. The fact that this was sorted
      out by go team itself only during 1.6 cycle
      
          https://github.com/golang/go/issues/12416
      
      tells a lot. The net is full of examples where those were hard to get,
      and git2go in particular has a story of e.g. heap corruption (the bug
      was on golang itself side and fixed only for 1.5)
      
          https://github.com/libgit2/git2go/issues/223
          https://groups.google.com/forum/#!topic/golang-nuts/Vi1HD-54BTA/discussion
      
      However there is no good (to my knowledge) pure-go git library, and the
      family of forks around github.com/speedata/gogit either:
      
          - works 3x slower compared to git2go
      
            ( or the same 3x in serial mode compared to e.g. `git cat-file --batch`
              as in serial mode git subservice and git2go has roughly similar performance )
      
          - or does not work at all (e.g. barfing out on REF_DELTA pack
            entries, etc)
      
      So because of 3x slowdown, pure-go way is currently a no-runner.
      
      Since one person from golang team cared to update git2go to properly
      follow the CGo rules
      
          https://github.com/libgit2/git2go/pull/282
      
      we can be relatively confident about git2go bindings quality and try to
      use it.
      
      This commit only hooks git2go into the build, subcommands and to Sha1
      for to/from Oid conversion. We'll be switching places to git2go
      incrementally in upcoming patches.
      
      NOTE for now we need git2go from next branch for
      
          https://github.com/libgit2/git2go/commit/cf7553e7
      
      The plan is to eventually switch to
      
          gopkg.in/libgit2/git2go.v25
      
      once it is out.
      624393db
    • Kirill Smelkov's avatar
      Rename git() -> ggit() · fdaa4a19
      Kirill Smelkov authored
      We are going to use git2go (see next patch) for which canonical import
      path is git (import "github.com/libgit2/git2go" results in package name
      being autotruncated to just "git") so free up the "git" name for that
      package.
      
      Reason is: git() - as function - is used not often, while the package
      will be used often.
      
      Regarding naming: not sure it is good choice but ggit() is something
      like xgit(), only g is for "GitError".
      fdaa4a19
  4. 27 Jul, 2016 1 commit
  5. 25 Jul, 2016 1 commit
    • Kirill Smelkov's avatar
      error/mypkgname: Fix for a package living under dotted prefix · 36da74e6
      Kirill Smelkov authored
      In 28986e0e (Rewrite in Go) I've added mypkgname() with comment that go
      escapes all '.' in function name with %2e. That turned out to be not
      true: Go escapes only dots in last component after last slash, e.g.
      
          lab.nexedi.com/kirr/git-backup/package%2ename.Function
          lab.nexedi.com/kirr/git-backup/pkg2.qqq/name%2ezzz.Function
      
      Correct mypkgname() accordingly.
      
      Noted while trying to run git-backup in a GOPATH root, not as
      standalone.
      36da74e6
  6. 07 Jul, 2016 2 commits
  7. 06 Jul, 2016 2 commits
    • Kirill Smelkov's avatar
      obj_represent_as_commit is always called with obj_type non-empty · b8bd89a3
      Kirill Smelkov authored
      It was a default leftover to autodetect object type if obj_type=None,
      from the beginning - from bbee44ce (Start of git-backup.git) - because
      even there obj_represent_as_commit() is always called with obj_type
      explicitly passed in.
      
      So remove the leftover.
      b8bd89a3
    • Kirill Smelkov's avatar
      Rewrite in Go · 28986e0e
      Kirill Smelkov authored
      This is more-or-less 1-to-1 port of git-backup to Go. There are things
      we handle a bit differently:
      
      - there is a separate type for Sha1
      - conversion of repo paths to git references is now more robust wrt
        avoiding not-allowed in git constructs like ".." or ".lock"
      
        https://git.kernel.org/cgit/git/git.git/tree/refs.c?h=v2.9.0-37-g6d523a3#n34
      
      The rewrite happened because we need to optimize restore, and for e.g.
      parallelizing part it should be convenient to use goroutines and channels.
      
      I'm not very comfortable with how error handling is done, because
      contrary to what canonical Go way seems to be, in a lot of places it still
      looks to me exceptions are better idea compared to just error codes,
      though in many places just error codes are better and makes more sense.
      Probably there will be less exceptions over time once the code starts to
      be collaborating set of goroutines with communications done via
      channels.
      
      Still a lot of python habits on my side.
      
      And as a bonus we now have end-to-end pull/restore tests...
      28986e0e
  8. 20 Jun, 2016 2 commits
  9. 13 Jun, 2016 1 commit
  10. 02 May, 2016 1 commit
  11. 13 Apr, 2016 2 commits
  12. 29 Feb, 2016 1 commit
  13. 28 Feb, 2016 1 commit
    • Kirill Smelkov's avatar
      gitlab-backup/restore: Don't allow ln ambiguity (which can lead to failures) · 7279754d
      Kirill Smelkov authored
      ln has several syntaxes. man ln 1 ln:
      
         SYNOPSIS
                ln [OPTION]... [-T] TARGET LINK_NAME   (1st form)
                ln [OPTION]... TARGET                  (2nd form)
                ln [OPTION]... TARGET... DIRECTORY     (3rd form)
                ln [OPTION]... -t DIRECTORY TARGET...  (4th form)
      
      so without -T or -t what is target and what is link name is ambiguous and
      ln tries to guess. Now imagine:
      
          ln -sf /path/to/new/hook    $H
      
      and let us consider that $H is already a symlink, pointing to some place
      which _exists_, but current user do not have access to. Then ln will
      complain:
      
          ln: accessing `$H': Permission denied
      
      and abort.
      
      Fix it by specifying ln form we use explicitly with -T.
      7279754d
  14. 10 Feb, 2016 1 commit
    • Kirill Smelkov's avatar
      Make sure git will recognize *.git as repositories, even empty ones, after restore · b770b689
      Kirill Smelkov authored
      On restore we were initializing refs/ and objects/ for repositories
      obtained from backuped refs set, but this approach does not cover empty
      repositories - e.g. repositories without any ref at all.
      
      A frequent case for this is *.wiki.git in gitlab, and if we restore only
      files for such repo, without empty refs/ and objects/ it would look like
      restored ok, but any git-related operation on such repo will fail.
      
      Fix it via making sure to create refs/ and objects/ the first time we
      see a *.git while restoring files.
      
      /cc @kazuhiko
      b770b689
  15. 09 Feb, 2016 9 commits
    • Kirill Smelkov's avatar
      gitlab-backup: Cosmetics · 02c80d58
      Kirill Smelkov authored
      Add comments about what each function does, and add appropriate echo
      which were missing in several pull & restore places.
      02c80d58
    • Kirill Smelkov's avatar
      gitlab-backup/restore: Review restoration commands + add way to actually run them on user request · 14ce9ff3
      Kirill Smelkov authored
      - don't start/stop services - we assume appropriate services start/stop
        will be done bu invoker, and tell people to do so via dumping proper
        comments. (Rationale: services are start/stopped differently on
        different systems, e.g. in omnibus and in slapos)
      
      - mv in repositories atomically with just 1 mv + fix case when there was
        no repositories/ previously at all.
      
      - adjust `gitlab-rake gitlab:backup:restore` with force=yes, so it does
        not interactively ask about whether ok to restore ssh keys - just do it.
      
      - add `-go` option to actually run gitlab restoration in addition to
        preparing backup files.
      
      /cc @kazuhiko
      14ce9ff3
    • Kirill Smelkov's avatar
      gitlab-backup/restore: Allow restoration on higher GitLab version, if user requests so · a8ba07d5
      Kirill Smelkov authored
      Currently GitLab backup restoration works on exactly the same GitLab
      version, as the one with which the backup was made:
      
          https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/manager.rb#L132
      
      However in many cases restoring backup on a newer GitLab version is
      desirable - e.g. when moving GitLab instance to upgraded software.
      GitLab answer - that we should first prepare exactly the same GitLab
      version on moved instance, restore backup, then upgrade GitLab itself
      _inplace_, is not satisfactory in e.g. slapos case - as upgrading can
      take a long time, and in-place software changes can render GitLab
      instance non-working.
      
      What we better prefer to do is to fully prepare new GitLab software
      version, and then knowing software is ready, restore backup in a quick
      manner.
      
      The following analysis says we should be 99% ok to do so:
      
      1. git-backup cares backward compatibility for format of repositories backup.
      2. db dump is backward compatible, because Rails, when seeing old db
         schema, will run migrations.
      3. the rest is relatively minor - e.g. uploads, which is just files in
         tar, and format for such things changes seldomly.
      
      because of 3, strictly speaking, it is not 100% correct to restore
      backup from older gitlab version to newer one (since gitlab does not
      provide a promise of backward compatibility on e.g. uploads/ backup
      format) , but in practice it is 99% correct and is usually handy.
      
      /cc @kazuhiko
      a8ba07d5
    • Kirill Smelkov's avatar
      gitlab-backup/restore: Gitlab wants uploads/ to be 0750 and dirs inside uploads/ to be 0755 · 48062989
      Kirill Smelkov authored
      As with repositories (see patch "gitlab-backup/restore: GitLab wants
      repositories to be drwxrws---") Gitlab wants proper permissions for
      uploads/ - else the following check fail
      
          Uploads directory setup correctly? ... no
            Try fixing it:
            sudo chmod 0750 .../var/gitlab/uploads
            For more information see:
            doc/install/installation.md in section "GitLab"
            Please fix the error above and rerun the checks.
      
          Uploads directory setup correctly? ... no
            Try fixing it:
            sudo chown -R slapuser14 .../var/gitlab/uploads
            sudo find .../var/gitlab/uploads -type f -exec chmod 0644 {} \;
            sudo find .../var/gitlab/uploads -type d -not -path .../var/gitlab/uploads -exec chmod 0755 {} \;
      
      and files are not served back from uploads - e.g. there is no uploaded icons shown.
      
      /cc @kazuhiko
      48062989
    • Kirill Smelkov's avatar
      gitlab-backup/restore: Adjust hooks links to point to current gitlab-shell location · a3e3e5ad
      Kirill Smelkov authored
      By design Gitlab currently symlinks *.git/hooks to hooks in gitlab-shell
      working tree. As when restoring backup on different machine gitlab-shell
      worktree can be located in another place, all hooks needs to be adjusted
      upon restoration.
      
      Btw, Gitlab itself does the same:
      
          https://gitlab.com/gitlab-org/gitlab-ce/blob/7383453b/lib/backup/repository.rb#L103
          https://gitlab.com/gitlab-org/gitlab-ce/commit/1d03fa2e
      
      /cc @kazuhiko
      a3e3e5ad
    • Kirill Smelkov's avatar
      gitlab-backup/restore: GitLab wants repositories to be drwxrws--- · c8ac2f3a
      Kirill Smelkov authored
      As git-backup does not currently preserve file persmissions fully, we
      need to adjust them on restore. For repositories after restore the
      following gitlab check currently fails:
      
          Repo base access is drwxrws---? ... no
      
      Fix it.
      
      /cc @kazuhiko
      c8ac2f3a
    • Kirill Smelkov's avatar
      gitlab-backup: Split each table to parts <= 16M in size · d31febed
      Kirill Smelkov authored
      As was outlined 2 patches before (gitlab-backup: Dump DB ourselves),
      currently DB dump is not git friendly, because for each table dump is
      just one (potentially large) file and grows over time. In Gitlab there
      is one big table which dominates ~95% of whole dump size.
      
      So to avoid overloading git with large blobs, let's split each table to
      parts <= 16M in size, so this way we do not store very large blobs in
      git, with which it is inefficient.
      
      The fact that table data is sorted (see previous patch) helps the
      splitting result to be more-or-less stable - as we split not purely by
      byte size, but by lines, and max size 16M is only approximate, if a row
      is changed in a part, it will be splitted the same way on the next
      backup run.
      
      This works not so good, when row entries are large itself (e.g. for big
      patches which change a lot of files with big diff). For such cases
      splitting can be improved with splitting by edges found similar to e.g.
      bup[1] - by finding nodes of a rolling checksum, but for now we are
      staying with more simple way of doing the split.
      
      This reduce load on git packing (for e.g. repack or when doing fetch and
      push) a lot.
      
      [1] https://github.com/bup/bup
      
      /cc @kazuhiko
      d31febed
    • Kirill Smelkov's avatar
      gitlab-backup: Sort each DB table data · 5534e682
      Kirill Smelkov authored
      As was outlined in previous patch, DB dump is currently not git/rsync
      friendly because order of rows in PostgreSQL dump constantly changes:
      
      pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering -
        http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
        http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
      - in fact it dumps data as stored raw in DB pages, and every record update changes row order.
      
      On the other hand, Rails by default adds integer `id` first column to
      every table as convention -
        http://edgeguides.rubyonrails.org/active_record_basics.html
      and GitLab does not override this. So we can sort tables on id and this
      way make data order stable.
      
      And even if there is no id column we can sort - as COPY does not
      guarantee ordering, we can change the order of rows in _whatever_ way and
      the dump will still be correct.
      
      This change helps git a lot to find good object deltas in less time, and
      it should also help rsync to find less delta between backup dumps.
      
      NOTE no changes are needed on restore side at all - the dump stays valid
          - sorted or not, and restores to semantically the same DB, even if
          internal rows ordering is different.
      
      /cc @kazuhiko
      5534e682
    • Kirill Smelkov's avatar
      gitlab-backup: Dump DB ourselves · 6fa6df4b
      Kirill Smelkov authored
      The reason to do this is that we want to have more control over DB dump
      process. Current problems which lead to this decision are:
      
          1. DB dump is one large file which size grows over time. This is not
             friendly to git;
      
          2. DB dump is currently not git/rsync friendly - when PostgreSQL
             does a dump, it just copes internal pages for data to output.
             And internal ordering changes every time a row is updated.
      
              http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590
              http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order
      
      both 1 and 2 currently put our backup tool to their knees. We'll be
      handling those issues in the following patches.
      
      For now we perform the dump manually and switch from dumping in
      plain-text SQL to dumping in PostgreSQL native "directory" format, where
      there is small table of contents with schema (toc.dat) and output of
      `COPY <table> TO stdout` for each table in separate file.
      
          http://www.postgresql.org/docs/9.5/static/app-pgdump.html
      
      On restore we restore plain-text SQL with pg_restore and give this
      plain-text SQL back to gitlab, so it thinks it restores it the usual way.
      
      NOTE: backward compatibility is preserved - restore part, if it sees
          backup made by older version of gitlab-backup, which dumps
          database.sql in plain text - restores it correctly.
      
      NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL).
          Adding support for other databases is possible, but requires custom
          handler for every DB (or just a fallback to usual plaintext maybe).
      
      NOTE3: even as we split DB into separate tables, this does not currently
          help problem #1, as in GitLab it is mostly just one table which
          occupies the whole space.
      
      /cc @kazuhiko
      6fa6df4b
  16. 08 Feb, 2016 5 commits
    • Kirill Smelkov's avatar
      gitlab-backup: Make $tmpd absolute · 5cdfd51e
      Kirill Smelkov authored
      For now having $tmpd worked ok, but in the next patch, we are going to
      pass this directory to a command, which, when run, automatically changes
      its working directory as a first step, so passing $tmpd as relative
      pathname won't work for it.
      
      So switch $tmpd to be an absolute path.
      
      /cc @kazuhiko
      5cdfd51e
    • Kirill Smelkov's avatar
      gitlab-backup: Refactor need_gitlab_config() a bit · 8099a8bf
      Kirill Smelkov authored
      In the following patches we will be adding more and more settings to
      read from gitlab config, so structure of code which does this is better
      prepared:
      
          - part that emits the settings (in Ruby) is now multiline
          - we prepare shortcuts c & s which are Gitlab.config and
            Gitlab.config.gitlab_shell
          - in the end there is "END" emitted, and the reader checks this to
            make sure generate and read parts stay in sync.
      
      /cc @kazuhiko
      8099a8bf
    • Kirill Smelkov's avatar
      gitlab-backup: There is no need to use ';' in inner block of need_gitlab_config() · bb03a6d7
      Kirill Smelkov authored
      It works ok without it:
      
          ---- 8< ---- z.sh
          #!/bin/bash -e
      
          {
              read A
              read B
          } < <(echo -e 'AAA\nBBB')
      
          echo $A
          echo $B
          ---- 8< ----
      
          $ ./z.sh
          AAA
          BBB
          $ echo $?
          0
      
      /cc @kazuhiko
      bb03a6d7
    • Kirill Smelkov's avatar
      gitlab-backup: Use find in a way, that does not hide errors · 64a16570
      Kirill Smelkov authored
      In 495bd2fa (gitlab-backup: Unpack *.tar.gz before storing them in git)
      we used find(1) to find *.tar.gz and unpack/repack them on
      backup/restore. However `find -exec ...` does not stop on errors and
      does not report them. Compare:
      
          ---- 8< ---- x.sh
          #!/bin/bash -e
      
          echo AAA
          find . -exec false ';'
          echo BBB
          ---- 8< ----
      
          ---- 8< ---- y.sh
          #!/bin/bash -e
      
          echo XXX
          find . | \
          while read F; do
              false
          done
          echo YYY
          ---- 8< ----
      
          $ ./x.sh
          AAA
          BBB
          $ echo $?
          0
      
          $ ./y.sh
          XXX
          $ echo $?
          1
      
      So we switch to second style where find passes entries to processing
      program via channel. This second new style is also more clean, in my view,
      because listing and processing parts are now more better structured.
      
      /cc @kazuhiko
      64a16570
    • Kirill Smelkov's avatar
      *: Update copyright years · 70776a8f
      Kirill Smelkov authored
      70776a8f
  17. 30 Dec, 2015 1 commit
  18. 14 Oct, 2015 1 commit
    • Kirill Smelkov's avatar
      fsck incoming objects on pull · 7c0e3ff2
      Kirill Smelkov authored
      Since objects are shared between backed up repositories, it is important
      to make sure we do not pull a broken object once, thus programming
      future corruption of that object after restore in all repositories which
      use it.
      
      Object corruption could happen for two reasons:
      
          - plain storage corruption, or
          - someone intentionally pushing corrupted object with known sha1 to
            any repository.
      
      Second case is even more dangerous, as it potentially allows attacker to
      change data in not-available-to-him repositories.
      
      Now objects are checked on pull, and if corrupt, git-backup complains,
      e.g. this way:
      
          RuntimeError: git -c fetch.fsckObjects=true fetch --no-tags ../D/corrupt.git refs/*:refs/backup/20151014-1914/aaa/corrupt.git/*
          error: inflate: data stream error (incorrect data check)
          fatal: loose object 52baccfe8479b61c2a0d5447bc0a6bf7c6827c60 (stored in ./objects/52/baccfe8479b61c2a0d5447bc0a6bf7c6827c60) is corrupt
          fatal: The remote end hung up unexpectedly
      7c0e3ff2
  19. 24 Sep, 2015 1 commit