gitlab-backup: Dump DB ourselves
The reason to do this is that we want to have more control over DB dump process. Current problems which lead to this decision are: 1. DB dump is one large file which size grows over time. This is not friendly to git; 2. DB dump is currently not git/rsync friendly - when PostgreSQL does a dump, it just copes internal pages for data to output. And internal ordering changes every time a row is updated. http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590 http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order both 1 and 2 currently put our backup tool to their knees. We'll be handling those issues in the following patches. For now we perform the dump manually and switch from dumping in plain-text SQL to dumping in PostgreSQL native "directory" format, where there is small table of contents with schema (toc.dat) and output of `COPY <table> TO stdout` for each table in separate file. http://www.postgresql.org/docs/9.5/static/app-pgdump.html On restore we restore plain-text SQL with pg_restore and give this plain-text SQL back to gitlab, so it thinks it restores it the usual way. NOTE: backward compatibility is preserved - restore part, if it sees backup made by older version of gitlab-backup, which dumps database.sql in plain text - restores it correctly. NOTE2: now gitlab-backup supports only PostgreSQL (e.g. not MySQL). Adding support for other databases is possible, but requires custom handler for every DB (or just a fallback to usual plaintext maybe). NOTE3: even as we split DB into separate tables, this does not currently help problem #1, as in GitLab it is mostly just one table which occupies the whole space. /cc @kazuhiko
Showing