gitlab-backup: Sort each DB table data
As was outlined in previous patch, DB dump is currently not git/rsync friendly because order of rows in PostgreSQL dump constantly changes: pg_dump dumps table data with `COPY ... TO stdout` which does not guaranty any ordering - http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_dump/pg_dump.c;h=aa01d6a6;hb=HEAD#l1590 http://stackoverflow.com/questions/24622579/does-or-can-the-postgresql-copy-to-command-guarantee-a-particular-row-order - in fact it dumps data as stored raw in DB pages, and every record update changes row order. On the other hand, Rails by default adds integer `id` first column to every table as convention - http://edgeguides.rubyonrails.org/active_record_basics.html and GitLab does not override this. So we can sort tables on id and this way make data order stable. And even if there is no id column we can sort - as COPY does not guarantee ordering, we can change the order of rows in _whatever_ way and the dump will still be correct. This change helps git a lot to find good object deltas in less time, and it should also help rsync to find less delta between backup dumps. NOTE no changes are needed on restore side at all - the dump stays valid - sorted or not, and restores to semantically the same DB, even if internal rows ordering is different. /cc @kazuhiko
Showing
Please register or sign in to comment