Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
G
gitlab-ce
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
1
Merge Requests
1
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
nexedi
gitlab-ce
Commits
862da3cf
Commit
862da3cf
authored
Aug 10, 2017
by
Yorick Peterse
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add more database development related docs
parent
9ac2a517
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
170 additions
and
0 deletions
+170
-0
doc/development/README.md
doc/development/README.md
+2
-0
doc/development/database_merge_request_checklist.md
doc/development/database_merge_request_checklist.md
+15
-0
doc/development/ordering_table_columns.md
doc/development/ordering_table_columns.md
+127
-0
doc/development/sql.md
doc/development/sql.md
+26
-0
No files found.
doc/development/README.md
View file @
862da3cf
...
@@ -46,6 +46,7 @@
...
@@ -46,6 +46,7 @@
## Databases
## Databases
-
[
Merge Request Checklist
](
database_merge_request_checklist.md
)
-
[
What requires downtime?
](
what_requires_downtime.md
)
-
[
What requires downtime?
](
what_requires_downtime.md
)
-
[
Adding database indexes
](
adding_database_indexes.md
)
-
[
Adding database indexes
](
adding_database_indexes.md
)
-
[
Post Deployment Migrations
](
post_deployment_migrations.md
)
-
[
Post Deployment Migrations
](
post_deployment_migrations.md
)
...
@@ -56,6 +57,7 @@
...
@@ -56,6 +57,7 @@
-
[
Background Migrations
](
background_migrations.md
)
-
[
Background Migrations
](
background_migrations.md
)
-
[
Storing SHA1 Hashes As Binary
](
sha1_as_binary.md
)
-
[
Storing SHA1 Hashes As Binary
](
sha1_as_binary.md
)
-
[
Iterating Tables In Batches
](
iterating_tables_in_batches.md
)
-
[
Iterating Tables In Batches
](
iterating_tables_in_batches.md
)
-
[
Ordering Table Columns
](
ordering_table_columns.md
)
## i18n
## i18n
...
...
doc/development/database_merge_request_checklist.md
0 → 100644
View file @
862da3cf
# Merge Request Checklist
When creating a merge request that performs database related changes (schema
changes, adjusting queries to optimise performance, etc) you should use the
merge request template called "Database Changes". This template contains a
checklist of steps to follow to make sure the changes are up to snuff.
To use the checklist, create a new merge request and click on the "Choose a
template" dropdown, then click "Database Changes".
An example of this checklist can be found at
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12463.
The source code of the checklist can be found in at
https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab/merge_request_templates/Database%20Changes.md
doc/development/ordering_table_columns.md
0 → 100644
View file @
862da3cf
# Ordering Table Columns
Similar to C structures the space of a table is influenced by the order of
columns. This is because the size of columns is aligned depending on the type of
the column. Take the following column order for example:
*
id (integer, 4 bytes)
*
name (text, variable)
*
user_id (integer, 4 bytes)
Integers are aligned to the word size. This means that on a 64 bit platform the
actual size of each column would be: 8 bytes, variable, 8 bytes. This means that
each row will require at least 16 bytes for the two integers, and a variable
amount for the text field. If a table has a few rows this is not an issue, but
once you start storing millions of rows you can save space by using a different
order. For the above example a more ideal column order would be the following:
*
id (integer, 4 bytes)
*
user_id (integer, 4 bytes)
*
name (text, variable)
In this setup the
`id`
and
`user_id`
columns can be packed together, which means
we only need 8 bytes to store _both_ of them. This in turn each row will require
8 bytes less of space.
For GitLab we require that columns of new tables are ordered based to use the
least amount of space. An easy way of doing this is to order them based on the
type size in descending order with variable sizes (string and text columns for
example) at the end.
## Type Sizes
While the PostgreSQL docuemntation
(https://www.postgresql.org/docs/current/static/datatype.html) contains plenty
of information we will list the sizes of common types here so it's easier to
look them up. Here "word" refers to the word size, which is 4 bytes for a 32
bits platform and 8 bytes for a 64 bits platform.
| Type | Size | Aligned To |
|:-----------------|:-------------------------------------|:-----------|
| smallint | 2 bytes | 1 word |
| integer | 4 bytes | 1 word |
| bigint | 8 bytes | 8 bytes |
| real | 4 bytes | 1 word |
| double precision | 8 bytes | 8 bytes |
| boolean | 1 byte | not needed |
| text / string | variable, 1 byte plus the data | 1 word |
| bytea | variable, 1 or 4 bytes plus the data | 1 word |
| timestamp | 8 bytes | 8 bytes |
| timestamptz | 8 bytes | 8 bytes |
| date | 4 bytes | 1 word |
A "variable" size means the actual size depends on the value being stored. If
PostgreSQL determines this can be embedded directly into a row it may do so, but
for very large values it will store the data externally and store a pointer (of
1 word in size) in the column. Because of this variable sized columns should
always be at the end of a table.
## Real Example
Let's use the "events" table as an example, which currently has the following
layout:
| Column | Type | Size |
|:------------|:----------------------------|:---------|
| id | integer | 4 bytes |
| target_type | character varying | variable |
| target_id | integer | 4 bytes |
| title | character varying | variable |
| data | text | variable |
| project_id | integer | 4 bytes |
| created_at | timestamp without time zone | 8 bytes |
| updated_at | timestamp without time zone | 8 bytes |
| action | integer | 4 bytes |
| author_id | integer | 4 bytes |
After adding padding to align the columns this would translate to columns being
divided into fixed size chunks as follows:
| Chunk Size | Columns |
|:-----------|:------------------|
| 8 bytes | id |
| variable | target_type |
| 8 bytes | target_id |
| variable | title |
| variable | data |
| 8 bytes | project_id |
| 8 bytes | created_at |
| 8 bytes | updated_at |
| 8 bytes | action, author_id |
This means that excluding the variable sized data we need at least 48 bytes per
row.
We can optimise this by using the following column order instead:
| Column | Type | Size |
|:------------|:----------------------------|:---------|
| created_at | timestamp without time zone | 8 bytes |
| updated_at | timestamp without time zone | 8 bytes |
| id | integer | 4 bytes |
| target_id | integer | 4 bytes |
| project_id | integer | 4 bytes |
| action | integer | 4 bytes |
| author_id | integer | 4 bytes |
| target_type | character varying | variable |
| title | character varying | variable |
| data | text | variable |
This would produce the following chunks:
| Chunk Size | Columns |
|:-----------|:-------------------|
| 8 bytes | created_at |
| 8 bytes | updated_at |
| 8 bytes | id, target_id |
| 8 bytes | project_id, action |
| 8 bytes | author_id |
| variable | target_type |
| variable | title |
| variable | data |
Here we only need 40 bytes per row excluding the variable sized data. 8 bytes
being saved may not sound like much, but for tables as large as the "events"
table it does begin to matter. For example, when storing 80 000 000 rows this
translates to a space saving of at least 610 MB: all by just changing the order
of a few columns.
doc/development/sql.md
View file @
862da3cf
...
@@ -216,4 +216,30 @@ exact same results. This also means there's no need to add an index on
...
@@ -216,4 +216,30 @@ exact same results. This also means there's no need to add an index on
`created_at`
to ensure consistent performance as
`id`
is already indexed by
`created_at`
to ensure consistent performance as
`id`
is already indexed by
default.
default.
## Use WHERE EXISTS instead of WHERE IN
While
`WHERE IN`
and
`WHERE EXISTS`
can be used to produce the same data it is
recommended to use
`WHERE EXISTS`
whenever possible. While in many cases
PostgreSQL can optimise
`WHERE IN`
quite well there are also many cases where
`WHERE EXISTS`
will perform (much) better.
In Rails you have to use this by creating SQL fragments:
```
ruby
Project
.
where
(
'EXISTS (?)'
,
User
.
select
(
1
).
where
(
'projects.creator_id = users.id AND users.foo = X'
))
```
This would then produce a query along the lines of the following:
```
sql
SELECT
*
FROM
projects
WHERE
EXISTS
(
SELECT
1
FROM
users
WHERE
projects
.
creator_id
=
users
.
id
AND
users
.
foo
=
X
)
```
[
gin-index
]:
http://www.postgresql.org/docs/current/static/gin.html
[
gin-index
]:
http://www.postgresql.org/docs/current/static/gin.html
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment