Commit 64fbee7e authored by James Ramsay's avatar James Ramsay

Simplify failover documentation

Automatic failover is needed for high availability and should be
enabled in the default setup. This commit integrates enabling automatic
failover into the Gitaly node configuration step after verifying they
can be reached.

This also solves the problem of Grafana docs being buried below.
parent 85d98d4b
......@@ -430,7 +430,7 @@ documentation](index.md#3-gitaly-server-configuration).
gitlab-ctl restart gitaly
```
**Complete these steps for each Gitaly node!**
**The steps above must be completed for each Gitaly node!**
After all Gitaly nodes are configured, you can run the Praefect connection
checker to verify Praefect can connect to all Gitaly servers in the Praefect
......@@ -442,6 +442,34 @@ config.
sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dial-nodes
```
1. Enable automatic failover by editing `/etc/gitlab/gitlab.rb`:
```ruby
praefect['failover_enabled'] = true
```
When automatic failover is enabled, Praefect check the health of internal
Gitaly nodes. If the primary has a certain amount of health checks fail, it
will promote one of the secondaries to be primary, and demote the primary to
be a secondary.
Manual failover is possible by updating `praefect['virtual_storages']` and
nominating a new primary node.
NOTE: **Note:**: Automatic failover is not yet supported for setups with
multiple Praefect nodes. There is currently no coordination between Praefect
nodes, which could result in two Praefect instances thinking two different
Gitaly nodes are the primary. Follow
[gitaly#2547](https://gitlab.com/gitlab-org/gitaly/-/issues/2547) for
updates.
1. Save the changes to `/etc/gitlab/gitlab.rb` and [reconfigure
Praefect](../restart_gitlab.md#omnibus-gitlab-reconfigure):
```shell
gitlab-ctl reconfigure
```
### GitLab
To complete this section you will need:
......@@ -554,158 +582,14 @@ Particular attention should be shown to:
- Deselect the **default** storage location
- Select the **praefect** storage location
![Update repository storage](img/praefect_storage_v12_10.png)
1. Verify everything is still working by creating a new project. Check the
"Initialize repository with a README" box so that there is content in the
repository that viewed. If the project is created, and you can see the
README file, it works!
Congratulations! You have configured a highly available Praefect cluster.
### Failover
There are two ways to do a failover from one internal Gitaly node to another as the primary. Manually, or automatically.
As an example, in this `config.toml` we have one virtual storage named "default" with two internal Gitaly nodes behind it.
One is deemed the "primary". This means that read and write traffic will go to `internal_storage_0`, and writes
will get replicated to `internal_storage_1`:
```toml
socket_path = "/path/to/Praefect.socket"
# failover_enabled will enable automatic failover
failover_enabled = false
[logging]
format = "json"
level = "info"
[[virtual_storage]]
name = "default"
[[virtual_storage.node]]
name = "internal_storage_0"
address = "tcp://localhost:9999"
primary = true
token = "supersecret"
[[virtual_storage.node]]
name = "internal_storage_1"
address = "tcp://localhost:9998"
token = "supersecret"
```
#### Manual failover
In order to failover from using one internal Gitaly node to using another, a manual failover step can be used. Unless `failover_enabled` is set to `true`
in the `config.toml`, the only way to fail over from one primary to using another node as the primary is to do a manual failover.
1. Move `primary = true` from the current `[[virtual_storage.node]]` to another node in `/etc/gitlab/gitlab.rb`:
```ruby
praefect['virtual_storages'] = {
'praefect' => {
'gitaly-1' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN',
# no longer the primary
},
'gitaly-2' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN',
# this is the new primary
'primary' => true
},
'gitaly-3' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN',
}
}
}
```
1. Save the file and [reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure).
On a restart, Praefect will send write traffic to `internal_storage_1`. `internal_storage_0` is the new secondary now,
and replication jobs will be created to replicate repository data to `internal_storage_0` **from** `internal_storage_1`
#### Automatic failover
When automatic failover is enabled, Praefect will do automatic detection of the health of internal Gitaly nodes. If the
primary has a certain amount of health checks fail, it will decide to promote one of the secondaries to be primary, and
demote the primary to be a secondary.
1. To enable automatic failover, edit `/etc/gitlab/gitlab.rb`:
```ruby
# failover_enabled turns on automatic failover
praefect['failover_enabled'] = true
praefect['virtual_storages'] = {
'praefect' => {
'gitaly-1' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN',
'primary' => true
},
'gitaly-2' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN'
},
'gitaly-3' => {
'address' => 'tcp://GITALY_HOST:8075',
'token' => 'PRAEFECT_INTERNAL_TOKEN'
}
}
}
```
1. Save the file and [reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure).
Below is the picture when Praefect starts up with the config.toml above:
```mermaid
graph TD
A[Praefect] -->|Mutator RPC| B(internal_storage_0)
B --> |Replication|C[internal_storage_1]
```
Let's say suddenly `internal_storage_0` goes down. Praefect will detect this and
automatically switch over to `internal_storage_1`, and `internal_storage_0` will serve as a secondary:
```mermaid
graph TD
A[Praefect] -->|Mutator RPC| B(internal_storage_1)
B --> |Replication|C[internal_storage_0]
```
NOTE: **Note:**: Currently this feature is supported for setups that only have 1 Praefect instance. Praefect instances running,
for example behind a load balancer, `failover_enabled` should be disabled. The reason is The reason is because there
is no coordination that currently happens across different Praefect instances, so there could be a situation where
two Praefect instances think two different Gitaly nodes are the primary.
## Backend Node Recovery
When a Praefect backend node fails and is no longer able to
replicate changes, the backend node will start to drift from the primary. If
that node eventually recovers, it will need to be reconciled with the current
primary. The primary node is considered the single source of truth for the
state of a shard. The Praefect `reconcile` subcommand allows for the manual
reconciliation between a backend node and the current primary.
Run the following command on the Praefect server after all placeholders
(`<virtual-storage>` and `<target-storage>`) have been replaced:
```shell
sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml reconcile -virtual <virtual-storage> -target <target-storage>
```
- Replace the placeholder `<virtual-storage>` with the virtual storage containing the backend node storage to be checked.
- Replace the placeholder `<target-storage>` with the backend storage name.
The command will return a list of repositories that were found to be
inconsistent against the current primary. Each of these inconsistencies will
also be logged with an accompanying replication job ID.
## Grafana
### Grafana
Grafana is included with GitLab, and can be used to monitor your Praefect
cluster. See [Grafana Dashboard
......@@ -748,6 +632,33 @@ To get started quickly:
1. Go to **Explore** and query `gitlab_build_info` to verify that you are
getting metrics from all your machines.
Congratulations! You've configured an observable highly available Praefect
cluster.
## Backend Node Recovery
When a Praefect backend node fails and is no longer able to
replicate changes, the backend node will start to drift from the primary. If
that node eventually recovers, it will need to be reconciled with the current
primary. The primary node is considered the single source of truth for the
state of a shard. The Praefect `reconcile` subcommand allows for the manual
reconciliation between a backend node and the current primary.
Run the following command on the Praefect server after all placeholders
(`<virtual-storage>` and `<target-storage>`) have been replaced:
```shell
sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml reconcile -virtual <virtual-storage> -target <target-storage>
```
- Replace the placeholder `<virtual-storage>` with the virtual storage containing the backend node storage to be checked.
- Replace the placeholder `<target-storage>` with the backend storage name.
The command will return a list of repositories that were found to be
inconsistent against the current primary. Each of these inconsistencies will
also be logged with an accompanying replication job ID.
## Migrating existing repositories to Praefect
If your GitLab instance already has repositories, these won't be migrated
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment