You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL - Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Set the access control on the primary to allow TCP connections using the
server's public IP and set the connection from the secondary to require a
password. Edit `pg_hba.conf` (for Debian/Ubuntu that would be
`/etc/postgresql/9.x/main/pg_hba.conf`):
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
```
Where `1.2.3.4` is the public IP address of the primary server, and `5.6.7.8`
the public IP address of the secondary one. If you want to add another
secondary, add one more row like the replication one and change the IP
The topology above assumes that the primary and secondary Geo clusters
are located in two separate locations, on their own virtual network
with private IP addresses. The network is configured such that all machines within
one geographic location can communicate with each other using their private IP addresses.
The IP addresses given are examples and may be different depending on the
network topology of your deployment.
The only external way to access the two Geo deployments is by HTTPS at
`gitlab.us.example.com` and `gitlab.eu.example.com` in the example above.
> **Note:** The primary and secondary Geo deployments must be able to
> communicate to each other over HTTPS.
## Redis and PostgreSQL High Availability
The primary and secondary Redis and PostgreSQL should be configured
for high availability. Because of the additional complexity involved
in setting up this configuration for PostgreSQL and Redis
it is not covered by this Geo HA documentation.
The two services will instead be configured such that
they will each run on a single machine.
For more information about setting up a highly available PostgreSQL cluster and Redis cluster using the omnibus package see the high availability documentation for
[PostgreSQL](../high_availability/database.md) and
geo_secondary['db_password'] = 'Geo tracking DB password'
```
NOTE: **Note:**
Be sure that other non-postgresql services are disabled by setting `enable` to `false` in
the [gitlab.rb configuration](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-config-template/gitlab.rb.template).
After making these changes be sure to run `sudo gitlab-ctl reconfigure` so that they take effect.
### Step 3: Setup the LoadBalancer
In this topology there will need to be a load balancers at each geographical location
to route traffic to the application servers.
See the [Load Balancer for GitLab HA](../high_availability/load_balancer.md)
documentation for more information.
### Step 4: Configure the Geo Frontend Application Servers
In the architecture overview there are two machines running the GitLab application
services. These services are enabled selectively in the configuration. Additionally
the addresses of the remote endpoints for PostgreSQL and Redis will need to be specified.
#### On the GitLab Primary Frontend servers
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.0.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.0.2.1'
gitlab_rails['redis_password'] = 'Redis password'
geo_primary_role['enable'] = true
```
#### On the GitLab Secondary Frontend servers
On the secondary the remote endpoint for the PostgreSQL Geo database will
be specified.
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally. Configure the secondary to connect to the Geo tracking database.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.1.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.1.2.1'
gitlab_rails['redis_password'] = 'Redis password'
##
## Enable the geo secondary role and configure the
## geo tracking database
##
geo_secondary_role['enable'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
geo_postgresql['enable'] = false
```
After making these changes [Reconfigure GitLab][] so that they take effect.
On the primary the following GitLab frontend services will be enabled:
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
On the secondary the following GitLab frontend services will be enabled:
* geo-logcursor
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
Verify these services by running `sudo gitlab-ctl status` on the frontend
- Using Geo in combination with High Availability is considered **GA** in GitLab Enterprise Edition 10.4
>**Note:**
Geo changes significantly from release to release. Upgrades **are**
supported and [documented](#updating-the-geo-nodes), but you should ensure that
you're following the right version of the documentation for your installation!
The best way to do this is to follow the documentation from the `/help` endpoint
on your **primary** node, but you can also navigate to [this page on GitLab.com](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/README.md)
and choose the appropriate release from the `tags` dropdown, e.g., `v10.0.0-ee`.
Geo allows you to replicate your GitLab instance to other geographical
locations as a read-only fully operational version.
## Overview
If you have two or more teams geographically spread out, but your GitLab
instance is in a single location, fetching large repositories can take a long
time.
Your Geo instance can be used for cloning and fetching projects, in addition to
reading any data. This will make working with large repositories over large
distances much faster.
![Geo overview](img/geo-overview.png)
When Geo is enabled, we refer to your original instance as a **primary** node
and the replicated read-only ones as **secondaries**.
Keep in mind that:
- Secondaries talk to the primary to get user data for logins (API) and to
replicate repositories, LFS Objects and Attachments (HTTPS + JWT).
- Since GitLab Premium 10.0, the primary no longer talks to
secondaries to notify for changes (API).
## Use-cases
- Can be used for cloning and fetching projects, in addition
to reading any data available in the GitLab web interface (see [current limitations](#current-limitations))
- Overcomes slow connection between distant offices, saving time by
improving speed for distributed teams
- Helps reducing the loading time for automated tasks,
We use the tracking database as metadata to control what needs to be
updated on the disk of the local instance (for example, download new assets,
fetch new LFS Objects or fetch changes from a repository that has recently been
updated).
Because the replicated instance is read-only, we need this additional instance
per secondary location.
### Geo Log Cursor
This daemon reads a log of events replicated by the primary node to the secondary
database and updates the Geo Tracking Database with changes that need to be
executed.
When something is marked to be updated in the tracking database, asynchronous
jobs running on the secondary node will execute the required operations and
update the state.
This new architecture allows us to be resilient to connectivity issues between the
nodes. It doesn't matter if it was just a few minutes or days. The secondary
instance will be able to replay all the events in the correct order and get in
sync again.
## Setup instructions
These instructions assume you have a working instance of GitLab. They will
guide you through making your existing instance the primary Geo node and
adding secondary Geo nodes.
The steps below should be followed in the order they appear. **Make sure the
GitLab version is the same on all nodes.**
### Using Omnibus GitLab
If you installed GitLab using the Omnibus packages (highly recommended):
1.[Install GitLab Enterprise Edition][install-ee] on the server that will serve
as the **secondary** Geo node. Do not create an account or login to the new
secondary node.
1.[Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1.[Setup the database replication](database.md)(`primary(read-write)<->
secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../operations/fast_ssh_key_lookup.md),
this step is required and needs to be done on both the primary AND secondary nodes.
1. [Configure GitLab](configuration.md) to set the primary and secondary nodes.
1. Optional: [Configure a secondary LDAP server](../auth/ldap.md)
for the secondary. See [notes on LDAP](#ldap).
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
### Using GitLab installed from source
If you installed GitLab from source:
1. [Install GitLab Enterprise Edition][install-ee-source] on the server that
will serve as the **secondary** Geo node. Do not create an account or login
to the new secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database_source.md) (`primary (read-write)
<-> secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../operations/fast_ssh_key_lookup.md),
do this step for both primary AND secondary nodes.
1. [Configure GitLab](configuration_source.md) to set the primary and secondary
nodes.
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
## Configuring Geo
Read through the [Geo configuration](configuration.md) documentation.
## Updating the Geo nodes
Read how to [update your Geo nodes to the latest GitLab version](updating_the_geo_nodes.md).
## Configuring Geo HA
Read through the [Geo High Availability documentation](ha.md).
## Configuring Geo with Object storage
When you have object storage enabled, please consult the
[Geo with Object Storage](object_storage.md) documentation.
## Replicating the Container Registry
Read how to [replicate the Container Registry](docker_registry.md).
## Current limitations
- You cannot push code to secondary nodes, see [3912](https://gitlab.com/gitlab-org/gitlab-ee/issues/3912) for details.
- The primary node has to be online for OAuth login to happen (existing sessions and Git are not affected)
- It works for repos, wikis, issues, and merge requests, but it does not work for job logs, artifacts, GitLab Pages, and Docker images of the Container
Registry (by default, but you can configure it separately, see [replicate the Container Registry](docker_registry.md) for details)
- The installation takes multiple manual steps that together can take about an hour depending on circumstances; we are working on improving this experience, see [#2978](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2978) for details.
## Frequently Asked Questions
Read more in the [Geo FAQ](faq.md).
## Log files
Since GitLab 9.5, Geo stores structured log messages in a `geo.log` file. For
Omnibus installations, this file can be found in
`/var/log/gitlab/gitlab-rails/geo.log`. This file contains information about
when Geo attempts to sync repositories and files. Each line in the file contains a
separate JSON entry that can be ingested into Elasticsearch, Splunk, etc. For
The following security review of the Geo feature set focuses on security
aspects of the feature as they apply to customers running their own GitLab
instances. The review questions are based in part on the [application security architecture](https://www.owasp.org/index.php/Application_Security_Architecture_Cheat_Sheet)
questions from [owasp.org](https://www.owasp.org).
## Business Model
### What geographic areas does the application service?
- This varies by customer. Geo allows customers to deploy to multiple areas,
and they get to choose where they are.
- Region and node selection is entirely manual.
## Data Essentials
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
includes full database replication, most files (user-uploaded attachments,
etc) and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2948)
### How can the data be classified into categories according to its sensitivity?
- GitLab’s model of sensitivity is centered around public vs. internal vs.
private projects. Geo replicates them all indiscriminately. “Selective sync”
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a secondary if desired.
- See also: [developing a data classification policy](https://gitlab.com/gitlab-com/security/issues/4).
### What data backup and retention requirements have been defined for the application?
- Geo is designed to provide replication of a certain subset of the application
data. It is part of the solution, rather than part of the problem.
## End-Users
### Who are the application's end‐users?
- Geo nodes (secondaries) are created in regions that are distant (in terms of
Internet latency) from the main GitLab installation (the primary). They are
intended to be used by anyone who would ordinarily use the primary, who finds
that the secondary is closer to them (in terms of Internet latency).
### How do the end‐users interact with the application?
- A Geo secondary node provides all the interfaces a Geo primary node does
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH git repository
access), but is constrained to read-only activities. The principal use case is
envisioned to be cloning git repositories from the secondary in favor of the
primary, but end-users may use the GitLab web interface to view projects,
issues, merge requests, snippets, etc.
### What security expectations do the end‐users have?
- The replication process must be secure. It would typically be unacceptable to
transmit the entire database contents or all files and repositories across the
public Internet in plaintext, for instance.
- The Geo secondary must have the same access controls over its content as the
primary - unauthenticated users must not be able to gain access to privileged
information on the primary by querying the secondary.
- Attackers must not be able to impersonate the secondary to the primary, and
thus gain access to privileged information.
## Administrators
### Who has administrative capabilities in the application?
- Nothing Geo-specific. Any user where `admin: true` is set in the database is
considered an admin with super-user privileges.
- See also: [more granular access control](https://gitlab.com/gitlab-org/gitlab-ce/issues/32730)
(not geo-specific)
- Much of Geo’s integration (database replication, for instance) must be
configured with the application, typically by system administrators.
### What administrative capabilities does the application offer?
- Geo secondaries may be added, modified, or removed by users with
administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq
administrative controls.
## Network
### What details regarding routing, switching, firewalling, and load‐balancing have been defined?
- Geo requires the primary and secondary to be able to communicate with each
other across a TCP/IP network. In particular, the secondaries must be able to
access HTTP/HTTPS and PostgreSQL services on the primary.
### What core network devices support the application?
- Varies from customer to customer.
### What network performance requirements exist?
- Maximum replication speeds between primary and secondary is limited by the
available bandwidth between sites. No hard requirements exist - time to complete
replication (and ability to keep up with changes on the primary) is a function
of the size of the data set, tolerance for latency, and available network
capacity.
### What private and public network links support the application?
- Customers choose their own networks. As sites are intended to be
geographically separated, it is envisioned that replication traffic will pass
over the public Internet in a typical deployment, but this is not a requirement.
## Systems
### What operating systems support the application?
- Geo imposes no additional restrictions on operating system (see the
[GitLab installation](https://about.gitlab.com/installation/) page for more
details), however we recommend using the operating systems listed in the [Geo documentation](http://docs.gitlab.com/ee/gitlab-geo/#geo-recommendations).
### What details regarding required OS components and lock‐down needs have been defined?
- The recommended installation method (Omnibus) packages most components itself.
A from-source installation method exists. Both are documented at
https://docs.gitlab.com/ee/gitlab-geo/
- There are significant dependencies on the system-installed OpenSSH daemon (Geo
requires users to set up custom authentication methods) and the omnibus or
system-provided PostgreSQL daemon (it must be configured to listen on TCP,
additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a
significant vulnerability in OpenSSH or other services, and the customer
wants to patch those services on the OS) is identical to the non-Geo
situation: security updates to OpenSSH would be provided to the user via the
usual distribution channels. Geo introduces no delay there.
## Infrastructure Monitoring
### What network and system performance monitoring requirements have been defined?
- None specific to Geo.
### What mechanisms exist to detect malicious code or compromised application components?
- None specific to Geo.
### What network and system security monitoring requirements have been defined?
- None specific to Geo.
## Virtualization and Externalization
### What aspects of the application lend themselves to virtualization?
- All.
## What virtualization requirements have been defined for the application?
- Nothing Geo-specific, but everything in GitLab needs to have full
functionality in such an environment.
### What aspects of the product may or may not be hosted via the cloud computing model?
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the
product. Deployment in clouds is a common and supported scenario.
## If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus "Pure" Cloud, a "full machine" approach such as AWS-EC2 versus a "hosted database" approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
## Environment
### What frameworks and programming languages have been used to create the application?
- Ruby on Rails, Ruby.
### What process, code, or infrastructure dependencies have been defined for the application?
- Nothing specific to Geo.
### What databases and application servers support the application?
- PostgreSQL >= 9.6, Redis, Sidekiq, Unicorn.
### How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the primary to the secondary at setup time. Our
documentation recommends transmitting them from the primary to the system
administrator via SSH, and then back out to the secondary in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
key (`db_key_base`) which is used to decrypt certain columns in the database.
The `db_key_base` secret is stored unencrypted on the filesystem, in
`/etc/gitlab/gitlab-secrets.json`, along with a number of other secrets. There is
no at-rest protection for them.
## Data Processing
### What data entry paths does the application support?
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
`gitlab-ctl set-primary-node`).
- Secondaries also receive inputs via PostgreSQL streaming replication from the
primary.
### What data output paths does the application support?
- Primaries output via PostgreSQL streaming replication to the secondary.
Otherwise, principally via the web application exposed by GitLab itself, and via
SSH `git clone` operations initiated by the end-user.
### How does data flow across the application's internal components?
- Secondaries and primaries interact via HTTP/HTTPS (secured with JSON web
tokens) and via PostgreSQL streaming replication.
- Within a primary or secondary, the SSOT is the filesystem and the database
(including Geo tracking database on secondary). The various internal components
are orchestrated to make alterations to these stores.
### What data input validation requirements have been defined?
- Secondaries must have a faithful replication of the primary’s data.
### What data does the application store and how?
- Git repositories and files, tracking information related to the them, and the
GitLab database contents.
### What data is or may need to be encrypted and what key management requirements have been defined?
- Neither primaries or secondaries encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the `db_otp_key`
- a static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit
communication to proceed unencrypted. The two main transits are the secondary’s
replication process for PostgreSQL, and for git repositories/files. Both should
be protected using TLS, with the keys for that managed via Omnibus per existing
configuration for end-user access to GitLab.
### What capabilities exist to detect the leakage of sensitive data?
- Comprehensive system logs exist, tracking every connection to GitLab and
PostgreSQL.
### What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against
both passive and active attack (e.g., MITM attacks should not be possible).
## Access
### What user privilege levels does the application support?
- Geo adds one type of privilege: secondaries can access a special Geo API to
download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
### What user identification and authentication requirements have been defined?
- Geo secondaries identify to Geo primaries via OAuth or JWT authentication
based on the shared database (HTTP access) or a PostgreSQL replication user (for
database replication). The database replication also requires IP-based access
controls to be defined.
### What user authorization requirements have been defined?
- Secondaries must only be able to *read* data. They are not currently able to
mutate data on the primary.
### What session management requirements have been defined?
- Geo JWTs are defined to last for only two minutes before needing to be
regenerated.
### What access requirements have been defined for URI and Service calls?
- A Geo secondary makes many calls to the primary's API. This is how file
replication proceeds, for instance. This endpoint is only accessible with a JWT
token.
- The primary also makes calls to the secondary to get status information.
## Application Monitoring
### What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested
By running the command above, `primary` should be `true` when executed in
the primary node, and `false` on any secondary.
#### How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"?
This means that the `max_replication_slots` PostgreSQL variable needs to
be set on the primary database. In GitLab 9.4, we have made this setting
default to 1. You may need to increase this value if you have more Geo
secondary nodes. Be sure to restart PostgreSQL for this to take
effect. See the [PostgreSQL replication
setup](database.md#postgresql-replication) guide for more details.
#### How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
This occurs when PostgreSQL does not have a replication slot for the
secondary by that name. You may want to rerun the [replication
process](database.md) on the secondary.
#### How do I fix the message, "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process](database.md#step-4-initiate-the-replication-process) on the Geo secondary, and indicates that your
initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
- Using Geo in combination with High Availability is considered **GA** in GitLab Enterprise Edition 10.4
>**Note:**
Geo changes significantly from release to release. Upgrades **are**
supported and [documented](#updating-the-geo-nodes), but you should ensure that
you're following the right version of the documentation for your installation!
The best way to do this is to follow the documentation from the `/help` endpoint
on your **primary** node, but you can also navigate to [this page on GitLab.com](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/README.md)
and choose the appropriate release from the `tags` dropdown, e.g., `v10.0.0-ee`.
Geo allows you to replicate your GitLab instance to other geographical
locations as a read-only fully operational version.
## Overview
If you have two or more teams geographically spread out, but your GitLab
instance is in a single location, fetching large repositories can take a long
time.
Your Geo instance can be used for cloning and fetching projects, in addition to
reading any data. This will make working with large repositories over large
distances much faster.
![Geo overview](img/geo-overview.png)
When Geo is enabled, we refer to your original instance as a **primary** node
and the replicated read-only ones as **secondaries**.
Keep in mind that:
- Secondaries talk to the primary to get user data for logins (API) and to
replicate repositories, LFS Objects and Attachments (HTTPS + JWT).
- Since GitLab Premium 10.0, the primary no longer talks to
secondaries to notify for changes (API).
## Use-cases
- Can be used for cloning and fetching projects, in addition
to reading any data available in the GitLab web interface (see [current limitations](#current-limitations))
- Overcomes slow connection between distant offices, saving time by
improving speed for distributed teams
- Helps reducing the loading time for automated tasks,
We use the tracking database as metadata to control what needs to be
updated on the disk of the local instance (for example, download new assets,
fetch new LFS Objects or fetch changes from a repository that has recently been
updated).
Because the replicated instance is read-only, we need this additional instance
per secondary location.
### Geo Log Cursor
This daemon reads a log of events replicated by the primary node to the secondary
database and updates the Geo Tracking Database with changes that need to be
executed.
When something is marked to be updated in the tracking database, asynchronous
jobs running on the secondary node will execute the required operations and
update the state.
This new architecture allows us to be resilient to connectivity issues between the
nodes. It doesn't matter if it was just a few minutes or days. The secondary
instance will be able to replay all the events in the correct order and get in
sync again.
## Setup instructions
These instructions assume you have a working instance of GitLab. They will
guide you through making your existing instance the primary Geo node and
adding secondary Geo nodes.
The steps below should be followed in the order they appear. **Make sure the
GitLab version is the same on all nodes.**
### Using Omnibus GitLab
If you installed GitLab using the Omnibus packages (highly recommended):
1.[Install GitLab Enterprise Edition][install-ee] on the server that will serve
as the **secondary** Geo node. Do not create an account or login to the new
secondary node.
1.[Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1.[Setup the database replication](database.md)(`primary(read-write)<->
secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../administration/operations/fast_ssh_key_lookup.md),
this step is required and needs to be done on both the primary AND secondary nodes.
1. [Configure GitLab](configuration.md) to set the primary and secondary nodes.
1. Optional: [Configure a secondary LDAP server](../administration/auth/ldap.md)
for the secondary. See [notes on LDAP](#ldap).
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee]: https://about.gitlab.com/downloads-ee/ "GitLab Enterprise Edition Omnibus packages downloads page"
### Using GitLab installed from source
If you installed GitLab from source:
1. [Install GitLab Enterprise Edition][install-ee-source] on the server that
will serve as the **secondary** Geo node. Do not create an account or login
to the new secondary node.
1. [Upload the GitLab License](../user/admin_area/license.md) on the **primary**
Geo node to unlock Geo.
1. [Setup the database replication](database_source.md) (`primary (read-write)
<-> secondary (read-only)` topology).
1. [Configure fast lookup of authorized SSH keys in the database](../administration/operations/fast_ssh_key_lookup.md),
do this step for both primary AND secondary nodes.
1. [Configure GitLab](configuration_source.md) to set the primary and secondary
nodes.
1. [Follow the "Using a Geo Server" guide](using_a_geo_server.md).
[install-ee-source]: https://docs.gitlab.com/ee/install/installation.html "GitLab Enterprise Edition installation from source"
## Configuring Geo
Read through the [Geo configuration](configuration.md) documentation.
## Updating the Geo nodes
Read how to [update your Geo nodes to the latest GitLab version](updating_the_geo_nodes.md).
## Configuring Geo HA
Read through the [Geo High Availability documentation](ha.md).
## Configuring Geo with Object storage
When you have object storage enabled, please consult the
[Geo with Object Storage](object_storage.md) documentation.
## Replicating the Container Registry
Read how to [replicate the Container Registry](docker_registry.md).
## Current limitations
- You cannot push code to secondary nodes, see [3912](https://gitlab.com/gitlab-org/gitlab-ee/issues/3912) for details.
- The primary node has to be online for OAuth login to happen (existing sessions and Git are not affected)
- It works for repos, wikis, issues, and merge requests, but it does not work for job logs, artifacts, GitLab Pages, and Docker images of the Container
Registry (by default, but you can configure it separately, see [replicate the Container Registry](docker_registry.md) for details)
- The installation takes multiple manual steps that together can take about an hour depending on circumstances; we are working on improving this experience, see [#2978](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2978) for details.
## Frequently Asked Questions
Read more in the [Geo FAQ](faq.md).
## Log files
Since GitLab 9.5, Geo stores structured log messages in a `geo.log` file. For
Omnibus installations, this file can be found in
`/var/log/gitlab/gitlab-rails/geo.log`. This file contains information about
when Geo attempts to sync repositories and files. Each line in the file contains a
separate JSON entry that can be ingested into Elasticsearch, Splunk, etc. For
You may also want to edit the `wal_keep_segments` and `max_wal_senders` to
match your database replication requirements. Consult the [PostgreSQL - Replication documentation](https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html)
for more information.
1. Set the access control on the primary to allow TCP connections using the
server's public IP and set the connection from the secondary to require a
password. Edit `pg_hba.conf` (for Debian/Ubuntu that would be
`/etc/postgresql/9.x/main/pg_hba.conf`):
```bash
host all all 127.0.0.1/32 trust
host all all 1.2.3.4/32 trust
host replication gitlab_replicator 5.6.7.8/32 md5
```
Where `1.2.3.4` is the public IP address of the primary server, and `5.6.7.8`
the public IP address of the secondary one. If you want to add another
secondary, add one more row like the replication one and change the IP
The topology above assumes that the primary and secondary Geo clusters
are located in two separate locations, on their own virtual network
with private IP addresses. The network is configured such that all machines within
one geographic location can communicate with each other using their private IP addresses.
The IP addresses given are examples and may be different depending on the
network topology of your deployment.
The only external way to access the two Geo deployments is by HTTPS at
`gitlab.us.example.com` and `gitlab.eu.example.com` in the example above.
> **Note:** The primary and secondary Geo deployments must be able to
> communicate to each other over HTTPS.
## Redis and PostgreSQL High Availability
The primary and secondary Redis and PostgreSQL should be configured
for high availability. Because of the additional complexity involved
in setting up this configuration for PostgreSQL and Redis
it is not covered by this Geo HA documentation.
The two services will instead be configured such that
they will each run on a single machine.
For more information about setting up a highly available PostgreSQL cluster and Redis cluster using the omnibus package see the high availability documentation for
[PostgreSQL](../administration/high_availability/database.md) and
geo_secondary['db_password'] = 'Geo tracking DB password'
```
NOTE: **Note:**
Be sure that other non-postgresql services are disabled by setting `enable` to `false` in
the [gitlab.rb configuration](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-config-template/gitlab.rb.template).
After making these changes be sure to run `sudo gitlab-ctl reconfigure` so that they take effect.
### Step 3: Setup the LoadBalancer
In this topology there will need to be a load balancers at each geographical location
to route traffic to the application servers.
See the [Load Balancer for GitLab HA](../administration/high_availability/load_balancer.md)
documentation for more information.
### Step 4: Configure the Geo Frontend Application Servers
In the architecture overview there are two machines running the GitLab application
services. These services are enabled selectively in the configuration. Additionally
the addresses of the remote endpoints for PostgreSQL and Redis will need to be specified.
#### On the GitLab Primary Frontend servers
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.0.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.0.2.1'
gitlab_rails['redis_password'] = 'Redis password'
geo_primary_role['enable'] = true
```
#### On the GitLab Secondary Frontend servers
On the secondary the remote endpoint for the PostgreSQL Geo database will
be specified.
1. Edit `/etc/gitlab/gitlab.rb` and ensure the following to disable PostgreSQL and Redis from running locally. Configure the secondary to connect to the Geo tracking database.
```ruby
##
## Disable PostgreSQL on the local machine and connect to the remote
##
postgresql['enable'] = false
gitlab_rails['auto_migrate'] = false
gitlab_rails['db_host'] = '10.1.3.1'
gitlab_rails['db_password'] = 'DB password'
##
## Disable Redis on the local machine and connect to the remote
##
redis['enable'] = false
gitlab_rails['redis_host'] = '10.1.2.1'
gitlab_rails['redis_password'] = 'Redis password'
##
## Enable the geo secondary role and configure the
## geo tracking database
##
geo_secondary_role['enable'] = true
geo_secondary['db_host'] = '10.1.4.1'
geo_secondary['db_password'] = 'Geo tracking DB password'
geo_postgresql['enable'] = false
```
After making these changes [Reconfigure GitLab][] so that they take effect.
On the primary the following GitLab frontend services will be enabled:
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
On the secondary the following GitLab frontend services will be enabled:
* geo-logcursor
* gitlab-pages
* gitlab-workhorse
* logrotate
* nginx
* registry
* remote-syslog
* sidekiq
* unicorn
Verify these services by running `sudo gitlab-ctl status` on the frontend
The following security review of the Geo feature set focuses on security
aspects of the feature as they apply to customers running their own GitLab
instances. The review questions are based in part on the [application security architecture](https://www.owasp.org/index.php/Application_Security_Architecture_Cheat_Sheet)
questions from [owasp.org](https://www.owasp.org).
## Business Model
### What geographic areas does the application service?
- This varies by customer. Geo allows customers to deploy to multiple areas,
and they get to choose where they are.
- Region and node selection is entirely manual.
## Data Essentials
### What data does the application receive, produce, and process?
- Geo streams almost all data held by a GitLab instance between sites. This
includes full database replication, most files (user-uploaded attachments,
etc) and repository + wiki data. In a typical configuration, this will
happen across the public Internet, and be TLS-encrypted.
- PostgreSQL replication is TLS-encrypted.
- See also: [only TLSv1.2 should be supported](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/2948)
### How can the data be classified into categories according to its sensitivity?
- GitLab’s model of sensitivity is centered around public vs. internal vs.
private projects. Geo replicates them all indiscriminately. “Selective sync”
exists for files and repositories (but not database content), which would permit
only less-sensitive projects to be replicated to a secondary if desired.
- See also: [developing a data classification policy](https://gitlab.com/gitlab-com/security/issues/4).
### What data backup and retention requirements have been defined for the application?
- Geo is designed to provide replication of a certain subset of the application
data. It is part of the solution, rather than part of the problem.
## End-Users
### Who are the application's end‐users?
- Geo nodes (secondaries) are created in regions that are distant (in terms of
Internet latency) from the main GitLab installation (the primary). They are
intended to be used by anyone who would ordinarily use the primary, who finds
that the secondary is closer to them (in terms of Internet latency).
### How do the end‐users interact with the application?
- A Geo secondary node provides all the interfaces a Geo primary node does
(notably a HTTP/HTTPS web application, and HTTP/HTTPS or SSH git repository
access), but is constrained to read-only activities. The principal use case is
envisioned to be cloning git repositories from the secondary in favor of the
primary, but end-users may use the GitLab web interface to view projects,
issues, merge requests, snippets, etc.
### What security expectations do the end‐users have?
- The replication process must be secure. It would typically be unacceptable to
transmit the entire database contents or all files and repositories across the
public Internet in plaintext, for instance.
- The Geo secondary must have the same access controls over its content as the
primary - unauthenticated users must not be able to gain access to privileged
information on the primary by querying the secondary.
- Attackers must not be able to impersonate the secondary to the primary, and
thus gain access to privileged information.
## Administrators
### Who has administrative capabilities in the application?
- Nothing Geo-specific. Any user where `admin: true` is set in the database is
considered an admin with super-user privileges.
- See also: [more granular access control](https://gitlab.com/gitlab-org/gitlab-ce/issues/32730)
(not geo-specific)
- Much of Geo’s integration (database replication, for instance) must be
configured with the application, typically by system administrators.
### What administrative capabilities does the application offer?
- Geo secondaries may be added, modified, or removed by users with
administrative access.
- The replication process may be controlled (start/stop) via the Sidekiq
administrative controls.
## Network
### What details regarding routing, switching, firewalling, and load‐balancing have been defined?
- Geo requires the primary and secondary to be able to communicate with each
other across a TCP/IP network. In particular, the secondaries must be able to
access HTTP/HTTPS and PostgreSQL services on the primary.
### What core network devices support the application?
- Varies from customer to customer.
### What network performance requirements exist?
- Maximum replication speeds between primary and secondary is limited by the
available bandwidth between sites. No hard requirements exist - time to complete
replication (and ability to keep up with changes on the primary) is a function
of the size of the data set, tolerance for latency, and available network
capacity.
### What private and public network links support the application?
- Customers choose their own networks. As sites are intended to be
geographically separated, it is envisioned that replication traffic will pass
over the public Internet in a typical deployment, but this is not a requirement.
## Systems
### What operating systems support the application?
- Geo imposes no additional restrictions on operating system (see the
[GitLab installation](https://about.gitlab.com/installation/) page for more
details), however we recommend using the operating systems listed in the [Geo documentation](http://docs.gitlab.com/ee/gitlab-geo/#geo-recommendations).
### What details regarding required OS components and lock‐down needs have been defined?
- The recommended installation method (Omnibus) packages most components itself.
A from-source installation method exists. Both are documented at
https://docs.gitlab.com/ee/gitlab-geo/
- There are significant dependencies on the system-installed OpenSSH daemon (Geo
requires users to set up custom authentication methods) and the omnibus or
system-provided PostgreSQL daemon (it must be configured to listen on TCP,
additional users and replication slots must be added, etc).
- The process for dealing with security updates (for example, if there is a
significant vulnerability in OpenSSH or other services, and the customer
wants to patch those services on the OS) is identical to the non-Geo
situation: security updates to OpenSSH would be provided to the user via the
usual distribution channels. Geo introduces no delay there.
## Infrastructure Monitoring
### What network and system performance monitoring requirements have been defined?
- None specific to Geo.
### What mechanisms exist to detect malicious code or compromised application components?
- None specific to Geo.
### What network and system security monitoring requirements have been defined?
- None specific to Geo.
## Virtualization and Externalization
### What aspects of the application lend themselves to virtualization?
- All.
## What virtualization requirements have been defined for the application?
- Nothing Geo-specific, but everything in GitLab needs to have full
functionality in such an environment.
### What aspects of the product may or may not be hosted via the cloud computing model?
- GitLab is “cloud native” and this applies to Geo as much as to the rest of the
product. Deployment in clouds is a common and supported scenario.
## If applicable, what approach(es) to cloud computing will be taken (Managed Hosting versus "Pure" Cloud, a "full machine" approach such as AWS-EC2 versus a "hosted database" approach such as AWS-RDS and Azure, etc)?
- To be decided by our customers, according to their operational needs.
## Environment
### What frameworks and programming languages have been used to create the application?
- Ruby on Rails, Ruby.
### What process, code, or infrastructure dependencies have been defined for the application?
- Nothing specific to Geo.
### What databases and application servers support the application?
- PostgreSQL >= 9.6, Redis, Sidekiq, Unicorn.
### How will database connection strings, encryption keys, and other sensitive components be stored, accessed, and protected from unauthorized detection?
- There are some Geo-specific values. Some are shared secrets which must be
securely transmitted from the primary to the secondary at setup time. Our
documentation recommends transmitting them from the primary to the system
administrator via SSH, and then back out to the secondary in the same manner.
In particular, this includes the PostgreSQL replication credentials and a secret
key (`db_key_base`) which is used to decrypt certain columns in the database.
The `db_key_base` secret is stored unencrypted on the filesystem, in
`/etc/gitlab/gitlab-secrets.json`, along with a number of other secrets. There is
no at-rest protection for them.
## Data Processing
### What data entry paths does the application support?
- Data is entered via the web application exposed by GitLab itself. Some data is
also entered using system administration commands on the GitLab servers (e.g.,
`gitlab-ctl set-primary-node`).
- Secondaries also receive inputs via PostgreSQL streaming replication from the
primary.
### What data output paths does the application support?
- Primaries output via PostgreSQL streaming replication to the secondary.
Otherwise, principally via the web application exposed by GitLab itself, and via
SSH `git clone` operations initiated by the end-user.
### How does data flow across the application's internal components?
- Secondaries and primaries interact via HTTP/HTTPS (secured with JSON web
tokens) and via PostgreSQL streaming replication.
- Within a primary or secondary, the SSOT is the filesystem and the database
(including Geo tracking database on secondary). The various internal components
are orchestrated to make alterations to these stores.
### What data input validation requirements have been defined?
- Secondaries must have a faithful replication of the primary’s data.
### What data does the application store and how?
- Git repositories and files, tracking information related to the them, and the
GitLab database contents.
### What data is or may need to be encrypted and what key management requirements have been defined?
- Neither primaries or secondaries encrypt Git repository or filesystem data at
rest. A subset of database columns are encrypted at rest using the `db_otp_key`
- a static secret shared across all hosts in a GitLab deployment.
- In transit, data should be encrypted, although the application does permit
communication to proceed unencrypted. The two main transits are the secondary’s
replication process for PostgreSQL, and for git repositories/files. Both should
be protected using TLS, with the keys for that managed via Omnibus per existing
configuration for end-user access to GitLab.
### What capabilities exist to detect the leakage of sensitive data?
- Comprehensive system logs exist, tracking every connection to GitLab and
PostgreSQL.
### What encryption requirements have been defined for data in transit - including transmission over WAN, LAN, SecureFTP, or publicly accessible protocols such as http: and https:?
- Data must have the option to be encrypted in transit, and be secure against
both passive and active attack (e.g., MITM attacks should not be possible).
## Access
### What user privilege levels does the application support?
- Geo adds one type of privilege: secondaries can access a special Geo API to
download files over HTTP/HTTPS, and to clone repositories using HTTP/HTTPS.
### What user identification and authentication requirements have been defined?
- Geo secondaries identify to Geo primaries via OAuth or JWT authentication
based on the shared database (HTTP access) or a PostgreSQL replication user (for
database replication). The database replication also requires IP-based access
controls to be defined.
### What user authorization requirements have been defined?
- Secondaries must only be able to *read* data. They are not currently able to
mutate data on the primary.
### What session management requirements have been defined?
- Geo JWTs are defined to last for only two minutes before needing to be
regenerated.
### What access requirements have been defined for URI and Service calls?
- A Geo secondary makes many calls to the primary's API. This is how file
replication proceeds, for instance. This endpoint is only accessible with a JWT
token.
- The primary also makes calls to the secondary to get status information.
## Application Monitoring
### What application auditing requirements have been defined? How are audit and debug logs accessed, stored, and secured?
- Structured JSON log is written to the filesystem, and can also be ingested
into a Kibana installation for further analysis.
This document was moved to [another location](../administration/geo/security_review.md).
By running the command above, `primary` should be `true` when executed in
the primary node, and `false` on any secondary.
#### How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"?
This means that the `max_replication_slots` PostgreSQL variable needs to
be set on the primary database. In GitLab 9.4, we have made this setting
default to 1. You may need to increase this value if you have more Geo
secondary nodes. Be sure to restart PostgreSQL for this to take
effect. See the [PostgreSQL replication
setup](database.md#postgresql-replication) guide for more details.
#### How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
This occurs when PostgreSQL does not have a replication slot for the
secondary by that name. You may want to rerun the [replication
process](database.md) on the secondary.
#### How do I fix the message, "Command exceeded allowed execution time" when setting up replication?
This may happen while [initiating the replication process](database.md#step-4-initiate-the-replication-process) on the Geo secondary, and indicates that your
initial dataset is too large to be replicated in the default timeout (30 minutes).
Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for