in production environments caused by simultaneous writes to different NFS
clients. Data corruption is not an acceptable risk.
Gitaly Cluster is purpose built to provide reliable, high performance, fault
tolerant Git storage.
Further reading:
- Blog post: [The road to Gitaly v1.0 (aka, why GitLab doesn't require NFS for storing Git data anymore)](https://about.gitlab.com/blog/2018/09/12/the-road-to-gitaly-1-0/)
- Blog post: [How we spent two weeks hunting an NFS bug in the Linux kernel](https://about.gitlab.com/blog/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/)
### Where Gitaly Cluster fits
GitLab accesses [repositories](../../user/project/repository/index.md) through the configured
[repository storages](../repository_storage_paths.md). Each new repository is stored on one of the
repository storages based on their configured weights. Each repository storage is either:
- A Gitaly storage served directly by Gitaly. These map to a directory on the file system of a
Gitaly node.
- A [virtual storage](#virtual-storage-or-direct-gitaly-storage) served by Praefect. A virtual
storage is a cluster of Gitaly storages that appear as a single repository storage.
Virtual storages are a feature of Gitaly Cluster. They support replicating the repositories to
multiple storages for fault tolerance. Virtual storages can improve performance by distributing
requests across Gitaly nodes. Their distributed nature makes it viable to have a single repository
storage in GitLab to simplify repository management.
### Components of Gitaly Cluster
Gitaly Cluster consists of multiple components:
-[Load balancer](praefect.md#load-balancer) for distributing requests and providing fault-tolerant access to
Praefect nodes.
-[Praefect](praefect.md#praefect) nodes for managing the cluster and routing requests to Gitaly nodes.
-[PostgreSQL database](praefect.md#postgresql) for persisting cluster metadata and [PgBouncer](praefect.md#pgbouncer),
recommended for pooling Praefect's database connections.
- Gitaly nodes to provide repository storage and Git access.
![Cluster example](img/cluster_example_v13_3.png)
In this example:
- Repositories are stored on a virtual storage called `storage-1`.
- Three Gitaly nodes provide `storage-1` access: `gitaly-1`, `gitaly-2`, and `gitaly-3`.
- The three Gitaly nodes store data on their file systems.
### Virtual storage or direct Gitaly storage
Gitaly supports multiple models of scaling:
- Clustering using Gitaly Cluster, where each repository is stored on multiple Gitaly nodes in the
cluster. Read requests are distributed between repository replicas and write requests are
broadcast to repository replicas. GitLab accesses virtual storage.
- Direct access to Gitaly storage using [repository storage paths](../repository_storage_paths.md),
where each repository is stored on the assigned Gitaly node. All requests are routed to this node.
The following is Gitaly set up to use direct access to Gitaly instead of Gitaly Cluster:
![Shard example](img/shard_example_v13_3.png)
In this example:
- Each repository is stored on one of three Gitaly storages: `storage-1`, `storage-2`,
or `storage-3`.
- Each storage is serviced by a Gitaly node.
- The three Gitaly nodes store data in three separate hashed storage locations.
Generally, virtual storage with Gitaly Cluster can replace direct Gitaly storage configurations, at
the expense of additional storage needed to store each repository on multiple Gitaly nodes. The
benefit of using Gitaly Cluster over direct Gitaly storage is:
- Improved fault tolerance, because each Gitaly node has a copy of every repository.
- Improved resource utilization, reducing the need for over-provisioning for shard-specific peak
loads, because read loads are distributed across replicas.
- Manual rebalancing for performance is not required, because read loads are distributed across
replicas.
- Simpler management, because all Gitaly nodes are identical.
Under some workloads, CPU and memory requirements may require a large fleet of Gitaly nodes. It
can be uneconomical to have one to one replication factor.
A hybrid approach can be used in these instances, where each shard is configured as a smaller
cluster. [Variable replication factor](https://gitlab.com/groups/gitlab-org/-/epics/3372) is planned
to provide greater flexibility for extremely large GitLab instances.
### Gitaly Cluster compared to Geo
Gitaly Cluster and [Geo](../geo/index.md) both provide redundancy. However the redundancy of:
- Gitaly Cluster provides fault tolerance for data storage and is invisible to the user. Users are
not aware when Gitaly Cluster is used.
- Geo provides [replication](../geo/index.md) and [disaster recovery](../geo/disaster_recovery/index.md) for
an entire instance of GitLab. Users know when they are using Geo for
[replication](../geo/index.md). Geo [replicates multiple data types](../geo/replication/datatypes.md#limitations-on-replicationverification),
including Git data.
The following table outlines the major differences between Gitaly Cluster and Geo:
| Gitaly Cluster | Multiple | Single | Approximately 1 ms | [Automatic](praefect.md#automatic-failover-and-leader-election) | [Strong](praefect.md#strong-consistency) | Data storage in Git |
| Geo | Multiple | Multiple | Up to one minute | [Manual](../geo/disaster_recovery/index.md) | Eventual | Entire GitLab instance |
in production environments caused by simultaneous writes to different NFS
clients. Data corruption is not an acceptable risk.
Gitaly Cluster is purpose built to provide reliable, high performance, fault
tolerant Git storage.
Further reading:
- Blog post: [The road to Gitaly v1.0 (aka, why GitLab doesn't require NFS for storing Git data anymore)](https://about.gitlab.com/blog/2018/09/12/the-road-to-gitaly-1-0/)
- Blog post: [How we spent two weeks hunting an NFS bug in the Linux kernel](https://about.gitlab.com/blog/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/)
## Where Gitaly Cluster fits
GitLab accesses [repositories](../../user/project/repository/index.md) through the configured
[repository storages](../repository_storage_paths.md). Each new repository is stored on one of the
repository storages based on their configured weights. Each repository storage is either:
- A Gitaly storage served directly by Gitaly. These map to a directory on the file system of a
Gitaly node.
- A [virtual storage](#virtual-storage-or-direct-gitaly-storage) served by Praefect. A virtual
storage is a cluster of Gitaly storages that appear as a single repository storage.
Virtual storages are a feature of Gitaly Cluster. They support replicating the repositories to
multiple storages for fault tolerance. Virtual storages can improve performance by distributing
requests across Gitaly nodes. Their distributed nature makes it viable to have a single repository
storage in GitLab to simplify repository management.
## Components of Gitaly Cluster
Gitaly Cluster consists of multiple components:
-[Load balancer](#load-balancer) for distributing requests and providing fault-tolerant access to
Praefect nodes.
-[Praefect](#praefect) nodes for managing the cluster and routing requests to Gitaly nodes.
-[PostgreSQL database](#postgresql) for persisting cluster metadata and [PgBouncer](#pgbouncer),
recommended for pooling Praefect's database connections.
-[Gitaly](index.md) nodes to provide repository storage and Git access.
![Cluster example](img/cluster_example_v13_3.png)
In this example:
- Repositories are stored on a virtual storage called `storage-1`.
- Three Gitaly nodes provide `storage-1` access: `gitaly-1`, `gitaly-2`, and `gitaly-3`.
- The three Gitaly nodes store data on their file systems.
### Virtual storage or direct Gitaly storage
Gitaly supports multiple models of scaling:
- Clustering using Gitaly Cluster, where each repository is stored on multiple Gitaly nodes in the
cluster. Read requests are distributed between repository replicas and write requests are
broadcast to repository replicas. GitLab accesses virtual storage.
- Direct access to Gitaly storage using [repository storage paths](../repository_storage_paths.md),
where each repository is stored on the assigned Gitaly node. All requests are routed to this node.
The following is Gitaly set up to use direct access to Gitaly instead of Gitaly Cluster:
![Shard example](img/shard_example_v13_3.png)
In this example:
- Each repository is stored on one of three Gitaly storages: `storage-1`, `storage-2`,
or `storage-3`.
- Each storage is serviced by a Gitaly node.
- The three Gitaly nodes store data in three separate hashed storage locations.
Generally, virtual storage with Gitaly Cluster can replace direct Gitaly storage configurations, at
the expense of additional storage needed to store each repository on multiple Gitaly nodes. The
benefit of using Gitaly Cluster over direct Gitaly storage is:
- Improved fault tolerance, because each Gitaly node has a copy of every repository.
- Improved resource utilization, reducing the need for over-provisioning for shard-specific peak
loads, because read loads are distributed across replicas.
- Manual rebalancing for performance is not required, because read loads are distributed across
replicas.
- Simpler management, because all Gitaly nodes are identical.
Under some workloads, CPU and memory requirements may require a large fleet of Gitaly nodes. It
can be uneconomical to have one to one replication factor.
A hybrid approach can be used in these instances, where each shard is configured as a smaller
cluster. [Variable replication factor](https://gitlab.com/groups/gitlab-org/-/epics/3372) is planned
to provide greater flexibility for extremely large GitLab instances.
### Gitaly Cluster compared to Geo
Gitaly Cluster and [Geo](../geo/index.md) both provide redundancy. However the redundancy of:
- Gitaly Cluster provides fault tolerance for data storage and is invisible to the user. Users are
not aware when Gitaly Cluster is used.
- Geo provides [replication](../geo/index.md) and [disaster recovery](../geo/disaster_recovery/index.md) for
an entire instance of GitLab. Users know when they are using Geo for
| Gitaly Cluster | Multiple | Single | Approximately 1 ms | [Automatic](#automatic-failover-and-leader-election) | [Strong](#strong-consistency) | Data storage in Git |
| Geo | Multiple | Multiple | Up to one minute | [Manual](../geo/disaster_recovery/index.md) | Eventual | Entire GitLab instance |
For more information, see:
-[Gitaly](index.md).
- Geo [use cases](../geo/index.md#use-cases) and [architecture](../geo/index.md#architecture).
## Architecture
Praefect is a router and transaction manager for Gitaly, and a required