Guidance for Gitaly, Gitaly Cluster, and Geo capabilities

5f8ba253 · Sarah Waldner · Craig Norris · f1274f3c · 5f8ba253 · 5f8ba253
Commit 5f8ba253 authored Sep 30, 2021 by Sarah Waldner Committed by Craig Norris Sep 30, 2021
16 changed files
--- a/doc/administration/configure.md
+++ b/doc/administration/configure.md
@@ -7,10 +7,44 @@ type: reference

 # Configure your GitLab installation **(FREE SELF)**

-Customize and configure your self-managed GitLab installation.
+Customize and configure your self-managed GitLab installation. Here are some quick links to get you started:

 - [Authentication](auth/index.md)
 - [Configuration](../user/admin_area/index.md)
 - [Repository storage](repository_storage_paths.md)
 - [Geo](geo/index.md)
 - [Packages](packages/index.md)
+
+The following tables are intended to guide you to choose the right combination of capabilties based on your requirements. It is common to want the most
+available, quickly recoverable, highly performant and fully data resilient solution. However, there are tradeoffs.
+
+The tables lists features on the left and provides their capabilities to the right along with known trade-offs.
+
+## Gitaly Capabilities
+
+| | Availability | Recoverability | Data Resiliency | Performance | Risks/Trade-offs|
+|-|--------------|----------------|-----------------|-------------|-----------------|
+|Gitaly Cluster | Very high - tolerant of node failures | RTO for a single node of 10s with no manual intervention | Data is stored on multiple nodes | Good - While writes may take slightly longer due to voting, read distribution improves read speeds | **Trade-off** - Slight decrease in write speed for redundant, strongly-consistent storage solution. **Risks** - [Does not currently support snapshot backups](gitaly/index.md#snapshot-backup-and-recovery-limitations), GitLab backup task can be slow for large data sets |
+|Gitaly Shards | Single storage location is a single point of failure | Would need to restore only shards which failed | Single point of failure | Good - can allocate repositories to shards to spread load | **Trade-off** - Need to manually configure repositories into different shards to balance loads / storage space **Risks** - Single point of failure relies on recovery process when single-node failure occurs |
+|Gitaly + NFS | Single storage location is a single point of failure | Single node failure requires restoration from backup | Single point of failure | Average - NFS is not ideally suited to large quantities of small reads / writes which can have a detrimental impact on performance | **Trade-off** - Easy and familiar administration though NFS is not ideally suited to Git demands **Risks** - Many instances of NFS compatibility issues which provide very poor customer experiences |
+
+## Geo Capabilities
+
+If your availabity needs to span multiple zones or multiple locations, please read about [Geo](geo/index.md).
+
+| | Availability | Recoverability | Data Resiliency | Performance | Risks/Trade-offs|
+|-|--------------|----------------|-----------------|-------------|-----------------|
+|Geo| Depends on the architecture of the Geo site. It is possible to deploy secondaries in single and multiple node configurations. | Eventually consistent. Recovery point depends on replication lag, which depends on a number of factors such as network speeds. Geo supports failover from a primary to secondary site using manual commands that are scriptable. | Geo currently replicates 100% of planned data types and verifies 50%. See [limitations table](geo/replication/datatypes.md#limitations-on-replicationverification) for more detail. | Improves read/clone times for users of a secondary.  | Geo is not intended to replace other backup/restore solutions. Because of replication lag and the possibility of replicating bad data from a primary, we recommend that customers also take regular backups of their primary site and test the restore process. |
+
+## Scenarios for failure modes and available mitigation paths
+
+The following table outlines failure modes and mitigation paths for the product offerings detailed in the tables above. Note - Gitaly Cluster install assumes an odd number replication factor of 3 or greater
+
+| Gitaly Mode | Loss of Single Gitaly Node | Application / Data Corruption | Regional Outage (Loss of Instance) | Notes |
+| ----------- | -------------------------- | ----------------------------- | ---------------------------------- | ----- |
+| Single Gitaly Node | Downtime - Must restore from backup | Downtime - Must restore from Backup | Downtime - Must wait for outage to end | |
+| Single Gitaly Node + Geo Secondary | Downtime - Must restore from backup, can perform a manual failover to secondary | Downtime - Must restore from Backup, errors could have propagated to secondary | Manual intervention - failover to Geo secondary | |
+| Sharded Gitaly Install | Partial Downtime - Only repos on impacted node affected, must restore from backup | Partial Downtime - Only repos on impacted node affected, must restore from backup | Downtime - Must wait for outage to end | |
+| Sharded Gitaly Install + Geo Secondary | Partial Downtime - Only repos on impacted node affected, must restore from backup, could perform manual failover to secondary for impacted repos | Partial Downtime - Only repos on impacted node affected, must restore from backup, errors could have propagated to secondary | Manual intervention - failover to Geo secondary | |
+| Gitaly Cluster Install* | No Downtime - will swap repository primary to another node after 10 seconds | N/A - All writes are voted on by multiple Gitaly Cluster nodes | Downtime - Must wait for outage to end | Snapshot backups for Gitaly Cluster nodes not supported at this time |
+| Gitaly Cluster Install* + Geo Secondary | No Downtime - will swap repository primary to another node after 10 seconds | N/A - All writes are voted on by multiple Gitaly Cluster nodes | Manual intervention - failover to Geo secondary | Snapshot backups for Gitaly Cluster nodes not supported at this time |
--- a/doc/administration/gitaly/faq.md
+++ b/doc/administration/gitaly/faq.md
@@ -35,7 +35,7 @@ For more information, see:

 ## Are there instructions for migrating to Gitaly Cluster?

-Yes! For more information, see [Migrate to Gitaly Cluster](index.md#migrate-to-gitaly-cluster).
+Yes! For more information, see [Migrating to Gitaly Cluster](index.md#migrating-to-gitaly-cluster).

 ## What are some repository storage recommendations?


--- a/doc/administration/gitaly/index.md
+++ b/doc/administration/gitaly/index.md
--- a/doc/administration/gitaly/praefect.md
+++ b/doc/administration/gitaly/praefect.md
@@ -429,7 +429,7 @@ On the **Praefect** node:
   WARNING:
   If you have data on an already existing storage called
   `default`, you should configure the virtual storage with another name and
-   [migrate the data to the Gitaly Cluster storage](index.md#migrate-to-gitaly-cluster)
+   [migrate the data to the Gitaly Cluster storage](index.md#migrating-to-gitaly-cluster)
   afterwards.

   Replace `PRAEFECT_INTERNAL_TOKEN` with a strong secret, which is used by
@@ -893,7 +893,7 @@ Particular attention should be shown to:

   WARNING:
   If you have existing data stored on the default Gitaly storage,
-   you should [migrate the data your Gitaly Cluster storage](index.md#migrate-to-gitaly-cluster)
+   you should [migrate the data your Gitaly Cluster storage](index.md#migrating-to-gitaly-cluster)
   first.

   ```ruby

--- a/doc/administration/nfs.md
+++ b/doc/administration/nfs.md
@@ -20,12 +20,46 @@ file system performance, see

 ## Gitaly and NFS deprecation

+Starting with GitLab version 14.0, support for NFS to store Git repository data will be deprecated. Technical customer support and engineering support will be available for the 14.x releases. Engineering will fix bugs and security vulnerabilities consistent with our [release and maintenance policy](../policy/maintenance.md#security-releases). 
+
+At the end of the 14.12 milestone (tenatively June 22nd, 2022) technical and engineering support for using NFS to store Git repository data will be officially at end-of-life. There will be no product changes or troubleshooting provided via Engineering, Security or Paid Support channels.
+
+For those customers still running earlier versions of GitLab, [our support eligibility and maintenance policy applies](https://about.gitlab.com/support/statement-of-support.html#version-support).
+
+For the 14.x releases, we will continue to help with Git related tickets from customers running one or more Gitaly servers with its data stored on NFS. Examples may include:
+
+- Performance issues or timeouts accessing Git data
+- Commits or branches vanish
+- GitLab intermittently returns the wrong Git data (such as reporting that a repository has no branches)
+
+Assistance will be limited to activities like:
+
+- Verifying developers' workflow uses features like protected branches
+- Reviewing GitLab event data from the database to advise if it looks like a force push over-wrote branches
+- Verifying that NFS client mount options match our [documented recommendations](#mount-options)
+- Analyzing the GitLab Workhorse and Rails logs, and determining that `500` errors being seen in the environment are caused by slow responses from Gitaly
+
+GitLab support will be unable to continue with the investigation if:
+
+- The date of the request is on or after the release of GitLab version 15.0, and
+- Support Engineers and Management determine that all reasonable non-NFS root causes have been exhausted
+
+If the issue is reproducible, or if it happens intermittently but regularly, GitLab Support will investigate providing the issue reproduces without the use of NFS. In order to reproduce without NFS, the affected repositories should be migrated to a different Gitaly shard, such as Gitaly cluster or a standalone Gitaly VM, backed with block storage.
+
+### Why remove NFS for Git repository data
+
+{:.no-toc}
+
+NFS is not well-suited to a workload consisting of many small files, like Git repositories. NFS does provide a number of configuration options designed to improve performance. However, over time, a number of these mount options have proven to result in inconsistencies across multiple nodes mounting the NFS volume, up to and including data loss. Addressing these inconsistencies consume extraordinary development and support engineer time that hamper our ability to develop [Gitaly Cluster](gitaly/praefect.md), our purpose-built solution to addressing the deficiencies of NFS in this environment.
+
+Please note that Gitaly Cluster provides highly-available Git repository storage. If this is not a requirement, single-node Gitaly backed by block storage is a suitable substitute.
+
 Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be
 unavailable from GitLab 15.0. No further enhancements are planned for this feature.

 Read:

- The [Gitaly and NFS deprecation notice](gitaly/index.md#nfs-deprecation-notice).
+- [Moving beyond NFS](gitaly/index.md#moving-beyond-nfs).
 - About the [correct mount options to use](#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Known kernel version incompatibilities
@@ -370,8 +404,8 @@ sudo ufw allow from <client_ip_address> to any port nfs
 ### Upgrade to Gitaly Cluster or disable caching if experiencing data loss

 WARNING:
-Engineering support for NFS for Git repositories is deprecated. Read the
-[Gitaly and NFS deprecation notice](gitaly/index.md#nfs-deprecation-notice).
+Engineering support for NFS for Git repositories is deprecated. Read about
+[moving beyond NFS](gitaly/index.md#moving-beyond-nfs).

 Customers and users have reported data loss on high-traffic repositories when using NFS for Git repositories.
 For example, we have seen:

--- a/doc/administration/operations/moving_repositories.md
+++ b/doc/administration/operations/moving_repositories.md
@@ -27,7 +27,7 @@ For more information, see:
  querying and scheduling snippet repository moves.
 - [The API documentation](../../api/group_repository_storage_moves.md) details the endpoints for
  querying and scheduling group repository moves **(PREMIUM SELF)**.
- [Migrate to Gitaly Cluster](../gitaly/index.md#migrate-to-gitaly-cluster).
+- [Migrating to Gitaly Cluster](../gitaly/index.md#migrating-to-gitaly-cluster).

 ### Move Repositories


--- a/doc/administration/reference_architectures/10k_users.md
+++ b/doc/administration/reference_architectures/10k_users.md
@@ -2118,7 +2118,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 <div align="right">

--- a/doc/administration/reference_architectures/25k_users.md
+++ b/doc/administration/reference_architectures/25k_users.md
@@ -2124,7 +2124,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Cloud Native Hybrid reference architecture with Helm Charts (alternative)

--- a/doc/administration/reference_architectures/2k_users.md
+++ b/doc/administration/reference_architectures/2k_users.md
@@ -965,7 +965,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Cloud Native Hybrid reference architecture with Helm Charts (alternative)

--- a/doc/administration/reference_architectures/3k_users.md
+++ b/doc/administration/reference_architectures/3k_users.md
@@ -2072,7 +2072,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Supported modifications for lower user counts (HA)

--- a/doc/administration/reference_architectures/50k_users.md
+++ b/doc/administration/reference_architectures/50k_users.md
@@ -2138,7 +2138,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Cloud Native Hybrid reference architecture with Helm Charts (alternative)

--- a/doc/administration/reference_architectures/5k_users.md
+++ b/doc/administration/reference_architectures/5k_users.md
@@ -2066,7 +2066,7 @@ unavailable from GitLab 15.0. No further enhancements are planned for this featu

 Read:

- The [Gitaly and NFS deprecation notice](../gitaly/index.md#nfs-deprecation-notice).
+- [Gitaly and NFS Deprecation](../nfs.md#gitaly-and-nfs-deprecation).
 - About the [correct mount options to use](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss).

 ## Cloud Native Hybrid reference architecture with Helm Charts (alternative)

--- a/doc/administration/repository_storage_paths.md
+++ b/doc/administration/repository_storage_paths.md
@@ -174,4 +174,4 @@ information.
 ## Move repositories

 To move a repository to a different repository storage (for example, from `default` to `storage2`), use the
-same process as [migrating to Gitaly Cluster](gitaly/index.md#migrate-to-gitaly-cluster).
+same process as [migrating to Gitaly Cluster](gitaly/index.md#migrating-to-gitaly-cluster).
--- a/doc/api/group_repository_storage_moves.md
+++ b/doc/api/group_repository_storage_moves.md
@@ -10,7 +10,7 @@ type: reference
 > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/53016) in GitLab 13.9.

 Group repositories can be moved between storages. This API can help you when
-[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrate-to-gitaly-cluster), for
+[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrating-to-gitaly-cluster), for
 example, or to migrate a [group wiki](../user/project/wiki/index.md#group-wikis).

 As group repository storage moves are processed, they transition through different states. Values

--- a/doc/api/project_repository_storage_moves.md
+++ b/doc/api/project_repository_storage_moves.md
@@ -10,7 +10,7 @@ type: reference
 > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/31285) in GitLab 13.0.

 Project repositories including wiki and design repositories can be moved between storages. This can be useful when
-[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrate-to-gitaly-cluster),
+[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrating-to-gitaly-cluster),
 for example.

 As project repository storage moves are processed, they transition through different states. Values

--- a/doc/api/snippet_repository_storage_moves.md
+++ b/doc/api/snippet_repository_storage_moves.md
@@ -10,7 +10,7 @@ type: reference
 > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/49228) in GitLab 13.8.

 Snippet repositories can be moved between storages. This can be useful when
-[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrate-to-gitaly-cluster), for
+[migrating to Gitaly Cluster](../administration/gitaly/index.md#migrating-to-gitaly-cluster), for
 example.

 As snippet repository storage moves are processed, they transition through different states. Values