Commit 7353c1c8 authored by Stan Hu's avatar Stan Hu

Add example to multi-version compatibility

This documents issues dealing with delays in deployment times across
node types and why a feature flag is useful. This stemmed from a
production incident discussed in the following issues:

1. https://gitlab.com/gitlab-org/gitlab/-/issues/230739
2. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2442
parent 193ab7ff
...@@ -20,6 +20,22 @@ but AJAX requests to URLs (like the GraphQL endpoint) won't match the pattern. ...@@ -20,6 +20,22 @@ but AJAX requests to URLs (like the GraphQL endpoint) won't match the pattern.
With this canary setup, we'd be in this mixed-versions state for an extended period of time until canary is promoted to With this canary setup, we'd be in this mixed-versions state for an extended period of time until canary is promoted to
production and post-deployment migrations run. production and post-deployment migrations run.
Also be aware that during a deployment to production, Web, API, and
Sidekiq nodes are updated in parallel, but they may finish at
different times. That means there may be a window of time when the
application code is not in sync across the whole fleet. Changes that
cut across Sidekiq, Web, and/or the API may [introduce unexpected
errors until the deployment is complete](#builds-failing-due-to-varying-deployment-times-across-node-types).
One way to handle this is to use a feature flag that is disabled by
default. The feature flag can be enabled when the deployment is in a
consistent state. However, this method of synchronization doesn't
guarantee that customers with on-premise instances can [upgrade with
zero downtime](https://docs.gitlab.com/omnibus/update/#zero-downtime-updates)
since point releases bundle many changes together. Minimizing the time
between when versions are out of sync across the fleet may help mitigate
errors caused by upgrades.
## Examples of previous incidents ## Examples of previous incidents
### Some links to issues and MRs were broken ### Some links to issues and MRs were broken
...@@ -75,3 +91,37 @@ the new application code, hence QA was successful. Unfortunately, the production ...@@ -75,3 +91,37 @@ the new application code, hence QA was successful. Unfortunately, the production
instance still uses the older code, so it started failing to insert a new release entry. instance still uses the older code, so it started failing to insert a new release entry.
For more information, see [this issue related to the Releases API](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64151). For more information, see [this issue related to the Releases API](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64151).
### Builds failing due to varying deployment times across node types
In [one production issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2442),
CI builds that used the `parallel` keyword and depending on the
variable `CI_NODE_TOTAL` being an integer failed. This was caused because after a user pushed a commit:
1. New code: Sidekiq created a new pipeline and new build. `build.options[:parallel]` is a `Hash`.
1. Old code: Runners requested a job from an API node that is running the previous version.
1. As a result, the [new code](https://gitlab.com/gitlab-org/gitlab/blob/42b82a9a3ac5a96f9152aad6cbc583c42b9fb082/app/models/concerns/ci/contextable.rb#L104)
was not run on the API server. The runner's request failed because the
older API server tried return the `CI_NODE_TOTAL` CI variable, but
instead of sending an integer value (e.g. 9), it sent a serialized
`Hash` value (`{:number=>9, :total=>9}`).
If you look at the [deployment pipeline](https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/202212),
you see all nodes were updated in parallel:
![GitLab.com deployment pipeline](img/deployment_pipeline_v13_3.png)
However, even though the updated started around the same time, the completion time varied significantly:
|Node type|Duration (min)|
|---------|--------------|
|API |54 |
|Sidekiq |21 |
|K8S |8 |
Builds that used the `parallel` keyword and depended on `CI_NODE_TOTAL`
and `CI_NODE_INDEX` would fail during the time after Sidekiq was
updated. Since Kubernetes (K8S) also runs Sidekiq pods, the window could
have been as long as 46 minutes or as short as 33 minutes. Either way,
having a feature flag to turn on after the deployment finished would
prevent this from happening.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment