Commit ef1d05d0 authored by Achilleas Pipinellis's avatar Achilleas Pipinellis

Merge branch '213918-document-elasticsearch-re-indexing-using-an-alias' into 'master'

Add zero-downtime re-indexing documentation

Closes #213918

See merge request gitlab-org/gitlab!29788
parents b3b25cce 71564639
---
title: Improve Elasticsearch Reindexing documentation
merge_request: 29788
author:
type: other
...@@ -121,6 +121,9 @@ Patterns: ...@@ -121,6 +121,9 @@ Patterns:
## Zero downtime reindexing with multiple indices ## Zero downtime reindexing with multiple indices
NOTE: **Note:**
This is not applicable yet as multiple indices functionality is not fully implemented.
Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime. Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime.
To avoid downtime, GitLab is working to support multiple indices that To avoid downtime, GitLab is working to support multiple indices that
......
...@@ -423,6 +423,140 @@ or creating [extra Sidekiq processes](../administration/operations/extra_sidekiq ...@@ -423,6 +423,140 @@ or creating [extra Sidekiq processes](../administration/operations/extra_sidekiq
For repository and snippet files, GitLab will only index up to 1 MiB of content, in order to avoid indexing timeouts. For repository and snippet files, GitLab will only index up to 1 MiB of content, in order to avoid indexing timeouts.
## Zero downtime reindexing
The idea behind this reindexing method is to leverage Elasticsearch index alias feature to atomically swap between two indices.
We will refer to each index as `primary` (online and used by GitLab for read/writes) and `secondary` (offline, for reindexing purpose).
Instead of connecting directly to the `primary` index, we'll setup an index alias such as we can change the underlying index at will.
NOTE: **Note:**
Any index attached to the production alias is deemed a `primary` and will end up being used by the GitLab Elasticsearch integration.
### Pause the indexing
Under **Admin Area > Integration > Elasticsearch**, check the **Pause Elasticsearch Indexing** setting and save.
With this, all updates that should happen on your Elasticsearch index will be buffered and caught up once unpaused.
### Setup
TIP: **Tip:**
If your index has been created with GitLab v13.0+ you can skip directly to [trigger the reindex](#trigger-the-reindex-via-the-elasticsearch-administration).
This process involves multiple shell commands and curl invocations, so a good initial setup will help down the road:
```shell
# You can find this value under Admin Area > Integration > Elasticsearch > URL
export CLUSTER_URL="http://localhost:9200"
export PRIMARY_INDEX="gitlab-production"
export SECONDARY_INDEX="gitlab-production-$(date +%s)"
```
### Reclaiming the `gitlab-production` index name
CAUTION: **Caution:**
It is highly recommended that you take a snapshot of your cluster to make sure there is a recovery path if anything goes wrong.
NOTE: **Note:**
Due to a technical limitation, there will be a slight downtime because of the fact that we need to reclaim the current `primary` index to be used as the alias.
To reclaim the `gitlab-production` index name, you need to first create a `secondary` index and then trigger the re-index from `primary`.
#### Creating a secondary index
To create a secondary index, run the following Rake task. The `SKIP_ALIAS`
environment variable will disable the automatic creation of the Elasticsearch
alias, which would conflict with the existing index under `$PRIMARY_INDEX`:
```shell
# Omnibus installation
sudo SKIP_ALIAS=1 gitlab-rake "gitlab:elastic:create_empty_index[$SECONDARY_INDEX]"
# Source installation
SKIP_ALIAS=1 bundle exec rake "gitlab:elastic:create_empty_index[$SECONDARY_INDEX]"
```
The index should be created successfully, with the latest index options and mappings.
#### Trigger the re-index from `primary`
To trigger the re-index from `primary` index:
1. Use the Elasticsearch [Reindex API](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/docs-reindex.html):
```shell
curl --request POST \
--header 'Content-Type: application/json' \
--data "{ \"source\": { \"index\": \"$PRIMARY_INDEX\" }, \"dest\": { \"index\": \"$SECONDARY_INDEX\" } }" \
"$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false"
```
There will be an output like:
```plaintext
{"task":"3qw_Tr0YQLq7PF16Xek8YA:1012"}
```
Note the `task` value here as it will be useful to follow the reindex progress.
1. Wait for the reindex process to complete, by checking the `completed` value.
Using the `task` value form the previous step:
```shell
export TASK_ID=3qw_Tr0YQLq7PF16Xek8YA:1012
curl "$CLUSTER_URL/_tasks/$TASK_ID?pretty"
```
The output will be like:
```plaintext
{"completed":false, …}
```
Once the returned value is `true`, you may continue to the next step.
1. Make sure that the secondary index has data in it. You can use the Elasticsearch
API to look for the index size and compare our two indices:
```shell
curl $CLUSTER_URL/$PRIMARY_INDEX/_count => 123123
curl $CLUSTER_URL/$SECONDARY_INDEX/_count => 123123
```
TIP: **Tip:**
Comparing the document count is more accurate than using the index size, as improvements to the storage might cause the new index to be smaller than the original one.
1. Once you are confident your `secondary` index is valid, you can process to the creation of the alias.
```shell
# Delete the original index
curl --request DELETE $CLUSTER_URL/$PRIMARY_INDEX
# Create the alias and add the `secondary` index to it
curl --request POST \
--header 'Content-Type: application/json' \
--data "{\"actions\":[{\"add\":{\"index\":\"$SECONDARY_INDEX\",\"alias\":\"$PRIMARY_INDEX\"}}]}}" \
$CLUSTER_URL/_aliases
```
The reindexing is now completed. Your GitLab instance is now ready to use the [automated in-cluster reindexing](#trigger-the-reindex-via-the-elasticsearch-administration) feature for future reindexing.
1. Unpause the indexing
Under **Admin Area > Integration > Elasticsearch**, uncheck the **Pause Elasticsearch Indexing** setting and save.
### Trigger the reindex via the Elasticsearch administration
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/34069) in [GitLab Starter](https://about.gitlab.com/pricing/) 13.2.
Under **Admin Area > Integration > Elasticsearch zero-downtime reindexing**, click on **Trigger cluster reindexing**.
NOTE: **Note:**
Reindexing can be a lengthy process depending on the size of your Elasticsearch cluster.
While the reindexing is running, you will be able to follow its progress under that same section.
## GitLab Elasticsearch Rake tasks ## GitLab Elasticsearch Rake tasks
Rake tasks are available to: Rake tasks are available to:
...@@ -586,7 +720,7 @@ Here are some common pitfalls and how to overcome them: ...@@ -586,7 +720,7 @@ Here are some common pitfalls and how to overcome them:
- **I indexed all the repositories but then switched Elasticsearch servers and now I can't find anything** - **I indexed all the repositories but then switched Elasticsearch servers and now I can't find anything**
You will need to re-run all the Rake tasks to re-index the database, repositories, and wikis. You will need to re-run all the Rake tasks to reindex the database, repositories, and wikis.
- **The indexing process is taking a very long time** - **The indexing process is taking a very long time**
......
...@@ -75,6 +75,7 @@ module Gitlab ...@@ -75,6 +75,7 @@ module Gitlab
client.indices.create create_index_options client.indices.create create_index_options
client.indices.put_alias(name: target_name, index: new_index_name) if with_alias client.indices.put_alias(name: target_name, index: new_index_name) if with_alias
new_index_name new_index_name
end end
......
...@@ -27,14 +27,15 @@ namespace :gitlab do ...@@ -27,14 +27,15 @@ namespace :gitlab do
desc "GitLab | Elasticsearch | Index projects in the background" desc "GitLab | Elasticsearch | Index projects in the background"
task index_projects: :environment do task index_projects: :environment do
print "Enqueuing projects" print "Enqueuing projects"
project_id_batches do |ids| count = project_id_batches do |ids|
::Elastic::ProcessInitialBookkeepingService.backfill_projects!(*Project.find(ids)) ::Elastic::ProcessInitialBookkeepingService.backfill_projects!(*Project.find(ids))
print "." print "."
end end
puts "OK" marker = count > 0 ? "✔" : "∅"
puts " #{marker} (#{count})"
end end
desc "GitLab | ElasticSearch | Check project indexing status" desc "GitLab | ElasticSearch | Check project indexing status"
...@@ -58,10 +59,17 @@ namespace :gitlab do ...@@ -58,10 +59,17 @@ namespace :gitlab do
desc "GitLab | Elasticsearch | Create empty index and assign alias" desc "GitLab | Elasticsearch | Create empty index and assign alias"
task :create_empty_index, [:target_name] => [:environment] do |t, args| task :create_empty_index, [:target_name] => [:environment] do |t, args|
with_alias = ENV["SKIP_ALIAS"].nil?
options = {}
# only create an index at the specified name
options[:index_name] = args[:target_name] unless with_alias
helper = Gitlab::Elastic::Helper.new(target_name: args[:target_name]) helper = Gitlab::Elastic::Helper.new(target_name: args[:target_name])
helper.create_empty_index index_name = helper.create_empty_index(with_alias: with_alias, options: options)
puts "Index and underlying alias '#{helper.target_name}' has been created.".color(:green) puts "Index '#{index_name}' has been created.".color(:green)
puts "Alias '#{helper.target_name}' → '#{index_name}' has been created".color(:green) if with_alias
end end
desc "GitLab | Elasticsearch | Delete index" desc "GitLab | Elasticsearch | Delete index"
...@@ -108,20 +116,25 @@ namespace :gitlab do ...@@ -108,20 +116,25 @@ namespace :gitlab do
end end
def project_id_batches(&blk) def project_id_batches(&blk)
relation = Project relation = Project.all
unless ENV['UPDATE_INDEX'] unless ENV['UPDATE_INDEX']
relation = relation.includes(:index_status).where('index_statuses.id IS NULL').references(:index_statuses) relation = relation.includes(:index_status).where('index_statuses.id IS NULL').references(:index_statuses)
end end
if ::Gitlab::CurrentSettings.elasticsearch_limit_indexing? if ::Gitlab::CurrentSettings.elasticsearch_limit_indexing?
relation = relation.where(id: ::Gitlab::CurrentSettings.elasticsearch_limited_projects.select(:id)) relation.merge!(::Gitlab::CurrentSettings.elasticsearch_limited_projects)
end end
relation.all.in_batches(start: ENV['ID_FROM'], finish: ENV['ID_TO']) do |relation| # rubocop: disable Cop/InBatches count = 0
relation.in_batches(start: ENV['ID_FROM'], finish: ENV['ID_TO']) do |relation| # rubocop: disable Cop/InBatches
ids = relation.reorder(:id).pluck(:id) ids = relation.reorder(:id).pluck(:id)
yield ids yield ids
count += ids.size
end end
count
end end
def display_unindexed(projects) def display_unindexed(projects)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment