Commit 0383cee9 authored by Paul Slaughter's avatar Paul Slaughter

Merge branch '297193-retry-button-for-halted-migrations' into 'master'

Allow retry for halted Elasticsearch migrations

See merge request gitlab-org/gitlab!51335
parents e1e0b3d5 973888bc
......@@ -36,6 +36,7 @@ The type of problem will determine what steps to take. The possible troubleshoot
- Indexing.
- Integration.
- Performance.
- Background Migrations.
### Search Results workflow
......@@ -147,6 +148,30 @@ graph TD;
F7(Escalate to<br>GitLab support.)
```
### Background Migrations workflow
```mermaid
graph TD;
D --> |No| D1
D --> |Yes| D2
D2 --> |No| D3
D2 --> |Yes| D4
D4 --> |No| D5
D4 --> |Yes| D6
D6 --> |No| D8
D6 --> |Yes| D7
D{Is there a halted migration?}
D1[Migrations run in the<br>background and will<br>stop when completed.]
D2{Does the elasticsearch.log<br>file contain errors?}
D3[This is likely a bug/issue<br>in GitLab and will require<br>deeper investigation. Escalate<br>to GitLab support.]
D4{Have the errors<br>been addressed?}
D5[Have an Elasticsearch admin<br>review and address<br>the errors.]
D6{Has the migration<br>been retried?}
D7[This is likely a bug/issue<br>in GitLab and will require<br>deeper investigation. Escalate<br>to GitLab support.]
D8[Retry the migration from<br>the Admin > Settings ><br>Advanced Search UI.]
```
## Troubleshooting walkthrough
Most Elasticsearch troubleshooting can be broken down into 4 categories:
......@@ -155,6 +180,7 @@ Most Elasticsearch troubleshooting can be broken down into 4 categories:
- [Troubleshooting indexing](#troubleshooting-indexing)
- [Troubleshooting integration](#troubleshooting-integration)
- [Troubleshooting performance](#troubleshooting-performance)
- [Troubleshooting background migrations](#troubleshooting-background-migrations)
Generally speaking, if it does not fall into those four categories, it is either:
......@@ -330,6 +356,18 @@ dig further into these.
Feel free to reach out to GitLab support, but this is likely to be something a skilled
Elasticsearch admin has more experience with.
### Troubleshooting background migrations
Troubleshooting background migration failures can be difficult and may require contacting
an Elasticsearch admin or GitLab Support.
The best place to start while debugging issues with a background migration is the
[`elasticsearch.log` file](../logs.md#elasticsearchlog). Migrations will
print information while a migration is in progress and any errors encountered.
Apply fixes for any errors found in the log and retry the migration.
If you still encounter issues after retrying the migration, reach out to GitLab support.
## Common issues
All common issues [should be documented](../../integration/elasticsearch.md#troubleshooting). If not,
......
......@@ -216,6 +216,9 @@ cron worker sequentially.
Any update to the Elastic index mappings should be replicated in [`Elastic::Latest::Config`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/elastic/latest/config.rb).
Migrations can be built with a retry limit and have the ability to be [failed and marked as halted](https://gitlab.com/gitlab-org/gitlab/-/blob/66e899b6637372a4faf61cfd2f254cbdd2fb9f6d/ee/lib/elastic/migration.rb#L40).
Any data or index cleanup needed to support migration retries should be handled within the migration.
### Migration options supported by the [`Elastic::MigrationWorker`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/workers/elastic/migration_worker.rb)
- `batched!` - Allow the migration to run in batches. If set, the [`Elastic::MigrationWorker`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/workers/elastic/migration_worker.rb)
......
......@@ -508,6 +508,15 @@ This should return something similar to:
In order to debug issues with the migrations you can check the [`elasticsearch.log` file](../administration/logs.md#elasticsearchlog).
### Retry a halted migration
Some migrations are built with a retry limit. If the migration cannot finish within the retry limit,
it will be halted and a notification will be displayed in the Advanced Search integration settings.
It is recommended to check the [`elasticsearch.log` file](../administration/logs.md#elasticsearchlog) to
debug why the migration was halted and make any changes before retrying the migration. Once you believe you've
fixed the cause of the failure, click "Retry migration", and the migration will be scheduled to be retried
in the background.
## GitLab Advanced Search Rake tasks
Rake tasks are available to:
......
......@@ -43,6 +43,19 @@ class Admin::ElasticsearchController < Admin::ApplicationController
redirect_to redirect_path
end
# POST
# Retry a halted migration
def retry_migration
migration = Elastic::DataMigrationService[params[:version].to_i]
Gitlab::Elastic::Helper.default.delete_migration_record(migration)
Elastic::DataMigrationService.drop_migration_halted_cache!(migration)
flash[:notice] = _('Migration has been scheduled to be retried')
redirect_to redirect_path
end
private
def redirect_path
......
......@@ -4,6 +4,7 @@ module Elastic
class DataMigrationService
MIGRATIONS_PATH = 'ee/elastic/migrate'
MIGRATION_REGEXP = /\A([0-9]+)_([_a-z0-9]*)\.rb\z/.freeze
CACHE_TIMEOUT = 30.minutes
class << self
def migration_files
......@@ -29,7 +30,7 @@ module Elastic
end
def migration_has_finished?(name)
Rails.cache.fetch cache_key(:migration_has_finished, name.to_s.underscore), expires_in: 30.minutes do
Rails.cache.fetch cache_key(:migration_has_finished, name.to_s.underscore), expires_in: CACHE_TIMEOUT do
migration_has_finished_uncached?(name)
end
end
......@@ -40,12 +41,38 @@ module Elastic
!!migration&.load_from_index&.dig('_source', 'completed')
end
def migration_halted?(migration)
Rails.cache.fetch cache_key(:migration_halted, migration.name_for_key), expires_in: CACHE_TIMEOUT do
migration_halted_uncached?(migration)
end
end
def drop_migration_halted_cache!(migration)
Rails.cache.delete cache_key(:migration_halted, migration.name_for_key)
end
def migration_halted_uncached?(migration)
!!migration&.load_from_index&.dig('_source', 'state', 'halted')
end
def pending_migrations?
migrations.reverse.any? do |migration|
!migration_has_finished?(migration.name_for_key)
end
end
def halted_migrations?
migrations.reverse.any? do |migration|
migration_halted?(migration)
end
end
def halted_migration
migrations.reverse.find do |migration|
migration_halted?(migration)
end
end
def mark_all_as_completed!
migrations.each do |migration|
migration.save!(completed: true)
......
......@@ -4,6 +4,7 @@
- recreate_index_link_start = '<a href="%{url}" target="_blank" rel="noopener noreferrer">'.html_safe % { url: recreate_index_url }
- recreate_index_text = s_("Changes won't take place until the index is %{link_start}recreated%{link_end}.").html_safe % { link_start: recreate_index_link_start, link_end: '</a>'.html_safe }
- expanded = integration_expanded?('elasticsearch_')
- elasticsearch_available = Gitlab::Elastic::Helper.default.client.ping
%section.settings.as-elasticsearch.no-animate#js-elasticsearch-settings{ class: ('expanded' if expanded), data: { qa_selector: 'elasticsearch_tab' } }
.settings-header
......@@ -13,12 +14,27 @@
= expanded ? 'Collapse' : 'Expand'
%p
= _('Advanced Search with Elasticsearch')
.settings-content
= form_for @application_setting, url: general_admin_application_settings_path(anchor: 'js-elasticsearch-settings'), html: { class: 'fieldset-form' } do |f|
= form_errors(@application_setting) if expanded
%fieldset
.sub-section
- halted_migrations = elasticsearch_available && Elastic::DataMigrationService.halted_migrations?
- if halted_migrations
.gl-alert.gl-alert-warning.gl-mt-3.gl-mb-3{ role: 'alert' }
= sprite_icon('warning', css_class: 'gl-icon gl-alert-icon gl-alert-icon-no-title')
%button.js-close.gl-alert-dismiss{ type: 'button', 'aria-label' => _('Dismiss') }
= sprite_icon('close', css_class: 'gl-icon')
.gl-alert-body
%h4.gl-alert-title= s_('There is a halted Elasticsearch migration')
= html_escape_once(s_('Check the elasticsearch.log file to debug why the migration was halted and make any changes before retrying the migration. When you fix the cause of the failure, click "Retry migration", and the migration will be scheduled to be retried in the background.')).html_safe
= link_to _('Learn more.'), help_page_path('integration/elasticsearch', anchor: 'background-migrations')
.gl-alert-actions
- migration = Elastic::DataMigrationService.halted_migration
= link_to s_('Retry migration'), admin_elasticsearch_retry_migration_path(version: migration.version), class: 'btn gl-alert-action btn-warning gl-button', disabled: @elasticsearch_reindexing_task&.in_progress?, data: { confirm: _('Are you sure you want to retry this migration?') }, method: :post
.form-group
.form-check
= f.check_box :elasticsearch_indexing, class: 'form-check-input', data: { qa_selector: 'indexing_checkbox' }
......@@ -35,7 +51,7 @@
.card-body
.form-group
.form-check
- pending_migrations = Elastic::DataMigrationService.pending_migrations? && Gitlab::CurrentSettings.elasticsearch_pause_indexing? rescue false
- pending_migrations = elasticsearch_available && Elastic::DataMigrationService.pending_migrations? && Gitlab::CurrentSettings.elasticsearch_pause_indexing?
- disable_checkbox = !Gitlab::CurrentSettings.elasticsearch_indexing? || pending_migrations
= f.check_box :elasticsearch_pause_indexing, class: 'form-check-input', data: { qa_selector: 'pause_checkbox' }, disabled: disable_checkbox
= f.label :elasticsearch_pause_indexing, class: 'form-check-label' do
......
---
title: Add retry ability for halted Elasticsearch migrations
merge_request: 51335
author:
type: changed
......@@ -81,5 +81,6 @@ namespace :admin do
post :enqueue_index
post :trigger_reindexing
post :cancel_index_deletion
post :retry_migration
end
end
......@@ -27,8 +27,10 @@ class MigrateIssuesToSeparateIndex < Elastic::Migration
def migrate
# On initial batch we only create index
if migration_state[:slice].blank?
cleanup # support retries
log "Create standalone issues index under #{issues_index_name}"
helper.create_standalone_indices unless helper.index_exists?(index_name: issues_index_name)
helper.create_standalone_indices(target_classes: [Issue])
options = {
slice: 0,
......@@ -88,6 +90,10 @@ class MigrateIssuesToSeparateIndex < Elastic::Migration
private
def cleanup
helper.delete_index(index_name: issues_index_name) if helper.index_exists?(index_name: issues_index_name)
end
def reindex(slice:, max_slices:)
body = query(slice: slice, max_slices: max_slices)
......
......@@ -93,14 +93,24 @@ module Gitlab
migrations_index_name
end
def standalone_indices_proxies
ES_SEPARATE_CLASSES.map do |class_name|
def delete_migration_record(migration)
result = client.delete(index: migrations_index_name, type: '_doc', id: migration.version)
result['result'] == 'deleted'
rescue ::Elasticsearch::Transport::Transport::Errors::NotFound => e
Gitlab::ErrorTracking.log_exception(e)
false
end
def standalone_indices_proxies(target_classes: nil)
classes = target_classes.presence || ES_SEPARATE_CLASSES
classes.map do |class_name|
::Elastic::Latest::ApplicationClassProxy.new(class_name, use_separate_indices: true)
end
end
def create_standalone_indices(with_alias: true, options: {})
standalone_indices_proxies.each_with_object({}) do |proxy, indices|
def create_standalone_indices(with_alias: true, options: {}, target_classes: nil)
proxies = standalone_indices_proxies(target_classes: target_classes)
proxies.each_with_object({}) do |proxy, indices|
alias_name = proxy.index_name
new_index_name = "#{alias_name}-#{Time.now.strftime("%Y%m%d-%H%M")}"
......
......@@ -26,7 +26,7 @@ RSpec.describe Admin::ElasticsearchController do
context 'without an index' do
before do
allow(Gitlab::Elastic::Helper.default).to(receive(:index_exists?)).and_return(false)
allow(helper).to(receive(:index_exists?)).and_return(false)
end
it 'does nothing and returns 404' do
......@@ -80,4 +80,27 @@ RSpec.describe Admin::ElasticsearchController do
expect(response).to redirect_to general_admin_application_settings_path(anchor: 'js-elasticsearch-settings')
end
end
describe 'POST #retry_migration' do
before do
sign_in(admin)
end
let(:migration) { Elastic::DataMigrationService.migrations.last }
let(:migration_version) { migration.version.to_i }
it 'deletes the migration record and drops the halted cache' do
allow(Elastic::MigrationRecord).to receive(:new).and_call_original
allow(Elastic::MigrationRecord).to receive(:new).with(version: migration.version, name: migration.name, filename: migration.filename).and_return(migration)
allow(Elastic::DataMigrationService).to receive(:migration_halted?).and_return(false)
allow(Elastic::DataMigrationService).to receive(:migration_halted?).with(migration).and_return(true, false)
expect(Elastic::DataMigrationService.halted_migrations?).to be_truthy
post :retry_migration, params: { version: migration.version }
expect(Elastic::DataMigrationService.halted_migrations?).to be_falsey
expect(controller).to set_flash[:notice].to include('Migration has been scheduled to be retried')
expect(response).to redirect_to general_admin_application_settings_path(anchor: 'js-elasticsearch-settings')
end
end
end
......@@ -321,4 +321,42 @@ RSpec.describe Gitlab::Elastic::Helper do
end
end
end
describe '#delete_migration_record', :elastic do
let(:migration) { ::Elastic::DataMigrationService.migrations.last }
subject { helper.delete_migration_record(migration) }
context 'when record exists' do
it { is_expected.to be_truthy }
end
context 'when record does not exist' do
before do
allow(migration).to receive(:version).and_return(1)
end
it { is_expected.to be_falsey }
end
end
describe '#standalone_indices_proxies' do
subject { helper.standalone_indices_proxies(target_classes: classes) }
context 'when target_classes is not provided' do
let(:classes) { nil }
it 'creates proxies for each separate class' do
expect(subject.count).to eq(Gitlab::Elastic::Helper::ES_SEPARATE_CLASSES.count)
end
end
context 'when target_classes is provided' do
let(:classes) { [Issue] }
it 'creates proxies for only the target classes' do
expect(subject.count).to eq(1)
end
end
end
end
......@@ -119,4 +119,73 @@ RSpec.describe Elastic::DataMigrationService, :elastic do
expect(subject.migration_has_finished?(migration_name)).to eq(finished)
end
end
describe '.migration_halted?' do
let(:migration) { subject.migrations.last }
before do
allow(Rails).to receive(:cache).and_return(ActiveSupport::Cache::MemoryStore.new)
allow(subject).to receive(:migration_halted_uncached?).with(migration).and_return(true, false)
end
it 'calls the uncached method only once' do
expect(subject).to receive(:migration_halted_uncached?).once
expect(subject.migration_halted?(migration)).to eq(true)
expect(subject.migration_halted?(migration)).to eq(true)
end
end
describe '.migration_halted_uncached?' do
let(:migration) { subject.migrations.last }
let(:halted_response) { { '_source': { 'state': { halted: true } } }.with_indifferent_access }
let(:not_halted_response) { { '_source': { 'state': { halted: false } } }.with_indifferent_access }
it 'returns true if migration has been halted' do
allow(migration).to receive(:load_from_index).and_return(not_halted_response)
expect(subject.migration_halted_uncached?(migration)).to eq(false)
allow(migration).to receive(:load_from_index).and_return(halted_response)
expect(subject.migration_halted_uncached?(migration)).to eq(true)
end
end
describe '.drop_migration_halted_cache!' do
let(:migration) { subject.migrations.last }
before do
allow(Rails).to receive(:cache).and_return(ActiveSupport::Cache::MemoryStore.new)
allow(subject).to receive(:migration_halted_uncached?).with(migration).and_return(true, false)
end
it 'drops cache' do
expect(subject).to receive(:migration_halted_uncached?).twice
expect(subject.migration_halted?(migration)).to eq(true)
subject.drop_migration_halted_cache!(migration)
expect(subject.migration_halted?(migration)).to eq(false)
end
end
describe '.halted_migration' do
let(:migration) { subject.migrations.last }
let(:halted_response) { { '_source': { 'state': { halted: true } } }.with_indifferent_access }
before do
allow(Rails).to receive(:cache).and_return(ActiveSupport::Cache::MemoryStore.new)
allow(Elastic::MigrationRecord).to receive(:new).and_call_original
allow(Elastic::MigrationRecord).to receive(:new).with(version: migration.version, name: migration.name, filename: migration.filename).and_return(migration)
end
it 'returns a migration when it is halted' do
expect(subject.halted_migration).to be_nil
allow(migration).to receive(:load_from_index).and_return(halted_response)
subject.drop_migration_halted_cache!(migration)
expect(subject.halted_migration).to eq(migration)
end
end
end
......@@ -192,4 +192,31 @@ RSpec.describe 'admin/application_settings/_elasticsearch_form' do
end
end
end
context 'elasticsearch migrations' do
let(:application_setting) { build(:application_setting) }
it 'does not show the retry migration card' do
render
expect(rendered).not_to include('There is a halted Elasticsearch migration')
expect(rendered).not_to include('Retry migration')
end
context 'when there is a halted migration' do
let(:migration) { Elastic::DataMigrationService.migrations.last }
before do
allow(Elastic::DataMigrationService).to receive(:halted_migrations?).and_return(true)
allow(Elastic::DataMigrationService).to receive(:halted_migration).and_return(migration)
end
it 'shows the retry migration card' do
render
expect(rendered).to include('There is a halted Elasticsearch migration')
expect(rendered).to include('Retry migration')
end
end
end
end
......@@ -3805,6 +3805,9 @@ msgstr ""
msgid "Are you sure you want to reset the registration token?"
msgstr ""
msgid "Are you sure you want to retry this migration?"
msgstr ""
msgid "Are you sure you want to revoke this %{type}? This action cannot be undone."
msgstr ""
......@@ -5398,6 +5401,9 @@ msgstr ""
msgid "Check the %{docs_link_start}documentation%{docs_link_end}."
msgstr ""
msgid "Check the elasticsearch.log file to debug why the migration was halted and make any changes before retrying the migration. When you fix the cause of the failure, click \"Retry migration\", and the migration will be scheduled to be retried in the background."
msgstr ""
msgid "Check your Docker images for known vulnerabilities."
msgstr ""
......@@ -18215,6 +18221,9 @@ msgstr ""
msgid "Migrated %{success_count}/%{total_count} files."
msgstr ""
msgid "Migration has been scheduled to be retried"
msgstr ""
msgid "Migration successful."
msgstr ""
......@@ -24305,6 +24314,9 @@ msgstr ""
msgid "Retry job"
msgstr ""
msgid "Retry migration"
msgstr ""
msgid "Retry this job"
msgstr ""
......@@ -28460,6 +28472,9 @@ msgstr ""
msgid "There are running deployments on the environment. Please retry later."
msgstr ""
msgid "There is a halted Elasticsearch migration"
msgstr ""
msgid "There is already a To-Do for this design."
msgstr ""
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment