Merge branch 'elasticsearch-disaster-recovery-logging' into 'master'

Improve Gitlab::Elastic::Indexer update logging and document possible use of logs See merge request gitlab-org/gitlab!50482

Merge branch 'elasticsearch-disaster-recovery-logging' into 'master'
Improve Gitlab::Elastic::Indexer update logging and document possible use of logs See merge request gitlab-org/gitlab!50482
ceaefe24 · Ash McKenzie · e8af934c · 72f17829 · ceaefe24 · ceaefe24
Commit ceaefe24 authored Jan 05, 2021 by Ash McKenzie
4 changed files
--- a/doc/development/elasticsearch.md
+++ b/doc/development/elasticsearch.md
@@ -337,3 +337,48 @@ cluster.routing.allocation.disk.watermark.high: 10gb
 Restart Elasticsearch, and the `read_only_allow_delete` will clear on it's own.

 _from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/disk-allocator.html)_
+
+### Disaster recovery/data loss/backups
+
+The use of Elasticsearch in GitLab is only ever as a secondary data store.
+This means that all of the data stored in Elasticsearch can always be derived
+again from other data sources, specifically PostgreSQL and Gitaly. Therefore if
+the Elasticsearch data store is ever corrupted for whatever reason you can
+simply reindex everything from scratch.
+
+If your Elasticsearch index is incredibly large it may be too time consuming or
+cause too much downtime to reindex from scratch. There aren't any built in
+mechanisms for automatically finding discrepencies and resyncing an
+Elasticsearch index if it gets out of sync but one tool that may be useful is
+looking at the logs for all the updates that occurred in a time range you
+believe may have been missed. This information is very low level and only
+useful for operators that are familiar with the GitLab codebase. It is
+documented here in case it is useful for others. The relevant logs that could
+theoretically be used to figure out what needs to be replayed are:
+
+1. All non-repository updates that were synced can be found in
+   [`elasticsearch.log`](../administration/logs.md#elasticsearchlog) by
+   searching for
+   [`track_items`](https://gitlab.com/gitlab-org/gitlab/-/blob/1e60ea99bd8110a97d8fc481e2f41cab14e63d31/ee/app/services/elastic/process_bookkeeping_service.rb#L25)
+   and these can be replayed by sending these items again through
+   `::Elastic::ProcessBookkeepingService.track!`
+1. All repository updates that occurred can be found in
+   [`elasticsearch.log`](../administration/logs.md#elasticsearchlog) by
+   searching for
+   [`indexing_commit_range`](https://gitlab.com/gitlab-org/gitlab/-/blob/6f9d75dd3898536b9ec2fb206e0bd677ab59bd6d/ee/lib/gitlab/elastic/indexer.rb#L41).
+   Replaying these requires resetting the
+   [`IndexStatus#last_commit/last_wiki_commit`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/index_status.rb)
+   to the oldest `from_sha` in the logs and then triggering another index of
+   the project using
+   [`ElasticCommitIndexerWorker`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/elastic_commit_indexer_worker.rb)
+1. All project deletes that occurred can be found in
+   [`sidekiq.log`](../administration/logs.md#sidekiqlog) by searching for
+   [`ElasticDeleteProjectWorker`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/elastic_delete_project_worker.rb).
+   These updates can be replayed by triggering another
+   `ElasticDeleteProjectWorker`.
+
+With the above methods and taking regular [Elasticsearch
+snapshots](https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html)
+we should be able to recover from different kinds of data loss issues in a
+relatively short period of time compared to indexing everything from
+scratch.
--- a/doc/integration/elasticsearch.md
+++ b/doc/integration/elasticsearch.md
@@ -980,3 +980,11 @@ results and assuming that basic search is supported in that scope. This "basic
 search" will behave as though you don't have Advanced Search enabled at all for
 your instance and search using other data sources (ie. PostgreSQL data and Git
 data).
+
+### Data recovery: Elasticsearch is a secondary data store only
+
+The use of Elasticsearch in GitLab is only ever as a secondary data store.
+This means that all of the data stored in Elasticsearch can always be derived
+again from other data sources, specifically PostgreSQL and Gitaly. Therefore, if
+the Elasticsearch data store is ever corrupted for whatever reason, you can
+simply reindex everything from scratch.
--- a/ee/lib/gitlab/elastic/indexer.rb
+++ b/ee/lib/gitlab/elastic/indexer.rb
@@ -38,9 +38,11 @@ module Gitlab
        return update_index_status(Gitlab::Git::BLANK_SHA) unless commit

        repository.__elasticsearch__.elastic_writing_targets.each do |target|
-          Sidekiq.logger.debug(message: "Indexation running for #{project.id} #{from_sha}..#{commit.sha}",
+          logger.debug(message: "indexing_commit_range",
                               project_id: project.id,
-                               wiki: index_wiki?)
+                               from_sha: from_sha,
+                               to_sha: commit.sha,
+                               index_wiki: index_wiki?)
          run_indexer!(commit.sha, target)
        end

@@ -178,10 +180,10 @@ module Gitlab
      # rubocop: disable CodeReuse/ActiveRecord
      def update_index_status(to_sha)
        unless Project.exists?(id: project.id)
-          Gitlab::Elasticsearch::Logger.build.debug(
+          logger.debug(
            message: 'Index status could not be updated as the project does not exist',
            project_id: project.id,
-            wiki: index_wiki?
+            index_wiki: index_wiki?
          )
          return false
        end
@@ -204,6 +206,10 @@ module Gitlab
        project.reload_index_status
      end
      # rubocop: enable CodeReuse/ActiveRecord
+
+      def logger
+        @logger ||= ::Gitlab::Elasticsearch::Logger.build
+      end
    end
  end
 end
--- a/ee/spec/lib/gitlab/elastic/indexer_spec.rb
+++ b/ee/spec/lib/gitlab/elastic/indexer_spec.rb
@@ -396,13 +396,14 @@ RSpec.describe Gitlab::Elastic::Indexer do
    before do
      allow(Gitlab::Elasticsearch::Logger).to receive(:build).and_return(logger_double)
      allow(indexer).to receive(:run_indexer!) { Project.where(id: project.id).delete_all }
+      allow(logger_double).to receive(:debug)
    end

    it 'does not raises an exception and prints log message' do
      expect(logger_double).to receive(:debug).with(
        message: 'Index status could not be updated as the project does not exist',
        project_id: project.id,
-        wiki: false
+        index_wiki: false
      )
      expect(IndexStatus).not_to receive(:safe_find_or_create_by!).with(project_id: project.id)
      expect { indexer.run }.not_to raise_error