Commit 5e468ced authored by Markus Koller's avatar Markus Koller Committed by Markus Koller

Respect limited indexing settings in rake tasks

Some of the rake tasks were not respecting the new limited indexing
settings for Elasticsearch, so this refactors them to use
IndexRecordService through ElasticIndexerWorker. As part of this change
we're also queuing a Sidekiq job for each individual project now
instead of processing them in batches in each job, and we're always
processing them asynchronously.

- The index_repositories, index_repositories_async, index_database and
  index_$MODEL tasks were replaced with a single index_projects task,
  which indexes projects and all their associated records and their
  repositories

- The BATCH environment variable was removed because it's not useful
  anymore, since everything gets queued in Sidekiq anyway
parent 7e5e9f9c
......@@ -192,9 +192,6 @@ Performing asynchronous indexing, as this will describe, will generate a lot of
Make sure to prepare for this task by either [Horizontally Scaling](../administration/high_availability/README.md#basic-scaling)
or creating [extra sidekiq processes](../administration/operations/extra_sidekiq_processes.md)
NOTE: **Note**:
After indexing the repositories asynchronously, you **MUST** index the database to be able to search.
Configure Elasticsearch's host and port in **Admin > Settings > Integrations**. Then create empty indexes using one of the following commands:
```sh
......@@ -217,78 +214,49 @@ curl --request PUT localhost:9200/gitlab-production/_settings --data '{
} }'
```
Then enable Elasticsearch indexing and run repository indexing tasks:
Then enable Elasticsearch indexing and run project indexing tasks:
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories_async
sudo gitlab-rake gitlab:elastic:index_projects
# Installations from source
bundle exec rake gitlab:elastic:index_repositories_async RAILS_ENV=production
bundle exec rake gitlab:elastic:index_projects RAILS_ENV=production
```
This enqueues a number of Sidekiq jobs to index your existing repositories.
You can view the jobs in the admin panel (they are placed in the `elastic_batch_project_indexer`)
This enqueues a Sidekiq job for each project that needs to be indexed.
You can view the jobs in the admin panel (they are placed in the `elastic_indexer`
queue), or you can query indexing status using a rake task:
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories_status
sudo gitlab-rake gitlab:elastic:index_projects_status
# Installations from source
bundle exec rake gitlab:elastic:index_repositories_status RAILS_ENV=production
bundle exec rake gitlab:elastic:index_projects_status RAILS_ENV=production
Indexing is 65.55% complete (6555/10000 projects)
```
By default, one job is created for every 300 projects. For large numbers of
projects, you may wish to increase the batch size, by setting the `BATCH`
environment variable.
You can also run the initial indexing synchronously - this is most useful if
you have a small number of projects or need finer-grained control over indexing
than Sidekiq permits:
If you want to limit the index to a range of projects you can provide the
`ID_FROM` and `ID_TO` parameters:
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories
sudo gitlab-rake gitlab:elastic:index_projects ID_FROM=1001 ID_TO=2000
# Installations from source
bundle exec rake gitlab:elastic:index_repositories RAILS_ENV=production
```
It might take a while depending on how big your Git repositories are.
If you want to run several tasks in parallel (probably in separate terminal
windows) you can provide the `ID_FROM` and `ID_TO` parameters:
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories ID_FROM=1001 ID_TO=2000
# Installations from source
bundle exec rake gitlab:elastic:index_repositories ID_FROM=1001 ID_TO=2000 RAILS_ENV=production
bundle exec rake gitlab:elastic:index_projects ID_FROM=1001 ID_TO=2000 RAILS_ENV=production
```
Where `ID_FROM` and `ID_TO` are project IDs. Both parameters are optional.
As an example, if you have 3,000 repositories and you want to run three separate indexing tasks, you might run:
The above examples will index all projects starting with ID `1001` up to (and including) ID `2000`.
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories ID_TO=1000
sudo gitlab-rake gitlab:elastic:index_repositories ID_FROM=1001 ID_TO=2000
sudo gitlab-rake gitlab:elastic:index_repositories ID_FROM=2001
# Installations from source
bundle exec rake gitlab:elastic:index_repositories RAILS_ENV=production ID_TO=1000
bundle exec rake gitlab:elastic:index_repositories RAILS_ENV=production ID_FROM=1001 ID_TO=2000
bundle exec rake gitlab:elastic:index_repositories RAILS_ENV=production ID_FROM=2001
```
Sometimes your repository index process `gitlab:elastic:index_repositories` or
`gitlab:elastic:index_repositories_async` can get interrupted. This may happen
for many reasons, but it's always safe to run the indexing job again - it will
skip those repositories that have already been indexed.
TIP: **Troubleshooting:**
Sometimes the project indexing jobs queued by `gitlab:elastic:index_projects`
can get interrupted. This may happen for many reasons, but it's always safe
to run the indexing task again - it will skip those repositories that have
already been indexed.
As the indexer stores the last commit SHA of every indexed repository in the
database, you can run the indexer with the special parameter `UPDATE_INDEX` and
......@@ -297,10 +265,10 @@ that repository is indexed, it can be useful in case if your index is outdated:
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_repositories UPDATE_INDEX=true ID_TO=1000
sudo gitlab-rake gitlab:elastic:index_projects UPDATE_INDEX=true ID_TO=1000
# Installations from source
bundle exec rake gitlab:elastic:index_repositories UPDATE_INDEX=true ID_TO=1000 RAILS_ENV=production
bundle exec rake gitlab:elastic:index_projects UPDATE_INDEX=true ID_TO=1000 RAILS_ENV=production
```
You can also use the `gitlab:elastic:clear_index_status` Rake task to force the
......@@ -320,16 +288,6 @@ bundle exec rake gitlab:elastic:index_wikis RAILS_ENV=production
The wiki indexer also supports the `ID_FROM` and `ID_TO` parameters if you want
to limit a project set.
Index all database entities (Keep in mind it can take a while, so consider using `screen` or `tmux`):
```sh
# Omnibus installations
sudo gitlab-rake gitlab:elastic:index_database
# Installations from source
bundle exec rake gitlab:elastic:index_database RAILS_ENV=production
```
Enable replication and refreshing again after indexing (only if you previously disabled it):
```bash
......@@ -376,25 +334,15 @@ There are several rake tasks available to you via the command line:
- This is a wrapper task. It does the following:
- `sudo gitlab-rake gitlab:elastic:create_empty_index`
- `sudo gitlab-rake gitlab:elastic:clear_index_status`
- `sudo gitlab-rake gitlab:elastic:index_projects`
- `sudo gitlab-rake gitlab:elastic:index_wikis`
- `sudo gitlab-rake gitlab:elastic:index_database`
- `sudo gitlab-rake gitlab:elastic:index_repositories`
- [sudo gitlab-rake gitlab:elastic:index_repositories_async](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This iterates over all projects and places them in batches. It then sends these batches to the background via sidekiq jobs to be indexed.
- [sudo gitlab-rake gitlab:elastic:index_repositories_status](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- `sudo gitlab-rake gitlab:elastic:index_snippets`
- [sudo gitlab-rake gitlab:elastic:index_projects](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This iterates over all projects and queues sidekiq jobs to index them in the background.
- [sudo gitlab-rake gitlab:elastic:index_projects_status](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This determines the overall status of the indexing. It is done by counting the total number of indexed projects, dividing by a count of the total number of projects, then multiplying by 100.
- [sudo gitlab-rake gitlab:elastic:index_repositories](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This iterates over all projects and places them in batches. It then performs indexing on said batches synchronously.
- [sudo gitlab-rake gitlab:elastic:index_wikis](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Iterates over every project, determines if said project contains wiki data, and then indexes the blobs (content) of said wiki data.
- [sudo gitlab-rake gitlab:elastic:index_database](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This is a [rake multitask](https://www.rubydoc.info/github/ruby/rake/Rake/MultiTask). It does the following:
- `sudo gitlab-rake gitlab:elastic:index_projects`
- `sudo gitlab-rake gitlab:elastic:index_issues`
- `sudo gitlab-rake gitlab:elastic:index_merge_requests`
- `sudo gitlab-rake gitlab:elastic:index_snippets`
- `sudo gitlab-rake gitlab:elastic:index_notes`
- `sudo gitlab-rake gitlab:elastic:index_milestones`
- [sudo gitlab-rake gitlab:elastic:create_empty_index](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- This generates an empty index on the Elasticsearch side.
- [sudo gitlab-rake gitlab:elastic:clear_index_status](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
......@@ -405,18 +353,8 @@ There are several rake tasks available to you via the command line:
- Does the same thing as `sudo gitlab-rake gitlab:elastic:create_empty_index`
- [sudo gitlab-rake gitlab:elastic:add_feature_visibility_levels_to_project](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Adds visibility information to the indices for projects.
- [sudo gitlab-rake gitlab:elastic:index_projects](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes projects data.
- [sudo gitlab-rake gitlab:elastic:index_issues](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes issues data.
- [sudo gitlab-rake gitlab:elastic:index_merge_requests](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes merge requests data.
- [sudo gitlab-rake gitlab:elastic:index_snippets](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes the snippets data.
- [sudo gitlab-rake gitlab:elastic:index_notes](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes the notes data.
- [sudo gitlab-rake gitlab:elastic:index_milestones](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/tasks/gitlab/elastic.rake)
- Performs an Elasticsearch import that indexes the milestones data.
### Environment Variables
......@@ -424,40 +362,16 @@ In addition to the rake tasks, there are some environment variables that can be
| Environment Variable | Data Type | What it does |
| -------------------- |:---------:| ---------------------------------------------------------------------------- |
| `BATCH` | Integer | Modifies the size of the indexing batch (default 300 projects). |
| `UPDATE_INDEX` | Boolean | Tells the indexer to overwrite any existing index data (true/false). |
| `ID_TO` | Integer | Tells the indexer to only index projects less than or equal to the value. |
| `ID_FROM` | Integer | Tells the indexer to only index projects greater than or equal to the value. |
### Batching
The ability to apply batching makes the indexer run more efficiently. The default
size of a batch is 300 projects, which may or may not be ideal for your setup.
Depending on the resources available to your GitLab instance (sidekiq) and your
Elasticsearch instance (RAM, CPU), you may be able to increase or decrease the
batch size for more efficiency.
- The larger the batch size is, the less sidekiq jobs and indexing requests get created.
- The larger the batch size is, the more time and RAM it takes to process.
- The smaller the batch size, the more sidekiq jobs, and indexing requests get created.
- The smaller the batch size, the more CPU gets utilized.
Finding the ideal size can be tricky, and will vary from GitLab instance to GitLab instance.
Generally speaking, if the default is not ideal for you, try reducing it to somewhere in
the 50-150 range (for bigger sized repos) or 450-600 range (for many small-sized repos).
Example use:
```sh
sudo gitlab-rake gitlab:elastic:index_repositories_async BATCH=50
```
### Indexing a specific project
Because the `ID_TO` and `ID_FROM` environment variables use the `or equal to` comparison, you can index only one project by using both these variables with the same project ID number:
```sh
root@git:~# sudo gitlab-rake gitlab:elastic:index_repositories ID_TO=5 ID_FROM=5
root@git:~# sudo gitlab-rake gitlab:elastic:index_projects ID_TO=5 ID_FROM=5
Indexing project repositories...I, [2019-03-04T21:27:03.083410 #3384] INFO -- : Indexing GitLab User / test (ID=33)...
I, [2019-03-04T21:27:05.215266 #3384] INFO -- : Indexing GitLab User / test (ID=33) is done!
```
......@@ -554,7 +468,7 @@ Here are some common pitfalls and how to overcome them:
- **The indexing process is taking a very long time**
The more data present in your GitLab instance, the longer the indexing process takes. You might want to try adjusting the BATCH sizes for asynchronous indexing to help speed up the process.
The more data present in your GitLab instance, the longer the indexing process takes.
- **No new data is added to the Elasticsearch index when I push code**
......
......@@ -33,6 +33,8 @@ module Elastic
end
def use_elasticsearch?
# FIXME: check project.use_elasticsearch? for ProjectSnippets?
# see https://gitlab.com/gitlab-org/gitlab-ee/issues/11850
::Gitlab::CurrentSettings.elasticsearch_indexing?
end
......
......@@ -8,6 +8,8 @@ module Elastic
# @param indexing [Boolean] determines whether operation is "indexing" or "updating"
def execute(record, indexing, options = {})
return true unless record.use_elasticsearch?
record.__elasticsearch__.client = client
import(record, record.class.nested?, indexing)
......
---
title: Respect limited indexing settings in rake tasks
merge_request: 13437
author:
type: fixed
......@@ -9,25 +9,29 @@ namespace :gitlab do
Rake::Task["gitlab:elastic:create_empty_index"].invoke
Rake::Task["gitlab:elastic:clear_index_status"].invoke
Rake::Task["gitlab:elastic:index_projects"].invoke
Rake::Task["gitlab:elastic:index_wikis"].invoke
Rake::Task["gitlab:elastic:index_database"].invoke
Rake::Task["gitlab:elastic:index_repositories"].invoke
Rake::Task["gitlab:elastic:index_snippets"].invoke
end
desc "GitLab | Elasticsearch | Index project repositories in the background"
task index_repositories_async: :environment do
print "Enqueuing project repositories in batches of #{batch_size}"
desc "GitLab | Elasticsearch | Index projects in the background"
task index_projects: :environment do
print "Enqueuing projects"
project_id_batches do |start, finish|
ElasticBatchProjectIndexerWorker.perform_async(start, finish)
project_id_batches do |ids|
args = ids.collect do |id|
[:index, 'Project', id, nil] # es_id is unused for :index
end
ElasticIndexerWorker.bulk_perform_async(args)
print "."
end
puts "OK"
end
desc "GitLab | ElasticSearch | Check project repository indexing status"
task index_repositories_status: :environment do
desc "GitLab | ElasticSearch | Check project indexing status"
task index_projects_status: :environment do
indexed = IndexStatus.count
projects = Project.count
percent = (indexed / projects.to_f) * 100.0
......@@ -35,16 +39,6 @@ namespace :gitlab do
puts "Indexing is %.2f%% complete (%d/%d projects)" % [percent, indexed, projects]
end
desc "GitLab | Elasticsearch | Index project repositories"
task index_repositories: :environment do
print "Indexing project repositories..."
Sidekiq::Logging.logger = Logger.new(STDOUT)
project_id_batches do |start, finish|
ElasticBatchProjectIndexerWorker.new.perform(start, finish)
end
end
desc 'GitLab | Elasticsearch | Unlock repositories for indexing in case something gets stuck'
task clear_locked_projects: :environment do
Gitlab::Redis::SharedState.with { |redis| redis.del(:elastic_projects_indexing) }
......@@ -70,34 +64,15 @@ namespace :gitlab do
end
end
INDEXABLE_CLASSES = {
"Project" => "index_projects",
"Issue" => "index_issues",
"MergeRequest" => "index_merge_requests",
"Snippet" => "index_snippets",
"Note" => "index_notes",
"Milestone" => "index_milestones"
}.freeze
INDEXABLE_CLASSES.each do |klass_name, task_name|
task task_name => :environment do
logger = Logger.new(STDOUT)
logger.info("Indexing #{klass_name.pluralize}...")
klass = Kernel.const_get(klass_name)
if klass_name == 'Note'
Note.searchable.es_import
else
klass.es_import
end
desc "GitLab | Elasticsearch | Index all snippets"
task index_snippets: :environment do
logger = Logger.new(STDOUT)
logger.info("Indexing snippets...")
logger.info("Indexing #{klass_name.pluralize}... " + "done".color(:green))
end
end
Snippet.es_import
desc "GitLab | Elasticsearch | Index all database objects"
multitask index_database: INDEXABLE_CLASSES.values
logger.info("Indexing snippets... " + "done".color(:green))
end
desc "GitLab | Elasticsearch | Create empty index"
task create_empty_index: :environment do
......@@ -190,10 +165,6 @@ namespace :gitlab do
end
end
def batch_size
ENV.fetch('BATCH', 300).to_i
end
def project_id_batches(&blk)
relation = Project
......@@ -201,10 +172,14 @@ namespace :gitlab do
relation = relation.includes(:index_status).where('index_statuses.id IS NULL').references(:index_statuses)
end
relation.all.in_batches(of: batch_size, start: ENV['ID_FROM'], finish: ENV['ID_TO']) do |relation| # rubocop: disable Cop/InBatches
if ::Gitlab::CurrentSettings.elasticsearch_limit_indexing?
relation = relation.where(id: ::Gitlab::CurrentSettings.elasticsearch_limited_projects.select(:id))
end
relation.all.in_batches(start: ENV['ID_FROM'], finish: ENV['ID_TO']) do |relation| # rubocop: disable Cop/InBatches
ids = relation.reorder(:id).pluck(:id)
Gitlab::Redis::SharedState.with { |redis| redis.sadd(:elastic_projects_indexing, ids) }
yield ids[0], ids[-1]
yield ids
end
end
......
......@@ -125,4 +125,21 @@ describe Elastic::IndexRecordService, :elastic do
expect(Note.elastic_search('note_2', options: options).present?).to eq(true)
expect(Note.elastic_search('note_3', options: options).present?).to eq(true)
end
it 'skips records for which indexing is disabled' do
project = nil
Sidekiq::Testing.disable! do
project = create :project, name: 'project_1'
end
expect(project).to receive(:use_elasticsearch?).and_return(false)
Sidekiq::Testing.inline! do
subject.execute(project, true)
Gitlab::Elastic::Helper.refresh_index
end
expect(Project.elastic_search('project_1').present?).to eq(false)
end
end
# frozen_string_literal: true
require 'rake_helper'
describe 'gitlab:elastic namespace rake tasks', :elastic, :sidekiq do
before do
Rake.application.rake_require 'tasks/gitlab/elastic'
stub_ee_application_setting(elasticsearch_indexing: true)
end
describe 'index' do
it 'calls all indexing tasks in order' do
expect(Rake::Task['gitlab:elastic:create_empty_index']).to receive(:invoke).ordered
expect(Rake::Task['gitlab:elastic:clear_index_status']).to receive(:invoke).ordered
expect(Rake::Task['gitlab:elastic:index_projects']).to receive(:invoke).ordered
expect(Rake::Task['gitlab:elastic:index_wikis']).to receive(:invoke).ordered
expect(Rake::Task['gitlab:elastic:index_snippets']).to receive(:invoke).ordered
run_rake_task 'gitlab:elastic:index'
end
end
describe 'index_projects' do
let(:project1) { create :project }
let(:project2) { create :project }
let(:project3) { create :project }
before do
Sidekiq::Testing.disable! do
project1
project2
end
end
it 'queues jobs for each project batch' do
expect(ElasticIndexerWorker).to receive(:bulk_perform_async).with([
[:index, 'Project', project1.id, nil],
[:index, 'Project', project2.id, nil]
])
run_rake_task 'gitlab:elastic:index_projects'
end
context 'with limited indexing enabled' do
before do
Sidekiq::Testing.disable! do
project1
project2
project3
create :elasticsearch_indexed_project, project: project1
create :elasticsearch_indexed_namespace, namespace: project3.namespace
end
stub_ee_application_setting(elasticsearch_limit_indexing: true)
end
it 'does not queue jobs for projects that should not be indexed' do
expect(ElasticIndexerWorker).to receive(:bulk_perform_async).with([
[:index, 'Project', project1.id, nil],
[:index, 'Project', project3.id, nil]
])
run_rake_task 'gitlab:elastic:index_projects'
end
end
end
describe 'index_snippets' do
it 'indexes snippets' do
expect(Snippet).to receive(:es_import)
run_rake_task 'gitlab:elastic:index_snippets'
end
end
end
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment