Initial elasticsearch development docs

46b81466 · Mario de la Ossa · d87a9904 · 46b81466 · 46b81466
Commit 46b81466 authored Apr 01, 2018 by Mario de la Ossa
Hide whitespace changes
Inline Side-by-side

Showing with 118 additions and 0 deletions

doc/development/README.md doc/development/README.md +1 -0

doc/development/elasticsearch.md doc/development/elasticsearch.md +117 -0

No files found.
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -43,6 +43,7 @@ comments: false
 - [Issue and merge requests state models](object_state_models.md)
 - [How to dump production data to staging](db_dump.md)
 - [Working with the GitHub importer](github_importer.md)
+- [Elasticsearch integration docs](elasticsearch.md)

 ## Performance guides


--- a/doc/development/elasticsearch.md
+++ b/doc/development/elasticsearch.md
+# Elasticsearch knowledge
+
+This area is to maintain a compendium of useful information when working with elasticsearch.
+
+Information on how to enable ElasticSearch and perform the initial indexing is kept in https://docs.gitlab.com/ee/integration/elasticsearch.html#enabling-elasticsearch
+
+## Initial installation on OS X
+
+It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with
+
+```
+docker run --name elastic55 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.5.3
+```
+
+and use `docker stop elastic55` and `docker start elastic55` to stop/start it.
+
+### Installing on the host
+
+We currently only support Elasticsearch [up to 5.5](https://docs.gitlab.com/ee/integration/elasticsearch.html#requirements), but `brew` only has elasticsearch 6, 5.6, and 2.4 available. While 2.4 would work you probably want to test things out in the latest one we support.
+
+In order to install 5.5.2, you would usually have to hunt down an old homebrew-core commit that contains the recipe for it. We've already done the work for you. Simply run:
+
+```
+brew install https://raw.githubusercontent.com/Homebrew/homebrew-core/f1a767645f61112762f05e68a610d89b161faa99/Formula/elasticsearch.rb
+```
+
+There is no need to install any plugins
+
+## Experimental indexer
+
+If you're interested on working with the experimental indexer, all you need to do is:
+
+- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git
+- make
+- make install
+
+this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.
+
+**note:** `make` will not recompile the executable unless you do `make clean` beforehand
+
+## How does it work?
+
+The ElasticSearch integration depends on an external indexer. We ship a [ruby indexer](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/bin/elastic_repo_indexer) by default but are also working on an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task, but after this is done GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/app/models/concerns/elastic/application_search.rb).
+
+All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).
+
+Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
+
+## Existing Analyzers/Tokenizers/Filters
+These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb
+
+### Analyzers
+#### `path_analyzer`
+Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
+
+Please see the `path_tokenizer` explanation below for an example.
+
+#### `sha_analyzer`
+Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
+
+Please see the `sha_tokenizer` explanation later below for an example.
+
+#### `code_analyzer`
+Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`
+
+The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.
+
+Please see the `code` filter for an explanation on how tokens are split.
+
+#### `code_search_analyzer`
+Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
+
+### Tokenizers
+#### `sha_tokenizer`
+This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).
+
+example:
+
+`240c29dc7e` becomes:
+- `240c2`
+- `240c29`
+- `240c29d`
+- `240c29dc`
+- `240c29dc7`
+- `240c29dc7e`
+
+#### `path_tokenizer`
+This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.
+
+example:
+
+`'/some/path/application.js'` becomes:
+- `'/some/path/application.js'`
+- `'some/path/application.js'`
+- `'path/application.js'`
+- `'application.js'`
+
+### Filters
+#### `code`
+Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves. 
+
+Patterns:
+- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
+- `"(\\d+)"`: extracts digits
+- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
+- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
+- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
+- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
+- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
+
+#### `edgeNGram_filter`
+Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`
+
+## Gotchas
+
+- Searches can have their own analyzers. Remember to check when editing analyzers
+- `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
\ No newline at end of file