Commit a0a14f6f authored by Dylan Griffith's avatar Dylan Griffith

Include letters, numbers & underscore always as Elasticsearch token

There are various other regexes here that are trying to capture tokens
in different contexts but at the very least we should also always be
greedily capturing a series of letters, numbers and underscores.
It's OK if this is already covered in some cases by another regex since
we de-duplicate tokens anyway.

The test included in this change is an example where we don't correctly
capture this token today and it is a common example in Ruby so we should
cover it.
parent 7f13e0bc
......@@ -111,7 +111,8 @@ Patterns:
- `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
- `'([\p{L}_.-]+)'` : some common chars in file names to keep the whole filename intact (eg. `my_file-ñame.txt`)
- `'([\p{L}_.-]+)'`: some common chars in file names to keep the whole filename intact (eg. `my_file-ñame.txt`)
- `'([\p{L}\d_]+)'`: letters, numbers and underscores are the most common tokens in programming. Always capture them greedily regardless of context.
## Gotchas
......
---
title: Allow searching word tokens with letters, numbers and underscores in advanced global search
merge_request: 36255
author:
type: changed
......@@ -61,7 +61,8 @@ module Elastic
'"((?:\\"|[^"]|\\")*)"', # capture terms inside quotes, removing the quotes
"'((?:\\'|[^']|\\')*)'", # same as above, for single quotes
'\.([^.]+)(?=\.|\s|\Z)', # separate terms on periods
'([\p{L}_.-]+)' # some common chars in file names to keep the whole filename intact (eg. my_file-name.txt)
'([\p{L}_.-]+)', # some common chars in file names to keep the whole filename intact (eg. my_file-name.txt)
'([\p{L}\d_]+)' # letters, numbers and underscores are the most common tokens in programming. Always capture them greedily regardless of context.
]
}
},
......
......@@ -637,6 +637,12 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
ParenthesesBetweenTokens)tokenAfterParentheses
a.b.c=missing_token_around_equals
def self.ruby_method_name(ruby_method_arg)
RubyClassInvoking.ruby_method_call(with_arg)
def self.ruby_method_123(ruby_another_method_arg)
RubyClassInvoking.ruby_call_method_123(with_arg)
FILE
end
let(:file_name) { 'elastic_specialchars_test.md' }
......@@ -703,6 +709,22 @@ RSpec.describe Gitlab::Elastic::SearchResults, :elastic, :sidekiq_might_not_need
it 'finds a token after = without a space' do
expect(search_for('missing_token_around_equals')).to include(file_name)
end
it 'finds a ruby method name even if preceeded with dot' do
expect(search_for('ruby_method_name')).to include(file_name)
end
it 'finds a ruby method name with numbers' do
expect(search_for('ruby_method_123')).to include(file_name)
end
it 'finds a ruby method call even if preceeded with dot' do
expect(search_for('ruby_method_call')).to include(file_name)
end
it 'finds a ruby method call with numbers' do
expect(search_for('ruby_call_method_123')).to include(file_name)
end
end
end
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment