Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
G
gitlab-ce
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
1
Merge Requests
1
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
nexedi
gitlab-ce
Commits
db077106
Commit
db077106
authored
Jul 26, 2019
by
GitLab Bot
Browse files
Options
Browse Files
Download
Plain Diff
Automatic merge of gitlab-org/gitlab-ce master
parents
5b3e3bb7
0d2b5bff
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
170 additions
and
0 deletions
+170
-0
doc/development/README.md
doc/development/README.md
+4
-0
doc/development/filtering_by_label.md
doc/development/filtering_by_label.md
+166
-0
No files found.
doc/development/README.md
View file @
db077106
...
...
@@ -113,6 +113,10 @@ description: 'Learn how to contribute to GitLab.'
-
[
Database helper modules
](
database_helpers.md
)
-
[
Code comments
](
code_comments.md
)
## Case studies
-
[
Database case study: Filtering by label
](
filtering_by_label.md
)
## Integration guides
-
[
Jira Connect app
](
integrations/jira_connect.md
)
...
...
doc/development/filtering_by_label.md
0 → 100644
View file @
db077106
# Filtering by label
## Introduction
GitLab has
[
labels
](
../user/project/labels.md
)
that can be assigned to issues,
merge requests, and epics. Labels on those objects are a many-to-many relation
through the polymorphic
`label_links`
table.
To filter these objects by multiple labels - for instance, 'all open
issues with the label ~Plan and the label ~backend' - we generate a
query containing a
`GROUP BY`
clause. In a simple form, this looks like:
```
sql
SELECT
issues
.
*
FROM
issues
INNER
JOIN
label_links
ON
label_links
.
target_id
=
issues
.
id
AND
label_links
.
target_type
=
'Issue'
INNER
JOIN
labels
ON
labels
.
id
=
label_links
.
label_id
WHERE
issues
.
project_id
=
13083
AND
(
issues
.
state
IN
(
'opened'
))
AND
labels
.
title
IN
(
'Plan'
,
'backend'
)
GROUP
BY
issues
.
id
HAVING
(
COUNT
(
DISTINCT
labels
.
title
)
=
2
)
ORDER
BY
issues
.
updated_at
DESC
,
issues
.
id
DESC
LIMIT
20
OFFSET
0
```
In particular, note that:
1.
We
`GROUP BY issues.id`
so that we can ...
2.
Use the
`HAVING (COUNT(DISTINCT labels.title) = 2)`
condition to ensure that
all matched issues have both labels.
This is more complicated than is ideal. It makes the query construction more
prone to errors (such as
[
gitlab-org/gitlab-ce#15557
](
https://gitlab.com/gitlab-org/gitlab-ce/issues/15557
)
).
## Attempt A: WHERE EXISTS
### Attempt A1: use multiple subqueries with WHERE EXISTS
In
[
gitlab-org/gitlab-ce#37137
](
https://gitlab.com/gitlab-org/gitlab-ce/issues/37137
)
and its associated merge request
[
gitlab-org/gitlab-ce!14022
](
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022
)
,
we tried to replace the
`GROUP BY`
with multiple uses of
`WHERE EXISTS`
. For the
example above, this would give:
```
sql
WHERE
(
EXISTS
(
SELECT
TRUE
FROM
label_links
INNER
JOIN
labels
ON
labels
.
id
=
label_links
.
label_id
WHERE
labels
.
title
=
'Plan'
AND
target_type
=
'Issue'
AND
target_id
=
issues
.
id
))
AND
(
EXISTS
(
SELECT
TRUE
FROM
label_links
INNER
JOIN
labels
ON
labels
.
id
=
label_links
.
label_id
WHERE
labels
.
title
=
'backend'
AND
target_type
=
'Issue'
AND
target_id
=
issues
.
id
))
```
While this worked without schema changes, and did improve readability somewhat,
it did not improve query performance.
## Attempt B: Denormalize using an array column
Having
[
removed MySQL support in GitLab
12.1
](
https://about.gitlab.com/2019/06/27/removing-mysql-support/
)
, using
[
Postgres's arrays
](
https://www.postgresql.org/docs/9.6/arrays.html
)
became more
tractable as we didn't have to support two databases. We discussed denormalizing
the
`label_links`
table for querying in
[
gitlab-org/gitlab-ce#49651
](
https://gitlab.com/gitlab-org/gitlab-ce/issues/49651
)
,
with two options: label IDs and titles.
We can think of both of those as array columns on
`issues`
,
`merge_requests`
,
and
`epics`
:
`issues.label_ids`
would be an array column of label IDs, and
`issues.label_titles`
would be an array of label titles.
These array columns can be complemented with
[
GIN
indexes
](
https://www.postgresql.org/docs/9.6/gin-intro.html
)
to improve
matching.
### Attempt B1: store label IDs for each object
This has some strong advantages over titles:
1.
Unless a label is deleted, or a project is moved, we never need to
bulk-update the denormalized column.
2.
It uses less storage than the titles.
Unfortunately, our application design makes this hard. If we were able to query
just by label ID easily, we wouldn't need the
`INNER JOIN labels`
in the initial
query at the start of this document. GitLab allows users to filter by label
title across projects and even across groups, so a filter by the label ~Plan may
include labels with multiple distinct IDs.
We do not want users to have to know about the different IDs, which means that
given this data set:
| Project | ~Plan label ID | ~backend label ID |
| --- | --- | --- |
| A | 11 | 12 |
| B | 21 | 22 |
| C | 31 | 32 |
We would need something like:
```
sql
WHERE
label_ids
@>
ARRAY
[
11
,
12
]
OR
label_ids
@>
ARRAY
[
21
,
22
]
OR
label_ids
@>
ARRAY
[
31
,
32
]
```
This can get even more complicated when we consider that in some cases, there
might be two ~backend labels - with different IDs - that could apply to the same
object, so the number of combinations would balloon further.
### Attempt B2: store label titles for each object
From the perspective of updating the labelable object, this is the worst
option. We have to bulk update the objects when:
1.
The objects are moved from one project to another.
1.
The project is moved from one group to another.
1.
The label is renamed.
1.
The label is deleted.
It also uses much more storage. Querying is simple, though:
```
sql
WHERE
label_titles
@>
ARRAY
[
'Plan'
,
'backend'
]
```
And our
[
tests in
gitlab-org/gitlab-ce#49651
](
https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346
)
showed that this could be fast.
However, at present, the disadvantages outweigh the advantages.
## Conclusion
We have yet to find a method that is demonstratably better than the current
method, when considering:
1.
Query performance.
1.
Readability.
1.
Ease of maintaining schema consistency.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment