lib/gitlab/pagination/keyset/order.rb · 74729c56b38a062e333ffe113f9a0551dbe477c4 · nexedi / gitlab-ce

Rebalance issues relative position without transaction · 8834fda8

Alexandru Croitor authored Aug 23, 2021

Rebalancing issues in a long running transaction with a lock
retry generates subtransactions that lead to overall DB
performance degradations.

see https://gitlab.com/gitlab-org/gitlab/-/issues/338346

So we are moving issues rebalancing out of one single big
transaction by locking the repositioning within a project or
namespace during the rebalance.

Rebalancing is a long running job, it can take multiple hours
to finish a rebalance of a namespace with large number of issues.
This change introduces a simple mechanism of resuming rebalance
from a checkpoint.

- A limited number of rebalancing jobs are allowed to run,
5 at this point
- Before starting a rebalance the number of running rebalances
is checked.
- If the limit of running rebalances is met we store the first
project id from the list of projects to be rebalanced, use that
to determine how many rebalances are running
- We load all namespace issue ids into a redis sorted set, by
using current issue relative position as a score. We do that so
that we do not have to run a very slow SQL query that would require
otherwise ordering and we are able to load the issue ids in batches
and get the sorting for free from redis. This also allows us to
pick up issue loading in case of a failure from the project we
read last time
- Because we are no longer in a DB transaction and we want to
preserve the relative position of the issues after the rebalance
we need to disable issue repositioning in the namespace
while rebalancing.
- Once all the issue ids are loaded the positions are being
updated in batches by reading the issues in a sorted order and
computing the new positions based on number of issues in the
namespace, distributed equally.
- Every successfull update stores a checkpoint from which the
next update can be picked-up in case of a failure
- Updates are retried on failure and batch sizes are dynamically
downsized and retried down to a limit of 5 issues per batch.
- All cache keys are set to expire in 10 days from last interaction
to leave enough time for the job to be picked up and also cleanup
any unused keys after given grace period.

Changelog: changed

8834fda8

order.rb 10.4 KB

Replace order.rb