Commit a53383fd authored by Mike Kozono's avatar Mike Kozono

Geo: Increase backoff cap for missing on primary

For legacy blobs, which are Job artifacts, LFS objects, and Uploads.

On staging.gitlab.com, many files are (intentionally) missing on the
primary, so geo.staging.gitlab.com attempts to sync them every hour. We
don't want to disable retries after some maximum number, because we want
the system to automatically recover if the files ever appear. But every
hour is a bit excessive, given all retries have failed up to that point.
So this commit raises the retry time cap for legacy blobs missing on
primary from 1 hour to 4 hours.

As an aside, resources which are replicated by the Geo framework will
soon gain the automatic verification and re-verification feature. This
will eventually resync resources which were missing on the primary and
then became not missing.
parent 6efb0e39
...@@ -79,9 +79,11 @@ module Geo ...@@ -79,9 +79,11 @@ module Geo
retry_later = !registry.success || registry.missing_on_primary retry_later = !registry.success || registry.missing_on_primary
if retry_later if retry_later
custom_max_wait_time = missing_on_primary ? 4.hours : nil
# We don't limit the amount of retries # We don't limit the amount of retries
registry.retry_count = (registry.retry_count || 0) + 1 registry.retry_count = (registry.retry_count || 0) + 1
registry.retry_at = next_retry_time(registry.retry_count) registry.retry_at = next_retry_time(registry.retry_count, custom_max_wait_time)
else else
registry.retry_count = 0 registry.retry_count = 0
registry.retry_at = nil registry.retry_at = nil
......
---
title: 'Geo: Increase backoff cap for Job artifacts, LFS objects, and Uploads which are missing on primary'
merge_request: 50812
author:
type: changed
...@@ -8,9 +8,10 @@ module Delay ...@@ -8,9 +8,10 @@ module Delay
# To prevent the retry time from storing invalid dates in the database, # To prevent the retry time from storing invalid dates in the database,
# cap the max time to a hour plus some random jitter value. # cap the max time to a hour plus some random jitter value.
def next_retry_time(retry_count) def next_retry_time(retry_count, custom_max_wait_time = nil)
proposed_time = Time.now + delay(retry_count).seconds proposed_time = Time.now + delay(retry_count).seconds
max_future_time = 1.hour.from_now + delay(1).seconds max_wait_time = custom_max_wait_time || 1.hour
max_future_time = max_wait_time.from_now + delay(1).seconds
[proposed_time, max_future_time].min [proposed_time, max_future_time].min
end end
......
...@@ -323,13 +323,13 @@ RSpec.describe Geo::FileDownloadService do ...@@ -323,13 +323,13 @@ RSpec.describe Geo::FileDownloadService do
end end
end end
it 'sets a retry date with a maximum of about 7 days' do it 'sets a retry date with a maximum of about 4 hours' do
registry_entry.update!(retry_count: 100, retry_at: 7.days.from_now) registry_entry.update!(retry_count: 100, retry_at: 1.minute.ago)
freeze_time do freeze_time do
execute! execute!
expect(registry_entry.reload.retry_at < 8.days.from_now).to be_truthy expect(registry_entry.reload.retry_at).to be_within(3.minutes).of(4.hours.from_now)
end end
end end
end end
...@@ -362,13 +362,13 @@ RSpec.describe Geo::FileDownloadService do ...@@ -362,13 +362,13 @@ RSpec.describe Geo::FileDownloadService do
end end
end end
it 'sets a retry date with a maximum of about 7 days' do it 'sets a retry date with a maximum of about 1 hour' do
registry_entry.update!(retry_count: 100, retry_at: 7.days.from_now) registry_entry.update!(retry_count: 100, retry_at: 1.minute.ago)
freeze_time do freeze_time do
execute! execute!
expect(registry_entry.reload.retry_at < 8.days.from_now).to be_truthy expect(registry_entry.reload.retry_at).to be_within(3.minutes).of(1.hour.from_now)
end end
end end
end end
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment