-
Sean McGivern authored
A few years ago we set the default number of retries to 3 globally. The reasoning was that we have a lot of jobs that try to communicate with external services, and that those jobs don't need 25 retries - if the service is down, we shouldn't waste our time trying to send to it the other 22 times. I think that this is valid, but it has some problems as a default: 1. Most of our workers don't connect to an external service. 2. 3 retries will happen over a couple of minutes. If - say - our database goes down for 5 minutes, all jobs that we tried during that period will fail completely. If we allow 25 retries then we have a few weeks to fix that issue, which should be sufficient. 3. We're rolling out a change to use fewer Sidekiq queues. Because we have a mixed-stage deployment, we can have jobs scheduled from canary using new worker classes that don't exist on the main stage (which runs Sidekiq) yet. Those jobs would just fail immediately. Setting a higher number of retries lets those jobs succeed once the main stage is deployed - albeit with a delay due to the back-off. The meta-reasoning here is that 25 is the default in Sidekiq for a reason, and in retrospect I think we were too hasty in changing it. Sidekiq is very heavily used software, and its default settings have come out of a lot of hard-won experience. Let's try to use more of that!
a3c6d078