• Sean McGivern's avatar
    Let new Sidekiq workers retry 25 times by default · a3c6d078
    Sean McGivern authored
    A few years ago we set the default number of retries to 3 globally. The
    reasoning was that we have a lot of jobs that try to communicate with
    external services, and that those jobs don't need 25 retries - if the
    service is down, we shouldn't waste our time trying to send to it the
    other 22 times.
    
    I think that this is valid, but it has some problems as a default:
    
    1. Most of our workers don't connect to an external service.
    2. 3 retries will happen over a couple of minutes. If - say - our
       database goes down for 5 minutes, all jobs that we tried during that
       period will fail completely. If we allow 25 retries then we have a
       few weeks to fix that issue, which should be sufficient.
    3. We're rolling out a change to use fewer Sidekiq queues. Because we
       have a mixed-stage deployment, we can have jobs scheduled from canary
       using new worker classes that don't exist on the main stage (which
       runs Sidekiq) yet. Those jobs would just fail immediately. Setting a
       higher number of retries lets those jobs succeed once the main stage
       is deployed - albeit with a delay due to the back-off.
    
    The meta-reasoning here is that 25 is the default in Sidekiq for a
    reason, and in retrospect I think we were too hasty in changing it.
    Sidekiq is very heavily used software, and its default settings have
    come out of a lot of hard-won experience. Let's try to use more of that!
    a3c6d078
every_sidekiq_worker_spec.rb 22.6 KB