• Stan Hu's avatar
    Atomically select replicas that meet LSN requirement · f3e1d2c6
    Stan Hu authored
    During a merge, we attempt to find a matching merge request with a SHA
    using a replica that should be up-to-date with the primary for a given
    PostgreSQL log sequence number (LSN). However, there is a race condition
    that can happen if service discovery alters the host list after this
    check has taken place. This most likely happens when a Web worker
    starts:
    
    1. When Rails starts up for the first time, there is a 1-minute or
    2-minute delay before service discovery finds replicas
    (see https://gitlab.com/gitlab-org/gitlab/-/issues/271575).
    
    2. During this time `LoadBalancer#all_caught_up?` will return
    `true`. This will indicate to the Web worker that it can use replicas
    and does not have to use the primary.
    
    3. During a request, service discovery may load all the replicas and
    change the host list. As a result, the next read may be directed to a
    lagging replica.
    
    However, this may cause a merge to fail if it cannot find a match.
    
    When a user merges a merge request, Sidekiq logs the minimum LSN needed
    to match a merge request for the API. If we have this LSN, we now:
    
    1. Select from the available list of replicas that meet this LSN
    requirement.
    2. Store this subset for the given request.
    3. Round-robin reads with this subset of replicas.
    
    Relates to https://gitlab.com/gitlab-org/gitlab/-/issues/247857
    f3e1d2c6
sh-load-balancer-atomic-replica.yml 108 Bytes