Fix feature flag race condition with stale reads

On GitLab.com we sometimes observe that a feature flag has been changed - and we have the response in chatops to demonstrate that - but the behaviour of the application doesn't reflect that. While we cannot prove that this is the case, we believe that this is due to a combination of these factors: 1. We have some feature flags (Gitaly feature flags in particular) that are read very frequently. All Gitaly feature flags are read on every Gitaly call. 2. We use database load balancing, so a read from a secondary may be slightly behind the primary. 3. The feature flag library we use (Flipper) deletes the feature flag from the cache when it's changed, and relies on the next read to re-populate the cache. As these reads use what is essentially `Rails.cache.fetch`, the last writer 'wins'. That could lead to this situation: 1. Client A is using the primary. It performs `fetch`, finds nothing, executes its block to get the value. 2. Client B does the same, but using a secondary and missing out the write performed by client A. 3. Client A finishes its block and writes to Redis. 4. Client B does the same, clobbering client A's write. The solution in this change is simple: instead of deleting the feature flag from the cache when updating it, we always immediately write the new value back. That way any clients using `Rails.cache.fetch` won't attempt to write until the cache has expired after an hour, by which point the secondaries would have caught up anyway. This is behind a feature flag because it is a somewhat risky change, but using a feature flag is safe here as this only controls how we handle feature flag writes.

Fix feature flag race condition with stale reads
On GitLab.com we sometimes observe that a feature flag has been changed - and we have the response in chatops to demonstrate that - but the behaviour of the application doesn't reflect that. While we cannot prove that this is the case, we believe that this is due to a combination of these factors: 1. We have some feature flags (Gitaly feature flags in particular) that are read very frequently. All Gitaly feature flags are read on every Gitaly call. 2. We use database load balancing, so a read from a secondary may be slightly behind the primary. 3. The feature flag library we use (Flipper) deletes the feature flag from the cache when it's changed, and relies on the next read to re-populate the cache. As these reads use what is essentially `Rails.cache.fetch`, the last writer 'wins'. That could lead to this situation: 1. Client A is using the primary. It performs `fetch`, finds nothing, executes its block to get the value. 2. Client B does the same, but using a secondary and missing out the write performed by client A. 3. Client A finishes its block and writes to Redis. 4. Client B does the same, clobbering client A's write. The solution in this change is simple: instead of deleting the feature flag from the cache when updating it, we always immediately write the new value back. That way any clients using `Rails.cache.fetch` won't attempt to write until the cache has expired after an hour, by which point the secondaries would have caught up anyway. This is behind a feature flag because it is a somewhat risky change, but using a feature flag is safe here as this only controls how we handle feature flag writes.
1cd0540c · Sean McGivern · dc611b97 · 1cd0540c · 1cd0540c · 1cd0540c
Commit 1cd0540c authored Mar 30, 2021 by Sean McGivern
4 changed files
--- a/config/feature_flags/development/feature_flags_cache_stale_read_fix.yml
+++ b/config/feature_flags/development/feature_flags_cache_stale_read_fix.yml
+---
+name: feature_flags_cache_stale_read_fix
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/57819
+rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/325452
+milestone: '13.11'
+type: development
+group:
+default_enabled: false
--- a/lib/feature.rb
+++ b/lib/feature.rb
@@ -18,10 +18,6 @@ class Feature
    superclass.table_name = 'feature_gates'
  end

-  class ActiveSupportCacheStoreAdapter < Flipper::Adapters::ActiveSupportCacheStore
-    # overrides methods in EE
-  end
-
  InvalidFeatureFlagError = Class.new(Exception) # rubocop:disable Lint/InheritException

  class << self

--- a/lib/feature/active_support_cache_store_adapter.rb
+++ b/lib/feature/active_support_cache_store_adapter.rb
+# frozen_string_literal: true
+
+# rubocop:disable Gitlab/NamespacedClass
+# This class was already nested this way before moving to a separate file
+class Feature
+  class ActiveSupportCacheStoreAdapter < Flipper::Adapters::ActiveSupportCacheStore
+    def enable(feature, gate, thing)
+      return super unless Feature.enabled?(:feature_flags_cache_stale_read_fix, default_enabled: :yaml)
+
+      result = @adapter.enable(feature, gate, thing)
+      @cache.write(key_for(feature.key), @adapter.get(feature), @write_options)
+      result
+    end
+
+    def disable(feature, gate, thing)
+      return super unless Feature.enabled?(:feature_flags_cache_stale_read_fix, default_enabled: :yaml)
+
+      result = @adapter.disable(feature, gate, thing)
+      @cache.write(key_for(feature.key), @adapter.get(feature), @write_options)
+      result
+    end
+
+    def remove(feature)
+      return super unless Feature.enabled?(:feature_flags_cache_stale_read_fix, default_enabled: :yaml)
+
+      result = @adapter.remove(feature)
+      @cache.delete(FeaturesKey)
+      @cache.write(key_for(feature.key), {}, @write_options)
+      result
+    end
+  end
+end
+# rubocop:disable Gitlab/NamespacedClass
--- a/spec/lib/feature_spec.rb
+++ b/spec/lib/feature_spec.rb
@@ -487,6 +487,109 @@ RSpec.describe Feature, stub_feature_flags: false do
    end
  end

+  context 'caching with stale reads from the database', :use_clean_rails_redis_caching, :request_store, :aggregate_failures do
+    let(:actor) { stub_feature_flag_gate('CustomActor:5') }
+    let(:another_actor) { stub_feature_flag_gate('CustomActor:10') }
+
+    # This is a bit unpleasant. For these tests we want to simulate stale reads
+    # from the database (due to database load balancing). A simple way to do
+    # that is to stub the response on the adapter Flipper uses for reading from
+    # the database. However, there isn't a convenient API for this. We know that
+    # the ActiveRecord adapter is always at the 'bottom' of the chain, so we can
+    # find it that way.
+    let(:active_record_adapter) do
+      adapter = described_class.flipper
+
+      loop do
+        break adapter unless adapter.instance_variable_get(:@adapter)
+
+        adapter = adapter.instance_variable_get(:@adapter)
+      end
+    end
+
+    [true, false].each do |enabled|
+      context "with feature_flags_cache_stale_read_fix #{enabled ? 'enabled' : 'disabled'}" do # rubocop:disable RSpec/EmptyExampleGroup
+        # Without this environment variable enabled we want the specs to fail.
+        method = enabled ? :it : :pending
+
+        before do
+          stub_feature_flags(feature_flags_cache_stale_read_fix: enabled)
+        end
+
+        send(method, 'gives the correct value when enabling for an additional actor') do
+          described_class.enable(:enabled_feature_flag, actor)
+          initial_gate_values = active_record_adapter.get(described_class.get(:enabled_feature_flag))
+
+          # This should only be enabled for `actor`
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag, another_actor)).to be(false)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+
+          # Enable for `another_actor` and simulate a stale read
+          described_class.enable(:enabled_feature_flag, another_actor)
+          allow(active_record_adapter).to receive(:get).once.and_return(initial_gate_values)
+
+          # Should read from the cache and be enabled for both of these actors
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag, another_actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+        end
+
+        send(method, 'gives the correct value when enabling for percentage of time') do
+          described_class.enable_percentage_of_time(:enabled_feature_flag, 10)
+          initial_gate_values = active_record_adapter.get(described_class.get(:enabled_feature_flag))
+
+          # Test against `gate_values` directly as otherwise it would be non-determistic
+          expect(described_class.get(:enabled_feature_flag).gate_values.percentage_of_time).to eq(10)
+
+          # Enable 50% of time and simulate a stale read
+          described_class.enable_percentage_of_time(:enabled_feature_flag, 50)
+          allow(active_record_adapter).to receive(:get).once.and_return(initial_gate_values)
+
+          # Should read from the cache and be enabled 50% of the time
+          expect(described_class.get(:enabled_feature_flag).gate_values.percentage_of_time).to eq(50)
+        end
+
+        send(method, 'gives the correct value when disabling the flag') do
+          described_class.enable(:enabled_feature_flag, actor)
+          described_class.enable(:enabled_feature_flag, another_actor)
+          initial_gate_values = active_record_adapter.get(described_class.get(:enabled_feature_flag))
+
+          # This be enabled for `actor` and `another_actor`
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag, another_actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+
+          # Disable for `another_actor` and simulate a stale read
+          described_class.disable(:enabled_feature_flag, another_actor)
+          allow(active_record_adapter).to receive(:get).once.and_return(initial_gate_values)
+
+          # Should read from the cache and be enabled only for `actor`
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag, another_actor)).to be(false)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+        end
+
+        send(method, 'gives the correct value when deleting the flag') do
+          described_class.enable(:enabled_feature_flag, actor)
+          initial_gate_values = active_record_adapter.get(described_class.get(:enabled_feature_flag))
+
+          # This should only be enabled for `actor`
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(true)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+
+          # Remove and simulate a stale read
+          described_class.remove(:enabled_feature_flag)
+          allow(active_record_adapter).to receive(:get).once.and_return(initial_gate_values)
+
+          # Should read from the cache and be disabled everywhere
+          expect(described_class.enabled?(:enabled_feature_flag, actor)).to be(false)
+          expect(described_class.enabled?(:enabled_feature_flag)).to be(false)
+        end
+      end
+    end
+  end
+
  describe Feature::Target do
    describe '#targets' do
      let(:project) { create(:project) }