• Kirill Smelkov's avatar
    Add tests that demonstrates data corruption when ZEO5 client is served by ZEO4 server · 72320df7
    Kirill Smelkov authored
    As explained by "Bug1" in https://github.com/zopefoundation/ZEO/issues/209
    there is a race in between load and invalidate when ZEO5 client works
    wrt ZEO4 server:
    
        ---- 8< ----
        ZEO5.client - contrary to ZEO4.client - does not account for simultaneous invalidations when working wrt ZEO4.server.
        It shows as e.g.
    
        ```console
        (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvx --test-path=src -t check_race_load_vs_external
        ...
    
        AssertionError: T1: obj1 (24)  !=  obj2 (23)
        obj1._p_serial: 0x03ea4b6fb05b52cc  obj2._p_serial: 0x03ea4b6faf253855
        zconn_at: 0x03ea4b6fb05b52cc  # approximated as max(serials)
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: None
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6faf253855  next_serial: 0x03ea4b6fb07def66
        zstor._cache.clear()
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03ea4b6fb05b52cc  next_serial: 0x03ea4b6fb07def66
        ```
    
        indicating that obj2 was provided to user from the cache that erroneously got stale.
    
        With [IO trace](kirr/ZEO@f184a1d9) showing message exchange in between client and the server, this looks as:
    
        ```
        # loadBefore issued
        tx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), False, 'loadBefore', ('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'))
    
        # received invalidateTransaction
        rx (0, 1, 'invalidateTransaction', ('\x03\xeaKo\xb0[R\xcc', ['\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02']))
    
        # received loadBefore reply but with end_tid=None !!!
        rx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), 0, '.reply', ('\x80\x03cZODB.tests.MinPO\nMinPO\nq\x01.\x80\x03}q\x02U\x05valueq\x03K\x17s.', '\x03\xeaKo\xaf%8U', None))
        ```
    
        which:
    
        1) contradicts what @jimfulton [wrote about ZEO4](https://github.com/zopefoundation/ZEO/blob/master/docs/ordering.rst#zeo-4): that there _invalidations are sent in a callback called when the storage lock is held, blocking loads while committing_, and
        2) points out what the bug is:
    
        Since in ZEO4 loads can be handled while a commit is in progress, ZEO4.client has special care to detect if an `invalidateTransaction` comes in between `load` request and corresponding `.reply` response, and _does not update the cache if invalidateTransaction sneaked in-between_:
    
        https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L367-L374
        https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L841-L852
        https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L1473-L1476
    
        but in ZEO5.client `loadBefore` does not have anything like that
    
        https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/ClientStorage.py#L603-L608
        https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/asyncio/client.py#L289-L309
    
        and thus `invalidateTransaction` sneaked in between `loadBefore` request and
        corresponding `.reply` causes ZEO client cache corruption.
    
        In the original `check_race_load_vs_external_invalidate` the problem appears
        only sometimes, but the bug happens with ~ 100% probability with the
        following delay injected on the server after `loadBefore`:
    
        ```diff
        --- a/src/ZEO/tests/ZEO4/StorageServer.py
        +++ b/src/ZEO/tests/ZEO4/StorageServer.py
        @@ -285,7 +285,9 @@ def loadEx(self, oid):
    
             def loadBefore(self, oid, tid):
                 self.stats.loads += 1
        -        return self.storage.loadBefore(oid, tid)
        +        x = self.storage.loadBefore(oid, tid)
        +        time.sleep(0.1)
        +        return x
    
             def getInvalidations(self, tid):
                 invtid, invlist = self.server.get_invalidations(self.storage_id, tid)
        ```
    
        so maybe, in addition to normal test runs, it could be also good idea to run
        the whole ZEO testsuite against so-amended storage backend. This way it will be
        similar to how e.g. [races are detected](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L76-102) by my [tracetest](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L20-73) package.
        ---- 8< ----
    
    -> Add suggested test to the whole ZEO testsuite. Currently it reliably
    catches the data curruption caused by explained load/invalidate race
    condition and fails as follows:
    
        (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvc -a 100 --test-path=src
        ...
    
        Running .FileStorageLoadDelayedTests tests:
        ...
    
        Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.FileStorageLoadDelayedTests)
        Traceback (most recent call last):
          File "/usr/lib/python2.7/unittest/case.py", line 329, in run
            testMethod()
          File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 253, in check_race_load_vs_external_invalidate
            return self._check_race_load_vs_external_invalidate(T2ObjectsInc())
          File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/util.py", line 417, in _
            return f(*argv, **kw)
          File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 346, in _check_race_load_vs_external_invalidate
            self.fail('\n\n'.join([_ for _ in failure if _]))
          File "/usr/lib/python2.7/unittest/case.py", line 410, in fail
            raise self.failureException(msg)
        AssertionError: T4: obj1 (19)  !=  obj2 (20)
        obj1._p_serial: 0x03eabd580073db55  obj2._p_serial: 0x03eabd580079a5dd
        zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580073db55  next_serial: None
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
        zstor._cache.clear()
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: 0x03eabd5800aff0ee
    
        T7: obj1 (18)  !=  obj2 (20)
        obj1._p_serial: 0x03eabd580071a011  obj2._p_serial: 0x03eabd580079a5dd
        zconn_at: 0x03eabd580079a5dd  # approximated as max(serials)
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580071a011  next_serial: None
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
        zstor._cache.clear()
        zstor.loadBefore(obj1, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
        zstor.loadBefore(obj2, @zconn.at)       ->  serial: 0x03eabd580079a5dd  next_serial: None
    
    Fortunately added tests do not fail for ZEO5.client-ZEO5.server.
    
    Running added tests takes more time compared to running regular tests.
    For example the time to run BlobAdaptedFileStorageTests at default
    level=1 takes 19s. To run the same BlobAdaptedFileStorageTests at
    level=2 (thus including check_race_external_invalidate_vs_disconnect
    from https://github.com/zopefoundation/ZODB/pull/368) takes 25s.
    And to run hereby-added FileStorageLoadDelayedTests takes 85s.
    
    That's why the test is marked as very long via level=10
    72320df7
component.xml 286 Bytes