Add tests that demonstrates data corruption when ZEO5 client is served by ZEO4 server
As explained by "Bug1" in https://github.com/zopefoundation/ZEO/issues/209 there is a race in between load and invalidate when ZEO5 client works wrt ZEO4 server: ---- 8< ---- ZEO5.client - contrary to ZEO4.client - does not account for simultaneous invalidations when working wrt ZEO4.server. It shows as e.g. ```console (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvx --test-path=src -t check_race_load_vs_external ... AssertionError: T1: obj1 (24) != obj2 (23) obj1._p_serial: 0x03ea4b6fb05b52cc obj2._p_serial: 0x03ea4b6faf253855 zconn_at: 0x03ea4b6fb05b52cc # approximated as max(serials) zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03ea4b6fb05b52cc next_serial: None zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03ea4b6faf253855 next_serial: 0x03ea4b6fb07def66 zstor._cache.clear() zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03ea4b6fb05b52cc next_serial: 0x03ea4b6fb07def66 zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03ea4b6fb05b52cc next_serial: 0x03ea4b6fb07def66 ``` indicating that obj2 was provided to user from the cache that erroneously got stale. With [IO trace](kirr/ZEO@f184a1d9) showing message exchange in between client and the server, this looks as: ``` # loadBefore issued tx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), False, 'loadBefore', ('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V')) # received invalidateTransaction rx (0, 1, 'invalidateTransaction', ('\x03\xeaKo\xb0[R\xcc', ['\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02'])) # received loadBefore reply but with end_tid=None !!! rx (('\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xeaKo\xaf%8V'), 0, '.reply', ('\x80\x03cZODB.tests.MinPO\nMinPO\nq\x01.\x80\x03}q\x02U\x05valueq\x03K\x17s.', '\x03\xeaKo\xaf%8U', None)) ``` which: 1) contradicts what @jimfulton [wrote about ZEO4](https://github.com/zopefoundation/ZEO/blob/master/docs/ordering.rst#zeo-4): that there _invalidations are sent in a callback called when the storage lock is held, blocking loads while committing_, and 2) points out what the bug is: Since in ZEO4 loads can be handled while a commit is in progress, ZEO4.client has special care to detect if an `invalidateTransaction` comes in between `load` request and corresponding `.reply` response, and _does not update the cache if invalidateTransaction sneaked in-between_: https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L367-L374 https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L841-L852 https://github.com/zopefoundation/ZEO/blob/47d3fbe8cbf24cad91b183483df069ef20708874/src/ZEO/ClientStorage.py#L1473-L1476 but in ZEO5.client `loadBefore` does not have anything like that https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/ClientStorage.py#L603-L608 https://github.com/zopefoundation/ZEO/blob/fc0729b3cc754bda02c7f54319260b5527dd42a3/src/ZEO/asyncio/client.py#L289-L309 and thus `invalidateTransaction` sneaked in between `loadBefore` request and corresponding `.reply` causes ZEO client cache corruption. In the original `check_race_load_vs_external_invalidate` the problem appears only sometimes, but the bug happens with ~ 100% probability with the following delay injected on the server after `loadBefore`: ```diff --- a/src/ZEO/tests/ZEO4/StorageServer.py +++ b/src/ZEO/tests/ZEO4/StorageServer.py @@ -285,7 +285,9 @@ def loadEx(self, oid): def loadBefore(self, oid, tid): self.stats.loads += 1 - return self.storage.loadBefore(oid, tid) + x = self.storage.loadBefore(oid, tid) + time.sleep(0.1) + return x def getInvalidations(self, tid): invtid, invlist = self.server.get_invalidations(self.storage_id, tid) ``` so maybe, in addition to normal test runs, it could be also good idea to run the whole ZEO testsuite against so-amended storage backend. This way it will be similar to how e.g. [races are detected](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L76-102) by my [tracetest](https://lab.nexedi.com/kirr/go123/blob/070bfdbb/tracing/tracetest/tracetest.go#L20-73) package. ---- 8< ---- -> Add suggested test to the whole ZEO testsuite. Currently it reliably catches the data curruption caused by explained load/invalidate race condition and fails as follows: (z-dev) kirr@deca:~/src/wendelin/z/ZEO5$ ZEO4_SERVER=1 zope-testrunner -fvvvc -a 100 --test-path=src ... Running .FileStorageLoadDelayedTests tests: ... Failure in test check_race_load_vs_external_invalidate (ZEO.tests.testZEO.FileStorageLoadDelayedTests) Traceback (most recent call last): File "/usr/lib/python2.7/unittest/case.py", line 329, in run testMethod() File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 253, in check_race_load_vs_external_invalidate return self._check_race_load_vs_external_invalidate(T2ObjectsInc()) File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/util.py", line 417, in _ return f(*argv, **kw) File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/tests/racetest.py", line 346, in _check_race_load_vs_external_invalidate self.fail('\n\n'.join([_ for _ in failure if _])) File "/usr/lib/python2.7/unittest/case.py", line 410, in fail raise self.failureException(msg) AssertionError: T4: obj1 (19) != obj2 (20) obj1._p_serial: 0x03eabd580073db55 obj2._p_serial: 0x03eabd580079a5dd zconn_at: 0x03eabd580079a5dd # approximated as max(serials) zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03eabd580073db55 next_serial: None zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: None zstor._cache.clear() zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: None zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: 0x03eabd5800aff0ee T7: obj1 (18) != obj2 (20) obj1._p_serial: 0x03eabd580071a011 obj2._p_serial: 0x03eabd580079a5dd zconn_at: 0x03eabd580079a5dd # approximated as max(serials) zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03eabd580071a011 next_serial: None zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: None zstor._cache.clear() zstor.loadBefore(obj1, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: None zstor.loadBefore(obj2, @zconn.at) -> serial: 0x03eabd580079a5dd next_serial: None Fortunately added tests do not fail for ZEO5.client-ZEO5.server. Running added tests takes more time compared to running regular tests. For example the time to run BlobAdaptedFileStorageTests at default level=1 takes 19s. To run the same BlobAdaptedFileStorageTests at level=2 (thus including check_race_external_invalidate_vs_disconnect from https://github.com/zopefoundation/ZODB/pull/368) takes 25s. And to run hereby-added FileStorageLoadDelayedTests takes 85s. That's why the test is marked as very long via level=10
Showing
src/ZEO/tests/component.xml
0 → 100644
src/ZEO/tests/racetest.py
0 → 100644
Please register or sign in to comment