- 20 Mar, 2017 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 18 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/master/handlers/client.py", line 70, in askFinishTransaction conn.getPeerId(), File "neo/master/transactions.py", line 387, in prepare assert node_list, (ready, failed) AssertionError: (set([]), frozenset([])) Master log leading to the crash: PACKET #0x0009 StartOperation > S1 PACKET #0x0004 BeginTransaction < C1 DEBUG Begin <...> PACKET #0x0004 AnswerBeginTransaction > C1 PACKET #0x0001 NotifyReady < S1 It was wrong to process BeginTransaction before receiving NotifyReady. The changes in the storage are cosmetics: the 'ready' attribute has become redundant with 'operational'.
-
- 17 Mar, 2017 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
Due to a bug in MariaDB Connector/C 2.3.2, some tests like testBasicStore and test_max_allowed_packet were retrying the same failing query indefinitely.
-
Julien Muchembled authored
-
- 14 Mar, 2017 4 commits
-
-
Julien Muchembled authored
On clusters with many deadlock avoidances, this flooded logs. Hopefully, this commit reduces the size of logs without losing information.
-
Julien Muchembled authored
An issue that happened for the first time on a storage node didn't always cause other nodes to flush their logs, which made debugging difficult.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 07 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 03 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Generators are not thread-safe: Exception in thread T2: Traceback (most recent call last): ... File "ZODB/tests/StorageTestBase.py", line 157, in _dostore r2 = self._storage.tpc_vote(t) File "neo/client/Storage.py", line 95, in tpc_vote return self.app.tpc_vote(transaction) File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 128, in _waitAnyMessage conn, packet, kw = get(block) File "neo/lib/locking.py", line 203, in get self._lock() File "neo/tests/threaded/__init__.py", line 590, in _lock for i in TIC_LOOP: ValueError: generator already executing ====================================================================== FAIL: check_checkCurrentSerialInTransaction (neo.tests.zodb.testBasic.BasicTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "neo/tests/zodb/testBasic.py", line 33, in check_checkCurrentSerialInTransaction super(BasicTests, self).check_checkCurrentSerialInTransaction() File "ZODB/tests/BasicStorage.py", line 294, in check_checkCurrentSerialInTransaction utils.load_current(self._storage, b'\0\0\0\0\0\0\0\xf4')[1]) failureException: False is not true
-
- 02 Mar, 2017 2 commits
-
-
Julien Muchembled authored
This is done by moving self.replicator.populate() after the switch to MasterOperationHandler, so that the latter is not delayed. This change comes with some refactoring of the main loop, to clean up app.checker and app.replicator properly (like app.tm). Another option could have been to process notifications with the last handler, instead of the first one. But if possible, cleaning up the whole code to not delay handlers anymore looks the best option.
-
Julien Muchembled authored
-
- 27 Feb, 2017 3 commits
-
-
Julien Muchembled authored
This happened in 2 cases: - Commit a4c06242 ("Review aborting of transactions") introduced a race condition causing oids to remain write-locked forever after that the transaction modifying them is aborted. - An unfinished transaction is not locked/unlocked during tpc_finish: oids must be unlocked when being notified that the transaction is finished.
-
Julien Muchembled authored
This was found by the first assertion of answerRebaseObject (client) because a storage node missed a few transactions and reported a conflict with an older serial than the one being stored: this must never happen and this commit adds a more generic assertion on the storage side. The above case is when the "first phase" of replication of a partition (all history up to the tid before unfinished transactions) ended after that the unfinished transactions are finished: this was a corruption bug, where UP_TO_DATE cells could miss data. Otherwise, if the "first phase" ended before, then the partition remained stuck in OUT_OF_DATE state. Restarting the storage node was enough to recover.
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 150, in _waitAnyTransactionMessage self._handleConflicts(txn_context) File "neo/client/app.py", line 474, in _handleConflicts self._store(txn_context, oid, conflict_serial, data) File "neo/client/app.py", line 410, in _store self._waitAnyTransactionMessage(txn_context, False) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 133, in _waitAnyMessage _handlePacket(conn, packet, kw) File "neo/lib/threaded_app.py", line 133, in _handlePacket handler.dispatch(conn, packet, kw) File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/client/handlers/storage.py", line 122, in answerRebaseObject assert txn_context.conflict_dict[oid] == (serial, conflict) AssertionError Scenario: 0. unanswered rebase from S2 1. conflict resolved between t1 and t2 -> S1 & S2 2. S1 reports a new conflict 3. S2 answers to the rebase: returned serial (t1) is smaller than in conflict_dict (t2) 4. S2 reports the same conflict as in 2
-
- 24 Feb, 2017 2 commits
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/storage/handlers/storage.py", line 111, in answerFetchObjects self.app.replicator.finish() File "neo/storage/replicator.py", line 370, in finish self._nextPartition() File "neo/storage/replicator.py", line 279, in _nextPartition assert app.pt.getCell(offset, app.uuid).isOutOfDate() AssertionError The scenario is: 1. partition A: start of replication, with unfinished transactions 2. partition A: all unfinished transactions are finished 3. partition A: end of replication with ReplicationDone notification 4. replication of partition B 5. partition A: AssertionError when starting replication The bug is that in 3, the partition A is partially replicated and the storage node must not notify the master.
-
Julien Muchembled authored
-
- 23 Feb, 2017 1 commit
-
-
Julien Muchembled authored
This fixes testBasicStore when run with MySQL backend, which started to fail with commit 9eb06ff1 when -L runner option is not used.
-
- 21 Feb, 2017 6 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
This is a first version with several optimizations possible: - improve EventQueue (or implement a specific queue) to minimize deadlocks - turn the RebaseObject packet into a notification Sorting oids could also be useful to reduce the probability of deadlocks, but that would never be enough to avoid them completely, even if there's a single storage. For example: 1. C1 does a first store (x or y) 2. C2 stores x and y; one is delayed 3. C1 stores the other -> deadlock When solving the deadlock, the data of the first store may only exist on the storage. 2 functional tests are removed because they're redundant, either with ZODB tests or with the new threaded tests.
-
Julien Muchembled authored
- Make sure that errors while processing a delayed packet are reported to the connection that sent this packet. - Provide a mechanism to process events for the same connection in chronological order.
-
Julien Muchembled authored
-
- 14 Feb, 2017 8 commits
-
-
Julien Muchembled authored
Fix conflict handling after a successful store to a node being disconnected for having missed a transaction
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
- fail sooner in case of unresolvable conflict - avoid OOM when there are many conflicts
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 02 Feb, 2017 6 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
Now that we do inequality comparisons between timestamps, the master must use a monotonic clock, to avoid issues when the clock is turned back. Before, the probability that time.time() returned again the same value was probably negligible.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
This optimizes the normal case, and handlers can now take specific action when requests are cancelled because a connection is closed.
-