- 31 Mar, 2017 11 commits
-
-
Julien Muchembled authored
Commit ad43dcd3 should have bumped it as well.
-
Julien Muchembled authored
Unused but it is likely to be useful in the future.
-
Julien Muchembled authored
The bug could lead to data corruption (if a partition is wrongly marked as UP_TO_DATE) or crashes (assertion failure on either the storage or the master). The protocol is extended to handle the following scenario: S M partition 0 outdated <-- UnfinishedTransactions ------> replication of partition 0 ... partition 1 outdated --- UnfinishedTransactions ... ... replication finished --- ReplicationDone ... tweak <-- partition 1 discarded -------- tweak <-- partition 1 outdated --------- ... UnfinishedTransactions --> ... ReplicationDone ---------> The master can't simply mark all outdated cells as being updatable when it receives an UnfinishedTransactions packet.
-
Julien Muchembled authored
-
Julien Muchembled authored
After an attempt to read from a non-readable, which happens when a client has a newer or older PT than storage's, the client now retries to read. This bugfix is for all kinds of read-access except undoLog, which can still report incomplete results.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
This revert commit bddc1802, to fix the following storage crash: Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/storage/handlers/master.py", line 44, in notifyPartitionChanges app.pt.update(ptid, cell_list, app.nm) File "neo/lib/pt.py", line 231, in update assert node is not None, 'No node found for uuid ' + uuid_str(uuid) AssertionError: No node found for uuid S3 Partitition table updates must also be processed with InitializationHandler when nodes remain in PENDING state because they're not added to the cluster.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 30 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 23 Mar, 2017 9 commits
-
-
Julien Muchembled authored
In the worst case, with many clients trying to lock the same oids, the cluster could enter in an infinite cascade of deadlocks. Here is an overview with 3 storage nodes and 3 transactions: S1 S2 S3 order of locking tids # abbreviations: l1 l1 l2 123 # l: lock q23 q23 d1q3 231 # d: deadlock triggered r1:l3 r1:l2 (r1) # for S3, we still have l2 # q: queued d2q1 q13 q13 312 # r: rebase Above, we show what happens when a random transaction gets a lock just after that another is rebased. Here, the result is that the last 2 lines are a permutation of the first 2, and this can repeat indefinitely with bad luck. This commit reduces the probability of deadlock by processing delayed stores/checks in the order of their locking tid. In the above example, S1 would give the lock to 2 when 1 is rebased, and 2 would vote successfully.
-
Julien Muchembled authored
-
Julien Muchembled authored
This fixes a bug that could to data corruption or crashes.
-
Julien Muchembled authored
It becomes possible to answer with several packets: - the last is the usual associated answer packet - all other (previously sent) packets are notifications Connection.send does not return the packet id anymore. This is not useful enough, and the caller can inspect the sent packet (getId).
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 22 Mar, 2017 1 commit
-
-
Julien Muchembled authored
In reality, this was tested with taskset 1 neotestrunner ...
-
- 21 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 20 Mar, 2017 2 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 18 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/master/handlers/client.py", line 70, in askFinishTransaction conn.getPeerId(), File "neo/master/transactions.py", line 387, in prepare assert node_list, (ready, failed) AssertionError: (set([]), frozenset([])) Master log leading to the crash: PACKET #0x0009 StartOperation > S1 PACKET #0x0004 BeginTransaction < C1 DEBUG Begin <...> PACKET #0x0004 AnswerBeginTransaction > C1 PACKET #0x0001 NotifyReady < S1 It was wrong to process BeginTransaction before receiving NotifyReady. The changes in the storage are cosmetics: the 'ready' attribute has become redundant with 'operational'.
-
- 17 Mar, 2017 3 commits
-
-
Julien Muchembled authored
-
Julien Muchembled authored
Due to a bug in MariaDB Connector/C 2.3.2, some tests like testBasicStore and test_max_allowed_packet were retrying the same failing query indefinitely.
-
Julien Muchembled authored
-
- 14 Mar, 2017 4 commits
-
-
Julien Muchembled authored
On clusters with many deadlock avoidances, this flooded logs. Hopefully, this commit reduces the size of logs without losing information.
-
Julien Muchembled authored
An issue that happened for the first time on a storage node didn't always cause other nodes to flush their logs, which made debugging difficult.
-
Julien Muchembled authored
-
Julien Muchembled authored
-
- 07 Mar, 2017 1 commit
-
-
Julien Muchembled authored
-
- 03 Mar, 2017 1 commit
-
-
Julien Muchembled authored
Generators are not thread-safe: Exception in thread T2: Traceback (most recent call last): ... File "ZODB/tests/StorageTestBase.py", line 157, in _dostore r2 = self._storage.tpc_vote(t) File "neo/client/Storage.py", line 95, in tpc_vote return self.app.tpc_vote(transaction) File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 128, in _waitAnyMessage conn, packet, kw = get(block) File "neo/lib/locking.py", line 203, in get self._lock() File "neo/tests/threaded/__init__.py", line 590, in _lock for i in TIC_LOOP: ValueError: generator already executing ====================================================================== FAIL: check_checkCurrentSerialInTransaction (neo.tests.zodb.testBasic.BasicTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "neo/tests/zodb/testBasic.py", line 33, in check_checkCurrentSerialInTransaction super(BasicTests, self).check_checkCurrentSerialInTransaction() File "ZODB/tests/BasicStorage.py", line 294, in check_checkCurrentSerialInTransaction utils.load_current(self._storage, b'\0\0\0\0\0\0\0\xf4')[1]) failureException: False is not true
-
- 02 Mar, 2017 2 commits
-
-
Julien Muchembled authored
This is done by moving self.replicator.populate() after the switch to MasterOperationHandler, so that the latter is not delayed. This change comes with some refactoring of the main loop, to clean up app.checker and app.replicator properly (like app.tm). Another option could have been to process notifications with the last handler, instead of the first one. But if possible, cleaning up the whole code to not delay handlers anymore looks the best option.
-
Julien Muchembled authored
-
- 27 Feb, 2017 3 commits
-
-
Julien Muchembled authored
This happened in 2 cases: - Commit a4c06242 ("Review aborting of transactions") introduced a race condition causing oids to remain write-locked forever after that the transaction modifying them is aborted. - An unfinished transaction is not locked/unlocked during tpc_finish: oids must be unlocked when being notified that the transaction is finished.
-
Julien Muchembled authored
This was found by the first assertion of answerRebaseObject (client) because a storage node missed a few transactions and reported a conflict with an older serial than the one being stored: this must never happen and this commit adds a more generic assertion on the storage side. The above case is when the "first phase" of replication of a partition (all history up to the tid before unfinished transactions) ended after that the unfinished transactions are finished: this was a corruption bug, where UP_TO_DATE cells could miss data. Otherwise, if the "first phase" ended before, then the partition remained stuck in OUT_OF_DATE state. Restarting the storage node was enough to recover.
-
Julien Muchembled authored
Traceback (most recent call last): ... File "neo/client/app.py", line 507, in tpc_vote self.waitStoreResponses(txn_context) File "neo/client/app.py", line 500, in waitStoreResponses _waitAnyTransactionMessage(txn_context) File "neo/client/app.py", line 150, in _waitAnyTransactionMessage self._handleConflicts(txn_context) File "neo/client/app.py", line 474, in _handleConflicts self._store(txn_context, oid, conflict_serial, data) File "neo/client/app.py", line 410, in _store self._waitAnyTransactionMessage(txn_context, False) File "neo/client/app.py", line 145, in _waitAnyTransactionMessage self._waitAnyMessage(queue, block=block) File "neo/client/app.py", line 133, in _waitAnyMessage _handlePacket(conn, packet, kw) File "neo/lib/threaded_app.py", line 133, in _handlePacket handler.dispatch(conn, packet, kw) File "neo/lib/handler.py", line 72, in dispatch method(conn, *args, **kw) File "neo/client/handlers/storage.py", line 122, in answerRebaseObject assert txn_context.conflict_dict[oid] == (serial, conflict) AssertionError Scenario: 0. unanswered rebase from S2 1. conflict resolved between t1 and t2 -> S1 & S2 2. S1 reports a new conflict 3. S2 answers to the rebase: returned serial (t1) is smaller than in conflict_dict (t2) 4. S2 reports the same conflict as in 2
-