Commits · 655a4ea946167ec08b4007afae0be9b8f07eaac7 · Stefane Fermigier / neo

20 Mar, 2017 2 commits
- client: fix harmless 'unexpected ... AnswerRequestIdentification' exceptions · 655a4ea9
  Julien Muchembled authored Mar 20, 2017
  
  655a4ea9
- qa: do not always use MySQL backend in testPack (neo.tests.zodb) · 2cb7bf1b
  Julien Muchembled authored Mar 20, 2017
  
  2cb7bf1b
18 Mar, 2017 1 commit

master: fix crash when a transaction begins while a storage node starts operation · 781b4eb5

Julien Muchembled authored Mar 17, 2017

Traceback (most recent call last):
  ...
  File "neo/lib/handler.py", line 72, in dispatch
    method(conn, *args, **kw)
  File "neo/master/handlers/client.py", line 70, in askFinishTransaction
    conn.getPeerId(),
  File "neo/master/transactions.py", line 387, in prepare
    assert node_list, (ready, failed)
AssertionError: (set([]), frozenset([]))

Master log leading to the crash:
  PACKET    #0x0009 StartOperation                 > S1
  PACKET    #0x0004 BeginTransaction               < C1
  DEBUG     Begin <...>
  PACKET    #0x0004 AnswerBeginTransaction         > C1
  PACKET    #0x0001 NotifyReady                    < S1

It was wrong to process BeginTransaction before receiving NotifyReady.

The changes in the storage are cosmetics: the 'ready' attribute has become
redundant with 'operational'.

781b4eb5

17 Mar, 2017 3 commits
- qa: fix ConnectionFilter bug causing packets to be stuck after __exit__/remove · 0fd3b652
  Julien Muchembled authored Mar 17, 2017
  
  0fd3b652
- mysql: do not retry a failing query forever · f0c45ea4
  Julien Muchembled authored Mar 16, 2017
```
Due to a bug in MariaDB Connector/C 2.3.2, some tests like testBasicStore and
test_max_allowed_packet were retrying the same failing query indefinitely.
```
  f0c45ea4
- qa: fix tests to not loop forever when the master dies unexpectedly · d0d0c143
  Julien Muchembled authored Mar 17, 2017
  
  d0d0c143
14 Mar, 2017 4 commits
- storage: avoid repeated 'Lock delayed' logs · e5fd0233
  Julien Muchembled authored Mar 10, 2017
```
On clusters with many deadlock avoidances, this flooded logs.
Hopefully, this commit reduces the size of logs without losing information.
```
  e5fd0233
- Warn when a cell becomes non-readable whereas all cells were readable · 3a39ac9a
  Julien Muchembled authored Mar 09, 2017
```
An issue that happened for the first time on a storage node didn't always cause
other nodes to flush their logs, which made debugging difficult.
```
  3a39ac9a
- Code clean up: PartitionTable · 1eed0239
  Julien Muchembled authored Mar 09, 2017
  
  1eed0239
- mysql: do not flood logs when retrying to connect non-stop · b61ee7f1
  Julien Muchembled authored Mar 13, 2017
  
  b61ee7f1
07 Mar, 2017 1 commit
- storage: fix possible KeyError when notifying about replicated partitions · ed966e80
  Julien Muchembled authored Mar 07, 2017
  
  ed966e80
03 Mar, 2017 1 commit

qa: fix random failure of check_checkCurrentSerialInTransaction · fec9a3a5

Julien Muchembled authored Mar 03, 2017

Generators are not thread-safe:

Exception in thread T2:
Traceback (most recent call last):
  ...
  File "ZODB/tests/StorageTestBase.py", line 157, in _dostore
    r2 = self._storage.tpc_vote(t)
  File "neo/client/Storage.py", line 95, in tpc_vote
    return self.app.tpc_vote(transaction)
  File "neo/client/app.py", line 507, in tpc_vote
    self.waitStoreResponses(txn_context)
  File "neo/client/app.py", line 500, in waitStoreResponses
    _waitAnyTransactionMessage(txn_context)
  File "neo/client/app.py", line 145, in _waitAnyTransactionMessage
    self._waitAnyMessage(queue, block=block)
  File "neo/client/app.py", line 128, in _waitAnyMessage
    conn, packet, kw = get(block)
  File "neo/lib/locking.py", line 203, in get
    self._lock()
  File "neo/tests/threaded/__init__.py", line 590, in _lock
    for i in TIC_LOOP:
ValueError: generator already executing

======================================================================
FAIL: check_checkCurrentSerialInTransaction (neo.tests.zodb.testBasic.BasicTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "neo/tests/zodb/testBasic.py", line 33, in check_checkCurrentSerialInTransaction
    super(BasicTests, self).check_checkCurrentSerialInTransaction()
  File "ZODB/tests/BasicStorage.py", line 294, in check_checkCurrentSerialInTransaction
    utils.load_current(self._storage, b'\0\0\0\0\0\0\0\xf4')[1])
failureException: False is not true

fec9a3a5

02 Mar, 2017 2 commits

storage: fix PT updates in case of late AnswerUnfinishedTransactions · a74937c8

Julien Muchembled authored Feb 28, 2017

This is done by moving
        self.replicator.populate()
after the switch to MasterOperationHandler, so that the latter is not delayed.

This change comes with some refactoring of the main loop,
to clean up app.checker and app.replicator properly (like app.tm).

Another option could have been to process notifications with the last handler,
instead of the first one. But if possible, cleaning up the whole code to not
delay handlers anymore looks the best option.

a74937c8

mysql: code clean up · 041a3eda
Julien Muchembled authored Feb 24, 2017

041a3eda

27 Feb, 2017 3 commits

Fix oids remaining write-locked forever · 9b33b1db

Julien Muchembled authored Feb 24, 2017

This happened in 2 cases:
- Commit a4c06242 ("Review aborting of
  transactions") introduced a race condition causing oids to remain
  write-locked forever after that the transaction modifying them is aborted.
- An unfinished transaction is not locked/unlocked during tpc_finish: oids
  must be unlocked when being notified that the transaction is finished.

9b33b1db

storage: fix bug not replicating unfinished transactions when the last ones are aborted · 7f754b5e

Julien Muchembled authored Feb 24, 2017

This was found by the first assertion of answerRebaseObject (client) because
a storage node missed a few transactions and reported a conflict with an older
serial than the one being stored: this must never happen and this commit adds a
more generic assertion on the storage side.

The above case is when the "first phase" of replication of a partition
(all history up to the tid before unfinished transactions) ended after
that the unfinished transactions are finished: this was a corruption bug,
where UP_TO_DATE cells could miss data.

Otherwise, if the "first phase" ended before, then the partition remained stuck
in OUT_OF_DATE state. Restarting the storage node was enough to recover.

7f754b5e

client: fix an AssertionError while processing late AnswerRebaseObject · 44452395

Julien Muchembled authored Feb 24, 2017

Traceback (most recent call last):
  ...
  File "neo/client/app.py", line 507, in tpc_vote
    self.waitStoreResponses(txn_context)
  File "neo/client/app.py", line 500, in waitStoreResponses
    _waitAnyTransactionMessage(txn_context)
  File "neo/client/app.py", line 150, in _waitAnyTransactionMessage
    self._handleConflicts(txn_context)
  File "neo/client/app.py", line 474, in _handleConflicts
    self._store(txn_context, oid, conflict_serial, data)
  File "neo/client/app.py", line 410, in _store
    self._waitAnyTransactionMessage(txn_context, False)
  File "neo/client/app.py", line 145, in _waitAnyTransactionMessage
    self._waitAnyMessage(queue, block=block)
  File "neo/client/app.py", line 133, in _waitAnyMessage
    _handlePacket(conn, packet, kw)
  File "neo/lib/threaded_app.py", line 133, in _handlePacket
    handler.dispatch(conn, packet, kw)
  File "neo/lib/handler.py", line 72, in dispatch
    method(conn, *args, **kw)
  File "neo/client/handlers/storage.py", line 122, in answerRebaseObject
    assert txn_context.conflict_dict[oid] == (serial, conflict)
AssertionError

Scenario:
0. unanswered rebase from S2
1. conflict resolved between t1 and t2 -> S1 & S2
2. S1 reports a new conflict
3. S2 answers to the rebase:
   returned serial (t1) is smaller than in conflict_dict (t2)
4. S2 reports the same conflict as in 2

44452395

24 Feb, 2017 2 commits

storage: fix an AssertionError in internal replication · 560e4fb1

Julien Muchembled authored Feb 24, 2017

Traceback (most recent call last):
  ...
  File "neo/storage/handlers/storage.py", line 111, in answerFetchObjects
    self.app.replicator.finish()
  File "neo/storage/replicator.py", line 370, in finish
    self._nextPartition()
  File "neo/storage/replicator.py", line 279, in _nextPartition
    assert app.pt.getCell(offset, app.uuid).isOutOfDate()
AssertionError

The scenario is:
1. partition A: start of replication, with unfinished transactions
2. partition A: all unfinished transactions are finished
3. partition A: end of replication with ReplicationDone notification
4. replication of partition B
5. partition A: AssertionError when starting replication

The bug is that in 3, the partition A is partially replicated and the storage
node must not notify the master.

560e4fb1

Improve log messages that are related to write-locking of objects · 61f72f9b
Julien Muchembled authored Feb 16, 2017

61f72f9b

23 Feb, 2017 1 commit

logger: fix backlog with a limit on packet size · 05cd65b6

Julien Muchembled authored Feb 23, 2017

This fixes testBasicStore when run with MySQL backend, which started to fail
with commit 9eb06ff1 when -L runner option is
not used.

05cd65b6

21 Feb, 2017 6 commits

Remove obsolete comment · df01cdcf
Julien Muchembled authored Feb 09, 2017

df01cdcf
Fix several issues when undoing transactions with conflict resolutions · ee5cb1f9
Julien Muchembled authored Jan 30, 2017

ee5cb1f9
Bump protocol version · c42baaef
Julien Muchembled authored Feb 21, 2017

c42baaef

Implement deadlock avoidance · 092992db

Julien Muchembled authored Dec 22, 2016

This is a first version with several optimizations possible:
- improve EventQueue (or implement a specific queue) to minimize deadlocks
- turn the RebaseObject packet into a notification

Sorting oids could also be useful to reduce the probability of deadlocks,
but that would never be enough to avoid them completely, even if there's a
single storage. For example:

1. C1 does a first store (x or y)
2. C2 stores x and y; one is delayed
3. C1 stores the other -> deadlock
   When solving the deadlock, the data of the first store may only
   exist on the storage.

2 functional tests are removed because they're redundant,
either with ZODB tests or with the new threaded tests.

092992db

Fixes/improvements to EventQueue · cc8d0a7c

Julien Muchembled authored Feb 02, 2017

- Make sure that errors while processing a delayed packet are reported to the
  connection that sent this packet.
- Provide a mechanism to process events for the same connection in
  chronological order.

cc8d0a7c

Change order of oid/serial fields in CheckCurrentSerial packet · 3e6adac3
Julien Muchembled authored Dec 22, 2016

3e6adac3

14 Feb, 2017 8 commits
- Fix conflict handling after a successful store to a node being disconnected... · 74c69d54
  Julien Muchembled authored Jan 24, 2017
```
Fix conflict handling after a successful store to a node being disconnected for having missed a transaction
```
  74c69d54
- client: use a class instead of a simple dict to hold transaction information · d3780906
  Julien Muchembled authored Jan 23, 2017
  
  d3780906
- client: make tpc_vote computes its return value only if successful · 97e57031
  Julien Muchembled authored Jan 19, 2017
  
  97e57031
- client: do not wait tpc_vote to start resolving conflicts · 5ae69542
  Julien Muchembled authored Jan 18, 2017
```
- fail sooner in case of unresolvable conflict
- avoid OOM when there are many conflicts
```
  5ae69542
- Review aborting of transactions · a4c06242
  Julien Muchembled authored Jan 10, 2017
  
  a4c06242
- Make sure transactions are committed in full when using internal replication · dc7a129c
  Julien Muchembled authored Jan 04, 2017
  
  dc7a129c
- Update TODO · cf6e48ea
  Julien Muchembled authored Jan 05, 2017
  
  cf6e48ea
- Lockless stores/checks during replication · 7af948cf
  Julien Muchembled authored Jan 04, 2017
  
  7af948cf
02 Feb, 2017 6 commits
- client: comment TransactionContainer and drop duplicate 'object_base_serial_dict' · 3a93658b
  Julien Muchembled authored Dec 30, 2016
  
  3a93658b
- Delayed connection acception when the storage node is ready · b7a5bc99
  Julien Muchembled authored Jan 03, 2017
```
Now that we do inequality comparisons between timestamps, the master must
use a monotonic clock, to avoid issues when the clock is turned back.
Before, the probability that time.time() returned again the same value was
probably negligible.
```
  b7a5bc99
- qa: new --log runner option · 9eb06ff1
  Julien Muchembled authored Feb 02, 2017
  
  9eb06ff1
- client: do not loop forever if conflicts happen on a big amount of data · 0e133ebb
  Julien Muchembled authored Jan 02, 2017
  
  0e133ebb
- client: simplify handling of conflicts · aeb8549b
  Julien Muchembled authored Dec 31, 2016
  
  aeb8549b
- More generic handling of closed MTConnection with pending requests · 8bf149d0
  Julien Muchembled authored Jan 06, 2017
```
This optimizes the normal case, and handlers can now take specific action
when requests are cancelled because a connection is closed.
```
  8bf149d0