Commits · wait-speedup · Stefane Fermigier / neo

04 Apr, 2018 1 commit

tests/cluster: speedup waiting a bit · 2bef65b7

Kirill Smelkov authored Apr 04, 2018

NEO functional tests use pdb.wait() in a few places, for example in
NEOCluster .run(), .start() and .expectCondition(). The wait
implementation uses polling with exponentially growing wait period.

With the following instrumentation

	--- a/neo/tests/cluster.py
	+++ b/neo/tests/cluster.py
	@@ -236,6 +236,7 @@ def wait(self, test, timeout):
	                         return False
	             finally:
	                 cluster_dict.release()
	+            print 'next_sleep:', next_sleep
	             sleep(next_sleep)
	         return True

during execution of functional tests it is not uncommon to see the
following sleep periods

	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.1
	next_sleep: 0.15
	next_sleep: 0.225
	next_sleep: 0.3375
	next_sleep: 0.50625

.

Without going into reworking the wait mechanism to use real
notifications instead of polling, it was observed that the exponential
progression tends to create too coarse sleeps. Initial 0.1s interval was
found to be also too much.

This patch remove the exponential period growth and reduces period by order
of one magnitude. For functional tests timings on my computer it is thus:

before patch:

	Functional tests

	28 Tests, 0 Failed

	Title                     : TestRunner
	Date                      : 2018-04-04
	Node                      : deco
	Machine                   : x86_64
	System                    : Linux
	Python                    : 2.7.14

	Directory                 : /tmp/neo_tests/1522868674.115798
	Status                    : 100.000%
	NEO_TESTS_ADAPTER         : SQLite

	                               NEO TESTS REPORT

	              Test Module |  run  | unexpected | expected | skipped |  time
	--------------------------+-------+------------+----------+---------+----------
	                   Client |    6  |       .    |      .   |     .   |   8.51s
	                  Cluster |    7  |       .    |      .   |     .   |   9.84s
	                   Master |    4  |       .    |      .   |     .   |   9.68s
	                  Storage |   11  |       .    |      .   |     .   |  20.76s
	--------------------------+-------+------------+----------+---------+----------
	     neo.tests.functional |       |            |          |         |
	--------------------------+-------+------------+----------+---------+----------
	                  Summary |   28  |       .    |      .   |     .   |  48.79s
	--------------------------+-------+------------+----------+---------+----------

after patch:

	Functional tests

	28 Tests, 0 Failed

	Title                     : TestRunner
	Date                      : 2018-04-04
	Node                      : deco
	Machine                   : x86_64
	System                    : Linux
	Python                    : 2.7.14

	Directory                 : /tmp/neo_tests/1522868527.624376
	Status                    : 100.000%
	NEO_TESTS_ADAPTER         : SQLite

	                               NEO TESTS REPORT

	              Test Module |  run  | unexpected | expected | skipped |  time
	--------------------------+-------+------------+----------+---------+----------
	                   Client |    6  |       .    |      .   |     .   |   7.38s
	                  Cluster |    7  |       .    |      .   |     .   |   7.05s
	                   Master |    4  |       .    |      .   |     .   |   8.22s
	                  Storage |   11  |       .    |      .   |     .   |  19.22s
	--------------------------+-------+------------+----------+---------+----------
	     neo.tests.functional |       |            |          |         |
	--------------------------+-------+------------+----------+---------+----------
	                  Summary |   28  |       .    |      .   |     .   |  41.87s
	--------------------------+-------+------------+----------+---------+----------

in other words ~ 10% improvement for the whole time to run functional tests.

2bef65b7

29 Mar, 2018 2 commits

master: automatically discard feeding cells that get out-of-date · 3efbbfe3

Julien Muchembled authored Mar 29, 2018

This is a follow-up of commit 2ca7c335,
which changed 'tweak' not to discard readable cells too quickly.

The scenario of a storage being lost whereas it has feeding cells was forgotten.
These must be discarded immediately, otherwise we end up with more up-to-date
cells than wanted. Without the change in outdate(), testSafeTweak would end
with: UU.|U.U|UUU

Once replication is optimized not to always restart checking cells from the
beginning:
- Remembering that an out-of-date cell was feeding could be a safer
  option, but it may not be worth the extra complexity.
- Another possibility may be to replace the FEEDING state by an automatic
  partial tweak that only discards up-to-date cells too many whenever a cell
  becomes up-to-date.

3efbbfe3

qa: remove useless indentation in testSafeTweak · 3443d483
Julien Muchembled authored Mar 29, 2018

3443d483

20 Mar, 2018 2 commits
- bench: new option to mesure ZEO perfs in matrix test · b621a98f
  Julien Muchembled authored Mar 20, 2018
  
  b621a98f
- bench: reduce number of partitions in matrix test · 114c7ab6
  Julien Muchembled authored Mar 20, 2018
  
  114c7ab6
14 Mar, 2018 1 commit

storage: fix replication of creation undone · c3343279

Julien Muchembled authored Mar 14, 2018

For records that undo object creation, None values are used at the backend
level whereas the protocol is not designed to serialize None for any field.

Therefore, a dance done in many places around packet serialization, using the
specific 0/ZERO_HASH/'' triplet to represent a deleted oid. For replication,
it was missing at the sender side, leading to the following crash:

  Traceback (most recent call last):
    File "neo/storage/app.py", line 147, in run
      self._run()
    File "neo/storage/app.py", line 178, in _run
      self.doOperation()
    File "neo/storage/app.py", line 257, in doOperation
      next(task_queue[-1]) or task_queue.rotate()
    File "neo/storage/handlers/storage.py", line 271, in push
      conn.send(Packets.AddObject(oid, *object), msg_id)
    File "neo/lib/protocol.py", line 234, in __init__
      self._fmt.encode(buf.write, args)
    File "neo/lib/protocol.py", line 345, in encode
      return self._trace(self._encode, writer, items)
    File "neo/lib/protocol.py", line 334, in _trace
      return method(*args)
    File "neo/lib/protocol.py", line 367, in _encode
      item.encode(writer, value)
    File "neo/lib/protocol.py", line 345, in encode
      return self._trace(self._encode, writer, items)
    File "neo/lib/protocol.py", line 342, in _trace
      raise ParseError(self, trace)
  ParseError: at add_object/checksum:
    File "neo/lib/protocol.py", line 553, in _encode
      assert len(checksum) == 20, (len(checksum), checksum)
  TypeError: object of type 'NoneType' has no len()

c3343279

13 Mar, 2018 1 commit
- Release version 1.9 · 1b57a7ae
  Julien Muchembled authored Mar 13, 2018
  
  1b57a7ae
02 Mar, 2018 3 commits

master: fix resumption of backup replication (internal or not) · 27229793

Julien Muchembled authored Feb 27, 2018

Before, it waited for upstream activity until all partitions are touched.
However, when upstream is idle the backup cluster could remain stuck forever
if it was interrupted whereas some cells were still late.

27229793

master: fix/simplify generation of TID · 7b2e6752
Julien Muchembled authored Feb 27, 2018
```
The 'min_tid < new_tid' assertion failed when jumping to the past.
```
7b2e6752

master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061

Julien Muchembled authored Feb 14, 2018

Given that:
- read locks are only taken by transactions (not replication)
- in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
  are synchronized up to different tids

there was a race condition with the master node replying to LastTransaction
with a TID that may not be replicated yet by all replicas, potentially causing
such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
too early.

IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
it is only readable by NEO clients up to `getBackupTid(min)` as long as the
cluster is in BACKINGUP state.

ca2f7061

17 Jan, 2018 1 commit

client: kill .supportsTransactionalUndo() · f95f336a

Kirill Smelkov authored Jan 16, 2018

Usage of supportsTransactionalUndo() was removed from ZODB in 2007 - see
e.g. the following commits:

https://github.com/zopefoundation/ZODB/commit/a06bfc03
https://github.com/zopefoundation/ZODB/commit/e667b022
https://github.com/zopefoundation/ZODB/commit/f595f7e7
...

/reviewed-by @vpelletier
/reviewed-on nexedi/neoppod!8

f95f336a

11 Jan, 2018 1 commit

client: for read accesses, pick a random good node, connected or not · 8dce4bbf

Julien Muchembled authored Jan 10, 2018

The issue was that at startup, or after nodes are back, the previous code
prevented full load balancing until some data are written.

It was like this to limit the number of connections, which does not matter
anymore (see commit 77132157).

8dce4bbf

08 Jan, 2018 1 commit

storage: optimize storage layout of raw data for replication · f4dd4bab

Julien Muchembled authored Nov 23, 2017

# Previous status

The issue was that we had extreme storage fragmentation from the point of view
of the replication algorithm, which processes one partition at a time.

By using an autoincrement for the 'data' table, rows were ordered by the time
at which they were added:
- parts may be the result of replication -> ordered by partition, tid, oid
- other rows are globally sorted by tid

Which means that when scanning a given partition, many rows were skipped all
the time:
- if readahead is bigger enough, the efficiency is 1/N for a node with N
  partitions assigned
- else, it is worse because it seeks all the time

For huge databases, the replication was horribly slow, in particular from HDD.

# Chosen solution

This commit changes how ids are generated to somehow split 'data'
per partition. The backend tracks 1 last id per assigned partition, where the
16 higher bits contains the partition. Keep in mind that the value of id has no
meaning and it's only chosen for performance reasons. IOW, a row can be
referred by an oid of a partition different than the 16 higher bits of id:
- there's no migration needed and the 16 higher bits of all existing rows are 0
- in case of deduplication, a row can still be shared by different partitions

Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
on existing databases.

## Downsides

On insertion, increasing the number of partitions now slows down significantly:
for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
partitions, the difference remains negligible. The solution for this issue will
be to enable to increase the number of partitions efficiently, so that nodes
can keep a small number of them, even for DB that are expected to grow so much
that many nodes are added over time: such feature was already considered so
that users don't have to worry anymore about this obscure setting at database
creation.

Read performance is only slowed down for applications that read a lot of data
that were written contiguously, but split in small blocks. A solution is to
extend ZODB so that the application tells it to chose new oids that will end up
in the same partition. Like for insertion, there should not be too many
partitions.

With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
collect all last ids at startup when there are many partitions.

## Other advantages

- The storage layout of data is now always the same and does not depend on
  whether rows came from replication or commits.
- Efficient deletion of partition to free space in-place will be possible.

# Considered alternative

The only serious alternative was to replicate as many partitions as possible at
the same time, ideally all assigned partitions, but it's not always possible.
For best performance, it would often require to synchronize new nodes, or even
all of them, so that thesource nodes don't have to scan 'data' several times.

If existing nodes are kept, all data that aren't copied to the newly added
nodes have to be skipped. If the number of nodes is multiplied by N, the
efficiency is 1-1/N at best (synchronized nodes), else it's even worse
because partitions are somehow shuffled.

Checking/replacing a single node would remain slow when there are several
source nodes.

At last, such an algorithm would be much more complex and we would not have the
other advantages listed above.

f4dd4bab

05 Jan, 2018 6 commits

sqlite: remove useless AUTOINCREMENT for data.id (reuse of deleted ids is fine) · 7b497b8e
Julien Muchembled authored Jan 05, 2018
```
For existing DB, altering the table may be doable with schema editing and
clean up of sqlite_sequence.
```
7b497b8e
storage: automatic upgrade of 'obj' table (change of indices) · d289050e
Julien Muchembled authored Jan 05, 2018

d289050e

storage: speed up reads by indexing 'obj' primarily by 'oid' (instead of 'tid') · 3c7a3160

Julien Muchembled authored Nov 20, 2017

getObject becomes faster because it does not use secondary index anymore.
Only the primary one. This frees RAM during normal operation. For MySQL,
DatabaseManager._getObject is sped up by ~3% for in-memory loads.
An improvement of ~1% from ERP5 was also mesured for IO-bound loads.

On insertion, the fast index is (`partition`, tid, oid) because we almost
always insert lines with increasing tid, whereas oid values are more random.
Although the value (data_id+value_tid) is moved from the fast to the slow index,
this should have little impact on performance because the value size is quite
small compared to the key.

The impact on replication should also be negligible:
- a little faster when there's no oid to replicate: only the secondary index,
  smaller, is scanned
- otherwise: the (slightly) biggest index is scanned randomly

On disk usage, an increase of ~4% was observed for TokuDB.
Less compressibility ? Any link with https://jira.percona.com/browse/TDB-86 ?

3c7a3160

storage: pass schema of tables to migration methods · ca7acefc
Julien Muchembled authored Nov 28, 2017

ca7acefc
storage: update backend version between each migration step · 04f6d9c3
Julien Muchembled authored Nov 28, 2017

04f6d9c3
debug: add helper to run code outside the signal handler · 875fc1b9
Julien Muchembled authored Jan 04, 2018

875fc1b9

21 Dec, 2017 1 commit
- Preserve 'packed' flag on import/iteration · a414f91f
  Julien Muchembled authored Dec 21, 2017
  
  a414f91f
15 Dec, 2017 3 commits
- fixup! storage: speed up replication by not getting object next_serial for nothing · 5abfa5fd
  Julien Muchembled authored Dec 15, 2017
  
  5abfa5fd
- Merge "client: fix accounting of cache size" · 121b3882
  Julien Muchembled authored Dec 15, 2017
  
  121b3882
- client: fix accounting of cache size · 5b02f44b
  Julien Muchembled authored Dec 13, 2017
  
  5b02f44b
13 Dec, 2017 1 commit
- doc: comments, fixups · f2070ca4
  Julien Muchembled authored Dec 13, 2017
  
  f2070ca4
11 Dec, 2017 3 commits

client: account for cache hit/miss statistics · c76b3a0a

Kirill Smelkov authored Jul 22, 2016

This information is handy to see how well cache performs.

Amended by Julien Muchembled:
- do not abbreviate some existing field names in repr result (asking the
  user to look at the source code in order to decipher logs is not nice)
- hit: change from %.1f to %.3g
- hit: hide it completely if nload is 0
- use __future__.division instead of adding more casts to float

c76b3a0a

client: remove redundant information from cache's __repr__ · d1f52422
Julien Muchembled authored Dec 11, 2017

d1f52422
cache: fix possible endless loop in __repr__/_iterQueue · d83cb872
Julien Muchembled authored Dec 09, 2017

d83cb872

05 Dec, 2017 2 commits
- storage: speed up replication by not getting object next_serial for nothing · be839e92
  Julien Muchembled authored Dec 05, 2017
  
  be839e92
- storage: speed up replication by sending bigger network packets · c25e68bc
  Julien Muchembled authored Dec 01, 2017
  
  c25e68bc
04 Dec, 2017 1 commit
- neoctl: remove ignored option · 96aeb716
  Julien Muchembled authored Dec 01, 2017
  
  96aeb716
21 Nov, 2017 1 commit

client: bug found, add log to collect more information · a1082cbc

Julien Muchembled authored Nov 21, 2017

INFO Z2 Log files reopened successfully
INFO SignalHandler Caught signal SIGTERM
INFO Z2 Shutting down fast
INFO ZServer closing HTTP to new connections
ERROR ZODB.Connection Couldn't load state for BTrees.LOBTree.LOBucket 0xc12e29
Traceback (most recent call last):
  File "ZODB/Connection.py", line 909, in setstate
    self._setstate(obj, oid)
  File "ZODB/Connection.py", line 953, in _setstate
    p, serial = self._storage.load(oid, '')
  File "neo/client/Storage.py", line 81, in load
    return self.app.load(oid)[:2]
  File "neo/client/app.py", line 355, in load
    data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
  File "neo/client/app.py", line 387, in _loadFromStorage
    askStorage)
  File "neo/client/app.py", line 297, in _askStorageForRead
    self.sync()
  File "neo/client/app.py", line 898, in sync
    self._askPrimary(Packets.Ping())
  File "neo/client/app.py", line 163, in _askPrimary
    return self._ask(self._getMasterConnection(), packet,
  File "neo/client/app.py", line 177, in _getMasterConnection
    result = self.master_conn = self._connectToPrimaryNode()
  File "neo/client/app.py", line 202, in _connectToPrimaryNode
    index = (index + 1) % len(master_list)
ZeroDivisionError: integer division or modulo by zero

a1082cbc

19 Nov, 2017 1 commit
- client: new 'cache-size' Storage option · acef3571
  Julien Muchembled authored Nov 19, 2017
  
  acef3571
17 Nov, 2017 4 commits
- doc: mention HTTPS URLs when possible · 9c56e9cd
  Julien Muchembled authored Nov 17, 2017
  
  9c56e9cd
- doc: update comment in neolog about Python issue 13773 · e2e2d895
  Julien Muchembled authored Nov 17, 2017
  
  e2e2d895
- neolog: add support for xz-compressed logs, using external xzcat commands · 188c55f9
  Julien Muchembled authored Nov 17, 2017
  
  188c55f9
- neolog: --from option now also tries to parse with dateutil · 9ee4e04d
  Julien Muchembled authored Nov 17, 2017
  
  9ee4e04d
15 Nov, 2017 1 commit
- importer: do not crash if a backup cluster tries to replicate · d6f61422
  Julien Muchembled authored Nov 15, 2017
```
It's not possible yet to replicate a node that is importing data.
One must wait that the migration is finished.
```
  d6f61422
07 Nov, 2017 2 commits
- storage: disable data deduplication by default · ca75709f
  Julien Muchembled authored Nov 07, 2017
  
  ca75709f
- Release version 1.8.1 · 03b5b47e
  Julien Muchembled authored Nov 07, 2017
  
  03b5b47e
27 Oct, 2017 1 commit
- neomigrate: fix typo in a warning message · 4feea25f
  Julien Muchembled authored Oct 27, 2017
  
  4feea25f