Commits · 0fc95175cc446a67ce1a4a1846f8072d75d10eb0 · nexedi / neoppod

16 Oct, 2023 2 commits

Bump protocol version · 0fc95175
Julien Muchembled authored Oct 13, 2023

0fc95175

Reimplement pack in a scalable way, partial pack & approval/reject of pack orders · 4c3b6c4d

Julien Muchembled authored Sep 03, 2020

This is still pack without garbage collection, and without deleting
any transaction metadata ('trans' table).

Partial pack means that the client can take a list of oids: only these
oids will be packed. No API is defined yet at IStorage level.

Storage nodes pack in background, independently from other storage
nodes, partition by partition, and calling IStorage.pack() returns
immediately (though internally, NEO does have a mechanism to wait
until it's done, which can be required for some ZODB unit tests).

This new implementation also introduces the concept of signing pack
orders. The idea is that calling IStorage.pack() only records a pack
order in the database, that can be reviewed/approved/rejected using
an UI that is left to be done. For the moment, pack orders are
automatically approved (by the master).

Internally, pack orders are stored as extra metadata of a transaction.
IOW, IStorage.pack() implies the commit of an (empty) transaction.

IStorage.pack() can be called without waiting for the previous one
to be completed. Pack orders processed in the same order as they are
requested:
- an unsigned pack order blocks the processing of any newer pack order;
- rejected pack order are ignored.

Approving a pack order also triggers pack on backup clusters.
That's the simplest way to have everything consistent.
Maybe later we could identify scenarios where it would be ok
to unsign pack orders during asynchronous replication.

The feature to check replicas is marked as experimental because it is
not aware of differences that can happen during pack operations.
_______________________________________________________________________

About concurrency within the storage node, a first implementation
extended what was done to delete partitions in background (see
previous commit). But here, the job can't be easily split in splices
that are never too big:
- it's simpler to never split the processing of an oid but this can
  freeze the application for a long time when packing an oid that was
  modified many times (e.g. 30 min for an oid with 20 millions
  historical records);
- then an attempt so that an oid can be processed in several times was
  inefficient, maybe due to a limit in RocksDB (packing the oid in the
  above example would take days during which NEO is significantly
  slower).

So background database jobs were moved to a separate thread, using a
separate connection to the underlying database. This is obviously
only useful for the MySQL backend. In order to share as much code as
possible between backends, SQLite also does the work in a separate
thread but sharing the main connection instead of opening a separate
one (so such backend would not be suited in the above example).

But deleting raw data with a secondary connection is not possible
without fsyncing too often (or transaction isolation issues...): these
deletions are deferred by recording them in a new table, which is
processed later with the main connection. This is not so bad because
the actual deletion of raw data is usually more efficient this way
(more sequential IO).

Here are a few numbers:
- without load: 10h45 (12h for the first reimplementation)
- with a load that normally takes 6h58:
  - load: 7h33 (so 8.4% slower)
  - pack: 15h36 (+4h51)

As explained above, the pack of a partition is split in 2 steps:
- the longest one (here 78% without load) should have negligible
  peformance impact on the application because the work is done in a
  separate thread with a secondary connection, and also with something
  to minimize GIL impact by prioritizing the main thread;
- the shortest one (22%) to process the deferred deletions,
  with even lower priority than replication: it tries to split
  the work in tasks that take ~10ms.

4c3b6c4d

11 Oct, 2023 1 commit

storage: delete partitions in a scalable way · 3204a4c6

Julien Muchembled authored May 09, 2017

This is implemented using the same concurrency mechanism as for the
replication: the work is split in slices that should be small enough
to avoid slowing down network requests significantly.

3204a4c6

04 Apr, 2023 9 commits
- undo: code clean-up · fd95a217
  Julien Muchembled authored Mar 16, 2021
```
undone_data_tid can't be equal to a TTID.
```
  fd95a217
- mysql: drop support for horizontal partitioning of trans/obj · 8535b9cc
  Julien Muchembled authored May 09, 2017
```
It has never been enabled and the code to drop partitions will be
changed in a way that only 'trans' may still benefit of partitioning.
We'll see in the future if we have cases where 'trans' is too big to
delete all rows (of a given partition) in a single query.
```
  8535b9cc
- debug: add an example to profile with yappi · fd87e153
  Julien Muchembled authored Mar 27, 2023
  
  fd87e153
- Fix signals not always being processed as soon as possible · 0e43dd1f
  Julien Muchembled authored Mar 21, 2023
  
  0e43dd1f
- storage: small code simplification · 3f516cd6
  Julien Muchembled authored Apr 04, 2023
  
  3f516cd6
- Do not define exception classes in protocol.py · 39ae4a2f
  Julien Muchembled authored Mar 16, 2021
  
  39ae4a2f
- undo: bugfixes · df2bf949
  Julien Muchembled authored Mar 12, 2021
```
- When undoing current record, fix:
  - crash of storage nodes that don't have the undo data (non-readable cells);
  - and conflict resolution.
- Fix undo deduplication in replication when NEO deduplication is disabled.
- client: minor fixes in undo() about concurrent storage disconnections
  and PT updates.
```
  df2bf949
- Small optimization in epoll loop · f7f8533a
  Julien Muchembled authored Mar 11, 2023
  
  f7f8533a
- sqlite3: fix rare strange exception when table does not exist · 90a5aa17
  Julien Muchembled authored Apr 01, 2023
```
Found by running testPruneOrphan many times. Once I even got:

  SystemError: NULL result without error in PyObject_Call
```
  90a5aa17
09 Mar, 2023 2 commits
- qa: do never import MySQL-specific code when testing SQLite · 40b6ba64
  Julien Muchembled authored Mar 09, 2023
  
  40b6ba64
- qa: fix interface check of SQLite & MySQL implementations · cf8f2028
  Julien Muchembled authored Mar 03, 2023
```
The reverts a wrong change in commit 30a02bdc
("importer: new option to write back new transactions to the source database").
```
  cf8f2028
19 Feb, 2023 1 commit
- mysql: drop support for TokuDB · ecc9c63c
  Julien Muchembled authored Feb 19, 2023
  
  ecc9c63c
16 Feb, 2023 2 commits
- qa: update a test for recent ZConfig · b83668e7
  Julien Muchembled authored Feb 15, 2023
  
  b83668e7
- Add support for PyPy & PyMySQL · 6153a752
  Julien Muchembled authored Feb 13, 2023
  
  6153a752
14 Feb, 2023 3 commits
- fixup! Drop support for ZODB3 · c29b8b2d
  Julien Muchembled authored Feb 13, 2023
  
  c29b8b2d
- mysql: minor optimization · f844fe0b
  Julien Muchembled authored Feb 13, 2023
```
It's been many years we don't get 'array' objects, no idea when exactly.
```
  f844fe0b
- Rename mysqldb.py to mysql.py · 9691459d
  Julien Muchembled authored Feb 14, 2023
  
  9691459d
10 Feb, 2023 1 commit

sqlite: minor optimization · 933412cd

Julien Muchembled authored Feb 10, 2023

Like commit 243c1a0f
("sqlite: optimize storage of metadata"), the fake changes in test
data are because we don't force upgrade for this optimization.

933412cd

02 Feb, 2022 1 commit

Fix breakage with zodbpickle >= 2 · d5afef8e

Kirill Smelkov authored Feb 02, 2022

Starting from zodbpickle 2 its binary class does not allow users to set
arbitrary attributes and so

	binary._pack = bytes.__str__

fails with

	TypeError: can't set attributes of built-in/extension type 'zodbpickle.binary'

-> Fix it by explicitly checking for binary type on encoding instead of
setting binary._pack

See slapos@27f574bc for pre-history.

/cc @jerome

d5afef8e

04 Jun, 2021 1 commit

admin: fix crash if not operational and a downstream cluster is RUNNING · 7f81ac2d

Julien Muchembled authored Jun 03, 2021

Traceback (most recent call last):
  ...
  File ".../neo/lib/handler.py", line 75, in dispatch
    method(conn, *args, **kw)
  File ".../neo/admin/handler.py", line 174, in wrapper
    return func(self, name, *args, **kw)
  File ".../neo/admin/handler.py", line 190, in notifyMonitorInformation
    self.app.updateMonitorInformation(name, **info)
  File ".../neo/admin/app.py", line 290, in updateMonitorInformation
    self._notify(self.operational)
  File ".../neo/admin/app.py", line 315, in _notify
    body += '', name, '    ' + backup.formatSummary(upstream)[1]
  File ".../neo/admin/app.py", line 83, in formatSummary
    tid = self.ltid
AttributeError: 'Backup' object has no attribute 'ltid'

7f81ac2d

11 May, 2021 1 commit
- neoctl: fix tweak command when used without any argument · ba0bc779
  Julien Muchembled authored May 11, 2021
  
  ba0bc779
02 Apr, 2021 5 commits
- qa: more Importer tests · de0feb4e
  Julien Muchembled authored Mar 30, 2021
  
  de0feb4e
- qa: at the end of each ZODB test, check there is no storage space leak · 34d0725e
  Julien Muchembled authored Mar 17, 2021
  
  34d0725e
- qa: when comparing replicas, checksum metadata & data rather than only keys · 28e097c8
  Julien Muchembled authored Mar 12, 2021
  
  28e097c8
- PartitionTable: small optimization · 60bcbc5c
  Julien Muchembled authored Apr 02, 2021
  
  60bcbc5c
- PartitionTable: rename getAssignedPartitionList to getReadableOffsetList · aa48adf9
  Julien Muchembled authored Mar 12, 2021
  
  aa48adf9
22 Mar, 2021 1 commit
- qa: renew certificates for tests · fa581be5
  Julien Muchembled authored Mar 22, 2021
  
  fa581be5
04 Mar, 2021 2 commits
- Drop support for ZODB3 · 3a8f6f03
  Julien Muchembled authored Mar 04, 2021
  
  3a8f6f03
- importer: fix assertion failure when loading a deleted oid that is fully imported · 414573b9
  Julien Muchembled authored Mar 04, 2021
  
  414573b9
15 Jan, 2021 2 commits

ssl: don't care whether EOF is ragged or not · d98205d0

Julien Muchembled authored Jan 15, 2021

The purpose of suppress_ragged_eofs=False was to micro-optimize the
normal case: when there's no EOF.

But commit 061cd5d8 showed that this
option only turns ragged EOF into an exception. It may be easier for
alternate NEO implementations to close the SSL connection properly. Or
the performance benefit was not worth the risk to freeze a NEO process.

d98205d0

ssl: Don't ignore non-ragged EOF · 061cd5d8

Kirill Smelkov authored Jan 13, 2021

Testing NEO/go client wrt NEO/py server revealed a bug in NEO/py SSL
handling: proper non-ragged EOF from a peer is ignored, and so leads to
hang in infinite loop inside _SSL.receive with read_buf memory growing
indefinitely. Details are below:

NEO/py wraps raw sockets with

	ssl.wrap_socket(suppress_ragged_eofs=False)

which instructs SSL layer to convert unexpected EOF when receiving a TLS
record into SSLEOFError exception. However when remote peer properly
closes its side of the connection, socket.read() still returns b'' to
report non-ragged regular EOF:

https://github.com/python/cpython/blob/v2.7.18/Lib/ssl.py#L630-L650

The code was handling SSLEOFError but not b'' return from socket recv.
Thus after NEO/go client was disconnecting and properly closing its side
of the connection, the code started to loop indefinitely in _SSL.receive
under `while 1` with  b'' returned by self.socket.recv() appended to
read_buf again and again.

-> Fix it by detecting non-ragged EOF as well and, similarly to how
SSLEOFError is handled, converting them into self._error('recv', None).

See merge request nexedi/neoppod!17

061cd5d8

11 Jan, 2021 4 commits
- client: fix relative import · 261dd4b4
  Julien Muchembled authored Jan 06, 2021
  
  261dd4b4
- qa: add testStorageGettingReadyDuringRecovery · 80d180e7
  Julien Muchembled authored Dec 17, 2020
  
  80d180e7
- master: ignore late AnswerInformationLocked during recovery · b22847d2
  Julien Muchembled authored Dec 15, 2020
  
  b22847d2
- master: simplify verification by ignoring completely nodes without readable cells · a760258b
  Julien Muchembled authored Nov 06, 2020
```
The scenario that was described in comments was meaningless
because S1 never goes out-of-date.
```
  a760258b
02 Oct, 2020 1 commit

Fix handling of -m/--masters arg · fa63d856

Julien Muchembled authored Oct 02, 2020

For the master, the purpose of -m/--masters is to specify addresses
of other master nodes, since its own address is already known via
-b/--bind. Therefore, an empty value for -m/--masters is valid.
The user remains free to repeat the -b value in -m.

More generally, a node may choose to only specify master addresses
via -D/--dynamic-master-list, so the check that at least one master
address is specified is moved where the NodeManager is expected to be
initialized.

fa63d856

29 Sep, 2020 1 commit
- Remove dead code · c34d332f
  Julien Muchembled authored Sep 29, 2020
  
  c34d332f