Commit 79078049 authored by Kirill Smelkov's avatar Kirill Smelkov Committed by GitHub

fsrefs: Optimize IO (take 2) (#340)

* fsrefs: Optimize IO  (take 2)

Access objects in the order of their position in file instead of in the order
of their OID. This should give dramatical speedup when data are on HDD.

For example @perrinjerome reports that on a 73Go database it takes
almost 8h to run fsrefs (where on the same database, fstest takes 15
minutes) [1,2]. After the patch fsrefs took ~80 minutes to run on the same
database. In other words this is ~ 6x improvement.

Fsrefs has no tests. I tested it only lightly via generating a bit
corrupt database with deleted referred object(*), and it gives the same
output as unmodified fsrefs.

    oid 0x0 __main__.Object
    last updated: 1979-01-03 21:00:42.900001, tid=0x285cbacb70a3db3
    refers to invalid objects:
            oid 0x07 missing: '<unknown>'
            oid 0x07 object creation was undone: '<unknown>'

This "take 2" version is derived from https://github.com/zopefoundation/ZODB/pull/338
and only iterates objects in the order of their in-file position without
building complete references graph in-RAM, because that in-RAM graph would
consume ~12GB of memory.

Added pos2oid in-RAM index also consumes memory: for the 73GB database in
question fs._index takes ~700MB, while pos2oid takes ~2GB. In theory it could be less,
because we need only array of oid sorted by key(oid)=fs._index[oid]. However
array.array does not support sorting, and if we use plain list to keep just
[]oid, the memory consumption just for that list is ~5GB. Also because
list.sort(key=...) internally allocates memory for key array (and
list.sort(cmp=...) was removed from Python3), total memory consumption just to
produce list of []oid ordered by pos is ~10GB.
So without delving into C/Cython and/or manually sorting the array in Python (=
slow), using QQBTree seems to be the best out-of-the-box option for oid-by-pos index.

[1] nexedi/zodbtools!19 (comment 129480)
[2] nexedi/zodbtools!19 (comment 129551)

(*) test database generated via a bit modified gen_testdata.py from
zodbtools:

https://lab.nexedi.com/nexedi/zodbtools/blob/v0.0.0.dev8-28-g129afa6/zodbtools/test/gen_testdata.py

+

```diff
--- a/zodbtools/test/gen_testdata.py
+++ b/zodbtools/test/gen_testdata.py
@@ -229,7 +229,7 @@ def ext(subj): return {}
         # delete an object
         name = random.choice(list(root.keys()))
         obj = root[name]
-        root[name] = Object("%s%i*" % (name, i))
+#       root[name] = Object("%s%i*" % (name, i))
         # NOTE user/ext are kept empty on purpose - to also test this case
         commit(u"", u"predelete %s" % unpack64(obj._p_oid), {})
```

/cc @tim-one, @jeremyhylton, @jamadden
/reviewed-by @jamadden, @perrinjerome 
/reviewed-on https://github.com/zopefoundation/ZODB/pull/340
parent 22d1405d
...@@ -8,6 +8,8 @@ ...@@ -8,6 +8,8 @@
- Fix UnboundLocalError when running fsoids.py script. - Fix UnboundLocalError when running fsoids.py script.
See `issue 268 <https://github.com/zopefoundation/ZODB/issues/285>`_. See `issue 268 <https://github.com/zopefoundation/ZODB/issues/285>`_.
- Rework ``fsrefs`` script to work significantly faster by optimizing how it does
IO. See `PR 340 <https://github.com/zopefoundation/ZODB/pull/340>`_.
5.6.0 (2020-06-11) 5.6.0 (2020-06-11)
================== ==================
......
...@@ -66,9 +66,10 @@ import traceback ...@@ -66,9 +66,10 @@ import traceback
from ZODB.FileStorage import FileStorage from ZODB.FileStorage import FileStorage
from ZODB.TimeStamp import TimeStamp from ZODB.TimeStamp import TimeStamp
from ZODB.utils import u64, oid_repr, get_pickle_metadata, load_current from ZODB.utils import u64, p64, oid_repr, get_pickle_metadata, load_current
from ZODB.serialize import get_refs from ZODB.serialize import get_refs
from ZODB.POSException import POSKeyError from ZODB.POSException import POSKeyError
from BTrees.QQBTree import QQBTree
# There's a problem with oid. 'data' is its pickle, and 'serial' its # There's a problem with oid. 'data' is its pickle, and 'serial' its
# serial number. 'missing' is a list of (oid, class, reason) triples, # serial number. 'missing' is a list of (oid, class, reason) triples,
...@@ -118,7 +119,18 @@ def main(path=None): ...@@ -118,7 +119,18 @@ def main(path=None):
# This does not include oids in undone. # This does not include oids in undone.
noload = {} noload = {}
for oid in fs._index.keys(): # build {pos -> oid} index that is reverse to {oid -> pos} fs._index
# we'll need this to iterate objects in order of ascending file position to
# optimize disk IO.
pos2oid = QQBTree() # pos -> u64(oid)
for oid, pos in fs._index.iteritems():
pos2oid[pos] = u64(oid)
# pass 1: load all objects listed in the index and remember those objects
# that are deleted or load with an error. Iterate objects in order of
# ascending file position to optimize disk IO.
for oid64 in pos2oid.itervalues():
oid = p64(oid64)
try: try:
data, serial = load_current(fs, oid) data, serial = load_current(fs, oid)
except (KeyboardInterrupt, SystemExit): except (KeyboardInterrupt, SystemExit):
...@@ -130,9 +142,13 @@ def main(path=None): ...@@ -130,9 +142,13 @@ def main(path=None):
traceback.print_exc() traceback.print_exc()
noload[oid] = 1 noload[oid] = 1
# pass 2: go through all objects again and verify that their references do
# not point to problematic object set. Iterate objects in order of ascending
# file position to optimize disk IO.
inactive = noload.copy() inactive = noload.copy()
inactive.update(undone) inactive.update(undone)
for oid in fs._index.keys(): for oid64 in pos2oid.itervalues():
oid = p64(oid64)
if oid in inactive: if oid in inactive:
continue continue
data, serial = load_current(fs, oid) data, serial = load_current(fs, oid)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment