-
Kirill Smelkov authored
ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will use to compute which blocks of a ZBigFile need to be invalidated in OS file cache given raw ZODB changes on ZODB invalidation message. It also will be used by WCFS to implement isolation protocol, where on every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to find out revision of corresponding file block. Quoting ΔBtail documentation: ---- 8< ---- ΔBtail provides BTree-level history tail. It translates ZODB object-level changes to information about which keys of which BTree were modified, and provides service to query that information. ΔBtail class documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~ ΔBtail represents tail of revisional changes to BTrees. It semantically consists of []δB ; rev ∈ (tail, head] where δB represents a change in BTrees space δB: .rev↑ {} root -> {}(key, δvalue) It covers only changes to keys from tracked subset of BTrees parts. In particular a key that was not explicitly requested to be tracked, even if it was changed in δZ, is not guaranteed to be present in δB. ΔBtail provides the following operations: .Track(path) - start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi] .Update(δZ) -> δB - update BTree δ tail given raw ZODB changes .ForgetPast(revCut) - forget changes ≤ revCut .SliceByRev(lo, hi) -> []δB - query for all trees changes with rev ∈ (lo, hi] .SliceByRootRev(root, lo, hi) -> []δT - query for changes of a tree with rev ∈ (lo, hi] .GetAt(root, key, at) -> (value, rev) - get root[key] @at assuming root[key] ∈ tracked where δT represents a change to one tree δT: .rev↑ {}(key, δvalue) An example for tracked set is a set of visited BTree paths. There is no requirement that tracked set belongs to only one single BTree. See also zodb.ΔTail and zdata.ΔFtail Concurrency ΔBtail is safe to use in single-writer / multiple-readers mode. That is at any time there should be either only sole writer, or, potentially several simultaneous readers. The table below classifies operations: Writers: Update, ForgetPast Readers: Track + all queries (SliceByRev, SliceByRootRev, GetAt) Note that, in particular, it is correct to run multiple Track and queries requests simultaneously. ΔBtail organization ~~~~~~~~~~~~~~~~~~~ ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is maintained with tree-level history in ΔTtail.vδT . Because it is very computationally expensive(+) to find out for an object to which BTree it belongs, ΔBtail cannot provide full BTree-level history given just ΔZtail with δZ changes. Due to this ΔBtail requires help from users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know that such and such ZODB objects constitute a path from root of a tree to some of its leaf. After Track call the objects from the path and tree keys, that are covered by leaf node, become tracked: from now-on ΔBtail will detect and provide BTree-level changes caused by any change of tracked tree objects or tracked keys. This guarantee can be provided because ΔBtail now knows that such and such objects belong to a particular tree. To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet. This data-structure represents so-called PP-connected set of tree nodes: simply speaking it builds on some leafs and then includes parent(leaf), parent(parent(leaf)), etc. In other words it's a "parent"-closure of the leafs. The property of being PP-connected means that starting from any node from such set, it is always possible to reach root node by traversing .parent links, and that every intermediate node went-through during traversal also belongs to the set. A new Track request potentially grows tracked keys coverage. Due to this, on a query, ΔBtail needs to recompute potentially whole vδT of the affected tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild" functions and uses the same treediff algorithm, that Update is using, but modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially needs to rebuild whole vδT history, not only append new δT, because a change to tracked tree nodes can result in growth of tracked key coverage. Queries are relatively straightforward code that work on vδT snapshot. The main complexity, besides BTree-diff algorithm, lies in recomputing vδT when set of tracked keys changes, and in handling that recomputation in such a way that multiple Track and queries requests could be all served in parallel. Concurrency In order to allow multiple Track and queries requests to be served in parallel ΔBtail employs special organization of vδT rebuild process where complexity of concurrency is reduced to math on merging updates to vδT and trackSet, and on key range lookup: 1. vδT is managed under read-copy-update (RCU) discipline: before making any vδT change the mutator atomically clones whole vδT and applies its change to the clone. This way a query, once it retrieves vδT snapshot, does not need to further synchronize with vδT mutators, and can rely on that retrieved vδT snapshot will remain immutable. 2. a Track request goes through 3 states: "new", "handle-in-progress" and "handled". At each state keys/nodes of the Track are maintained in: - ΔTtail.ktrackNew and .trackNew for "new", - ΔTtail.krebuildJobs for "handle-in-progress", and - ΔBtail.trackSet for "handled". trackSet keeps nodes, and implicitly keys, from all handled Track requests. For all keys, covered by trackSet, vδT is fully computed. a new Track(keycov, path) is remembered in ktrackNew and trackNew to be further processed when a query should need keys from keycov. vδT is not yet providing data for keycov keys. when a Track request starts to be processed, its keys and nodes are moved from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data for requested-to-be-tracked keys. all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint: trackSet ^ trackNew = ø trackSet ^ krebuildJobs = ø trackNew ^ krebuildJobs = ø 3. when a query is served, it needs to retrieve vδT snapshot that takes related previous Track requests into account. Retrieving such snapshots is implemented in vδTSnapForTracked*() family of functions: there it checks ktrackNew/trackNew, and if those sets overlap with query's keys of interest, run vδT rebuild for keys queued in ktrackNew. the main part of that rebuild can be run without any locks, because it does not use nor modify any ΔBtail data, and for δ(vδT) it just computes a fresh full vδT build modulo retrieved ktrackNew. Only after that computation is complete, ΔBtail is locked again to quickly merge in δ(vδT) update back into vδT. This organization is based on the fact that vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂ ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the same as merge of vδT computed for tracked set T₁ and vδT computed for tracked set T₂ ) and that trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂) ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same as union of tracking set updated with δPP₁ and tracking set updated with δPP₂ ) these merge properties allow to run computation for δ(vδT) and δ(trackSet) independently and with ΔBtail unlocked, which in turn enables running several Track/queries in parallel. 4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov entry to indicate in-progress rebuild. Should a query need vδT for keys from that job, it first waits for corresponding job(s) to complete. Explained rebuild organization allows non-overlapping queries/track-requests to run simultaneously. (This property is essential to WCFS because otherwise WCFS would not be able to serve several non-overlapping READ requests to one file in parallel.) -------- (*) implemented in treediff.go (+) full database scan ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some preliminary history: kirr/wendelin.core@877e64a9 X wcfs: Fix tests to pass again kirr/wendelin.core@c32055fc X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø kirr/wendelin.core@78f2f88b X wcfs/xbtree: Fix treediff(a, ø) kirr/wendelin.core@5324547c X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø) kirr/wendelin.core@f65f775b X wcfs/xbtree: treediff(ø, b) kirr/wendelin.core@c75b1c6f X wcfs/xbtree: Start killing holeIdx kirr/wendelin.core@0fa06cbd X kadj must be taken into account as kadj^δZ kirr/wendelin.core@ef5e5183 X treediff ret += δtkeycov kirr/wendelin.core@f30826a6 X another bug in δtkeyconv computation kirr/wendelin.core@0917380e X wcfs: assert that keycov only grow kirr/wendelin.core@502e05c2 X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs kirr/wendelin.core@450ba707 X Fix rebuild with ø @at2 kirr/wendelin.core@f60528c9 X ΔBtail.Clone had bug that it was aliasing klon and orig data kirr/wendelin.core@9d20f8e8 X treediff: Fix BUG while computing AB coverage kirr/wendelin.core@ddb28043 X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV kirr/wendelin.core@324241eb X rebuild: tests: Don't reflect.DeepEqual in inner loop kirr/wendelin.core@8f6e2b1e X rebuild: tests: Don't access ZODB in XGetδKV kirr/wendelin.core@2c0b4793 X rebuild: tests: Don't access ZODB in xtrackKeys kirr/wendelin.core@8f0e37f2 X rebuild: tests: Precompute kadj10·kadj21 kirr/wendelin.core@271d953d X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test kirr/wendelin.core@a87cc6de X rebuild: tests: Don't recompute trackSet(keys1R2) several times kirr/wendelin.core@01433e96 X rebuild: tests: Don't compute keyCover in trackSet kirr/wendelin.core@7371f9c5 X rebuild: tests: Inline _assertTrack kirr/wendelin.core@3e9164b3 X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update kirr/wendelin.core@e9c4b619 X rebuild: tests: Random testing kirr/wendelin.core@d0fe680a X δbtail += ForgetPast kirr/wendelin.core@210e9b07 X Fix ΔBtail.SliceByRootRev (lo,hi] handling kirr/wendelin.core@855ab4b8 X ΔBtail: Goodbye .KVAtTail kirr/wendelin.core@2f5582e6 X ΔBtail: Tweak tests to run faster in normal mode kirr/wendelin.core@cf352737 X random testing found another failing test for rebuild... kirr/wendelin.core@7f7e34e0 X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild - called by Update - already added it kirr/wendelin.core@6ad0052c X ΔBtail.Track: No need to return error kirr/wendelin.core@aafcacdf X xbtree: GetAt test kirr/wendelin.core@784a6761 X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys kirr/wendelin.core@0bb1c22e X xbtree: Verify that ForgetPast clones vδT on trim kirr/wendelin.core@a8945cbf X Start reworking rebuild routines not to modify data inplace kirr/wendelin.core@b74dda09 X Start switching Track from Track(key) to Track(keycov) kirr/wendelin.core@dea85e87 X Switch GetAt to vδTSnapForTrackedKey kirr/wendelin.core@aa0288ce X Switch SliceByRootRev to vδTSnapForTracked kirr/wendelin.core@c4366b14 X xbtree: tests: Also verify state of ΔTtail.ktrackNew kirr/wendelin.core@b98706ad X Track should be nop if keycov/path is already in krebuildJobs kirr/wendelin.core@e141848a X test.go ↑ timeout 10m -> 20m kirr/wendelin.core@423f77be X wcfs: Goodby holeIdx kirr/wendelin.core@37c2e806 X wcfs: Teach treediff to compute not only δtrack (set of nodes), but also δ for track-key coverage kirr/wendelin.core@52c72dbb X ΔBtail.rebuild started to work draftly kirr/wendelin.core@c9f13fc7 X Get rebuild tests to run in a sane time; Add proper random-based testing for rebuild kirr/wendelin.core@c7f1e3c9 X xbtree: Factor testing infrastructure bits into xbtree/xbtreetest kirr/wendelin.core@7602c1f4 ΔBtail concurrency
2ab4be93