• Kirill Smelkov's avatar
    wcfs: xbtree: ΔBtail · 2ab4be93
    Kirill Smelkov authored
    ΔBtail provides BTree-level history tail that WCFS - via ΔFtail - will
    use to compute which blocks of a ZBigFile need to be invalidated in OS
    file cache given raw ZODB changes on ZODB invalidation message.
    
    It also will be used by WCFS to implement isolation protocol, where on
    every FUSE READ request WCFS will query ΔBtail - again via ΔFtail - to
    find out revision of corresponding file block.
    
    Quoting ΔBtail documentation:
    
    ---- 8< ----
    
    ΔBtail provides BTree-level history tail.
    
    It translates ZODB object-level changes to information about which keys of
    which BTree were modified, and provides service to query that information.
    
    ΔBtail class documentation
    ~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    ΔBtail represents tail of revisional changes to BTrees.
    
    It semantically consists of
    
        []δB			; rev ∈ (tail, head]
    
    where δB represents a change in BTrees space
    
        δB:
        	.rev↑
        	{} root -> {}(key, δvalue)
    
    It covers only changes to keys from tracked subset of BTrees parts.
    In particular a key that was not explicitly requested to be tracked, even if
    it was changed in δZ, is not guaranteed to be present in δB.
    
    ΔBtail provides the following operations:
    
      .Track(path)	- start tracking tree nodes and keys; root=path[0], keys=path[-1].(lo,hi]
    
      .Update(δZ) -> δB				- update BTree δ tail given raw ZODB changes
      .ForgetPast(revCut)			- forget changes ≤ revCut
      .SliceByRev(lo, hi) -> []δB		- query for all trees changes with rev ∈ (lo, hi]
      .SliceByRootRev(root, lo, hi) -> []δT	- query for changes of a tree with rev ∈ (lo, hi]
      .GetAt(root, key, at) -> (value, rev)	- get root[key] @at assuming root[key] ∈ tracked
    
    where δT represents a change to one tree
    
        δT:
        	.rev↑
        	{}(key, δvalue)
    
    An example for tracked set is a set of visited BTree paths.
    There is no requirement that tracked set belongs to only one single BTree.
    
    See also zodb.ΔTail and zdata.ΔFtail
    
    Concurrency
    
    ΔBtail is safe to use in single-writer / multiple-readers mode. That is at
    any time there should be either only sole writer, or, potentially several
    simultaneous readers. The table below classifies operations:
    
        Writers:  Update, ForgetPast
        Readers:  Track + all queries (SliceByRev, SliceByRootRev, GetAt)
    
    Note that, in particular, it is correct to run multiple Track and queries
    requests simultaneously.
    
    ΔBtail organization
    ~~~~~~~~~~~~~~~~~~~
    
    ΔBtail keeps raw ZODB history in ΔZtail and uses BTree-diff algorithm(*) to
    turn δZ into BTree-level diff. For each tracked BTree a separate ΔTtail is
    maintained with tree-level history in ΔTtail.vδT .
    
    Because it is very computationally expensive(+) to find out for an object to
    which BTree it belongs, ΔBtail cannot provide full BTree-level history given
    just ΔZtail with δZ changes. Due to this ΔBtail requires help from
    users, which are expected to call ΔBtail.Track(treepath) to let ΔBtail know
    that such and such ZODB objects constitute a path from root of a tree to some
    of its leaf. After Track call the objects from the path and tree keys, that
    are covered by leaf node, become tracked: from now-on ΔBtail will detect
    and provide BTree-level changes caused by any change of tracked tree objects
    or tracked keys. This guarantee can be provided because ΔBtail now knows
    that such and such objects belong to a particular tree.
    
    To manage knowledge which tree part is tracked ΔBtail uses PPTreeSubSet.
    This data-structure represents so-called PP-connected set of tree nodes:
    simply speaking it builds on some leafs and then includes parent(leaf),
    parent(parent(leaf)), etc. In other words it's a "parent"-closure of the
    leafs. The property of being PP-connected means that starting from any node
    from such set, it is always possible to reach root node by traversing
    .parent links, and that every intermediate node went-through during
    traversal also belongs to the set.
    
    A new Track request potentially grows tracked keys coverage. Due to this,
    on a query, ΔBtail needs to recompute potentially whole vδT of the affected
    tree. This recomputation is managed by "vδTSnapForTracked*" and "_rebuild"
    functions and uses the same treediff algorithm, that Update is using, but
    modulo PPTreeSubSet corresponding to δ key coverage. Update also potentially
    needs to rebuild whole vδT history, not only append new δT, because a
    change to tracked tree nodes can result in growth of tracked key coverage.
    
    Queries are relatively straightforward code that work on vδT snapshot. The
    main complexity, besides BTree-diff algorithm, lies in recomputing vδT when
    set of tracked keys changes, and in handling that recomputation in such a way
    that multiple Track and queries requests could be all served in parallel.
    
    Concurrency
    
    In order to allow multiple Track and queries requests to be served in
    parallel ΔBtail employs special organization of vδT rebuild process where
    complexity of concurrency is reduced to math on merging updates to vδT and
    trackSet, and on key range lookup:
    
    1. vδT is managed under read-copy-update (RCU) discipline: before making
       any vδT change the mutator atomically clones whole vδT and applies its
       change to the clone. This way a query, once it retrieves vδT snapshot,
       does not need to further synchronize with vδT mutators, and can rely on
       that retrieved vδT snapshot will remain immutable.
    
    2. a Track request goes through 3 states: "new", "handle-in-progress" and
       "handled". At each state keys/nodes of the Track are maintained in:
    
       - ΔTtail.ktrackNew and .trackNew       for "new",
       - ΔTtail.krebuildJobs                  for "handle-in-progress", and
       - ΔBtail.trackSet                      for "handled".
    
       trackSet keeps nodes, and implicitly keys, from all handled Track
       requests. For all keys, covered by trackSet, vδT is fully computed.
    
       a new Track(keycov, path) is remembered in ktrackNew and trackNew to be
       further processed when a query should need keys from keycov. vδT is not
       yet providing data for keycov keys.
    
       when a Track request starts to be processed, its keys and nodes are moved
       from ktrackNew/trackNew into krebuildJobs. vδT is not yet providing data
       for requested-to-be-tracked keys.
    
       all trackSet, trackNew/ktrackNew and krebuildJobs are completely disjoint:
    
        trackSet ^ trackNew     = ø
        trackSet ^ krebuildJobs = ø
        trackNew ^ krebuildJobs = ø
    
    3. when a query is served, it needs to retrieve vδT snapshot that takes
       related previous Track requests into account. Retrieving such snapshots
       is implemented in vδTSnapForTracked*() family of functions: there it
       checks ktrackNew/trackNew, and if those sets overlap with query's keys
       of interest, run vδT rebuild for keys queued in ktrackNew.
    
       the main part of that rebuild can be run without any locks, because it
       does not use nor modify any ΔBtail data, and for δ(vδT) it just computes
       a fresh full vδT build modulo retrieved ktrackNew. Only after that
       computation is complete, ΔBtail is locked again to quickly merge in
       δ(vδT) update back into vδT.
    
       This organization is based on the fact that
    
        vδT/(T₁∪T₂) = vδT/T₁ | vδT/T₂
    
         ( i.e. vδT computed for tracked set being union of T₁ and T₂ is the
           same as merge of vδT computed for tracked set T₁ and vδT computed
          for tracked set T₂ )
    
       and that
    
        trackSet | (δPP₁|δPP₂) = (trackSet|δPP₁) | (trackSet|δPP₂)
    
        ( i.e. tracking set updated for union of δPP₁ and δPP₂ is the same
          as union of tracking set updated with δPP₁ and tracking set updated
          with δPP₂ )
    
       these merge properties allow to run computation for δ(vδT) and δ(trackSet)
       independently and with ΔBtail unlocked, which in turn enables running
       several Track/queries in parallel.
    
    4. while vδT rebuild is being run, krebuildJobs keeps corresponding keycov
       entry to indicate in-progress rebuild. Should a query need vδT for keys
       from that job, it first waits for corresponding job(s) to complete.
    
    Explained rebuild organization allows non-overlapping queries/track-requests
    to run simultaneously. (This property is essential to WCFS because otherwise
    WCFS would not be able to serve several non-overlapping READ requests to one
    file in parallel.)
    
    --------
    
    (*) implemented in treediff.go
    (+) full database scan
    
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    Some preliminary history:
    
    877e64a9    X wcfs: Fix tests to pass again
    c32055fc    X wcfs/xbtree: ΔBtail tests += ø -> Tree; Tree -> ø
    78f2f88b    X wcfs/xbtree: Fix treediff(a, ø)
    5324547c    X wcfs/xbtree: root(a) must stay in trackSet even after treediff(a,ø)
    f65f775b    X wcfs/xbtree: treediff(ø, b)
    c75b1c6f    X wcfs/xbtree: Start killing holeIdx
    0fa06cbd    X kadj must be taken into account as kadj^δZ
    ef5e5183    X treediff ret += δtkeycov
    f30826a6    X another bug in δtkeyconv computation
    0917380e    X wcfs: assert that keycov only grow
    502e05c2    X found why TestΔBTailAllStructs was not effective to find δtkeycov bugs
    450ba707    X Fix rebuild with ø @at2
    f60528c9    X ΔBtail.Clone had bug that it was aliasing klon and orig data
    9d20f8e8    X treediff: Fix BUG while computing AB coverage
    ddb28043    X rebuild: Don't return nil for empty ΔPPTreeSubSet - that leads to SIGSEGV
    324241eb    X rebuild: tests: Don't reflect.DeepEqual in inner loop
    8f6e2b1e    X rebuild: tests: Don't access ZODB in XGetδKV
    2c0b4793    X rebuild: tests: Don't access ZODB in xtrackKeys
    8f0e37f2    X rebuild: tests: Precompute kadj10·kadj21
    271d953d    X rebuild: tests: Move ΔBtail.Clone test out of hot inner loop into separate test
    a87cc6de    X rebuild: tests: Don't recompute trackSet(keys1R2) several times
    01433e96    X rebuild: tests: Don't compute keyCover in trackSet
    7371f9c5    X rebuild: tests: Inline _assertTrack
    3e9164b3    X rebuild: tests: Don't exercise keys from keys2 that already became tracked after Track(keys1) + Update
    e9c4b619    X rebuild: tests: Random testing
    d0fe680a    X δbtail += ForgetPast
    210e9b07    X Fix ΔBtail.SliceByRootRev (lo,hi] handling
    855ab4b8    X ΔBtail: Goodbye .KVAtTail
    2f5582e6    X ΔBtail: Tweak tests to run faster in normal mode
    cf352737    X random testing found another failing test for rebuild...
    7f7e34e0    X wcfs/xbtree: Fix update not to add duplicate extra point if rebuild  - called by Update - already added it
    6ad0052c    X ΔBtail.Track: No need to return error
    aafcacdf    X xbtree: GetAt test
    784a6761    X xbtree: Fix KAdj definition after treediff was reworked this summer to base decisions on node keycoverage instead of particular node keys
    0bb1c22e    X xbtree: Verify that ForgetPast clones vδT on trim
    a8945cbf    X Start reworking rebuild routines not to modify data inplace
    b74dda09    X Start switching Track from Track(key) to Track(keycov)
    dea85e87    X Switch GetAt to vδTSnapForTrackedKey
    aa0288ce    X Switch SliceByRootRev to vδTSnapForTracked
    c4366b14    X xbtree: tests: Also verify state of ΔTtail.ktrackNew
    b98706ad    X Track should be nop if keycov/path is already in krebuildJobs
    e141848a    X test.go  ↑ timeout  10m -> 20m
    423f77be    X wcfs: Goodby holeIdx
    37c2e806    X wcfs: Teach treediff to compute not only δtrack (set of nodes), but also δ for track-key coverage
    52c72dbb    X ΔBtail.rebuild started to work draftly
    c9f13fc7    X Get rebuild tests to run in a sane time; Add proper random-based testing for rebuild
    c7f1e3c9    X xbtree: Factor testing infrastructure bits into xbtree/xbtreetest
    7602c1f4    ΔBtail concurrency
    2ab4be93
xbtree.go 3.05 KB