1. 03 Jul, 2018 1 commit
  2. 09 Apr, 2018 1 commit
  3. 13 Mar, 2018 1 commit
  4. 02 Mar, 2018 3 commits
    • Julien Muchembled's avatar
      master: fix resumption of backup replication (internal or not) · 27229793
      Julien Muchembled authored
      Before, it waited for upstream activity until all partitions are touched.
      However, when upstream is idle the backup cluster could remain stuck forever
      if it was interrupted whereas some cells were still late.
      27229793
    • Julien Muchembled's avatar
      master: fix/simplify generation of TID · 7b2e6752
      Julien Muchembled authored
      The 'min_tid < new_tid' assertion failed when jumping to the past.
      7b2e6752
    • Julien Muchembled's avatar
      master: fix possible failure when reading data in a backup cluster with replicas · ca2f7061
      Julien Muchembled authored
      Given that:
      - read locks are only taken by transactions (not replication)
      - in backup mode, storage nodes stay in UP_TO_DATE state, even if partitions
        are synchronized up to different tids
      
      there was a race condition with the master node replying to LastTransaction
      with a TID that may not be replicated yet by all replicas, potentially causing
      such replicas to reply OidDoesNotExist or OidNotFound if a client asks it data
      too early.
      
      IOW, even if the cluster does contain the data up to `getBackupTid(max)`,
      it is only readable by NEO clients up to `getBackupTid(min)` as long as the
      cluster is in BACKINGUP state.
      ca2f7061
  5. 17 Jan, 2018 1 commit
  6. 15 Jan, 2018 25 commits
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: TODO (cmp, analyze) · 6faed528
      Kirill Smelkov authored
      6faed528
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Catobj · aa1d7e12
      Kirill Smelkov authored
      `zodb catobj` command to dump content of an object - similarly to `git
      cat-file`. Two modes: raw and verbose with `zodb dump` like headers for
      the object present.
      
      There is no such command currently in zodbtools/py.
      aa1d7e12
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Info · 27d02ad5
      Kirill Smelkov authored
      Command to print general information about a ZODB database.
      Same as `zodb info` in zodbtools/py.
      27d02ad5
    • Kirill Smelkov's avatar
      go/zodb/zodbtools: Dump · dbb63f65
      Kirill Smelkov authored
      Add `zodb dump` command to dump arbitrary ZODB database in generic
      format. The actual dump protocol being used here is the same as in
      zodbtools/py with
      
      	zodbtools!3
      
      applied. (the MR there is OK and is just waiting for upstream ZODB to
      negotiate a way to retrieve transaction extension data in raw form).
      dbb63f65
    • Kirill Smelkov's avatar
      go/zodb: Start of zodbtools - tools for managing ZODB databases · c6457cf7
      Kirill Smelkov authored
      Add zodbtools which is generic (contrast to fs1tools) set of ZODB
      managing utilities. Only package and command infrastructure here -
      actual commands will follow up in the next patches.
      c6457cf7
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Reindex, Verify-index · 11ee44e0
      Kirill Smelkov authored
      Add commands for FileStorage index maintainance: manually rebuild the
      index and to performe index verification.
      11ee44e0
    • Kirill Smelkov's avatar
      go/zodb/fs1tools: Dump · 9de107fe
      Kirill Smelkov authored
      Add various FileStorage-specific dump commands with output being
      bit-to-bit exact with the following ZODB/py FileStorage tools:
      
      - fsdump.py
      - fsdump.py (verbose dumper)
      - fstail.py
      
      Please see the patch for links about this dump formats.
      9de107fe
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1: My notes on I/O · 0814c1e1
      Kirill Smelkov authored
      0814c1e1
    • Kirill Smelkov's avatar
      d232237e
    • Kirill Smelkov's avatar
      go/zodb/fs1: Actual FileStorage ZODB driver · 7792a133
      Kirill Smelkov authored
      Build FileStorage ZODB driver out of format record loading/decoding
      and index routines we just added in previous patches.
      
      The driver supports only read-only mode so far.
      
      Promised tests for data format interoperability with ZODB/py are added.
      7792a133
    • Kirill Smelkov's avatar
    • Kirill Smelkov's avatar
      go/zodb/fs1: Index save/load · 8fa9fdaf
      Kirill Smelkov authored
      Build index type on top of fsb.Tree introduced in the previous patch and
      add routines to save and load it to/from disk.
      
      We ensure ZODB/py compatibility via generating test FileStorage database
      + its index and checking we can load index from it and also that if we
      save an index ZODB/py can load it back. FileStorage index is hard to get
      bit-to-bit identical since this index uses python pickles which can
      encode the same objects in several different ways.
      8fa9fdaf
    • Kirill Smelkov's avatar
      go/zodb/fs1: BTree specialized with KEY=zodb.Oid, VALUE=int64 · 33d10066
      Kirill Smelkov authored
      FileStorage index maps oid to file position storing latest data record
      for this oid. This index is naturally to implement via BTree as e.g.
      ZODB/py does.
      
      In Go world there is github.com/cznic/b BTree library but without
      specialization and working via interface{} it is slower than it could be
      and allocates a lot. So generate specialized version of that code with
      key and value types exactly suitable for FileStorage indexing.
      
      We use a bit patched b version with speed ups for bulk-loading data via
      regular point-ingestion BTree entry point:
      
      	https://lab.nexedi.com/kirr/b x/refill
      
      The patches has not been upstreamed because it slows down general case a
      bit (only a bit, but still this is a "no" to me), and because with
      dedicated bulk-loading API it could be possible to still load data
      several times faster. Still current version is enough for not very-huge
      indices.
      
      Btw ZODB/py does the same (see fsBucket + friends).
      33d10066
    • Kirill Smelkov's avatar
      go/zodb: Start of FileStorage support · 8f64f6ed
      Kirill Smelkov authored
      Start implementing FileStorage support by adding code to load/decode
      FileStorage records and way to iterate a FileStorage.
      
      Tests will come in a later patch together with ZODB-level loading
      support.
      8f64f6ed
    • Kirill Smelkov's avatar
      go/zodb: Way for storage-drivers to be registered and for clients to open them by URL · fcab9405
      Kirill Smelkov authored
      Storage drivers can register themselves via zodb.RegisterDriver.
      
      Later cliens can request to open a storage by URL via zodb.OpenStorage.
      The opener will lookup driver registry and wrap created driver instance
      with common layer with cache etc to turn an IStorageDriver into fully
      working IStorage.
      fcab9405
    • Kirill Smelkov's avatar
      zodb/go: In-RAM client cache · 7233b4c0
      Kirill Smelkov authored
      The cache is needed so that we can provide IStorage.Prefetch
      functionality generally wrapped on top of a storage driver: when an
      object is loaded, the loading itself consists of steps:
      
      1. start loading object into cache,
      2. wait for the loading to complete.
      
      This way Prefetch is naturally only "1" - start loading object into
      cache but do not wait for the loading to be complete. Go's goroutines
      naturally help here where we can spawn every such loading into its own
      goroutine instead of explicitly programming loading in terms of a state
      machine.
      
      Since this cache is mainly needed for Prefetch to work, not to actually
      cache data (though it works as cache for repeating access too), the goal
      when writing it was to add minimal overhead for "data-not-yet-in-cache"
      case. Current state we are not completely there yet but the latency is
      acceptable - depending on the workload the cache layer adds ~
      
      	0.5 - 1 - 3µs
      
      to loading times.
      7233b4c0
    • Kirill Smelkov's avatar
      go/zodb: Minimal serialization compatibility with ZODB/py · dfd4fb73
      Kirill Smelkov authored
      ZODB/py serializes data using python pickles. Basically every serialized
      object has two parts: class description and object state. Here we
      start by providing minimal functionality to extract class-name from
      serialized data.
      
      The library used for pickle decoding (and in later patches encoding) is
      
      	github.com/kisielk/og-rek
      
      It was audited by me for security flaws to some extent.
      
      Contrary to Python pickle module it does not run arbitrary code on
      decoding.
      dfd4fb73
    • Kirill Smelkov's avatar
      go/zodb: Tid connection with time · bac6c953
      Kirill Smelkov authored
      Since in ZODB TIDs are corresponding to time, provide functionality to
      convert a tid to timestamp. Do so in exactly the same way as ZODB/py
      does for interoperability.
      bac6c953
    • Kirill Smelkov's avatar
      3d13a276
    • Kirill Smelkov's avatar
      go: Start of ZODB · 20d8456c
      Kirill Smelkov authored
      Our path of implementing NEO in Go will be not only for server-side, but
      also for client-side, since it is needed by Wendelin.core. On
      server-side we'll also need to work with types and data model Python
      ZODB implementation uses, so here it goes: Start of ZODB in Go.
      
      Here we define ZODB data types, data model and operational interfaces
      for IStorage + friends.
      
      The interfaces are currently read-only with stubs for write mode.
      20d8456c
    • Kirill Smelkov's avatar
      go: Basic .gitignore · 7cb20f32
      Kirill Smelkov authored
      Ignore files commonly produced while profiling Go programs and running
      tests.
      7cb20f32
    • Kirill Smelkov's avatar
      NEO/go licensing · 612d556d
      Kirill Smelkov authored
      We want to make sure the code can be used by all projects without a
      problem. This way the license is GPLv3+ with wide exception for all Free
      Software / Open Source projects + Business options.
      
      Nexedi stack is licensed under Free Software licenses with various exceptions
      that cover three business cases:
      
      - Free Software
      - Proprietary Software
      - Rebranding
      
      As long as one intends to develop Free Software based on Nexedi stack, no
      license cost is involved. Developing proprietary software based on Nexedi stack
      may require a proprietary exception license. Rebranding Nexedi stack is
      prohibited unless rebranding license is acquired.
      
      Through this licensing approach, Nexedi expects to encourage Free Software
      development without restrictions and at the same time create a framework for
      proprietary software to contribute to the long term sustainability of the
      Nexedi stack.
      
      Please see https://www.nexedi.com/licensing for details, rationale and options.
      
      ( NEO/py for now stays at the old terms but it will be upgraded to the same
        terms as NEO/go eventually )
      612d556d
    • Kirill Smelkov's avatar
      Sync NEO/py · a48d51c2
      Kirill Smelkov authored
      Sync with current NEO in Python implementation as the first step.
      
      We'll be using some common bits and in particular on-the-wire protocol
      must be the same and for py/go interoperability testing we'll also need
      python parts.
      a48d51c2
  7. 11 Jan, 2018 1 commit
  8. 08 Jan, 2018 1 commit
    • Julien Muchembled's avatar
      storage: optimize storage layout of raw data for replication · f4dd4bab
      Julien Muchembled authored
      # Previous status
      
      The issue was that we had extreme storage fragmentation from the point of view
      of the replication algorithm, which processes one partition at a time.
      
      By using an autoincrement for the 'data' table, rows were ordered by the time
      at which they were added:
      - parts may be the result of replication -> ordered by partition, tid, oid
      - other rows are globally sorted by tid
      
      Which means that when scanning a given partition, many rows were skipped all
      the time:
      - if readahead is bigger enough, the efficiency is 1/N for a node with N
        partitions assigned
      - else, it is worse because it seeks all the time
      
      For huge databases, the replication was horribly slow, in particular from HDD.
      
      # Chosen solution
      
      This commit changes how ids are generated to somehow split 'data'
      per partition. The backend tracks 1 last id per assigned partition, where the
      16 higher bits contains the partition. Keep in mind that the value of id has no
      meaning and it's only chosen for performance reasons. IOW, a row can be
      referred by an oid of a partition different than the 16 higher bits of id:
      - there's no migration needed and the 16 higher bits of all existing rows are 0
      - in case of deduplication, a row can still be shared by different partitions
      
      Due to https://jira.mariadb.org/browse/MDEV-12836, we leave the autoincrement
      on existing databases.
      
      ## Downsides
      
      On insertion, increasing the number of partitions now slows down significantly:
      for 2 nodes using TokuDB, 4% for 180 partitions, 40% for 2000. For 12
      partitions, the difference remains negligible. The solution for this issue will
      be to enable to increase the number of partitions efficiently, so that nodes
      can keep a small number of them, even for DB that are expected to grow so much
      that many nodes are added over time: such feature was already considered so
      that users don't have to worry anymore about this obscure setting at database
      creation.
      
      Read performance is only slowed down for applications that read a lot of data
      that were written contiguously, but split in small blocks. A solution is to
      extend ZODB so that the application tells it to chose new oids that will end up
      in the same partition. Like for insertion, there should not be too many
      partitions.
      
      With RocksDB (MariaDB 10.2.10), it takes a significant amount of time to
      collect all last ids at startup when there are many partitions.
      
      ## Other advantages
      
      - The storage layout of data is now always the same and does not depend on
        whether rows came from replication or commits.
      - Efficient deletion of partition to free space in-place will be possible.
      
      # Considered alternative
      
      The only serious alternative was to replicate as many partitions as possible at
      the same time, ideally all assigned partitions, but it's not always possible.
      For best performance, it would often require to synchronize new nodes, or even
      all of them, so that thesource nodes don't have to scan 'data' several times.
      
      If existing nodes are kept, all data that aren't copied to the newly added
      nodes have to be skipped. If the number of nodes is multiplied by N, the
      efficiency is 1-1/N at best (synchronized nodes), else it's even worse
      because partitions are somehow shuffled.
      
      Checking/replacing a single node would remain slow when there are several
      source nodes.
      
      At last, such an algorithm would be much more complex and we would not have the
      other advantages listed above.
      f4dd4bab
  9. 05 Jan, 2018 6 commits