1. 21 Dec, 2016 3 commits
    • Julien Muchembled's avatar
      storage: start replicating the partition which is furthest behind · 4d3f3723
      Julien Muchembled authored
      This fixes the following case when the backup is far behing the upstream DB,
      and there are transactions being committed at the same time:
      
      1. replicate partition 0
      2. replicate partition 0
      3. replicate partition 1
      4. replicate partition 0
      5. replicate partition 1
      6. replicate partition 2
      7. replicate partition 0
      ...
      and so on in a quadratic way.
      
      When the upstream activity was too high, the backup could even be stuck looping
      on the first partitions.
      4d3f3723
    • Julien Muchembled's avatar
      master: fix possibly wrong knowledge of cells' backup_tid when resuming backup · 17af3b47
      Julien Muchembled authored
      The issue happens when there were commits while the backup cluster was down.
      In this case, the master thinks that these commits are already replicated,
      reporting wrong backup_tid to neoctl. It solved by itself once:
      - there are new commits triggering replication for all partitions;
      - all storage nodes have really replicated.
      
      This also resulted in an inconsistent database when leaving backup mode during
      this period.
      17af3b47
    • Julien Muchembled's avatar
      Minor comment/doc changes · c95c6c39
      Julien Muchembled authored
      c95c6c39
  2. 20 Dec, 2016 1 commit
  3. 06 Dec, 2016 2 commits
    • Julien Muchembled's avatar
      master,client: ignore notifications before complete initialization · 36b2d141
      Julien Muchembled authored
      A backup master crashed with the following traceback after a reconnection:
      
          Traceback (most recent call last):
            File "neo/master/app.py", line 127, in run
              self._run()
            File "neo/master/app.py", line 147, in _run
              self.playPrimaryRole()
            File "neo/master/app.py", line 348, in playPrimaryRole
              self.backup_app.provideService())
            File "neo/master/backup_app.py", line 123, in provideService
              poll(1)
            File "neo/lib/event.py", line 126, in poll
              to_process.process()
            File "neo/lib/connection.py", line 500, in process
              self._handlers.handle(self, self._queue.pop(0))
            File "neo/lib/connection.py", line 110, in handle
              self._handle(connection, packet)
            File "neo/lib/connection.py", line 125, in _handle
              handler.packetReceived(connection, packet)
            File "neo/lib/handler.py", line 117, in packetReceived
              self.dispatch(*args)
            File "neo/lib/handler.py", line 66, in dispatch
              method(conn, *args, **kw)
            File "neo/master/handlers/backup.py", line 52, in invalidateObjects
              app.invalidatePartitions(tid, partition_set)
            File "neo/master/backup_app.py", line 257, in invalidatePartitions
              self.triggerBackup(node)
            File "neo/master/backup_app.py", line 281, in triggerBackup
              assert cell_list, offset
          AssertionError: 0
      36b2d141
    • Julien Muchembled's avatar
  4. 01 Dec, 2016 5 commits
    • Julien Muchembled's avatar
      Remove dead code found by coverage · 23b9544d
      Julien Muchembled authored
      23b9544d
    • Julien Muchembled's avatar
      Remove some useless unit tests · 1e4a4178
      Julien Muchembled authored
      Many "unit" tests (!= "threaded" tests) don't do more than checking
      implementation details, and increase coverage artificially. As with testEvent
      in commit 71e30fb9, most of these tests will
      either be removed or rewritten as threaded tests.
      
      The fact that the remaining unit tests actually cover code that other test
      don't gives motivation to maintain them. It will be also less code to update
      when switching to https://pypi.python.org/pypi/mock
      
      I proceeded as follows:
      
      1. Measure coverage for all tests except unit tests. While checking my work,
         I found that coverage stats for threaded/functional/zodb tests are quite
         unstable, so I restarted from the beginning by doing this measure several
         times and only keeping the intersection of coverage data.
      
      2. Measure coverage individually for each 'unit' tests, and substract the
         each result with the data in 1.
      
      3. The candidates for deletion are those without any code covered.
      
      Tests I didn't delete:
      
      - neo.tests.master.testElectionHandler: I always do minimal changes about
        election, as long as there's no serious review.
      
      - neo.tests.master.testMasterPT.MasterPartitionTableTests.test_13_outdate
      
      - 4 tests in neo.tests.testPT:
        test_01_Cell, test_04_removeCell, test_06_clear, test_08_filled
      
      - neo.tests.storage.testStorage{MySQL,SQLite}
      
      - neo.tests.testUtil.UtilTests.testReadBufferRead
      
      In a way, this commit is actually quite conservative. There are still many
      useless tests that only check error paths and for simple tested methods, this
      is just duplicating thie tested code.
      1e4a4178
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      Remove unused imports, found by pylint · 3b5a6edb
      Julien Muchembled authored
      3b5a6edb
    • Julien Muchembled's avatar
      TODO: tweak should be safer · e0a2a217
      Julien Muchembled authored
      e0a2a217
  5. 30 Nov, 2016 3 commits
    • Julien Muchembled's avatar
      Various neoctl/neolog formatting improvements/fixes · 264f6f57
      Julien Muchembled authored
      - format IPv6 inside [] when followed by :<port>
      - unify rendering of node lists
      - neoctl: do not crash on empty DB (no PT id)
      264f6f57
    • Julien Muchembled's avatar
    • Julien Muchembled's avatar
      client: fix simultaneous (re)connections to the master · ec031cdf
      Julien Muchembled authored
      This fixes a reqression in commit c39d5c67,
      which could leads to failures like:
      
      2016-11-29 09:56:58,756 ERROR ZODB.Connection Couldn't load state for 0x4843
      Traceback (most recent call last):
        File "ZODB/Connection.py", line 860, in setstate
          self._setstate(obj)
        File "ZODB/Connection.py", line 901, in _setstate
          p, serial = self._storage.load(obj._p_oid, '')
        File "neo/client/Storage.py", line 82, in load
          return self.app.load(oid)[:2]
        File "neo/client/app.py", line 352, in load
          data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
        File "neo/client/app.py", line 372, in _loadFromStorage
          for node, conn in self.cp.iterateForObject(oid, readable=True):
        File "neo/client/pool.py", line 91, in iterateForObject
          pt = self.app.pt
        File "neo/client/app.py", line 146, in __getattr__
          return self.__getattribute__(attr)
      AttributeError: 'Application' object has no attribute 'pt'
      ec031cdf
  6. 28 Nov, 2016 4 commits
  7. 27 Nov, 2016 11 commits
    • Julien Muchembled's avatar
      Bump protocol version · 8eb14b01
      Julien Muchembled authored
      8eb14b01
    • Julien Muchembled's avatar
      Fix identification issues, including a race condition causing id conflicts · 9385706f
      Julien Muchembled authored
      The added test describes how the new id timestamps fix the race condition.
      These timestamps could be any unique opaque values, and the protocol is
      extended to exchange them along with node ids.
      
      Internally, nodes also reuse timestamps as a marker to identify the first
      NotifyNodeInformation packets from the master: since this packet is a complete
      list of nodes in the cluster, any other node in the node manager has left the
      cluster definitely and is removed.
      
      The secondary masters didn't receive update about master nodes.
      It's also useless to send them information about non-master nodes.
      9385706f
    • Julien Muchembled's avatar
      54e819ff
    • Julien Muchembled's avatar
      Remove AskNodeInformation packet · d048a52d
      Julien Muchembled authored
      When Client (including backup master) and admin nodes are identified,
      the primary master now sends them automatically all nodes with
      NotifyNodeInformation, as with storage nodes.
      d048a52d
    • Julien Muchembled's avatar
      master: fix crashes in identification due to buggy nodes · 35664759
      Julien Muchembled authored
      - check address conflicts
      - on invalid values, reject peer instead of dying
      35664759
    • Julien Muchembled's avatar
      lib.node: fix NodeManager accessors returning identified nodes · e7cccf01
      Julien Muchembled authored
      Listing connected/connecting nodes with a UUID is used:
      - in one place by storage nodes: here, it does not matter if we skip nodes that
        aren't really identified
      - in many places by the master, only for server connections, in which case we
        have equivalence with real identification
      
      So in practice, NodeManager is only simplified to reuse the 'identified'
      property of nodes.
      e7cccf01
    • Julien Muchembled's avatar
      lib.node: code refactoring · 5941b27d
      Julien Muchembled authored
      5941b27d
    • Julien Muchembled's avatar
      storage: only accept clients that are known by the master · c17f5f91
      Julien Muchembled authored
      Therefore, a client node in the node manager is always RUNNING.
      c17f5f91
    • Julien Muchembled's avatar
      Give new ids to clients whose ids were already reallocated · d752aadb
      Julien Muchembled authored
      Although the change applies to any node with a temporary ids (all but storage),
      only clients don't have addresses and are therefore not recognizable.
      
      After a client is disconnected from the master and before reconnecting, another
      client may join the cluster and "steals" the id of the first client. This issue
      leads to stuck clients, failing in loop with exceptions like the following one:
      
          ERROR ZODB.Connection Couldn't load state for 0x0251
          Traceback (most recent call last):
            File "ZODB/Connection.py", line 860, in setstate
              self._setstate(obj)
            File "ZODB/Connection.py", line 901, in _setstate
              p, serial = self._storage.load(obj._p_oid, '')
            File "neo/client/Storage.py", line 82, in load
              return self.app.load(oid)[:2]
            File "neo/client/app.py", line 353, in load
              data, tid, next_tid, _ = self._loadFromStorage(oid, tid, before_tid)
            File "neo/client/app.py", line 373, in _loadFromStorage
              for node, conn in self.cp.iterateForObject(oid, readable=True):
            File "neo/client/pool.py", line 91, in iterateForObject
              pt = self.app.pt
            File "neo/client/app.py", line 145, in __getattr__
              self._getMasterConnection()
            File "neo/client/app.py", line 214, in _getMasterConnection
              result = self.master_conn = self._connectToPrimaryNode()
            File "neo/client/app.py", line 246, in _connectToPrimaryNode
              handler=handler)
            File "neo/lib/threaded_app.py", line 154, in _ask
              _handlePacket(qconn, qpacket, kw, handler)
            File "neo/lib/threaded_app.py", line 135, in _handlePacket
              handler.dispatch(conn, packet, kw)
            File "neo/lib/handler.py", line 66, in dispatch
              method(conn, *args, **kw)
            File "neo/lib/handler.py", line 188, in error
              getattr(self, Errors[code])(conn, message)
            File "neo/client/handlers/__init__.py", line 23, in protocolError
              raise StorageError("protocol error: %s" % message)
          StorageError: protocol error: already connected
      d752aadb
    • Julien Muchembled's avatar
      spelling: oudated -> outdated · b62b8dc3
      Julien Muchembled authored
      b62b8dc3
    • Julien Muchembled's avatar
      Fix spelling mistakes · 6e32ebb7
      Julien Muchembled authored
      6e32ebb7
  8. 25 Nov, 2016 2 commits
  9. 21 Nov, 2016 2 commits
    • Julien Muchembled's avatar
      client: fix item eviction from cache, which could break loading from storage · 4ef05b9e
      Julien Muchembled authored
      `ClientCache._oid_dict` shall not have empty values. For a given oid, when the
      last item is removed from the cache, the oid must be removed as well to free
      memory. In some cases, this was not done.
      
      A consequence of this bug is the following exception:
      
          ERROR ZODB.Connection Couldn't load state for 0x02d1e1e4
          Traceback (most recent call last):
            File "ZODB/Connection.py", line 860, in setstate
              self._setstate(obj)
            File "ZODB/Connection.py", line 901, in _setstate
              p, serial = self._storage.load(obj._p_oid, '')
            File "neo/client/Storage.py", line 82, in load
              return self.app.load(oid)[:2]
            File "neo/client/app.py", line 358, in load
              self._cache.store(oid, data, tid, next_tid)
            File "neo/client/cache.py", line 228, in store
              prev = item_list[-1]
          IndexError: list index out of range
      4ef05b9e
    • Julien Muchembled's avatar
  10. 15 Nov, 2016 2 commits
    • Kirill Smelkov's avatar
      backup: Teach cluster in BACKUPING state to also serve regular ZODB clients in read-only mode · d4944062
      Kirill Smelkov authored
      A backup cluster for tids <= backup_tid has all data to provide regular
      read-only ZODB service. Having regular ZODB access to the data can be
      handy e.g. for externally verifying data for consistency between
      main and backup clusters. Peeking around without disturbing main
      cluster might be also useful sometimes.
      
      In this patch:
      
      - master & storage nodes are taught:
      
          * to instantiate read-only or regular client service handler depending on cluster state:
            RUNNING   -> regular
            BACKINGUP -> read-only
      
          * in read-only client handler:
            + to reject write-related operations
            + to provide read operations but adjust semantic as last_tid in the database
              would be = backup_tid
      
      - new READ_ONLY_ACCESS protocol error code is introduced so that client can
        raise POSException.ReadOnlyError upon receiving it.
      
      I have not implemented back-channel for invalidations in read-only mode (yet ?).
      This way once a client connects to cluster in backup state, it won't see
      new data fetched by backup cluster from upstream after client connected.
      
      The reasons invalidations are not implemented is that for now (imho)
      there is no off-hand ready infrastructure to get updates from
      replicating node on transaction-by-transaction basis (it currently only
      notifies when whole batch is done). For consistency verification (main
      reason for this patch) we also don't need invalidations to work, as in
      that task we always connect afresh to backup. So I simply only put
      relevant TODOs about invalidations for now.
      
      The patch is not very polished but should work.
      
      /reviewed-on nexedi/neoppod!4
      d4944062
    • Kirill Smelkov's avatar
  11. 27 Oct, 2016 1 commit
    • Iliya Manolov's avatar
      neoctl: make 'print ids' command display time of TIDs · d9dd39f0
      Iliya Manolov authored
      Currently, the command "neoctl [arguments] print ids" has the following output:
      
          last_oid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      or
      
          backup_tid = 0x...
          last_tid = 0x...
          last_ptid = ...
      
      depending on whether the cluster is in normal or backup mode.
      
      This is extremely unreadable since the admin is often interested in the time that corresponds to each tid. Now the output is:
      
          last_oid = 0x...
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      or
      
          backup_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_tid = 0x... (yyyy-mm-dd hh:mm:ss.ssssss)
          last_ptid = ...
      
      /reviewed-on nexedi/neoppod!2
      d9dd39f0
  12. 17 Oct, 2016 1 commit
    • Kirill Smelkov's avatar
      mysql: force _getNextTID() to use appropriate/whole index · eaa00a88
      Kirill Smelkov authored
      Similarly to 13911ca3 on the same instance after MariaDB was upgraded to
      10.1.17 the following query, even after `OPTIMIZE TABLE obj`, started to execute
      very slowly:
      
          MariaDB [(none)]> SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +--------------------+
          | tid                |
          +--------------------+
          | 268707072758797063 |
          +--------------------+
          1 row in set (4.82 sec)
      
      Both explain and analyze says the query will/is using `partition` key but only partially (note key_len is only 10, not 18):
      
          MariaDB [(none)]> SHOW INDEX FROM neo1.obj;
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | Table | Non_unique | Key_name  | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          | obj   |          0 | PRIMARY   |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            2 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | PRIMARY   |            3 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            1 | partition   | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            2 | oid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          0 | partition |            3 | tid         | A         |    28755928 |     NULL | NULL   |      | BTREE      |         |               |
          | obj   |          1 | data_id   |            1 | data_id     | A         |    28755928 |     NULL | NULL   | YES  | BTREE      |         |               |
          +-------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
          7 rows in set (0.00 sec)
      
          MariaDB [(none)]> explain SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+--------------------------+
          1 row in set (0.00 sec)
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          | id   | select_type | table | type | possible_keys     | key       | key_len | ref         | rows | r_rows     | filtered | r_filtered | Extra                    |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | partition | 10      | const,const |    2 | 9741121.00 |   100.00 |       0.00 | Using where; Using index |
          +------+-------------+-------+------+-------------------+-----------+---------+-------------+------+------------+----------+------------+--------------------------+
          1 row in set (4.93 sec)
      
      By explicitly forcing (partition, oid, tid) index usage which is precisely designed to serve this and similar queries can avoid the query from being slow:
      
          MariaDB [(none)]> analyze SELECT tid FROM neo1.obj FORCE INDEX(`partition`) WHERE `partition`=5 AND oid=79613 AND tid>268707071353462798 ORDER BY tid LIMIT 1;
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          | id   | select_type | table | type  | possible_keys | key       | key_len | ref  | rows | r_rows | filtered | r_filtered | Extra                    |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          |    1 | SIMPLE      | obj   | range | partition     | partition | 18      | NULL |    2 |   1.00 |   100.00 |     100.00 | Using where; Using index |
          +------+-------------+-------+-------+---------------+-----------+---------+------+------+--------+----------+------------+--------------------------+
          1 row in set (0.00 sec)
      
      /cc @jm, @vpelltier, @Tyagov
      
      /reviewed-on !1
      eaa00a88
  13. 12 Sep, 2016 1 commit
  14. 29 Aug, 2016 2 commits
    • Julien Muchembled's avatar
      mysql: fix use of wrong SQL index when checking for dropped partitions · 13911ca3
      Julien Muchembled authored
      After partitions were dropped with TokuDB, we had a case where MariaDB 10.1.14
      stopped using the most appropriate index.
      
      MariaDB [neo0]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=5;
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      | id   | select_type | table | type  | possible_keys     | key     | key_len | ref  | rows | Extra                                 |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      |    1 | SIMPLE      | obj   | range | PRIMARY,partition | data_id | 11      | NULL |   10 | Using where; Using index for group-by |
      +------+-------------+-------+-------+-------------------+---------+---------+------+------+---------------------------------------+
      MariaDB [neo0]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=5;
      Empty set (1 min 51.47 sec)
      
      Expected:
      
      MariaDB [neo1]> explain SELECT DISTINCT data_id FROM obj WHERE `partition`=4;
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      | id   | select_type | table | type | possible_keys     | key     | key_len | ref   | rows | Extra                        |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      |    1 | SIMPLE      | obj   | ref  | PRIMARY,partition | PRIMARY | 2       | const |    1 | Using where; Using temporary |
      +------+-------------+-------+------+-------------------+---------+---------+-------+------+------------------------------+
      1 row in set (0.00 sec)
      MariaDB [neo1]> SELECT SQL_NO_CACHE DISTINCT data_id FROM obj WHERE `partition`=4;
      Empty set (0.00 sec)
      
      Restarting the server or 'OPTIMIZE TABLE obj; ' does not help.
      
      Such issue could prevent the cluster to start due to timeouts, by always going
      back to RECOVERING state.
      13911ca3
    • Julien Muchembled's avatar
      Update TODO · 00ffb1ef
      Julien Muchembled authored
      00ffb1ef