TODO 10.8 KB
Newer Older
1 2 3 4 5
  Documentation
    - Clarify node state signification, and consider renaming them in the code.
      Ideas:
        TEMPORARILY_DOWN becomes UNAVAILABLE
        BROKEN is removed ?
6 7 8
    - Clarify the use of each error codes:
      - NOT_READY removed (connection kept opened until ready)
      - Split PROTOCOL_ERROR (BAD IDENTIFICATION, ...)
9 10 11 12
    - Add docstrings (think of doctests)

  Code

13
    Code changes often impact more than just one node. They are categorised by
14
    node where the most important changes are needed.
15 16

    General
17
    - Review XXX/TODO code tags (CODE)
18 19
    - When all cells are OUT_OF_DATE in backup mode, the one with most data
      could become UP_TO_DATE with appropriate backup_tid, so that the cluster
Julien Muchembled's avatar
Julien Muchembled committed
20
      stays operational. (FEATURE)
21
    - Finish renaming UUID into NID everywhere (CODE)
22
    - Remove sleeps (LATENCY, CPU WASTE)
23 24 25
      Code still contains many delays (explicit sleeps or polling timeouts).
      They must be removed to be either infinite (sleep until some condition
      becomes true, without waking up needlessly in the meantime) or null
26
      (don't wait at all).
27
    - Implements delayed connection acceptation.
28
      Currently, any node that connects too early to another that is busy for
29 30 31 32 33 34 35 36
      some reasons is immediately rejected with the 'not ready' error code. This
      should be replaced by a queue in the listening node that keep a pool a
      nodes that will be accepted late, when the conditions will be satisfied.
      This is mainly the case for :
        - Client rejected before the cluster is operational
        - Empty storages rejected during recovery process
      Masters implies in the election process should still reject any connection
      as the primary master is still unknown.
37 38 39 40
    - Wait before reconnecting to a node after an error. For example, a backup
      storage node that is rejected because the upstream cluster is not ready
      currently reconnects in loop without delay, using 100% CPU and
      flooding logs.
41 42 43 44 45
    - Implement transaction garbage collection API (FEATURE)
      NEO packing implementation does not update transaction metadata when
      deleting object revisions. This inconsistency must be made possible to
      clean up from a client application, much in the same way garbage
      collection part of packing is done.
46
    - Factorise node initialisation for admin, client and storage (CODE)
47 48
      The same code to ask/receive node list and partition table exists in too
      many places.
49 50
    - Clarify handler methods to call when a connection is accepted from a
      listening conenction and when remote node is identified
51
      (cf. neo/lib/bootstrap.py).
52 53 54
    - Choose how to handle a storage integrity verification when it comes back.
      Do the replication process, the verification stage, with or without
      unfinished transactions, cells have to set as outdated, if yes, should the
55
      partition table changes be broadcasted ? (BANDWITH, SPEED)
56
    - Make SIGINT on primary master change cluster in STOPPING state.
57
    - Review PENDING/HIDDEN/SHUTDOWN states, don't use notifyNodeInformation()
58
      to do a state-switch, use a exception-based mechanism ? (CODE)
59 60
    - Review handler split (CODE)
      The current handler split is the result of small incremental changes. A
61
      global review is required to make them square.
62 63 64 65 66 67
    - Make handler instances become singletons (SPEED, MEMORY)
      In some places handlers are instanciated outside of App.__init__ . As a
      handler is completely re-entrant (no modifiable properties) it can and
      should be made a singleton (saves the CPU time needed to instanciates all
      the copies - often when a connection is established, saves the memory
      used by each copy).
Julien Muchembled's avatar
Julien Muchembled committed
68
    - Review node notifications. Eg. A storage don't have to be notified of new
Grégory Wisniewski's avatar
Grégory Wisniewski committed
69
      clients but only when one is lost.
Vincent Pelletier's avatar
Vincent Pelletier committed
70 71 72 73
    - Review transactional isolation of various methods
      Some methods might not implement proper transaction isolation when they
      should. An example is object history (undoLog), which can see data
      committed by future transactions.
74 75
    - Add a 'devid' storage configuration so that master do not distribute
      replicated partitions on storages with same 'devid'.
76 77

    Storage
78
    - Use HailDB instead of a stand-alone MySQL server.
79
    - Notify master when storage becomes available for clients (LATENCY)
80 81
      Currently, storage presence is broadcasted to client nodes too early, as
      the storage node would refuse them until it has only up-to-date data (not
82
      only up-to-date cells, but also a partition table and node states).
83 84 85 86 87
    - In backup mode, 2 simultaneous replication should be possible so that:
      - outdated cells does not block backup for too long time
      - constantly modified partitions does not prevent outdated cells to
        replicate
      Current behaviour is undefined and the above 2 scenarios may happen.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
88 89
    - Create a specialized PartitionTable that know the database and replicator
      to remove duplicates and remove logic from handlers (CODE)
90
    - Consider insert multiple objects at time in the database, with taking care
91
      of maximum SQL request size allowed. (SPEED)
92
    - Prevent from SQL injection, escape() from MySQLdb api is not sufficient,
93
      consider using query(request, args) instead of query(request % args)
94 95
    - Make listening address and port optionnal, and if they are not provided
      listen on all interfaces on any available port.
96
    - Make replication speed configurable (HIGH AVAILABILITY)
Julien Muchembled's avatar
typos  
Julien Muchembled committed
97 98
      In its current implementation, replication runs at lowest priority, to
      not degrade performance for client nodes. But when there's only 1 storage
99 100
      left for a partition, it may be wanted to guarantee a minimum speed to
      avoid complete data loss if another failure happens too early.
101 102 103 104 105 106
    - Pack segmentation & throttling (HIGH AVAILABILITY)
      In its current implementation, pack runs in one call on all storage nodes
      at the same time, which lcoks down the whole cluster. This task should
      be split in chunks and processed in "background" on storage nodes.
      Packing throttling should probably be at the lowest possible priority
      (below interactive use and below replication).
107 108 109
    - tpc_finish failures propagation to master (FUNCTIONALITY)
      When asked to lock transaction data, if something goes wrong the master
      node must be informed.
110 111 112 113 114 115 116 117
    - Verify data checksum on reception (FUNCTIONALITY)
      In current implementation, client generates a checksum before storing,
      which is only checked upon load. This doesn't prevent from storing
      altered data, which misses the point of having a checksum, and creates
      weird decisions (ex: if checksum verification fails on load, what should
      be done ? hope to find a storage with valid checksum ? assume that data
      is correct in storage but was altered when it travelled through network
      as we loaded it ?).
118 119 120 121 122
    - Check replicas: (HIGH AVAILABILITY)
      - Automatically tell corrupted cells to fix their data when a good source
        is known.
      - Add an option to also check all rows of trans/obj/data, instead of only
        keys (trans.tid & obj.{tid,oid}).
123 124 125

    Master
    - Master node data redundancy (HIGH AVAILABILITY)
126
      Secondary master nodes should replicate primary master data (ie, primary
127
      master should inform them of such changes).
128
      This data takes too long to extract from storage nodes, and losing it
129 130 131
      increases the risk of starting from underestimated values.
      This risk is (currently) unavoidable when all nodes stop running, but this
      case must be avoided.
Julien Muchembled's avatar
Julien Muchembled committed
132 133 134
    - If the cluster can't start automatically because the last partition table
      is not operational, allow the user to select an older operational one,
      and truncate the DB.
135 136
    - Optimize operational status check by recording which rows are ready
      instead of parsing the whole partition table. (SPEED)
137 138 139
    - tpc_finish failures propagation to client (FUNCTIONALITY)
      When a storage node notifies a problem during lock/unlock phase, an error
      must be propagated to client.
140 141

    Client
142
    - Merge Application into Storage (SPEED)
Julien Muchembled's avatar
Julien Muchembled committed
143
    - Optimize cache.py by rewriting it either in C or Cython (LOAD LATENCY)
144
    - Use generic bootstrap module (CODE)
Julien Muchembled's avatar
Julien Muchembled committed
145 146
    - If too many storage nodes are dead, the client should check the partition
      table hasn't changed by pinging the master and retry if necessary.
147 148
    - Implement IStorageRestoreable (ZODB API) in order to preserve data
      serials (i.e. undo information).
Julien Muchembled's avatar
Julien Muchembled committed
149 150 151 152
    - tpc_finish might raise while transaction got successfully committed.
      This can happen if it gets disconnected from primary master while waiting
      for AnswerFinishTransaction after primary received it and hence will
      commit transaction independently from client presence. Client could
Julien Muchembled's avatar
Julien Muchembled committed
153
      legitimately think transaction is not committed, and might decide to
Julien Muchembled's avatar
Julien Muchembled committed
154 155
      retry. To solve this, client can know if its TTID got successfuly
      committed by looking at currently unused '(t)trans.ttid' column.
156
      See neo.threaded.test.Test.testStorageFailureDuringTpcFinish
157
    - Fix and reenable deadlock avoidance (SPEED). This is required for
158
      neo.threaded.test.Test.testDeadlockAvoidance
159

160 161
    Admin
    - Make admin node able to monitor multiple clusters simultaneously
Julien Muchembled's avatar
Julien Muchembled committed
162
    - Send notifications (ie: mail) when a storage or master node is lost
163 164 165
    - Add ctl command to truncate DB at arbitrary TID. 'Truncate' message
      can be reused. There should also be a way to list last transactions,
      like fstail for FileStorage.
166

Julien Muchembled's avatar
Julien Muchembled committed
167 168 169 170
    Tests
    - Use another mock library that is eggified and maintained.
      See http://garybernhardt.github.com/python-mock-comparison/
      for a comparison of available mocking libraries/frameworks.
171
    - Fix epoll descriptor leak.
Julien Muchembled's avatar
Julien Muchembled committed
172
    - Fix occasional deadlocks in threaded tests.
Julien Muchembled's avatar
Julien Muchembled committed
173

174
  Later
175
    - Consider auto-generating cluster name upon initial startup (it might
176
      actualy be a partition property).
177
    - Consider ways to centralise the configuration file, or make the
178 179
      configuration updatable automaticaly on all nodes.
    - Consider storing some metadata on master nodes (partition table [version],
180
      ...). This data should be treated non-authoritatively, as a way to lower
181
      the probability to use an outdated partition table.
182
    - Decentralize primary master tasks as much as possible (consider
183
      distributed lock mechanisms, ...)
184
    - Choose how to compute the storage size
185
    - Make storage check if the OID match with it's partitions during a store
186 187 188
    - Investigate delta compression for stored data
      Idea would be to have a few most recent revisions being stored fully, and
      older revision delta-compressed, in order to save space.
Julien Muchembled's avatar
Julien Muchembled committed
189 190 191
    - Consider using multicast for cluster-wide notifications. (BANDWITH)
      Currently, multi-receivers notifications are sent in unicast to each
      receiver. Multicast should be used.