TODO 16.3 KB
Newer Older
1 2 3 4 5 6 7 8
RC = Release Critical (for next release)

  Documentation
    - Clarify node state signification, and consider renaming them in the code.
      Ideas:
        TEMPORARILY_DOWN becomes UNAVAILABLE
        DOWN becomes UNKNOWN
        BROKEN is removed ?
9 10 11 12
    - Clarify the use of each error codes:
      - NO_ERROR error packet vs ACK special packet
      - NOT_READY removed (connection kept opened until ready)
      - Split PROTOCOL_ERROR (BAD IDENTIFICATION, ...)
13 14 15 16 17 18 19 20 21 22
RC  - Clarify cell state signification
    - Add docstrings (think of doctests)
RC  - Update README (TODOs should be dropped/moved here)

  Tests
    - rewrite tests
RC  - write ZODB-API-level tests

  Code

23 24
    Code changes often impact more than just one node. They are categorised by 
    node where the most important changes are needed.
25 26 27 28 29 30

    General
RC  - Review XXX in the code (CODE)
RC  - Review TODO in the code (CODE)
RC  - Review output of pylint (CODE)
    - Connections should be integrated to Node class instances (CODE)
31 32 33 34 35
      Currently, connections are managed separately from nodes, and the code 
      very often needs to find one from the other. As all connections are to 
      a node, and as all nods can be reperesented as Node class instances, 
      such instance should directly contain associated connection for code 
      simplicity.
36 37 38 39 40 41 42 43
      Link node state and connection state, a node cannot be seens as running 
      and unconnected at the same time from a node point of view. 
      Consider something like:
        - UNAVAILABLE implies UNCONNECTED
        - UNKNOWN implies UNCONNECTED
        - RUNNING implies CONNECTED
        - IDENTIFIED implies RUNNING
        - DOWN implies that the node is dropped from manager/db/config
44
    - Rework indexes in NodeManager class (CODE)
45 46
      NodeManager should provide indexes to quickly find nodes by type, UUID, 
      (ip, port), connection type (listening...) and state (identified...)
47
    - Keep-alive (HIGH AVAILABILITY)
48 49 50
      Consider the need to implement a keep-alive system (packets sent 
      automatically when there is no activity on the connection for a period 
      of time).
51
    - Factorise packet data when sending partition table cells (BANDWITH)
52 53 54 55
      Currently, each cell in a partition table update contains UUIDs of all 
      involved nodes.
      It must be changed to a correspondance table using shorter keys (sent 
      in the packet) to avoid repeating the same UUIDs many times.
56
    - Make IdleEvent know what message they are expecting (DEBUGABILITY)
57 58 59
      If a PING packet is sent, there is currently no way to know which 
      request created associated IdleEvent, nor which response is expected 
      (knowing either should be enough).
60
    - Consider using multicast for cluster-wide notifications. (BANDWITH)
61 62
      Currently, multi-receivers notifications are sent in unicast to each 
      receiver. Multicast should be used.
63
    - Remove sleeps (LATENCY, CPU WASTE)
64 65 66 67 68 69
      Code still contains many delays (explicit sleeps or polling timeouts). 
      They must be removed to be either infinite (sleep until some condition 
      becomes true, without waking up needlessly in the meantime) or null 
      (don't wait at all).
      There is such delay somewhere in master node startup (near the end of 
      the election phase).
70 71 72 73 74 75 76 77 78 79
    - Implements delayed connection acceptation.
      Currently, any node that connects to early to another that is busy for 
      some reasons is immediately rejected with the 'not ready' error code. This
      should be replaced by a queue in the listening node that keep a pool a
      nodes that will be accepted late, when the conditions will be satisfied.
      This is mainly the case for :
        - Client rejected before the cluster is operational
        - Empty storages rejected during recovery process
      Masters implies in the election process should still reject any connection
      as the primary master is still unknown.
80
    - Connections must support 2 simultaneous handlers (CODE)
81 82 83 84 85 86 87 88 89 90 91
      Connections currently define only one handler, which is enough for 
      monothreaded code. But when using multithreaded code, there are 2 
      possible handlers involved in a packet reception:
      - The first one handles notifications only (nothing special to do 
        regarding multithreading)
      - The second one handles expected messages (such message must be 
        directed to the right thread)
      The second handler must be possible to set on the connection when that 
      connection is thread-safe (MT version of connection classes).
      Also, the code to detect wether a response is expected or not must be 
      genericised and moved out of handlers.
92
    - Pack (FEATURE)
93 94 95 96 97
    - Control that client processed all invalidations before starting a 
      transaction (CONSISTENCY)
      If a client starts a transaction before it received an invalidation 
      message caused by a transaction commited, it will use outdated data. 
      This is a bug known in Zeo.
98
    - Factorise node initialisation for admin, client and storage (CODE)
99 100 101 102 103
      The same code to ask/receive node list and partition table exists in too
      many places.
    - Clarify handler methods to call when a connection is accepted from a 
      listening conenction and when remote node is identified 
      (cf. neo/bootstrap.py).
104 105 106
    - Choose how to handle a storage integrity verification when it comes back.
      Do the replication process, the verification stage, with or without
      unfinished transactions, cells have to set as outdated, if yes, should the
107
      partition table changes be broadcasted ? (BANDWITH, SPEED)
108
    - Review PENDING/HIDDEN/SHUTDOWN states, don't use notifyNodeInformation() 
109
      to do a state-switch, use a exception-based mechanism ? (CODE)
110
    - Ensure that registered timeout are canceled if the related connection was
111
      closed. (CODE)
112 113 114
    - Clarify big packet handling, is it needed to split them at connection
      level, application level, use the ask/send/answer scheme ? Currently it's
      not consistent, essentially with ask/answer/send partition table.
115
    - Split protocol.py in a 'protocol' module
116 117
    - Review handler split (CODE)
      The current handler split is the result of small incremental changes. A
118
      global review is required to make them square.
119 120 121 122 123 124
    - Make handler instances become singletons (SPEED, MEMORY)
      In some places handlers are instanciated outside of App.__init__ . As a
      handler is completely re-entrant (no modifiable properties) it can and
      should be made a singleton (saves the CPU time needed to instanciates all
      the copies - often when a connection is established, saves the memory
      used by each copy).
125 126 127
    - Consider replace setNodeState admin packet by one per action, like
      dropNode to reduce packet processing complexity and reduce bad actions
      like set a node in TEMPORARILY_DOWN state.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
128 129 130 131 132
    - Consider process writable events in event.poll() method to ensure that
      pending outgoing data are sent if the network is ready to avoid wait for 
      an incoming packet that trigger the poll() system call.
    - Allow daemonize NEO processes, re-use code from TIDStorage and support
      start/stop/restart/status commands.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
133 134
    - Consider don't close the connection after sending a packet but wait (a
    	bit) for the closure from the remote peer.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
135

136 137 138

    Storage
    - Implement incremental storage verification (BANDWITH)
139 140 141 142 143 144 145 146 147 148 149 150 151 152
      When a partition cell is in out-of-date state, the entire transition 
      history is checked.
      This is because there might be gaps in cell tid history, as an out-of-date
      node is writable (although non-readable).
      It should use an incremental mechanism to only check transaction past a 
      certain TID known to have no gap.
    - Use embeded MySQL database instead of a stand-alone MySQL server. 
      (LATENCY)(to be discussed)
    - Make replication work even in non-operational cluster state 
      (HIGH AVAILABILITY)
      When a master decided a partition change triggering replication, 
      replication should happen independently of cluster state. (Maybe we still 
      need a primary master, to void replicating from an outdated partition 
      table setup.)
153
    - Flush asynchronously objects from partition cells not served (DISK SPACE)
154
    - Close connections to other storage nodes (SYSTEM RESOURCE USAGE)
155 156 157
      When a replication finishes, the connection is not closed currently. It 
      should be closed (possibly asynchronously, and possibly by detecting that
      connection is idle - similar to keep-alive principle)
158
    - Notify master when storage becomes available for clients (LATENCY)
159 160 161
      Currently, storage presence is broadcasted to client nodes too early, as 
      the storage node would refuse them until it has only up-to-date data (not 
      only up-to-date cells, but also a partition table and node states).
Grégory Wisniewski's avatar
Grégory Wisniewski committed
162 163
    - Create a specialized PartitionTable that know the database and replicator
      to remove duplicates and remove logic from handlers (CODE)
164
    - Consider insert multiple objects at time in the database, with taking care
165
      of maximum SQL request size allowed. (SPEED)
166
    - Prevent from SQL injection, escape() from MySQLdb api is not sufficient,
167
      consider using query(request, args) instead of query(request % args)
Grégory Wisniewski's avatar
Grégory Wisniewski committed
168 169 170 171
    - Create database adapter for other RDBMS (sqlite, postgres)
    - fix __undoLog when there is out of date cells, there is a busy loop
      because the client expected more answer than the available number of
      storage nodes.
172
    - Improve replication process (BANDWITH)
173
      Current implementation do this way to replicate objects (for a given TID):
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
        S1 > S2 : Ask for a range of OIDs
        S1 < S2 : Answer the range fo OIDs
        For each OID :
          S1 > S2 : Ask a range of the object history
          S1 < S2 : Answer the object history
          For each missing version of the object :
            S1 > S2 : Ask object data
            S1 < S2 : Answer object data
      Proposal (just to keep the basics in mind):
        S1 > S2 : Send its object state list, with last serial for each oid
        S1 < S2 : Answer object data for latter state of each object
        Or something like that, the idea is to say what we have instead or check
        what we don't have.

    Master
    - Master node data redundancy (HIGH AVAILABILITY)
190 191 192 193 194 195
      Secondary master nodes should replicate primary master data (ie, primary 
      master should inform them of such changes).
      This data takes too long to extract from storage nodes, and loosing it 
      increases the risk of starting from underestimated values.
      This risk is (currently) unavoidable when all nodes stop running, but this
      case must be avoided.
196
    - Don't reject peers during startup phases (STARTUP LATENCY)
197 198 199 200 201
      When (for example) a client sends a RequestNodeIdentification to the 
      primary master node while the cluster is not yet operational, the primary 
      master should postpone the node acceptance until the cluster is 
      operational, instead of closing the connection immediately. This would 
      avoid the need to poll the master to know when it is ready.
202
    - Differential partition table updates (BANDWITH)
203 204 205 206 207
      When a storage asks for current partition table (when it connects to a 
      cluster in service state), it must update its knowledge of the partition 
      table. Currently it's done by fetching the entire table. If the master 
      keeps a history of a few last changes to partition table, it would be able
      to only send a differential update (via the incremental update mechanism)
208
    - During recovery phase, store multiple partition tables (ADMINISTATION)
209 210 211
      When storage nodes know different version of the partition table, the 
      master should be abdle to present them to admin to allow him to choose one
      when moving on to next phase.
212 213
    - Optimize operational status check by recording which rows are ready
      instead of parsing the whole partition table. (SPEED)
214 215
    - Improve partition table tweaking algorithm to reduce differences between
      frequently and rarely used nodes (SCALABILITY)
216 217

    Client
218 219
    - Client should prefer storage nodes it's already connected to when 
      retrieving objects (LOAD LATENCY)
220 221
    - Implement C version of mq.py (LOAD LATENCY)
    - Move object data replication task to storage nodes (COMMIT LATENCY)
222 223 224 225 226 227 228
      Currently the client node must send a single object data to all storage 
      nodes in charge of the partition cell containing that object. This 
      increases the time the client has to wait for storage reponse, and 
      increases client-to-storage bandwith usage. It must be possible to send 
      object data to only one stroage and that storage should automatically 
      replicate on other storages. Locks on objects would then be released by 
      storage nodes.
229
    - Use generic bootstrap module (CODE)
230 231 232 233 234 235 236
    - Extend waitMessage to expect more than one response, on multiple 
      connections (LATENCY)
      To be able to pipeline requests, waitMessage must be extended to allow 
      responses to arrive out of order.
      The extreme case is when we must ask multiple nodes for object history 
      (used to support undo) because different msg_ids are expected on different
      connections.
237 238 239
    - Find a way to make ask() from the thread poll to allow send initial packet
      (requestNodeIdentification) from the connectionCompleted() event instead
      of app. This requires to know to what thread will wait for the answer.
240 241 242
    - Discuss about dead lstorage notification. If a client fails to connect to
      a storage node supposed in running state, then it should notify the master
      to check if this node is well up or not.
243
    - Cache for loadSerial/loadBefore
244 245
    - Implement restore() ZODB API method to bypass consistency checks during
      imports.
246 247

  Later
248 249 250 251 252 253 254 255 256
    - Consider auto-generating cluster name upon initial startup (it might 
      actualy be a partition property).
    - Consider ways to centralise the configuration file, or make the 
      configuration updatable automaticaly on all nodes.
    - Consider storing some metadata on master nodes (partition table [version],
      ...). This data should be treated non-authoritatively, as a way to lower 
      the probability to use an outdated partition table.
    - Decentralize primary master tasks as much as possible (consider 
      distributed lock mechanisms, ...)
257
    - Make admin node able to monitor multiple clusters simultaneously
258
    - Choose how to compute the storage size
259
    - Make storage check if the OID match with it's partitions during a store
260
    - Send notifications when a storage node is lost
261
    - When importing data, objects with non-allocated OIDs are stored. The
262
    storage can detect this and could notify the master to not allocated lower
263 264 265 266
    OIDs. But during import, each object stored trigger this notification and
    may cause a big network overhead. It would be better to refuse any client 
    connection and thus no OID allocation during import. It may be interesting 
    to create a new stage for the cluster startup... to be discussed.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
267 268 269
    - Simple deployment solution, based on embedded database, integrated master
    and storage node that works out of the box
    - Simple import/export solution that generate SQL/data.fs.
270
    - Consider using out-of-band TCP feature.
271

272 273 274 275 276 277
Bugs:
  - Node in partition table already known: When a node receive a partition table
    (full or partial), nodes used must already known and referenced in the node
    manager (special case for master during recovery). Sometimes functional
    tests fails on the assert in pt.load(), meaning the node is not in the ode
    manager.
278 279 280 281 282 283 284 285 286
  - Race condition between master, storage and admin nodes:
      - Master is in recovery state
      - Storage connect to the master
      - Master ask last IDs to the storage
      - Admin connect
      - Admin allow the cluster to start
      - Master change handlers
      - Storage answer last IDs
      -> unexpected packet, storage is broken
287 288 289 290 291 292 293 294 295 296 297 298 299 300

Old TODO

    - Handling connection timeouts.
    - Handling write timeouts.
    - IdleEvent for a certain message type as well as a message ID.
    - Flushing write buffers only without reading packets.
    - Garbage collection of unused nodes.
    - Stopping packet processing by returning a boolean value from
      a handler, otherwise too tricky to exchange a handler with another.
    - History.
    - Multiple undo.
    - Expiration of temporarily down nodes.