TODO 12.3 KB
Newer Older
1 2 3 4 5 6 7
RC = Release Critical (for next release)

  Documentation
    - Clarify node state signification, and consider renaming them in the code.
      Ideas:
        TEMPORARILY_DOWN becomes UNAVAILABLE
        BROKEN is removed ?
8 9 10
    - Clarify the use of each error codes:
      - NOT_READY removed (connection kept opened until ready)
      - Split PROTOCOL_ERROR (BAD IDENTIFICATION, ...)
11 12 13 14 15 16 17 18
RC  - Clarify cell state signification
    - Add docstrings (think of doctests)

  Tests
RC  - write ZODB-API-level tests

  Code

19
    Code changes often impact more than just one node. They are categorised by
20
    node where the most important changes are needed.
21 22 23 24 25 26

    General
RC  - Review XXX in the code (CODE)
RC  - Review TODO in the code (CODE)
RC  - Review output of pylint (CODE)
    - Keep-alive (HIGH AVAILABILITY)
27 28
      Consider the need to implement a keep-alive system (packets sent
      automatically when there is no activity on the connection for a period
29
      of time).
30
    - Factorise packet data when sending partition table cells (BANDWITH)
31
      Currently, each cell in a partition table update contains UUIDs of all
32
      involved nodes.
33
      It must be changed to a correspondance table using shorter keys (sent
34
      in the packet) to avoid repeating the same UUIDs many times.
35
    - Consider using multicast for cluster-wide notifications. (BANDWITH)
36
      Currently, multi-receivers notifications are sent in unicast to each
37
      receiver. Multicast should be used.
38
    - Remove sleeps (LATENCY, CPU WASTE)
39 40 41
      Code still contains many delays (explicit sleeps or polling timeouts).
      They must be removed to be either infinite (sleep until some condition
      becomes true, without waking up needlessly in the meantime) or null
42
      (don't wait at all).
43
    - Implements delayed connection acceptation.
44
      Currently, any node that connects to early to another that is busy for
45 46 47 48 49 50 51 52
      some reasons is immediately rejected with the 'not ready' error code. This
      should be replaced by a queue in the listening node that keep a pool a
      nodes that will be accepted late, when the conditions will be satisfied.
      This is mainly the case for :
        - Client rejected before the cluster is operational
        - Empty storages rejected during recovery process
      Masters implies in the election process should still reject any connection
      as the primary master is still unknown.
53
    - Connections must support 2 simultaneous handlers (CODE)
54 55
      Connections currently define only one handler, which is enough for
      monothreaded code. But when using multithreaded code, there are 2
56
      possible handlers involved in a packet reception:
57
      - The first one handles notifications only (nothing special to do
58
        regarding multithreading)
59
      - The second one handles expected messages (such message must be
60
        directed to the right thread)
61
      The second handler must be possible to set on the connection when that
62
      connection is thread-safe (MT version of connection classes).
63
      Also, the code to detect wether a response is expected or not must be
64
      genericised and moved out of handlers.
65 66 67 68 69
    - Implement transaction garbage collection API (FEATURE)
      NEO packing implementation does not update transaction metadata when
      deleting object revisions. This inconsistency must be made possible to
      clean up from a client application, much in the same way garbage
      collection part of packing is done.
70
    - Factorise node initialisation for admin, client and storage (CODE)
71 72
      The same code to ask/receive node list and partition table exists in too
      many places.
73 74
    - Clarify handler methods to call when a connection is accepted from a
      listening conenction and when remote node is identified
75
      (cf. neo/bootstrap.py).
76 77 78
    - Choose how to handle a storage integrity verification when it comes back.
      Do the replication process, the verification stage, with or without
      unfinished transactions, cells have to set as outdated, if yes, should the
79
      partition table changes be broadcasted ? (BANDWITH, SPEED)
80
    - Review PENDING/HIDDEN/SHUTDOWN states, don't use notifyNodeInformation()
81
      to do a state-switch, use a exception-based mechanism ? (CODE)
82 83 84
    - Clarify big packet handling, is it needed to split them at connection
      level, application level, use the ask/send/answer scheme ? Currently it's
      not consistent, essentially with ask/answer/send partition table.
85
    - Split protocol.py in a 'protocol' module
86 87
    - Review handler split (CODE)
      The current handler split is the result of small incremental changes. A
88
      global review is required to make them square.
89 90 91 92 93 94
    - Make handler instances become singletons (SPEED, MEMORY)
      In some places handlers are instanciated outside of App.__init__ . As a
      handler is completely re-entrant (no modifiable properties) it can and
      should be made a singleton (saves the CPU time needed to instanciates all
      the copies - often when a connection is established, saves the memory
      used by each copy).
95 96 97
    - Consider replace setNodeState admin packet by one per action, like
      dropNode to reduce packet processing complexity and reduce bad actions
      like set a node in TEMPORARILY_DOWN state.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
98
    - Consider process writable events in event.poll() method to ensure that
99
      pending outgoing data are sent if the network is ready to avoid wait for
Grégory Wisniewski's avatar
Grégory Wisniewski committed
100 101
      an incoming packet that trigger the poll() system call.

102 103

    Storage
104
    - Use Kyoto Cabinet instead of a stand-alone MySQL server.
105
    - Make replication work even in non-operational cluster state
106
      (HIGH AVAILABILITY)
107 108 109
      When a master decided a partition change triggering replication,
      replication should happen independently of cluster state. (Maybe we still
      need a primary master, to void replicating from an outdated partition
110
      table setup.)
111
    - Notify master when storage becomes available for clients (LATENCY)
112 113
      Currently, storage presence is broadcasted to client nodes too early, as
      the storage node would refuse them until it has only up-to-date data (not
114
      only up-to-date cells, but also a partition table and node states).
Grégory Wisniewski's avatar
Grégory Wisniewski committed
115 116
    - Create a specialized PartitionTable that know the database and replicator
      to remove duplicates and remove logic from handlers (CODE)
117
    - Consider insert multiple objects at time in the database, with taking care
118
      of maximum SQL request size allowed. (SPEED)
119
    - Prevent from SQL injection, escape() from MySQLdb api is not sufficient,
120
      consider using query(request, args) instead of query(request % args)
Grégory Wisniewski's avatar
Grégory Wisniewski committed
121
    - Create database adapter for other RDBMS (sqlite, postgres)
122 123
    - Make listening address and port optionnal, and if they are not provided
      listen on all interfaces on any available port.
124 125 126 127
    - Replication throttling (HIGH AVAILABILITY)
      In its current implementation, replication runs at full speed, which
      degrades performance for client nodes. Replication should allow
      throttling, and that throttling should be configurable.
128 129 130 131 132 133
    - Pack segmentation & throttling (HIGH AVAILABILITY)
      In its current implementation, pack runs in one call on all storage nodes
      at the same time, which lcoks down the whole cluster. This task should
      be split in chunks and processed in "background" on storage nodes.
      Packing throttling should probably be at the lowest possible priority
      (below interactive use and below replication).
134 135 136

    Master
    - Master node data redundancy (HIGH AVAILABILITY)
137
      Secondary master nodes should replicate primary master data (ie, primary
138
      master should inform them of such changes).
139
      This data takes too long to extract from storage nodes, and loosing it
140 141 142
      increases the risk of starting from underestimated values.
      This risk is (currently) unavoidable when all nodes stop running, but this
      case must be avoided.
143
    - Don't reject peers during startup phases (STARTUP LATENCY)
144 145 146 147
      When (for example) a client sends a RequestNodeIdentification to the
      primary master node while the cluster is not yet operational, the primary
      master should postpone the node acceptance until the cluster is
      operational, instead of closing the connection immediately. This would
148
      avoid the need to poll the master to know when it is ready.
149
    - Differential partition table updates (BANDWITH)
150 151 152
      When a storage asks for current partition table (when it connects to a
      cluster in service state), it must update its knowledge of the partition
      table. Currently it's done by fetching the entire table. If the master
153 154
      keeps a history of a few last changes to partition table, it would be able
      to only send a differential update (via the incremental update mechanism)
155
    - During recovery phase, store multiple partition tables (ADMINISTATION)
156
      When storage nodes know different version of the partition table, the
157 158
      master should be abdle to present them to admin to allow him to choose one
      when moving on to next phase.
159 160
    - Optimize operational status check by recording which rows are ready
      instead of parsing the whole partition table. (SPEED)
161 162
    - Improve partition table tweaking algorithm to reduce differences between
      frequently and rarely used nodes (SCALABILITY)
163 164 165 166

    Client
    - Implement C version of mq.py (LOAD LATENCY)
    - Move object data replication task to storage nodes (COMMIT LATENCY)
167 168 169 170 171 172
      Currently the client node must send a single object data to all storage
      nodes in charge of the partition cell containing that object. This
      increases the time the client has to wait for storage reponse, and
      increases client-to-storage bandwith usage. It must be possible to send
      object data to only one stroage and that storage should automatically
      replicate on other storages. Locks on objects would then be released by
173
      storage nodes.
174
    - Use generic bootstrap module (CODE)
175 176 177
    - Find a way to make ask() from the thread poll to allow send initial packet
      (requestNodeIdentification) from the connectionCompleted() event instead
      of app. This requires to know to what thread will wait for the answer.
178 179 180
    - Discuss about dead lstorage notification. If a client fails to connect to
      a storage node supposed in running state, then it should notify the master
      to check if this node is well up or not.
181
    - Cache for loadSerial/loadBefore
182 183
    - Implement restore() ZODB API method to bypass consistency checks during
      imports.
184 185

  Later
186
    - Consider auto-generating cluster name upon initial startup (it might
187
      actualy be a partition property).
188
    - Consider ways to centralise the configuration file, or make the
189 190
      configuration updatable automaticaly on all nodes.
    - Consider storing some metadata on master nodes (partition table [version],
191
      ...). This data should be treated non-authoritatively, as a way to lower
192
      the probability to use an outdated partition table.
193
    - Decentralize primary master tasks as much as possible (consider
194
      distributed lock mechanisms, ...)
195
    - Make admin node able to monitor multiple clusters simultaneously
196
    - Choose how to compute the storage size
197
    - Make storage check if the OID match with it's partitions during a store
198
    - Send notifications when a storage node is lost
199
    - When importing data, objects with non-allocated OIDs are stored. The
200
    storage can detect this and could notify the master to not allocated lower
201
    OIDs. But during import, each object stored trigger this notification and
202 203
    may cause a big network overhead. It would be better to refuse any client
    connection and thus no OID allocation during import. It may be interesting
204
    to create a new stage for the cluster startup... to be discussed.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
205 206 207
    - Simple deployment solution, based on embedded database, integrated master
    and storage node that works out of the box
    - Simple import/export solution that generate SQL/data.fs.
208
    - Consider using out-of-band TCP feature.
Grégory Wisniewski's avatar
Grégory Wisniewski committed
209
    - IPv6 support (address field, bind, name resolution)
210 211 212 213 214 215 216 217 218 219

Old TODO

    - Handling write timeouts.
    - Flushing write buffers only without reading packets.
    - Garbage collection of unused nodes.
    - Stopping packet processing by returning a boolean value from
      a handler, otherwise too tricky to exchange a handler with another.
    - Expiration of temporarily down nodes.