WIP: erp5: Introduce mariadb replication at SlapOS level
mariadb_update
service
1. Remove Instead, initialize databases and users on creation, and run updater and apply timezones info on every (re)start. This covers the actions that mariadb_update
used to handle.
In particular: before this, mariadb_update
would regularly overwrite any changes to a user (e.g. password change) made through direct interaction with mariadb. Now the configuration in SlapOS is really only an initial configuration.
This is a prerequisite to mariadb replication because mariadb_update
was a) interfering with replication and b) overwriting the users replicated from a primary.
To facilitate these changes, component/mariadb now exposes a template script for the mariadbd service, with ready hooks to take actions on database creation and on database (re)start.
2. Allow requesting a mariadb set-up to replicate another mariadb
Using parameters of the form:
'replication': {
'bootstrap-url': 'http(s)://<recent-backup-of-primary>,
'primary-url': 'mysql://<replication-user>:<password>@<ip>:<port>',
'seconds-behind-master-threshold': <integer, defaults to 0>,
}
This takes effect on mariadb database creation - when no data exists yet. That way existing data cannot be deleted by setting or changing the replication parameters after the fact.
A promise checks that the state of the running mariadb matches the requested state (replica/primary, replication source); but if not, the mariadb database will not automatically converge without human intervention once ~/srv/mariadb directory exists.
The bootstrap-url may be omitted: this skips replication bootstrap and requires that all binlogs be still available on the primary. This is useful when the primary is recent and may not have a ready backup for bootstrap yet.
Finally, a mariadb replica can optionally disable TCP access:
'replication': {
# ...
'allow-tcp-connections-on-replica': True or False,
}
Add option allow-tcp-connections-on-replica
, set to true by default. This option concerns only replica mariadbs: TCP connections are always enabled when replication parameters are unset, even if the database is actually in replication state in contradiction with the parameters.
This option corresponds to skip-networking
in mariadb configuration; this setting is static, so when it changes the mariadb process will be automatically restarted by SlapOS to apply the new configuration.
Note: disabling TCP connections on replicas with this option currently breaks the property that takoever can be done without having to change the instance parameters and reprocess the partition, as until then the taken-over mariadb will still have TCP disabled and remain unusable.
TODO:
-
Allow a replica mariadb to stop replicating and become a primary without requiring manual login to the instance and manual operations on the DB (e.g. by providing a url where the user can click to perform this action). This will be a necessary step of an eventual automated takeover procedure.
-
Find a better solution for
mariadb_update
functionality. See #1. -
Make the
mariadb_replication
promise avoid needless partition processing (bang): currently, it the will trigger a bang when the state of mariadb (replica/primary, replication source) does not match the expected state (corresponding to the parameters), even though SlapOS only controls the initial state on database creation, and reprocessing the partition will by-design not make it converge to the expected state. -
For mariadb replicas requested with
allow-tcp-connections-on-replica=false
(which results inskip-networking
being written in the config file), find a way to takoever without needing to edit its instance parameters and reprocess the partition. This requires a way to restart mariadb with different parameters with different options, using only the privileges of the partition. This could maybe be done by wrapping the mariadb service in a wrapper program (maybe an ad-hoc script, maybe supervisord) that allows restarting mariadb withskip-networking
enabled or disabled as appropriate. Note that whenallow-tcp-connections-on-replica=true
, takoever does not require editing the instance parameters nor reprocessing (which is the main reasontrue
is the current default).
3. Automate mariadb replication bootstrapping
Make any mariadb (replica or primary) a) statically serve recent backups (dumps) on the same IP as the mariadb server and b) have a configured replication_user
with random password, and publish two corresponding connection parameters replication-bootstrap-url
and replication-primary-url
, to be used to setup a replica mariadb.
Note: This is currently insecure when using public IPv6, but it can already be used on private IPv4 when the primary and the replica are in the same LAN.
TODO:
-
Use SSL on public IP: serve the backups with SSL and proxy the mariadb server with SSL on public IP (instead of enabling SSL in mariadb directly, to allow SSL-less access with private IP (?)). This will also impact the replica-initialisation logic. Ideally, use (mutual?) authentication with trusted certificates (provided by caucase?). As a temporary step, maybe self-signed SSL with passwords published in connection parameters?
-
Use mariabackup instead of dumps to allow fast bootstrapping of a replica. This will affect the replica-initialisation logic as well.
-
Propagate these mariadb connection parameters in erp5 root instance.
-->mariabd-replication-primary-url
andmariadb-replication-bootstrap-url
4. Automate neo asynchronous replication
When upstream-cluster and upstream-masters are given, also pass --backup to the neo master so that it converges automatically to BACKINGUP state.
In other words, when a neo is requested with upstream-
parameters, make it to automatically start in BACKINGUP state without requiring manual intervention. This applies only on neo database creation.
Add a promise that asserts neo is BACKINGUP state when upstream-
parameters are set (but does not assert neo is in RUNNING state when upstream-
parameters are unset, for backwards compatibility with current usage).
TODO:
- Make the neo state promise avoid needless partition processing (bang): currently, it the will trigger a bang e.g. when the state is RUNNING and the promise expects BACKINGUP, even though SlapOS only controls the initial state on database creation, and reprocessing the partition will by-design not attempt to make it converge to the expected state.
5. Make zope aware of replication
Deactivate zope promises when the neo is expected to be BACKINGUP, as this makes the zope process crash, which is currently expected.
Deactivating the zope process entirely in that case is not desired because reactivating it would require updating instance parameters and reprocessing the partition. Instead, ideally, the zope service should adapt to the state of the neo.
Also, move zope service from etc/service to etc/run to make it not be "on-watch", so that when the neo is BACKINGUP and zope crashes, the partition does not bang and reprocess continuously. This seems ok because the promise already asserts the service is running.
TODO:
-
Adapt the zope service so that it detects when neo is in BACKINGUP state and goes on standby until neo is RUNNING as part of normal execution of the service, instead of crashing. One envisioned way is to wrap the existing zope service in a wrapper program that will handle this additional functionality, catch zope crashing, poll neo state or otherwise be notified of neo state changes, and relaunch zope as needed. Such a program could be an ad-hoc wrapper script, or maybe a supervisord launching a zope and a kind of neo-listener service.
-
Standardize operations related to creating an ERP5 clone of a production ERP5: this implies creating a replica, "detaching" it (like taking it over without stopping the original primary), selectively start an admin zope while making sure activity zopes remain stopped, and change all that is required to prevent the clone from interfering with the actual production ERP5 before starting the remaining zopes. One way is simply to start only selected zope partitions via SlapOS. Another way could be that zope services may be started or stopped directly: this could also be achieved via a wrapper program such as supervisord, but would require it offers a remote interface. Or maybe the right thing to do would be to standardize a way to control network access of each partition via a firewall, so as to be able to selectively cut network access.
6. Miscellaneous fixes
Include some miscellaneous fixes for mariadb-with-IPv6 and gcc-version-for-Python2-SRs.