sql/log_event_server.cc · a3cbc44b24ec467f33e445f57e2022e038b88623 · nexedi / MariaDB

MDEV-31833 replication breaks when using optimistic replication and replica is a galera node · a3cbc44b

sjaakola authored Sep 12, 2023

MariaDB async replication SQL thread was stopped for any failure
in applying of replication events and error message logged for the failure
was: "Node has dropped from cluster". The assumption was that event applying
failure is always due to node dropping out.
With optimistic parallel replication, event applying can fail for natural
reasons and applying should be retried to handle the failure. This retry
logic was never exercised because the slave SQL thread was stopped with first
applying failure.

To support optimistic parallel replication retrying logic this commit will
now skip replication slave abort, if node remains in cluster (wsrep_ready==ON)
and replication is configured for optimistic or aggressive retry logic.

During the development of this fix, galera.galera_as_slave_nonprim test showed
some problems. The test was analyzed, and it appears to need some attention.
One excessive sleep command was removed in this commit, but it will need more
fixes still to be fully deterministic. After this commit galera_as_slave_nonprim
is successful, though.
Signed-off-by: Julius Goryavsky <julius.goryavsky@mariadb.com>

a3cbc44b

log_event_server.cc 276 KB

Replace log_event_server.cc