fix a 3-way deadlock in galera_sr.galera-features#56

rarely (try --repeat 1000), the following happens: * from wsrep_bf_abort (when a thread is being killed), wsrep-lib starts streaming_rollback that wants to convert_streaming_client_to_applier. wsrep_create_streaming_applier creates a new THD(). All while the other THD is being killed, so under LOCK_thd_kill and LOCK_thd_data. In particular, THD::init() takes LOCK_global_system_variables under LOCK_thd_kill. * updating @@wsrep_slave_threads takes LOCK_global_system_variables and LOCK_wsrep_cluster_config (in that order) and invokes wsrep_slave_threads_update() that takes LOCK_wsrep_slave_threads * wsrep_replication_process() takes LOCK_wsrep_slave_threads and invokes wsrep_close_applier(), that does thd->set_killed() which takes LOCK_thd_kill. et voilà. As a fix I copied a workaround from wsrep_cluster_address_update() to wsrep_slave_threads_update(). It seems to be safe: without mutexes a race condition is possible and a concurrent SET might change wsrep_slave_threads, but wsrep_slave_threads_update() always verifies if there's a need to do something, so it will not run twice in this case, it'll be a no-op.

fix a 3-way deadlock in galera_sr.galera-features#56
rarely (try --repeat 1000), the following happens: * from wsrep_bf_abort (when a thread is being killed), wsrep-lib starts streaming_rollback that wants to convert_streaming_client_to_applier. wsrep_create_streaming_applier creates a new THD(). All while the other THD is being killed, so under LOCK_thd_kill and LOCK_thd_data. In particular, THD::init() takes LOCK_global_system_variables under LOCK_thd_kill. * updating @@wsrep_slave_threads takes LOCK_global_system_variables and LOCK_wsrep_cluster_config (in that order) and invokes wsrep_slave_threads_update() that takes LOCK_wsrep_slave_threads * wsrep_replication_process() takes LOCK_wsrep_slave_threads and invokes wsrep_close_applier(), that does thd->set_killed() which takes LOCK_thd_kill. et voilà. As a fix I copied a workaround from wsrep_cluster_address_update() to wsrep_slave_threads_update(). It seems to be safe: without mutexes a race condition is possible and a concurrent SET might change wsrep_slave_threads, but wsrep_slave_threads_update() always verifies if there's a need to do something, so it will not run twice in this case, it'll be a no-op.
b91e77cf · Sergei Golubchik · 259b9452 · b91e77cf
Commit b91e77cf authored Feb 12, 2021 by Sergei Golubchik
Hide whitespace changes
Inline Side-by-side

Showing with 4 additions and 0 deletions

sql/wsrep_var.cc sql/wsrep_var.cc +4 -0

No files found.
--- a/sql/wsrep_var.cc
+++ b/sql/wsrep_var.cc
@@ -674,7 +674,11 @@ static void wsrep_slave_count_change_update ()

 bool wsrep_slave_threads_update (sys_var *self, THD* thd, enum_var_type type)
 {
+  mysql_mutex_unlock(&LOCK_wsrep_cluster_config);
+  mysql_mutex_unlock(&LOCK_global_system_variables);
  mysql_mutex_lock(&LOCK_wsrep_slave_threads);
+  mysql_mutex_lock(&LOCK_global_system_variables);
+  mysql_mutex_lock(&LOCK_wsrep_cluster_config);
  bool res= false;

  wsrep_slave_count_change_update();