MDEV-28430: Fix memory barrier missing of lf_alloc on Arm64
When testing MariaDB on Arm64, a stall issue will occur, jira link: https://jira.mariadb.org/browse/MDEV-28430. The stall occurs because of an unexpected circular reference in the LF_PINS->purgatory list which is traversed in lf_pinbox_real_free(). We found that on Arm64, ABA problem in LF_ALLOCATOR->top list was not solved, and various undefined problems will occur, including circular reference in LF_PINS->purgatory list. The following codes are used to solve ABA problem, code copied from below link. https://github.com/MariaDB/server/blob/cb4c2713553c5f522d2a4ebf186c6505384c748d/mysys/lf_alloc-pin.c#L501-#L505 do { 503 node= allocator->top; 504 lf_pin(pins, 0, node); 505 } while (node != allocator->top && LF_BACKOFF()); 1. ABA problem on Arm64 Combine the below steps to analyze how ABA problem occur on Arm64, the relevant codes in steps are simplified, code line numbers below are in MariaDB v10.4. ------------------------------------------------------------------------ Abnormal case. Initial state: pin = 0, top = A, top list: A->B T1 T2 step1. write top=B //seq-cst, #L517 step2. write A->next= "any" step3. read pin==0 //relaxed, #L295 step1. write pin=A //seq-cst, #L504 step2. read old value of top==A //relaxed, #L505 step3. next=A->next="any" //#L517 step4. write A->next=B,top=A //#L420-435 step4. CAS(top,A,next) //#L517 step5. write pin=0 //#L521 ------------------------------------------------------------------------ Above case is due to T1.step2 reading the old value of top, causing "T1.step3, T1.step4" and "T2.step4" to occur at the same time, in other words, they are not mutually exclusive. It may happen that T2.step4 is sandwiched between T1.step3 and T1.step4, which cause top to be updated to "any", which may be in-use or invalid address. 2. Analyze above issue with Dekker's algorithm Above problem can be mapped to Dekker's algorithm, link is as below https://en.wikipedia.org/wiki/Dekker%27s_algorithm. The following extracts the read and write operations on 'top' and 'pin', and maps them to Dekker's algorithm to analyze the root cause. ------------------------------------------------------------------------ Initial state: top = A, pin = 0 T1 T2 store_seq_cst(pin, A) // write pin store_seq_cst(top, B) //write top rt= load_relaxed(top) // read top rp= load_relaxed(pin) //read pin if (rt == A && rp == 0) printf("oops\n"); // will "oops" be printed? ------------------------------------------------------------------------ How T1 and T2 enter their critical section: (1) T1, write pin, if T1 reads that top has not been updated, T1 enter its critical section(T1.step3 and T1.step4, try to obtain 'A', #L517), otherwise just give up (T1 without priority). (2) T2, write top, if T2 reads that pin has not been updated, T2 enter critical section(T2.step4, try to add 'A' to top list again, #L420-435), otherwise wait until pin!=A (T2 with priority). In the previous code, due to load 'top' and 'pin' with relaxed semantic, on arm and ppc, there is no guarantee that the above critical sections are mutually exclusive, in other words, "oops" will be printed. This bug only happens on arm and ppc, not x86. On current x86 implementation, load is always seq-cst (relaxed and seq-cst load generates same machine code), as shown in https://godbolt.org/z/sEzMvnjd9 3. Fix method Add sequential-consistency semantic to read 'top' in #L505(T1.step2), Add sequential-consistency semantic to read "el->pin[i]" in #L295 and #L320. 4. Issue reproduce Add "delay" after #L503 in lf_alloc-pin.c, When run unit.lf, can quickly get segment fault because "top" point to an invalid address. For detail, see comment area of below link. https://jira.mariadb.org/browse/MDEV-28430. 5. Futher improvement To make this code more robust and safe on all platforms, we recommend replacing volatile with C11 atomics and to fix all data races. This will also make the code easier to reason. Signed-off-by: Xiaotong Niu <xiaotong.niu@arm.com>
Showing
Please register or sign in to comment