MDEV-29023 MTR hangs after multiple failures
Passing $opt_parallel as $childs is wrong: child can be killed before it connects and you will never decrement $childs for this. Another problem is (and that is the cause of this bug): child can be killed and never close server socket. This can happen f.ex. after unmaskable KILL signal. In such case the socket is closed by reaping the child but that never happens inside reading the socket loop in run_test_server(). The proper design is the waitless reap of children inside the socket loop and if there is no more children we finish the socket loop. Since there is Windows variation where we don't control the children via waitpid(), all the clients must normally close the socket and only this can finish the socket loop. For Unix variation we reckon that case as all children closed the socket but not all yet died and for that we do final waiting waitpid() (was done before the patch as well). To be more complete, we now handle 3 end-of-game scenarios in Unix: 1. all children closed socket, all children died: everything is handled by the socket loop; 2. all children closed socket, not all yet died: we wait for alive children to die after exiting the socket loop; 3. not all children closed socket, all children died: everything is handled by the socket loop. For Windows end-of-game scenario is only one: All children close the socket.
Showing
Please register or sign in to comment