Fix race conditions in EventManager between _poll/connection_dict and (un)registration
The following error was reported on a client node: #0x0000 Error < None (2001:...:2051) 1 (Retry Later) connection closed for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, closed, client) at 7f1ea7c42f90> Event Manager: connection started for <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> #0x0000 RequestIdentification > None (2001:...:2051) Readers: [] Writers: [] Connections: 13: <MTClientConnection(uuid=None, address=2001:...:2051, handler=PrimaryNotificationsHandler, fd=13, on_close=onConnectionClosed, connecting, client) at 7f1ea7c25c10> (pending=False) Node manager : 1 nodes * None | MASTER | 2001:...:2051 | UNKNOWN <ClientCache history_size=0 oid_count=0 size=0 time=0 queue_length=[0] (life_time=10000 max_history_size=100000 max_size=20971520)> poll raised, retrying Traceback (most recent call last): File "neo/lib/threaded_app.py", line 93, in _run poll(1) File "neo/lib/event.py", line 134, in poll self._poll(0) File "neo/lib/event.py", line 164, in _poll conn = self.connection_dict[fd] KeyError: 13 which means that: - while the poll thread is getting a (13, EPOLLIN) event because it is closed (aborted by the master) - another thread processes the error packet, by closing it in PrimaryBootstrapHandler.notReady - next, the poll thread resumes the execution of EpollEventManager._poll and fails to find fd=13 in self.connection_dict So here, we have a race condition between epoll_wait and any further use of connection_dict to map returned fds. However, what commit a4731a0c does to handle the case of fd reallocation only works for mono-threaded applications. In EPOLLIN, wrapping 'self.connection_dict[fd]' the same way as for other events is not enough. For example: - case 1: - thread 1: epoll returns fd=13 - thread 2: close(13) - thread 2: open(13) - thread 1: self.connection_dict[13] does not match but this would be handled by the 'unregistered' list - case 2: - thread 1: reset 'unregistered' - thread 2: close(13) - thread 2: open(13) - thread 1: epoll returns fd=13 - thread 1: self.connection_dict[13] matches but it would be wrongly ignored by 'unregistered' - case 3: - thread 1: about to call readable/writable/onTimeout on a connection - thread 2: this connection is closed - thread 1: readable/writable/onTimeout wrongly called on a closed connection We could protect _poll() with a lock, and make unregister() use wakeup() so that it gets a chance to acquire it, but that causes threaded tests to deadlock (continuing in this direction seems too complicated). So we have to deal with the fact that there can be race conditions at any time and there's no way to make 'connection_dict' match exactly what epoll returns. We solve this by preventing fd reallocation inside _poll(), which is fortunately possible with sockets, using 'shutdown': the closing of fds is delayed. For above case 3, readable/writable/onTimeout for MTClientConnection are also changed to test whether the connection is still open while it has the lock. Just for safety, we do the same for 'process'. At last, another kind of race condition that this commit also fixes concerns the use of itervalues() on EventManager.connection_dict.
Showing
Please register or sign in to comment