Add a "process_shutdown" method to be invoked asynchronously by a tic in every...
Add a "process_shutdown" method to be invoked asynchronously by a tic in every shutdown phase (see Lifetime.py) when Zope shutdown sequence is runing. git-svn-id: https://svn.erp5.org/repos/public/erp5/trunk@22386 20353a03-c40f-0410-a6d1-a30d3c3de9de
-
Owner
I did a bit of digging in git history, for reference, here are my findings: this was something supported only in ClockServer. Zope's ZServer had an API to control the shutdown sequence, by allowing servers to implement a "clean shutdown protocol", with a
clean_shutdown_control
method that was be called at every phase of the shutdown, as described here:https://github.com/zopefoundation/ZServer/blob/0259e288/src/Lifetime/__init__.py#L20-L38
ClockServer was implementing the
clean_shutdown_control
method: https://lab.nexedi.com/nexedi/erp5/blob/07a136ec3/product/ClockServer/ClockServer.py#L57It was described in the README ( https://lab.nexedi.com/nexedi/erp5/blob/07a136ec3/product/ClockServer/README#L32 ) that it was expected to be configured to call TimerService's process_shutdown; It seems we were using a mix of ClockServer + TimerService there, maybe it was during a transition phase. I did not find when this configuration was set. At the time there was an "experimental buildout" in svn repository, maybe it was there, or maybe it was never set. So it seems there was not only the project scripts using the
wget -q "http://127.0.0.1:${ZOPE_PORT}/erp5/portal_activities/process_shutdown?phase:int=3&time_in_phase:int=0" -O -
, there was also the beginning of something integrated.In 49400d40 (we don't use ClockServer. we use TimerService instead., 2013-10-07) we removed ClockServer, since this day this code is not used in zope.
In waitress/WSGI mode, there's no "clean_shutdown_control" API, so it would need more work to have this with waitress. Also I think we were discussing implementing a "clean shutdown" at slapos level. Anyway we'll need slapos cooperation, at least having more time than the default "SIGTERM wait 60 seconds SIGKILL" sequence from supervisor (60 is configured here)
I started writing this as a commit message for a commit removing
clean_shutdown_control
, but maybe we want to keep this, I did not understand until today that it was called externally by wget. -
Owner
Here are my views a on clean shutdown:
- the "probably easy" part of the shutdown issue is on the http server side: it needs to close the listening socket (so any new connection attempt gets a "port closed"), and then to wait for any established connection to get closed (modulo whatever maximum time we accept to wait). This should provide a graceful shutdown, by completing any started transaction, while other requests should be already automatically directed (by haproxy) to any other process in the same cluster.
- the "harder" part of the shutdown issue is that, once we decide that we do not want to wait (much) more, then we should prevent any new transactions to enter the 2-phase-commit, and to wait for those which are already there. Any brutal shutdown while there are transactions in the 2pc will cause data inconsistency between databases, so if we really want a timeout here, it should be rather long. Also, I think Zope should self-monitor the time spent in the 2pc, and warn if this is getting close to whatever maximum delay we are ready to wait for, so we can either reconsider this delay, or somehow lower the time needed for such transaction. And of course if the intent is "kill this instance which is not paying to be hosted anymore", then we probably do not care about database consistency and
kill -9
is made for this (to me, this distinction is missing in slapos process management: not all "stop" are equal). - finally, a not necessarily hard (the code should already be here, but likely unused for 10 years, so likely somewhat broken), but necessary part of such work is to get CMFActivity to stop its processing when a shutdown is requested. I implemented this with a lock which would be acquired in the processing loop in CMFActivity (CMFActivity loops to avoid the large latency cost of waiting for the next tic whenever we already know there is still stuff to do). The intent is to both tell CMFActivity that it is not ok to start the next loop, but also to let the caller know when CMFActivity stopped processing anything:
wget
will block for as long as CMFActivity is running an activity. Unless there is a Zope-level http timeout, but I do not think there is.