ZApplication: Do not let Cleanup references to be held anywhere except REQUEST
Cleanup.__del__() calls transaction.abort(). transaction.abort() implicitly uses thread-local transaction manager. __del__ method is called when the last reference to an object is dropped. If there is no object cycles this can be more or less controlled. However if an object with __del__ method is held by some other objects forming a cycle, the __del__ method will be called only eventually whenever that cycle is garbage-collected by GC. The garbage collection can be ran from *whatever* thread which caused the GC. If we leave cleanup local variable, a reference to it will be held in frame's .f_locals, and if e.g. an exception is raised and even caught a cycle will be created: exc_traceback | ^ | | v .f_localsplus frame This is perfectly the situation which was outlined above - a cycle of objects which will be freed only by GC and one of those objects (.f_localsplus) reference `cleanup`. This way Cleanup.__del__ will be called not strictly when REQUEST goes away but maybe a bit later if the cycle stays alive more (and thus potentially aborting different next transaction) and/or from another thread (and thus aborting unrelated transaction). The case was caught via analyzing strange messages in Zope logs on a Wendelin system: Exception KeyError: (140574709077840,) in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7fd9f22f6830>> ignored This turned out to be Wendelin.core's _ZBigFileH.on_connection_close() trying to unregister itself from transaction manager it previously registered to on ZODB connection open: https://lab.nexedi.com/nexedi/wendelin.core/blob/87bf4908/bigfile/file_zodb.py#L621 https://lab.nexedi.com/nexedi/wendelin.core/blob/87bf4908/bigfile/file_zodb.py#L608 the transaction_manager being ThreadTransactionManager: https://github.com/zopefoundation/transaction/blob/bd26c334/transaction/__init__.py#L23 https://github.com/zopefoundation/transaction/blob/bd26c334/transaction/_manager.py#L122 which implicitly keeps its instance variables in thread-local storage (i.e. self.var is different for every thread), and so transaction_manager.unregisterSynch() was removing _ZBigFileH instance from a set different to where it was originally registered: https://github.com/zopefoundation/transaction/blob/bd26c334/transaction/_manager.py#L79 thus the error message, not to mention the further inevitable desynchronization in between wendelin.core transaction manager https://lab.nexedi.com/nexedi/wendelin.core/blob/87bf4908/bigfile/file_zodb.py#L548 and ZODB transaction manager which can lead to all kind of corruptions. ---- So do not let `cleanup` variable to stay in the frame. This is simple but not optimal fix. The good solution would avoid using __del__ at all and track request lifetime explicitly via regular programming logic.
-
Maintainer
I'd like to add that the bug is more critical for Wendelin systems than regular ERP5 or Zope since in Wendelin systems wendelin.core has to implement its own transaction manager logic and carefully synchronize it with ZODB transaction managing machinery.
-
@kirr : Very nice low-level debugging and clear analysis, as usual :) .
A
__del__
method being (de facto made) responsible for transaction control is totally frightening to me. I believe this should also be removed. Transaction control on a "normal" instance (ie, not unit tests) must be the exclusive task ofZPublisher
and (until it someday gets rewritten to go throughZPublisher
for every single transaction)CMFActivity
. Nothing else should ever get close to the "controll transaction boundaries" side oftransaction
module.Would it make sense to reverse the relation ? Like, trigger cleanup in a before (or after) commit hook ? I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (
_p_jar
), but I may be missing something. Also, I would expectZODB.Connection
, by registering itself to the transaction manager, to be also kept alive by that path until transaction end, which must ultimately happen inZPublisher
.About the fix, what about collapsing it all into a single statement so the local does not exist and hence does not need to be deleted ?
REQUEST._hold(Cleanup(conn))
It does remove the natural place to document the change in the code, but it is already well documented in the commit message. It is admitedly one of my OCDs, but I dislike such single-use variables very much as they just load the (mental) namespace for no good reason, even when there is no GC trickery around. Fewer statements FTW !
-
We all 3 had the idea of collapsing into a single statement. But in any case, a comment is required in the code to make sure nobody won't revert. Commit messages are obviously not a place for warnings.
-
No comment is needed if
Cleanup.__del__
stop aborting transactions...And we cannot reasonably put all the "don't"s in all code everywhere, so I do not think a warning is critical enough that we should keep over complex (...for some ridiculously low amount of extra complexity, granted) just for the sake of having the comment in the code.
-
Maintainer
@vpelletier, @jm thanks for feedback.
A
__del__
method being (de facto made) responsible for transaction control is totally frightening to me. I believe this should also be removed.Yes, this is what I was saying in "This is simple but not optimal fix. The good solution would avoid using
__del__
at all and track request lifetime explicitly via regular programming logic." I agree we should look into properly fixing this, probably as you say in ZPublisher.Would it make sense to reverse the relation ? Like, trigger cleanup in a before (or after) commit hook ?
Yes, I would program it this way that after request is handled without an error transaction is committed, or if there is error transaction is aborted. Then database connection is "closed" (see below).
I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (
_p_jar
), but I may be missing something.When connection is closed it is not destroyed but actually is returned to database connection poll. It's cache is also retained for next requests to use:
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/interfaces.py#L186
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L279
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L325After transaction is committed or aborted there could not be live objects left in dirty state - they all become synchronized to database state - either to newly created transaction or to a previous transaction in case of abort. This way all live objects are actually cache. ZODB cache keeps them (and garbage-collects some times) but if a reference to an object stays live from outside after transaction ends it has to be understood that it becomes a reference to inside a cache and that modifying an object via such reference is playing dirty games with cache internals thus potentially affecting unrelated transactions. In other words it is invalid to modify objects reference to which were obtained in a transaction after that transaction lifetime ends.
Also, I would expect
ZODB.Connection
, by registering itself to the transaction manager, to be also kept alive by that path until transaction end, which must ultimately happen inZPublisher
.Transaction manager keeps registered synchronizers in a weakset:
https://github.com/zopefoundation/transaction/blob/1.1.1/transaction/_manager.py#L76
https://github.com/zopefoundation/transaction/blob/1.1.1/transaction/_manager.py#L56
https://github.com/zopefoundation/ZODB/blob/3b700be9/src/ZODB/Connection.py#L925this way it cannot be relied on just transaction manager keeping reference to registered connection.
It is imho better to explicitly program the logic: request comes; open db connection; handle request; transacion commit or abort; close connection.
I would agree into collapsing Cleanup call into
REQUEST._hold
with comments left. However I think this is minor and we better fix Zope to serve requests without__del__
at all. -
probably as you say in ZPublisher.
What I meant is that ZPublisher is already handling transaction boundaries: it is where exceptions are caught so error page is rendered, which also provides natural locations for abort & commit:
When connection is closed it is not destroyed
Oh right, that code was not about holding a strong reference to
Connection
, but merely using weak references to control something else (transaction lifetime).Transaction manager keeps registered synchronizers in a weakset:
Mmh this is a bit weird to me. I mean, if a transactional adapter expressed interest in current transaction, I see this as a strong expectation that transaction will actually do stuff on that connector, and it may not be trivial to guarantee the synchronizer will be kept alive in another way. Maybe there is a separate strong-relation reference chain elsewhere ? Anyway, this may be drifting too far from the original topic.
-
Maintainer
For the reference - with the following patch applied on one system:
--- ZApplication.py.kirr 2017-06-14 14:46:30.000000000 +0200 +++ ZApplication.py 2017-08-20 15:47:08.231696334 +0200 @@ -49,6 +49,14 @@ # arrange for the connection to be closed when the request goes away cleanup = Cleanup(conn) REQUEST._hold(cleanup) + # https://lab.nexedi.com/kirr/Zope/commit/9ca50f15 + # do not let `cleanup` stay referenced by local frame. + # this could be dangerous if the frame is held in any traceback + # collected e.g. for exceptions raised and caught in ^^^ hooks, because + # if so cleanup lifetime will be different from REQUEST lifetime and + # thus cleanup will be called at arbitrary time from under GC from + # arbitrary thread. + del cleanup conn.setDebugInfo(REQUEST.environ, REQUEST.other) @@ -75,10 +83,25 @@ return connection.root()[aname] +import traceback +from thread import get_ident as gettid +import time +import sys class Cleanup: def __init__(self, jar): + self._tid = gettid() + self._created = time.time() self._jar = jar def __del__(self): - transaction.abort() - self._jar.close() + tid = gettid() + if tid != self._tid: + now = time.time() + print >>sys.stderr, "E: %s: created from thread %s @%s; __del__ runs from thread %s @%s" \ + % (self, self._tid, time.ctime(self._created), tid, time.ctime(now)) + try: + transaction.abort() + self._jar.close() + except: + traceback.print_exc() + raise
I still see entries in log like this:
==> slappart8/.slappart8_zope-2.log <== E: <App.ZApplication.Cleanup instance at 0x7f1ca5237c68>: created from thread 139761418483456 @Tue Aug 29 04:35:38 2017; __del__ runs from thread 139761443661568 @Tue Aug 29 04:35:42 2017
in other words Cleanup is still being run from another thread that created it and oops - beacuse
transaction
module is imlicitly thread-local Cleanup wrongly aborts transaction of another thread.I can imagine explaining it this way: even though if Cleanup is held only by
REQUEST
, theREQUEST
itself could participate in a cycle and thus it won't neccessarily be destroyed when main zpublisher logic drops reference to it. So if REQUEST stays alive in cycle until GC is run, and if GC is triggered by another thread -> then Cleanup will be called by another thread.So the only reliable fix is top stop using
__del__
at all. -
Maintainer
Another data point:
On a project we started to see the following regular errors on one zopes from activities zope group:
018-07-09 08:28:49,472 ERROR ZODB.Connection Shouldn't load state for 0xf7b1 when the connection is closed ------ 2018-07-09 08:28:49,472 ERROR ZServerPublisher exception caught Traceback (most recent call last): File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__ response=b) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module environ, debug, request, response) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard return publish_module_standard.original(*args, **kw) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 276, in publish_module_standard if request is not None: request.close() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 220, in close notify(EndRequestEvent(None, self)) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.event-3.5.2-py2.7.egg/zope/event/__init__.py", line 31, in notify subscriber(event) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.component-3.9.5-py2.7.egg/zope/component/event.py", line 24, in dispatch zope.component.subscribers(event, None) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/zope.component-3.9.5-py2.7.egg/zope/component/_api.py", line 136, in subscribers return sitemanager.subscribers(objects, interface) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 859, in setstate raise ConnectionStateError(msg) ConnectionStateError: Shouldn't load state for 0xf7b1 when the connection is closed
Collecting all such objects for which state could not be loaded becase connection was closed showed they are:
# grep "Shouldn't load state for" *event* | awk '{print $9}' |sort |uniq -c 3 1 0x00 128 0x01 77 0x01e8 20 0x03 200 0xf7b0 9 0xf87b
and checking via
zodb catobj
showed that:0x00 is a root object 0x01 = OFS.Application.Application 0x03 = App.ApplicationManager.ApplicationManager 0x01e8 = Products.TimerService.TimerService 0xf87b = erp5.portal_type.Activity Tool
in other words all problematic objects were somehow very global.
Further inspecting it turned out to be errors from health cheker trying to access
/
every second. Curl confirmed that:[12:33:15] <kirr> (venv.1) (56ad0b3e7929159eb04546ac668fd1e9) slapuser8@micfix01:~/srv/runner/instance/slappart8/var/log$ curl -v 10.0.208.122:2221 * Rebuilt URL to: 10.0.208.122:2221/ * Trying 10.0.208.122... * Connected to 10.0.208.122 (10.0.208.122) port 2221 (#0) > GET / HTTP/1.1 > Host: 10.0.208.122:2221 > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 500 Internal Server Error < Server: Zope/(2.13.24, python 2.7.12, linux2) ZServer/1.1 < Date: Mon, 09 Jul 2018 09:32:48 GMT < Bobo-Exception-Line: 859 < Content-Length: 1157 < Bobo-Exception-Value: See the server error log for details < Content-Type: text/html; charset=utf-8 < Bobo-Exception-Type: <class 'ZODB.POSException.ConnectionStateError'> < Bobo-Exception-File: Connection.py < <h2>Site Error</h2> <p>An error was encountered while publishing this resource. </p> <p><strong><class 'ZODB.POSException.ConnectionStateError'></strong></p> Sorry, a site error occurred.<p><p>Traceback (innermost last):</p> <ul> <li> Module ZPublisher.Publish, line 247, in publish_module_standard<br /> setDefaultSkin(request)</li> <li> Module zope.publisher.skinnable, line 46, in setDefaultSkin<br /> adapters = zope.component.getSiteManager().adapters</li> <li> Module ZODB.Connection, line 859, in setstate<br /> raise ConnectionStateError(msg)</li> </ul><p>ConnectionStateError: Shouldn't load state for 0xf7b1 when the connection is closed </p></p> <hr noshade="noshade"/> <p>Troubleshooting Suggestions</p> <ul> <li>The URL may be incorrect.</li> <li>The parameters passed to this resource may be incorrect.</li> <li>A resource that this resource relies on may be encountering an error.</li> </ul> <p>For more detailed information about the error, please refer to the error log. </p> <p>If the error persists please contact the site maintainer. Thank you for your patience. * Connection #0 to host 10.0.208.122 left intact
It was also observed that liveness checks for that particular zope started to be 500 at
zope-1-Z2.log:10.0.208.122 - Anonymous [08/Jul/2018:18:48:30 +0000] "GET / HTTP/1.0" 500 0 "" ""
and after checking
.slappart8_*.log
it was discovered that even thoughException KeyError: (140574709077840,) in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7fd9f22f6830>> ignored
is regularly seen for all zopes, for this particular zope there was something new:
Exception AssertionError: AssertionError() in <bound method Cleanup.__del__ of <App.ZApplication.Cleanup instance at 0x7fe62b4b2098>> ignored
This error was at the end of
.slappart8_zope-1.log
and last mtime of that file was corresponding to the time when 500 errors started to appear inzope-1-Z2.log
.Inspecting
Cleanup
code shows:class Cleanup: def __init__(self, jar): self._jar = jar def __del__(self): transaction.abort() self._jar.close()
that
Cleanup.__del__
tries to abort current transaction and close initially-pinned-to-it ZODB connection. Unfortunately there is no traceback printed for the AssertionError since it happenned under finalizer running. Anyway what can be said is that somehow becauseCleanup.__del__
was run either a) from a wrong thread, or b) too late some assert and missinchronization in ZODB happenned and that prevented it from further working correctly.For the reference inspecting all threads of that zopes showed nothing abnormal:
Threads traceback dump at 2018-07-09 09:55:44 Thread 140626557400832: File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 25, in __init__ name, a, b=accept() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZRendezvous.py", line 73, in accept l.acquire() Thread 140626635888384: File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 774, in __bootstrap self.__bootstrap_inner() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/monitor.py", line 46, in run while not select(rlist, (), (), t)[0] or os.read(r, 8): Thread 140626878301952: File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/eggs/timerserver-2.0.2-py2.7.egg/timerserver/TimerServer.py", line 93, in run time.sleep(interval) Thread 140626669999872: File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 774, in __bootstrap self.__bootstrap_inner() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/python2.7/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/threaded_app.py", line 81, in run self._run() File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/threaded_app.py", line 91, in _run poll(1) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/event.py", line 155, in poll self._poll(blocking) File "/srv/slapgrid/slappart8/srv/runner/software/56ad0b3e7929159eb04546ac668fd1e9/parts/neoppod-repository/neo/lib/event.py", line 202, in _poll event_list = self.epoll.poll(blocking) End of dump
and after that zope restart the problem went away.
Once again: the only reliable fix is to stop using
__del__
at all. -
Maintainer
@Tyagov, as 9ca50f15 (comment 38864) says even if we add explicit
del cleanup
it won't prevent problems andCleanup.__del__
can still be, and in practice is, called from wrong context. The only solution is to rework zope more to avoid depending on finalizer at all. There is no patch for this yet. -
I would expect the ZODB connection to be kept alive merely because of every single persistent object loaded in any frame of about every script run during the transaction keeping a reference to the connection (_p_jar), but I may be missing something.
Because of ZODB#185, which I hope to get fixed one day, we currently want Connections to be closed once the zope worker has finished with a request. I agree this generates overhead but the only solution to avoid memory leaks would be to also process invalidations immediately for open Connections that are outside transactions. I didn't think about this before but that looks doable now that Connections can require explicit calls to
transaction.begin
(transaction >= 2.1.0).publish()
is actually agnostic about accessed databases. We're somewhat biased because in our case, we always open the same DB. For the moment, I don't see what I could change in this method or inpublish_module_standard
(transactions are already begun/committed/aborted when needed, and the request is explicitly closed at the end). We could say that the whole ZPublisher is not optimised for our use.Like, trigger cleanup in a before (or after) commit hook ?
What about aborts ? Or multiple commits within a request ? I know the latter is questionable (to the point we often wonder what's the difference between transactions and requests, and where to cache things - transactional cache or request ?), but there's actually a good example: process_timer (in particular CMFActivity). With what you suggest, we're likely to cleanup things too early, or never.
-
I thought about using zope components, by making ZApplicationWrapper clean up things during
EndRequestEvent
(seeBaseRequest.close
), but there would be no guarantee that this clean up is processed last (as written in a comment, subscribers might need the zodb).Another thing is that
Cleanup
seems to do more than just cleanup. It also keeps the Connection open during the request: that's not done by transactions (which only have weak refs) and it's not clear whether traversal guarantees to have references to persistent objects (either in locals or in the request). BTW, since__del__
is not implemented, a Connection disappears without returning to the DB pool if all its referrers are deleted.Here is the best idea I have for the moment (not tested):
diff --git a/src/App/ZApplication.py b/src/App/ZApplication.py index 1f8c785..7e3c5d1 100644 --- a/src/App/ZApplication.py +++ b/src/App/ZApplication.py @@ -47,8 +47,11 @@ def __bobo_traverse__(self, REQUEST=None, name=None): hook(conn) # arrange for the connection to be closed when the request goes away - cleanup = Cleanup(conn) - REQUEST._hold(cleanup) + request_clear = REQUEST.clear + def clear(): + request_clear() + conn.close() + REQUEST.clear = clear conn.setDebugInfo(REQUEST.environ, REQUEST.other) @@ -73,12 +76,3 @@ def __call__(self, connection=None): connection=db.open(connection) return connection.root()[aname] - - -class Cleanup: - def __init__(self, jar): - self._jar = jar - - def __del__(self): - transaction.abort() - self._jar.close()
-
I can imagine explaining it this way: even though if Cleanup is held only by REQUEST, the REQUEST itself could participate in a cycle and thus it won't necessarily be destroyed when main zpublisher logic drops reference to it. So if REQUEST stays alive in cycle until GC is run, and if GC is triggered by another thread -> then Cleanup will be called by another thread.
What remains obscure to me is that ZPublisher does close the request explicitly. How is it possible that
self._held=None
(inBaseRequest.clear
) would not triggerCleanup.__del__
synchronously ? Something may be wrong with nexedi/erp5@bc120f55. -
How is it possible that
self._held=None
(inBaseRequest.clear
) would not triggerCleanup.__del__
synchronously ?Why would it be guaranteed to be synchronous ?
__del__
is known to be an obscure mechanism in python, and in my experience it should be seen as a hook into GC cleanup worker rather than a regular destructor/finaliser for instances. And AFAIK it is inherently asyncrhonous.For example, I have seen
__del__
be called despite the object still being referenced (although I do not remember now if it was during interpreter shutdown, which is yet another case triggering more counter-intuitive effects):a.foo = b
,b.__del__
being called beforea.__del__
. This was not even part of a reference loop resolution. The answer I was given when asking about this behaviour was along the lines of "use context managers, they are the only lifetime management mechanism available in python which work deterministicaly and reliably". Asides from writing our owntry..finaly
blocks, of course, but that's just more verbose for AFAIK no gain.I think relying on exact
__del__
call sequence could legitimately make our code non-portable to other python interpretations (thinking of pypy as it has a radically different approach to garbage collection than cPython). -
My point is not to defend
__del__
; I know very well it's wrong in this case at least because it's incompatible with PyPy. But we're not going to use Zope 2.13 with PyPy, nor fix coding crimes in Zope if they don't actually affect us. I also spent time in NEO removing many of them.Why would it be guaranteed to be synchronous ?
Because of refcounting. GC only plays a role in case of cycles and before CPython 3.4, it ignores objects with a
__del__
method. The second paragraph of this commit message is wrong.I read the CPython code (not trivial so I might be wrong) and it matches what we can read e.g. at https://rushter.com/blog/python-garbage-collector/. tp_dealloc is called immediately when the refcount reaches 0, which in turn calls
__del__
and dealloc everything (I let aside specific cases like weak refs, resurrection, etc).For this commit to do anything, we should also make sure that an exception actually happened, so that a ref to
cleanup
is kept.For me
__del__
is predictable and I just want to understand the bug. -
Thanks for explaining, I understand your objective better now.
-
Too bad that the tracebacks we have on Prod don't show what calls
__del__
. But the log added by Kirill shows that it's by the LongRequestLogger thread. I see nothing there aboutrequest._held
orrequest.__dict__
, but maybe manipulating frames is enough.Quick check:
import sys, threading, time class A(object): def __del__(self): print('__del__') def f(): x = A() time.sleep(.2) #del x def g(): f() print('stopped') g() t = threading.Thread(target=g) t.start() time.sleep(.1) x = sys._current_frames()[t.ident] t.join() print('end') del x
displays:
__del__ stopped stopped end __del__
and if the
del
statement is uncommented:__del__ stopped __del__ stopped end
I see only 1 place where frames could keep a reference to the cleanup object:
BaseRequest._hold
. Actually, Kirill's patch handlesZApplication.__bobo_traverse__
but misses callees. The remaining problem with this hypothesis is that the logger should not attempt to dump a request so early.Maybe Kirill has more information so I'll wait he's back.
@klaus @Tyagov Meanwhile, the patch I suggest above should work, and you can either apply it this form or nexedi/erp5@879a8a62. Alternatives are:
- if you're curious, add
del object
at the end ofBaseRequest._hold
(in addition to Kirill's patch) - disable the long request logger
- if you're curious, add
-
- if you're curious, add
del object
at the end ofBaseRequest._hold
(in addition to Kirill's patch)
That's actually a good idea for 2 reasons:
- that may confirm my analysis
- we're sure there's no regression with such change
About nexedi/erp5@879a8a62, a test result is completed at https://nexedi.erp5.net/test_result_module/20180821-55EF02 (apparently successfully, it's never obvious with some functional tests).
- if you're curious, add
-
Maintainer
@jm, @vpelletier, nexedi/erp5@bc120f55 indeed seems to be suspicious: on the first glance
my_self.REQUEST.clear()
ormy_self.REQUEST.close()
seems to be missing after we hand-over held objects to originalself.REQUEST
. This way e.g. afterself.REQUEST
is explicitly closed by ZPublisher, and if saymyself
would still be held by a cycle, it could be eventually released at that future GC garbage-collecting that cycle.@jm, I suggest not to guess and simply add to
BaseRequest.clear
verification of reference-count of released Cleanup object. And if it is not what is expected - print the information about who is additionally helding the object viagc.get_referrers
. This way next time the bug happens we'll get the information about who is helding Cleanup automatically. That would help to write a test-case for the bug.I think the LongRequest logger is unrelated - it is only a place where user visible bug happens, but the cause of the bug happenned before and was due to transaction.abort() was incorrectly called from wrong context. In other words LongRequest logger traceback is only a consequence of the bug.
The second paragraph of this commit message is wrong.
I still think it is right as the following testcase demonstrates, where
__del__
is called by thread T2 because object cycle that references finalized object is garbage-collected by that second thread, not original T1 where finalized object was created:#!/usr/bin/env python # demonstrate that __del__ might be called from whatever thread which causes cyclic GC from __future__ import print_function from golang import go, chan from six.moves._thread import get_ident as gettid import gc, sys # class without finalizer (objects with finalizers if cycled are not collected) class X: pass # call f on __del__ class DelCall: def __del__(self): print('__del__ called under thread', gettid()) # mkgarbagecycle creates object cycle, so that only cyclic GC can free them. def mkgarbagecycle(): # cycle a, b = X(), X() a.b, b.a = b, a return a def main(): ready = chan() # t1 -> t2 done = chan() # t1 <- t2 # second thread def T2(): ready.recv() # wait till first thread has done its work print('T2:', gettid()) # perform GC. # this simulates a possible scenario as if cyclic GC was automatically triggered by T2 thread. gc.collect() done.close() go(T2) # first thread print(sys.version) print('T1:', gettid()) # disable automatic GC; now only objects whose reference count explicitly goes to 0 are deallocated. # objects, whose are kept in cycles, stay alive. # this simulates a possible scenariou where cyclic GC would be triggered by T2 thread. gc.disable() # create cyclic garbage referencing object with __del__ a = mkgarbagecycle() a.held = DelCall() del a ready.close() done.recv() if __name__ == '__main__': main()
(py2.venv) kirr@deco:~/tmp/trashme/gcdel$ ./gcdel.py 2.7.15 (default, Jul 28 2018, 11:29:29) [GCC 8.1.0] T1: 139630321579840 T2: 139630308894464 __del__ called under thread 139630308894464
(py3.venv) kirr@deco:~/tmp/trashme/gcdel$ ./gcdel.py 3.6.6 (default, Jun 27 2018, 14:44:17) [GCC 8.1.0] T1: 140348348168000 T2: 140348332013312 __del__ called under thread 140348332013312
-
nexedi/erp5@bc120f55 indeed seems to be suspicious: ...
But unrelated to the current issue.
In other words LongRequest logger traceback is only a consequence of the bug.
No. It could really be what triggers the bug, by keeping a ref to the Cleanup instance.
I still think it is right as the following testcase demonstrates, where
__del__
is called by thread T2 because object cycle that references finalized object is garbage-collected by that second thread, not original T1 where finalized object was created:I already know what your testcase shows. And I misread second paragraph of the commit message.
Your hypothesis is rather that a cycle is formed by an exception. But that's valid only if it happens in
ZApplication.__bobo_traverse__
(or a callee) whereas thecleanup
variable is defined. I don't see what could raise an exception. An example ?@klaus @Tyagov Do not wait this discussion to end up. First disable the disable the long request logger on activity nodes. This is anyway a good idea because this logger is currently incompatible with CMFActivity, in that it produces useless logs.
-
Maintainer
@jm, I don't have ready answer but I already said what I would do to automatically determine who is holding additional reference on Cleanup. Maybe that is due to exception, maybe not. Maybe LongRequestLogger is related (but I do not see how after a quick check; maybe my fault), but the usefulnes of LongRequestLogger is high, especially on activity nodes. Thus I would suggest not to disable it blindly without clear understanding how/if it is faulty and we cannot fix.
-
the usefulnes of LongRequestLogger is high, especially on activity nodes
Sure, a working LongRequestLogger would be useful on activity nodes. But what happens currently is that
process_timer
processes activities in loop as long as there are activities to process, whereas LongRequestLogger starts the timeout a the beginning ofprocess_timer
, rather than at the beginning of each activity. So as soon as you have more than a few validated activities, LongRequestLogger can dump a lot even if each activity take less than the configured timeout.Does someone really look at long request logs of activity nodes ? I'd like to but: these are currently huge, with much noise, and no way to make stats from them. On another project, it was the first cause of downtime by filling the disk.
Thus I would suggest not to disable it blindly without clear understanding how/if it is faulty and we cannot fix.
But we have logs reporting that it happens ~10 times per day. Applying any of my suggestions won't give all answers but we'd still learn something. They're safe, quick to test and I have good hope they work.
So what is the mysterious ref to the Cleanup instance even after the unconditional call to
BaseRequest.close
in publish_module_standard. I hoped you may have answers from a previous debugging session, but no. If you want to add debug information toBaseRequest.clear
, go on. -
Maintainer
Julien, I'm assuming the assignment about this issue was passed to you, that's why I'm not debugging it myself further. Please let me know if it is not the case.
-
Your hypothesis is rather that a cycle is formed by an exception.
Is a stack trace any different from an exception for this purpose ?
At a short glance in DeadlockDebugger/dumper.py I see that
frame
does not get deleted beforedump_threads
returns. Values extracted from non-DeadlockDebugger
threads are kept alive aswell:co
,code
andrequest
(!). IfDeadlockDebugger
's own thread is the last one returned by_current_frames
, then it will cause a cycle with itself and keep the above variables alive from the last non-DeadlockDebugger
thread.Off-topic: it's somewhat worrying to realise
DeadlockDebugger
's password matching is done by regex. I'm certain this is vulnerable to timing attacks. We normally do not expose zope root externally, so it should not be critical, but it's bad practice nevertheless. -
I agree with JM for his quick approach of disabling long request logging. With some luck this can give us q quick and cheap info what is going on. @klaus , I suggest we disable it for 2-3 weeks and monitor. Still I think it is not activity nodes which are affected (in Woelfel SHM* case) but "web" family of nodes.
-
-
@vpelletier , generally I agree with you. My point is to do only for 2-3 weeks, gather information, make conclusions and bring back (and hopefully fix it).
If for whatever business reasons which Klaus can comment on this is not an option then well no option...
-
Still I think it is not activity nodes which are affected (in Woelfel SHM* case) but "web" family of nodes.
They are, but it's true it's a little too rare. For example, this night:
/srv/slapgrid/slappart5/srv/runner/instance/slappart5/.slappart5_zope-4.log:E: <App.ZApplication.Cleanup instance at 0x7f545ca9bb90>: created from thread 140000710526720 @Tue Aug 28 01:05:46 2018; __del__ runs from thread 140000719443712 @Tue Aug 28 01:05:50 2018
-
We disabled LongRequestLogger on the activity nodes of mic_wind and mic_construction production servers today ([forum post for reference] (https://www.erp5.com/project_section/woelfel-monitoring-project/forum/Mic-wind-upgrade---deactivate-the-long-request-logger-for-activity-nodes-XCcymCzmqK)).
-
Maintainer
/cc @levin.zimmermann, @jerome