storeblk() vs DB lock deadlock
The following was seen on a project: a Zope thread was stuck for a very long time with memory usage growing. The thing here is probably similar to loadblk() vs DB lock deadlock case we already solved in f49c11a3. Details:
In slapart9/zope-0 we are currently having:
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |grep -o 'Thread .*:' |sort |uniq
Thread 140164638697216:
Thread 140164647089920:
Thread 140164991964928:
Thread 140165000357632:
-> 4 long running threads. Here is latest info about them:
.../slappart9/var/log$ grep -v '^$' longrequest_logger_zope-0.log |tail -4
2016-10-21 09:24:44,515 - Thread 140165000357632: Started on 1476162420.8; Running for 879463.7 secs; Same.
2016-10-21 09:24:44,515 - Thread 140164647089920: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164991964928: Started on 1476162461.5; Running for 879423.0 secs; Same.
2016-10-21 09:24:44,516 - Thread 140164638697216: Started on 1476162461.7; Running for 879422.8 secs; Same.
so yes "Running for 879463.7 secs" means a thread ate 244 hour CPU time, but "Started on 1476162420.8" means it was start on:
In [1]: import time
In [2]: time.ctime(1476162420.8)
Out[2]: 'Tue Oct 11 08:07:00 2016'
-> On Oct 11 morning.
Now I've went to slappart9/srv/backup/logrotate and tracked there in longrequest_logger_zope-0.log-*.gz to those thread origins (by the way verifying there was nothing new longrunning started/stopped in between), so here is the original stucking info:
2016-10-11 05:07:41,917 - Thread 140165000357632: Started on 1476162420.8; Running for 41.1 secs; request: GET http://:/Control_Panel/timer_service/process_timer
retry count: 0
form: {}
other: {'ACTUAL_URL': 'http://:/Control_Panel/timer_service/process_timer',
'AUTHENTICATED_USER': <SpecialUser 'Anonymous User'>,
'AUTHENTICATION_PATH': '',
'AcceptLanguage': {},
'PARENTS': [<Activity Tool at /erp5/portal_activities>, <ERP5Site at /erp5>],
'PUBLISHED': <bound method TimerService.process_timer of <TimerService at /timer_service used for /Control_Panel>>,
'RESPONSE': TimerResponse(''),
'SERVER_URL': 'http://:',
'TraversalRequestNameStack': [],
'URL': 'http://:/Control_Panel/timer_service/process_timer',
'interval': 1,
'method': 'GET'}
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 138, in publish
request, bind=1)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/mapply.py", line 77, in mapply
if debug is not None: return debug(object,args,context)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 48, in call_object
result=apply(object,args) # Type s<cr> to step into published object.
File ".../parts/erp5/product/TimerService/TimerService.py", line 102, in process_timer
DateTime(prev_tick), DateTime(next_tick))
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1040, in process_timer
self.tic(processing_node_list.index(currentNode) + 1)
File ".../parts/erp5/product/CMFActivity/ActivityTool.py", line 1095, in tic
if not last.dequeueMessage(inner_self, processing_node):
File ".../parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 509, in dequeueMessage
transaction.commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
return self.get().commit()
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
self._commitResources()
File ".../eggs/Products.TIDStorage-5.4.9-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 261, in _commitResources
result = original__commitResources(self, *args, **kw)
File ".../eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
rm.commit(self)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 716, in commit
self.zfileh.dirty_writeout(WRITEOUT_STORE)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 406, in storeblk
def storeblk(self, blk, buf): return self.zself.storeblk(blk, buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 496, in storeblk
blkchanged = zblk.setblkdata(buf)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 311, in setblkdata
if (start % CHUNKSIZE) or len(chunk.data) > CHUNKSIZE:
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 860, in setstate
self._setstate(obj)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 901, in _setstate
p, serial = self._storage.load(obj._p_oid, '')
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ClientStorage.py", line 833, in load
data, tid = self._server.loadEx(oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/ServerStub.py", line 176, in loadEx
return self.rpc.call("loadEx", oid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 763, in call
r_args = self.wait(msgid)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZEO/zrpc/connection.py", line 791, in wait
self.replies_cond.wait()
File ".../parts/python2.7/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
( 140165000357632 seems to be an activity executor thread triggered to run under /timer_service/process_timer; some activity involving wendelin.core has been run. then there is commit, which involves wendelin.core writeout activity. wendelin.core wants to store an object to ZEO and this in turn triggers some another load from ZEO, and loading is stuck forever for some reason )
2016-10-11 05:07:45,739 - Thread 140164638697216: Started on 1476162461.7; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
2016-10-11 05:07:45,537 - Thread 140164991964928: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 725, in open
self._a()
File ".../parts/python2.7/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
( Above are just 2 std zope threads - they want to open the database first but are stuck on the lock being taken )
2016-10-11 05:07:45,474 - Thread 140164647089920: Started on 1476162461.5; Running for 4.0 secs; [No request]
Traceback:
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZServer/PubCore/ZServerPublisher.py", line 31, in __init__
response=b)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 455, in publish_module
environ, debug, request, response)
File ".../eggs/Products.LongRequestLogger-2.0.0-py2.7.egg/Products/LongRequestLogger/__init__.py", line 19, in publish_module_standard
return publish_module_standard.original(*args, **kw)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 249, in publish_module_standard
response = publish(request, module_name, after_list, debug=debug)
File ".../parts/erp5/product/Localizer/patches.py", line 84, in new_publish
x = zope_publish(request, module_name, after_list, debug)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/Publish.py", line 127, in publish
object=request.traverse(path, validated_hook=validated_hook)
File ".../eggs/Zope2-2.13.24-py2.7.egg/ZPublisher/BaseRequest.py", line 413, in traverse
object=object.__bobo_traverse__(request)
File ".../eggs/Zope2-2.13.24-py2.7.egg/App/ZApplication.py", line 43, in __bobo_traverse__
conn = db.open()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/DB.py", line 751, in open
result.open(transaction_manager)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 534, in Connection_open
orig_Connection_open(self, transaction_manager, delegate)
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 1046, in open
self._flush_invalidations()
File ".../eggs/ZODB3-3.10.7-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 539, in _flush_invalidations
self._cache.invalidate(invalidated)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 152, in _p_invalidate
self._v_zfile.invalidateblk(self._v_blk)
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 505, in invalidateblk
fileh.invalidate_page(blk) # XXX assumes blksize == pagesize
File ".../develop-eggs/wendelin.core-0.8-py2.7-linux-x86_64.egg/wendelin/bigfile/file_zodb.py", line 635, in invalidate_page
return self.zfileh.invalidate_page(pgoffset)
( this seems to be another worker thread in between transactions which received invalidation from DB and this way also triggers wendelin.core invalidations which then wants to acquire virtmem lock and waits forever because that lock is already being held by writeout code )
All this looks similar to deadlock we already solved between loadblk() and ZEO lock (f49c11a3). Here we are probably having some kind of storeblk() vs ZEO lock deadlock as after I've checked storeblk() are still being run from under virtmem lock), though it is not clear off-hand why the memory usage is growing.