Fix assertion failure when retrying to migrate a workflow history
When a first migration was reverted for any reason (e.g. ConflictError), the assertion failed because __setstate__ was not called yet to fix the class back to the old one.
-
Owner
@all Due to a bug in ZSQLCatalog, the assertion failure can turn into data corruption, so this commit should be backported to all instances using ERP5 >= 2019-10-04 (the regression was in 1ca655a3).
An attempt to fix ZSQLCatalog (partially) was tried with:
--- a/product/ZSQLCatalog/SQLCatalog.py +++ b/product/ZSQLCatalog/SQLCatalog.py @@ -265,6 +265,12 @@ def __getitem__(self, index): except ConflictError: raise except Exception: + # XXX: Better than nothing. It is too dangerous to ignore errors + # if there are changes to the ZODB. We already had cases of + # data corruption. + conn = getattr(document, '_p_jar', None) + if conn is not None and not conn._needs_to_join: + raise LOG('SQLCatalog', WARNING, 'Failed to call method %s on %r' % (attribute, document), error=True,
Unfortunately, as reported by https://nexedi.erp5.net/test_result_module/20200729-24F92F7A/ ("TypeError: getAcquiredCategoryList() takes exactly 2 arguments (1 given)" everywhere), this would need more work.
-
Owner
how does the data corruption look like ? workflow state getters return wrong result (another state) ? do we have some kind of error raised ?
TypeError: getAcquiredCategoryList() takes exactly 2 arguments (1 given)" everywhere
Yes,
def getAcquiredCategoryList(self)
inBase
anddef getAcquiredCategoryList(self, context)
inCategoryTool
. We already fixed several cases like this of two methods with the same name, but with different signatures and different semantics. pylint detects these problems, so hopefully we are no longer introducing new. -
Owner
how does the data corruption look like ?
And is there a known way to fix documents where it already happened ?
BTW, this fix (beyond the issue found in tests) will only work for single-ZODB instances, as documents from multiple connectors may be indexed. If this fix direction is kept, this may need to rather introspect the current transation, find registered ZODB connections and check them all. I know this is not the way we want new instances to be, but I believe we have still-maintained instances which do.
And more broadly, if we have the opportunity I would like the effort to be in the direction of getting rid of this broad try..except altogether: "silently" (because I do not believe anyone notices these logs) skipping part of the indexation of a document can be worse than having a visible failure: then catalog query results are inconsistent with ZODB content, and in turn code running these queries will take inconsistent decisions. Which means we need to locate and fix these API inconsistencies, and while they can be hard to fix at least should be very easy to detect.
-
Owner
For information, here is the summary Rafael gave me after the bug discovered in SlapOS master:
Rafael basically workflow history got corrupted by a bug Tomo in zodb ? Rafael yes in short, there is a conflict whiile migrate the workflow history to new format: 2020-07-06 22:33:33,665 WARNING SQLDict Exception raised when invoking messages (uid, path, method_id) [(4055295444616983670L, ('', 'erp5', 'software_instance_module', '20180214-2F346902'), 'immediateReindexObject'), (4055295444616983671L, ('', 'erp5', 'software_instance_module', '20180213-24B5A97F'), 'immediateReindexObject')] Traceback (most recent call last): File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/CMFActivity/Activity/SQLBase.py", line 666, in dequeueMessage transaction.commit() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_manager.py", line 123, in commit return self.get().commit() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 280, in commit reraise(t, v, tb) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 271, in commit self._commitResources() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/Products.TIDStorage-5.5.0-py2.7.egg/Products/TIDStorage/transaction_transaction.py", line 251, in _commitResources result = original__commitResources(self, *args, **kw) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 416, in _commitResources reraise(t, v, tb) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/transaction-1.7.0-py2.7.egg/transaction/_transaction.py", line 393, in _commitResources rm.tpc_vote(self) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZODB-4.4.5-py2.7.egg/ZODB/Connection.py", line 797, in tpc_vote s = vote(transaction) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZEO-4.3.1+slapospatched001-py2.7.egg/ZEO/ClientStorage.py", line 1072, in tpc_vote return self._check_serials() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/eggs/ZEO-4.3.1+slapospatched001-py2.7.egg/ZEO/ClientStorage.py", line 902, in _check_serials raise s ConflictError: database conflict error (oid 0x09cb6ceb, class Products.ERP5Type.Workflow.WorkflowHistoryList, serial this txn started with 0x03d913a98404f0bb 2020-07-06 22:33:30.942023, serial currently committed 0x03d913a98e78f833 2020-07-06 22:33:33.392001) while cataloging the document and it was caused because this warnnig: 2020-07-06 22:33:33,944 WARNING SQLCatalog Failed to call method getCreationDate on <Products.ERP5Catalog.CatalogTool.IndexableObjectWrapper for /erp5/software_instance_module/20180214-2F346902> Traceback (most recent call last): File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ZSQLCatalog/SQLCatalog.py", line 264, in __getitem__ value = value() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/Base.py", line 3236, in getCreationDate return history_list['edit_workflow'][0]['time'] File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/ConflictFree.py", line 104, in __getitem__ return next(islice(self, start, None)) File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/ConflictFree.py", line 57, in __iter__ bucket = self._next File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/patches/WorkflowTool.py", line 745, in _next self._migrate() File "/srv/slapgrid/slappart0/srv/runner/software/28960713fbbf4df63d84897ae0c6eb13/parts/erp5/product/ERP5Type/patches/WorkflowTool.py", line 762, in _migrate assert bucket.__class__ is WorkflowHistoryList, bucket.__class__ AssertionError: <class 'Products.ERP5Type.Workflow.WorkflowHistoryList'> in general this a warnning and not an error so zodb isn't roll back on this thread, causing migration fail, corrupting the history so, only happens when conflicts in this way the fix is: https://lab.nexedi.com/nexedi/erp5/commit/e57dd4fb090387ad546ee9f325cb9c9c0b841331 to the next documents I follow up the suggestion of julien and recreated the corrupted workflow history first using: "Base_detectAndFixWorkflowHistory" (I added on production) and considering I wouldn't know who was corrupted or not (consider that the bug only shows up when a buffer is filled) so it is a lazy problem, some problems are from 7/07 and only raised now so I wrote this: root@sd-148139:/srv/slapgrid/slappart0/srv/runner/instance/slappart4/software_release/parts/erp5/product# git diff diff --git a/product/ERP5Type/patches/WorkflowTool.py b/product/ERP5Type/patches/WorkflowTool.py index 366c57107a..b91847b125 100644 --- a/product/ERP5Type/patches/WorkflowTool.py +++ b/product/ERP5Type/patches/WorkflowTool.py @@ -759,6 +759,7 @@ class WorkflowHistoryList(NewWorkflowHistoryList): stack = [self] while True: stack.append(bucket) + bucket._p_activate() assert bucket.__class__ is WorkflowHistoryList, bucket.__class__ bucket.__class__ = NewWorkflowHistoryList bucket = bucket._prev @@ -792,6 +793,14 @@ def WorkflowTool_setStatusOf(self, wf_id, ob, status): if wfh is not None and not isinstance(wfh, NewWorkflowHistoryList): wfh = NewWorkflowHistoryList(wfh) ob.workflow_history[wf_id] = wfh + # Hackish fixup + elif wfh is not None and isinstance(wfh, NewWorkflowHistoryList): + if wfh._prev is not None and wfh._next is None: + wfh = list(reversed(wfh)) + wfh.reverse() + wfh = NewWorkflowHistoryList(wfh) + ob.workflow_history[wf_id] = wfh + # LOG('WorkflowTool', WARNING, 'Bad workflow migrated %s %s' % (wf_id, ob.getUid()) if wfh is None: wfh = NewWorkflowHistoryList() if not has_history: in general, I fix when it before it gets appended then the problem is gone ```
-
Owner
ConflictError: ... while cataloging the document and it was caused because this warnnig
No. The explanation is wrong.
First, there was a normal migration and the transaction got aborted for a reason that does not matter (the fact that at this moment, it was a ConflictError during reindexing is irrelevant).
Then, while reindexing, there was a second attempt to migrate the workflow history and that's where things started to go wrong, because on transaction abort, ZODB.Connection did not restore the class as it is actually stored (more generally, as written in WorkflowHistoryList.setstate, ZODB.Connection never fixes the class when loading an object). This led to an assertion failure (unexpected new class), whereas the workflow history was partially migrated. By ignoring the exception, ZSQLCatalog forced the transaction to commit with data that are neither in old format nor in new format (somewhere between and not usable by the code).
(consider that the bug only shows up when a buffer is filled)
( s/buffer/bucket/ )
Not only. This corruption also causes bugs immediately without raising exceptions (e.g. wrong
__len__
). -
Owner
Thanks Julien to fix up the explanation.
-
Owner
Base_detectAndFixWorkflowHistory
This script is not available in this repository.
With the new posts, I believe the answers to the questions I and Jérôme asked are:
how does the data corruption look like ? workflow state getters return wrong result (another state) ? do we have some kind of error raised ?
AsserttionError
gets raised from a type-checking statement, at least when migration is retried. But is it also the case on non-migration future calls, if there are such things ? Or will it always trigger a new migration attempt, which would always detect the issue and react with suchAssertionError
?Also, if this migration is triggered when just reading documents (ex: calling
getCreationDate
on a document), then it means that not all corrupted documents are visible in the catalog. For those which are visible in the catalog, it would mean they haveNULL
increation_date
,modification_date
, and*_state
columns.It would be nice to have a clear confirmation of this, rather than being left to guess what we should be looking for.
And is there a known way to fix documents where it already happened ?
Yes, but not published yet. The closest we have so far is the "Hackish fixup" from Thomas' quote of Rafael message.
It would be nice to share the tools in case others uncover other occurrences of the same issue, rather than leaving them to duplicate debugging work.
Is this an accurate summary of the current situation ?
-
Owner
There's little to see in the
WorkflowHistoryList
class and its__setstate__
method:- if the workflow history is migrated, there isn't anything overloading DoublyLinkList to trigger automatic migration;
- else a legacy class is used, with a few overloading attributes migrating automatically if there's any benefit doing so. The type-checking assertion is in the migration code.
The involved code is small. The important lines highlighted. Immediate consequences described. Further consequences can be varied and you have enough information to guess them as well as me.
I also didn't think about ways to find all corrupted workflow histories.
The issue only affects workflow histories with more than 16 entries.
Also, if this migration is triggered when just reading documents (ex: calling
getCreationDate
on a document), then it means that not all corrupted documents are visible in the catalog. For those which are visible in the catalog, it would mean they haveNULL
increation_date
,modification_date
, and*_state
columns.Not sure all columns would be NULL. Maybe only the first computed one. For other columns, the workflow history is only using the new code and because the data aren't completely migrated, it can either return correct values (e.g.
...[-1]
) or wrong values (e.g....[0]
). I don't see further exceptions on read operations (only when adding items).