Commit 041642d0 authored by Vincent Pelletier's avatar Vincent Pelletier

Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances.

- Ignore node preference when spawning activities.
  Otherwise, activities which are not spawned with a preferred node will
  get an effective priority penalty compared to same-priority activities
  spawned *with* a node preference, despite both being to execute by the
  same processing node.
- Break activity processing loop when the current processing node is also
  the activity validation node.
  This avoids pathological cases of activity accumulation, for example when
  reindexing an entire site: _recurseCallMethod is spawned in
  processing_node=0, but immediateReindexObject is spawned in
  processing_node=-1 because of serialization_tag dependency, so with such
  loop _recurseCallMethod will be executed over and over, piling indexation
  activities up until _recurseCallMethod does not self-respawn.
  In turn, such activity accumulation lead to an increased overhead, and
  decreased activity processing efficiency.
  This may also allow multi-node instances to more reliably use the
  validation node as a processing node.

The cost for multi-node instances of these changes should be absolutely
minimal (no extra IO necessary, minimal extra code).
A possible drawback on single-node instances is that tic period may become
more important because process_timer will return more often.
parent a55b0f78
Pipeline #18991 failed with stage
in 0 seconds
......@@ -1293,8 +1293,9 @@ class ActivityTool (BaseTool):
self.registerNode(currentNode)
processing_node_list = self.getNodeList(role=ROLE_PROCESSING)
is_distributing_node = self.getDistributingNode() == currentNode
# only distribute when we are the distributingNode
if self.getDistributingNode() == currentNode:
if is_distributing_node:
self.distribute(len(processing_node_list))
# SkinsTool uses a REQUEST cache to store skin objects, as
......@@ -1310,7 +1311,20 @@ class ActivityTool (BaseTool):
# the processing_node numbers are the indices of the elements
# in the node tuple +1 because processing_node starts form 1
if currentNode in processing_node_list:
self.tic(processing_node_list.index(currentNode) + 1)
self.tic(
processing_node=processing_node_list.index(currentNode) + 1,
# Tell tic it cannot keep processing activities forever when
# current node is also the distribution node: it must return
# in order for validation to happen. Otherwise, there can be
# an accumulation of activities in processing_node=-1 which
# would otherwise be possible to execute.
# To minimize the impact of this control, this is computed
# here, so if the current node is given the distribution node
# role while already in the "tic" method, it will keep looping.
# Node role changes are rare and several are dangerous on a
# busy cluster, so it should not be a big drawback.
can_loop=is_distributing_node,
)
except:
# Catch ALL exception to avoid killing timerserver.
LOG('ActivityTool', ERROR, 'process_timer received an exception',
......@@ -1330,7 +1344,7 @@ class ActivityTool (BaseTool):
activity.distribute(aq_inner(self), node_count)
security.declarePublic('tic')
def tic(self, processing_node=1, force=0):
def tic(self, processing_node=1, force=0, can_loop=True):
"""
Starts again an activity
processing_node starts from 1 (there is not node 0)
......@@ -1371,6 +1385,8 @@ class ActivityTool (BaseTool):
break
else:
break
if not can_loop:
break
finally:
is_running_lock.release()
finally:
......@@ -1462,9 +1478,13 @@ class ActivityTool (BaseTool):
elif node != 'same':
kw['node'] = self.getFamilyId(node)
break
processing_node_list = self.getNodeList(role=ROLE_PROCESSING)
if len(processing_node_list) < 2:
# If there is less than 2 processing nodes at the time this activity
# is spawned, then ignore "same" (including implicit "same").
break
try:
kw['node'] = 1 + self.getNodeList(
role=ROLE_PROCESSING).index(getCurrentNode())
kw['node'] = 1 + processing_node_list.index(getCurrentNode())
except ValueError:
pass
break
......
  • it seems this commit break our tests , there has a lot pending immediateReindexObject activities , @vpelletier can you pls check?

  • Thanks for the notification, this should be fixed by 4eb26017 : I did not identify that tests monkey-patch the tic method I modified here.

  • ...and one more fix, this time for some UI tests which now require more tic calls now: 208b11ad .

    Hopefully this should be the last fix needed to stabilise the tests, as only testFunctionalCore and erp5_accounting_renderjs_ui_test:testFunctionalRJSAccountingReport were failing...

  • @vpelletier it seems the accounting report tests are still failing. I didn't investigate yet why this script starts failing after your change.

    Maybe @jerome has an idea?

  • There are at least 3 issues behind these failures:

    • since 49b9612b, the Preference owner is changed, but not which user has the Owner local role, which is inconsistent in itself. I am fixing this.
    • Preference_disableOtherPreferences searched for preferences by owner but then checks the current user hard the Owner local role. It should rather search by viewable_owner, as this is based on the role and not document ownership, and if for whatever reason the owner does not have view permission for their Owner role, it would already not be listed in the owner-based catalog result. I am fixing this.
    • ...that the above points have exactly nothing to do with my CMFActivity change. So there is more here than these issues. So I should keep looking before applying the above fixes to see what else I am missing.

    [EDIT] I pushed the early fixes (untested for now) in vpelletier/erp5/fix_preference_ownership_checks.

    Edited by Vincent Pelletier
  • And another bug in the bug in the bug: the ownership change does not reindex the modified document, so the tests behave differently before and after the first time that preference gets reindexed.

  • Ahah, this may be the explanation of all current errors in erp5 tests

  • I pushed the first 2 fixes (the BT dependency fix is unnecessary and I am getting some failures with it). I'm still trying to understand how my CMFActivity change caused all this to surface.

  • ...yet another fix for code which relied on a limited number of tic calls to be enough: 1c3ff213 .

    I am looking at testERP5Security failures, which seem to happen because stop_condition is being invoked many more times for a given activity queue (so the test discovers that there appears to be a hole when no user can log in - which is what the test was trying to verify), which is also causing a feed-back loop (try creating Person during stop_condition, get exception, no rollback: activities get queued, so one more tic loop, so one more stop_condition call, etc). I do not have a fix for this ready yet.

    I am feeling that these failures are going on for too long, and while the issues discovered are generally real issues, I feel that it could maybe be better to revert the CMFActivity change (and corresponding tic API change fixup in one test), keep the other fixes in, and debug in a separate test runner. If this evening I cannot get decent test scores, I intend to do so, so next week should start on a clean test result (even though we now know there are bugs hidden by the old single-node tic scheduling).

  • I am more and more tempted with reverting this (and the related fixups, but not the test fixes nor the tic-count-to-deadline changes).

    Here is the breakdown of remaining failing tests:

    • erp5_workflow_test:testWorkflowAndDCWorkflow: UID change (document changed between the moment the activity was spawned and the moment it was executed)
    • erp5_core_test:testRestrictedPythonSecurity: this one is surprising, the CRYPT password hashing scheme is causing the test to fail. I suspect this is unrelated to my change as this is not an ERP5 test, but a zope test.
    • erp5_archive:testArchive: takes longer than before (over 3 times longer than before)
    • testUpgradeInstanceWithOldDataFs: (likewise)
    • erp5_configurator_standard:testStandardConfigurationWorkflow: (likewise)
    • erp5_officejs_support_request_ui_test:testFunctionalSupportRequest: random UI test failure ? (timeout on non-activity-related selenium step)
    • erp5_officejs_ui_test:testFunctionalOfficeJSoOoSpreadsheet: random UI test failure ? (ERP5JS traceback on non-activity-related selenium step)

    So overall:

    • 4 maybe-issues, 3 likely unrelated to this commit
    • 3 failures just because the tests take a lot longer than before to execute activities (because instead of looking deep inside CMFActivity, it is the test tic loop which iterates more than before - so more code needs to run for each activity execution)

    @jerome reported a test installing an instance taking 5x longer than before to complete. This could be improved by modifying process_timer to loop instead of returning control to timerserver, but given the much-stronger-than-expected performance hit that even test tic is getting, I am not very optimistic about the final performances.

    So while I like that tests could detect a lot of actual bugs (missing activity dependencies, or accidentally-strong expectations of some tests on how the activities are processed), the overall performance impact on high-saturation uses (like creating a new ERP5 instance and running tests) and ease of deploying a multi-zopes development instance, I now feel this change is not the best solution. Maybe we can decide that single-zope setups are not a supported setup for anything beyond environments with only a handful of documents ?

  • mentioned in commit 340acac6

    Toggle commit list
  • Revert pushed: 340acac6 .

    • erp5_core_test:testRestrictedPythonSecurity: this one is surprising, the CRYPT password hashing scheme is causing the test to fail. I suspect this is unrelated to my change as this is not an ERP5 test, but a zope test.

    Yes, this one is surprising. It might not be related. I also saw it starting to fail recently, it's might be because of something else, maybe when we updated test machines to debian 11. I don't see what recent changes in erp5 or slapos might cause this.

  • maybe when we updated test machines to debian 11

    This would be my guess, indeed. This, or maybe an updated egg responsible for registering this password hashing scheme to the registry (or otherwise messing with the registry).

    crypt is 3des-based, not salted, ... Deprecating it would be a good thing. Having it as a fallback for any existing crypt-encoded password is nice, but AFAIK ERP5 has always used SSHA so far so it should not be necessary in any ERP5 instance.

    EDIT: Just a random duckduckgo find: RIPE deprecated crypt-pw in its own password database in 2006~2007

    Edited by Vincent Pelletier
  • Let's remove crypt test !1530 (merged)

Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment