Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances. (!1526) · Merge Requests · nexedi / erp5

Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances.

Background:

I investigated abnormal activity spawning patterns on Romain's dev instance when reindexing the entire site, which contains about 1 million documents:

The main reindexation phase was spawning indexation activities which were not being validated, so once all _recursiveReindexObjects were done in SQLQueue there were over a million indexation activities in SQLDict. This is because ActivityTool.tic keeps looping as long as it finds activities to run. This is perfectly fine when another process is doing activity validation, but when the cluster is composed of a single zope this completely freezes the activity validation process. This not only causes such activity accumulation, but also means that any interactive use of the site is impossible: indexation activities spawned by interactive use are also never validated.
When by chance some (recursiveReindexObject) activities in SQLDict did get validated, they were not executed for as long as _recursiveReindexObject activities existed in SQLQueue. This is because recursiveReindexObject are spawned without node preference, but _recursiveReindexObject is. These choices make sense, but they also mean that the effective priority of the former is 3, while the priority of the latter is 2. This, combined with the fact that they are spawned in different queues means, and the fact that _recursiveReindexObject respawns itself and is immediately validated (inserted with processing_node=0) means that SQLDict is never executed for as long as _recursiveReindexObject exist.

The first point is fixed by ActivityTool.process_timer telling ActivityTool.tic whether it is allowed to keep executing activities, and disallowing it when current node is the validation node. The internal logic of breaking the iteration when a queue could execute activities is preserved, so that activity validation happens before queue priorities are recomputed.

The second point is fixed by not setting same node preference when spawning activities at a time when there is a single processing node. This is done at activity insertion because it seems easier to do with a very low overhead than during priority computations later in the activity's lifecycle. This means that a cluster temporarily set with a single processing node will trigger this condition for all activities spawned during such period, but I believe this is exceedingly rare, and the temporary performance loss from having sub-optimal node selection in such transitory configuration should be meaningless. Explicit node family choices are obeyed independently of the number of processing nodes.

These changes should have an unnoticeable performance impact on multi-processing-nodes setups.

These changes should have a positive effect on multi-processing-nodes setups by improving the behaviour of a node configured both as validation node and as processing node (which is historically not a recommended setup), as it will now not completely stall validation for as long as there are processable activities. I would still recommend against such setup, as it will necessarily increase the validation latency, which will have a negative effect on activity performance.

With these changes, the activity spawning & execution pattern on Romain's single-node instance was much more stable.

/cc @jm @romain

Background:

I investigated abnormal activity spawning patterns on Romain's dev instance when reindexing the entire site, which contains about 1 million documents:
- The main reindexation phase was spawning indexation activities which were not being validated, so once all `_recursiveReindexObject`s were done in SQLQueue there were over a million indexation activities in SQLDict. This is because `ActivityTool.tic` keeps looping as long as it finds activities to run. This is perfectly fine when another process is doing activity validation, but when the cluster is composed of a single zope this completely freezes the activity validation process. This not only causes such activity accumulation, but also means that any interactive use of the site is impossible: indexation activities spawned by interactive use are also never validated.
- When by chance some (`recursiveReindexObject`) activities in SQLDict did get validated, they were not executed for as long as `_recursiveReindexObject` activities existed in SQLQueue. This is because `recursiveReindexObject` are spawned without node preference, but `_recursiveReindexObject` is. These choices make sense, but they also mean that the effective priority of the former is 3, while the priority of the latter is 2. This, combined with the fact that they are spawned in different queues means, and the fact that `_recursiveReindexObject` respawns itself and is immediately validated (inserted with `processing_node=0`) means that SQLDict is never executed for as long as `_recursiveReindexObject` exist.

The first point is fixed by `ActivityTool.process_timer` telling `ActivityTool.tic` whether it is allowed to keep executing activities, and disallowing it when current node is the validation node. The internal logic of breaking the iteration when a queue could execute activities is preserved, so that activity validation happens before queue priorities are recomputed.

The second point is fixed by not setting `same` node preference when spawning activities at a time when there is a single processing node. This is done at activity insertion because it seems easier to do with a very low overhead than during priority computations later in the activity's lifecycle. This means that a cluster temporarily set with a single processing node will trigger this condition for all activities spawned during such period, but I believe this is exceedingly rare, and the temporary performance loss from having sub-optimal node selection in such transitory configuration should be meaningless. Explicit node family choices are obeyed independently of the number of processing nodes.

These changes should have an unnoticeable performance impact on multi-processing-nodes setups.

With these changes, the activity spawning & execution pattern on Romain's single-node instance was much more stable.

/cc @jm @romain

Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances.

Revert this commit

Cherry-pick this commit