Products.CMFActivity.ActivityTool: Improve behaviour on single-node instances.
Background:
I investigated abnormal activity spawning patterns on Romain's dev instance when reindexing the entire site, which contains about 1 million documents:
- The main reindexation phase was spawning indexation activities which were not being validated, so once all
_recursiveReindexObject
s were done in SQLQueue there were over a million indexation activities in SQLDict. This is becauseActivityTool.tic
keeps looping as long as it finds activities to run. This is perfectly fine when another process is doing activity validation, but when the cluster is composed of a single zope this completely freezes the activity validation process. This not only causes such activity accumulation, but also means that any interactive use of the site is impossible: indexation activities spawned by interactive use are also never validated. - When by chance some (
recursiveReindexObject
) activities in SQLDict did get validated, they were not executed for as long as_recursiveReindexObject
activities existed in SQLQueue. This is becauserecursiveReindexObject
are spawned without node preference, but_recursiveReindexObject
is. These choices make sense, but they also mean that the effective priority of the former is 3, while the priority of the latter is 2. This, combined with the fact that they are spawned in different queues means, and the fact that_recursiveReindexObject
respawns itself and is immediately validated (inserted withprocessing_node=0
) means that SQLDict is never executed for as long as_recursiveReindexObject
exist.
The first point is fixed by ActivityTool.process_timer
telling ActivityTool.tic
whether it is allowed to keep executing activities, and disallowing it when current node is the validation node. The internal logic of breaking the iteration when a queue could execute activities is preserved, so that activity validation happens before queue priorities are recomputed.
The second point is fixed by not setting same
node preference when spawning activities at a time when there is a single processing node. This is done at activity insertion because it seems easier to do with a very low overhead than during priority computations later in the activity's lifecycle. This means that a cluster temporarily set with a single processing node will trigger this condition for all activities spawned during such period, but I believe this is exceedingly rare, and the temporary performance loss from having sub-optimal node selection in such transitory configuration should be meaningless. Explicit node family choices are obeyed independently of the number of processing nodes.
These changes should have an unnoticeable performance impact on multi-processing-nodes setups.
These changes should have a positive effect on multi-processing-nodes setups by improving the behaviour of a node configured both as validation node and as processing node (which is historically not a recommended setup), as it will now not completely stall validation for as long as there are processable activities. I would still recommend against such setup, as it will necessarily increase the validation latency, which will have a negative effect on activity performance.
With these changes, the activity spawning & execution pattern on Romain's single-node instance was much more stable.