Open phtyson opened 7 months ago
I have been studying this problem for a few days, and have tried lots of things that didn't work, but the problem remains. The big picture is:
YNetRunner.processCompletedSubnet()
is called. The items are processed asynchronously, and the exit flag is set only when the active items = completed items. At this time, the last (current) item is completed and unpersisted. The other items of this completed multiple-instance remain in the database, and can be seen in the runner_states
table. (I have added debug logging so I can see most of what's going on after it happens. Unfortunately I have not yet put it in live debug mode.)I have tried:
ResourceManager
when it tries to restore runners. This results in further state problems that prevent the engine from starting.YNetRunner
and YTask
methods to try to complete the dangling MI items. It is always prevented until the last item is completed, and then it does not appear possible to cancel or complete the other items.Any suggestions for improving the code to handle this would be appreciated. I'm working on a slightly modified fork of 4.2. Alternatively, if there is a way of restructuring the process specification to avoid this, I could do that.
We can offer professonial support here. More information on this is here: https://yawlfoundation.github.io/page11.html
Further investigation indicates these symptoms may have been caused by the workflow specification configuration. I rewrote the spec to eliminate some subnets and simplify the joins, and the runtime symptoms disappeared.
I have not done controlled testing to find the problematic configuration. There were 2 places in the spec where it was failing, and both had the same characteristic flow configuration, in which two multiple composite tasks on different XOR branches were connected directly to the subnet output condition. I don't have time to pursue this right now, but I suspect there is danger when connecting a multiple composite task to a subnet output condition, when there are more than 1 instances of the task.
I have a multiple-instance composite task that decomposes into a single atomic automated task with codelet. At runtime, several child instance workitems are created, with ids like
n.n.n.1
,n.n.n.2
, etc. And for each of those, the atomic task workitem id is liken.n.n.1.1
,n.n.n.2.1
, etc. All the automated atomic tasks are completed, checked in, and unpersisted. However, only one of the individual MI tasks is processed completely (checked out, completed, and checked in). This is the last child MI workitem processed. The others remain in persisted "busy" state. This causes problems when restoring state after restarting the service, because the engine state doesn't agree with the workitem cache. Data is lost.I see that
ResourceManager.handleAutoTask()
only processes the first child. The obvious fix would be to loop through all children and callprocessAutoTask()
, but since all the atomic tasks are already being processed there must be something else going on. UPDATE: This appears to be irrelevant. The call chains for each child of MI task are identical as far as I can tell.Can someone explain the code for handling multiple instances of automated tasks, and suggest a solution?