yawlfoundation / yawl

Yet Another Workflow Language
http://www.yawlfoundation.org
GNU Lesser General Public License v3.0
88 stars 35 forks source link

ResourceManager does not finalize multiple instances of automated task #647

Open phtyson opened 7 months ago

phtyson commented 7 months ago

I have a multiple-instance composite task that decomposes into a single atomic automated task with codelet. At runtime, several child instance workitems are created, with ids like n.n.n.1, n.n.n.2, etc. And for each of those, the atomic task workitem id is like n.n.n.1.1, n.n.n.2.1, etc. All the automated atomic tasks are completed, checked in, and unpersisted. However, only one of the individual MI tasks is processed completely (checked out, completed, and checked in). This is the last child MI workitem processed. The others remain in persisted "busy" state. This causes problems when restoring state after restarting the service, because the engine state doesn't agree with the workitem cache. Data is lost.

I see that ResourceManager.handleAutoTask() only processes the first child. The obvious fix would be to loop through all children and call processAutoTask(), but since all the atomic tasks are already being processed there must be something else going on. UPDATE: This appears to be irrelevant. The call chains for each child of MI task are identical as far as I can tell.

Can someone explain the code for handling multiple instances of automated tasks, and suggest a solution?

phtyson commented 7 months ago

I have been studying this problem for a few days, and have tried lots of things that didn't work, but the problem remains. The big picture is:

  1. A Multiple Composite task fires with 2 or more instances
  2. All instance workitems are successfully created, checked out, executed, and checked back in. (All automatic codelets.)
  3. For each completed MI workitem, YNetRunner.processCompletedSubnet() is called. The items are processed asynchronously, and the exit flag is set only when the active items = completed items. At this time, the last (current) item is completed and unpersisted. The other items of this completed multiple-instance remain in the database, and can be seen in the runner_states table. (I have added debug logging so I can see most of what's going on after it happens. Unfortunately I have not yet put it in live debug mode.)
  4. Everything appears normal in UI resource and monitor front ends until the services are restarted.
  5. When the services are restarted, the engine attempts to restore all runner states--including the completed MI subnets. This causes a NPE which logs a warning that engine state is unreliable. When the resource manager tries to restore from cache, it doesn't find what it expects in engine and then removes data, resulting in loss of active workitems.

I have tried:

  1. Bypassing the null nets in the ResourceManager when it tries to restore runners. This results in further state problems that prevent the engine from starting.
  2. Forcing the YNetRunner and YTask methods to try to complete the dangling MI items. It is always prevented until the last item is completed, and then it does not appear possible to cancel or complete the other items.

Any suggestions for improving the code to handle this would be appreciated. I'm working on a slightly modified fork of 4.2. Alternatively, if there is a way of restructuring the process specification to avoid this, I could do that.

ahense commented 7 months ago

We can offer professonial support here. More information on this is here: https://yawlfoundation.github.io/page11.html

phtyson commented 7 months ago

Further investigation indicates these symptoms may have been caused by the workflow specification configuration. I rewrote the spec to eliminate some subnets and simplify the joins, and the runtime symptoms disappeared.

I have not done controlled testing to find the problematic configuration. There were 2 places in the spec where it was failing, and both had the same characteristic flow configuration, in which two multiple composite tasks on different XOR branches were connected directly to the subnet output condition. I don't have time to pursue this right now, but I suspect there is danger when connecting a multiple composite task to a subnet output condition, when there are more than 1 instances of the task.