workfloworchestrator / orchestrator-core

The workflow orchestrator core repository
Apache License 2.0
38 stars 14 forks source link

Child subscription disabling actions on parent subscription #140

Closed gstewart86 closed 2 years ago

gstewart86 commented 2 years ago

Today we ran into a situation where we had an old Layer3 subscription that had been manually terminated on CREATE for some reason, likely due to bad data. This had the effect of disabling all actions on the parent subscription (Prefix List). As a workaround, we manually set the terminated Layer3 to in-sync to allow parent subscription actions.

So, a few questions:

hanstrompert commented 2 years ago

The philosophy is that when the orchestrator is in sync with all OSS/BSS then all workflows will succeed. If a workflow fails for whatever reason it will leave the subscription out of sync to prevent other workflows from being started because it is likely that they will fail as well. And because it is not known what OSS/BSS was (partially) updated or not we want to prevent other changes to OSS/BSS that could potentially create bigger problems. All failed workflows should be checked and if necessary the CoreDB and OSS/BSS should be corrected manually.

You could cleanup things automatically when you are absolutely sure that the orchestrator and all OSS/BSS are in sync. But you could also ask yourself why people abort workflows and review the procedures that people follow and minimise the reasons for people to abort workflows.

acidjunk commented 2 years ago

I'm not sure about the phrase: "workflows which were aborted during the CREATE process" and stale workflows is that a workflow that failed to start: e.g. has the state RUNNING but doesn't do anything because the threadpool wasn't running or some other error that caused a Workflow to not actually start?

I think the answer by Hans describes the general process philosophy, but occasionally you could run into problems that are caused by stale workflows: when the threadpool is killed with a -9 for example, whilst executing WF's. (that's why the GUI has a Pause Engine feature)

These can often be resolved by aborting + delete: of course this depends on which step it was executing. I think the solution you used (forcing a subscription to in-sync) is ok; for that kind of scenario. We discussed having a button for that in the GUI but it was deemed too dangerous: as users would probably click on it without looking at the process state first.

gstewart86 commented 2 years ago

I think I muddied the waters by providing too much context around the circumstances in which the parent subscription was terminated.

Given a subscription tree:

node
│
└► service edge
       │
       └► layer 3
            │
            └► prefix list

I was surprised that the status of a child subscription (prefix list) was able to affect the status of a parent subscription (layer 3). It seemed to me that a child subscription would certainly be affected by the status of the parent, but not the other way around.

In any case, we haven't seen this in production since, so I'll go ahead and resolve the issue unless we see it again.

Thanks for weighing in with all the info, regardless!