Failure with inject_environment should fail only task which triggered it

kalisp commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues.

Please describe the feature you have in mind and explain what the current shortcomings are?

It is possible that issue in inject_environment (as not installed AYON, or wrong credentials or timeout in server communication) could cause full stop of the job and not only task that has that issue (Via RepositoryUtils.FailJob(job) https://github.com/ynput/ayon-deadline/blob/develop/client/ayon_deadline/repository/custom/plugins/GlobalJobPreLoad.py#L584

How would you imagine the implementation of the feature?

This needs to be replicated and if it is the case, only failing task should be stopped from running (as AYON environments must be present and we cannot fail silently). Failed task could be picked up by another worker which may be configured correctly and handle that task.

Are there any labels you wish to add?

[X] I have added the relevant labels to the enhancement request.

Describe alternatives you've considered:

No response

Additional context:

No response

iLLiCiTiT commented 4 months ago

If I remember correctly we did try to achieve that in past and failed with that.

GlobalJobPreLoad does exactly what is in the name, prepares job, not task. So at the moment when it is triggered it actually does not have access to a task yet.

At the same time raising error in GlobalJobPreLoad did not cause fail of the task and continued to render. Althou this should be re-validated we're talking about deadline from 2020 maybe earlier. But render fail can be caused only by Plugin calling FailRender or using FailRenderException , maybe that could be triggered from job preload?

fabiaserra commented 4 months ago

At the same time raising error in GlobalJobPreLoad did not cause fail of the task and continued to render.

That's fine but you are calling RepositoryUtils.FailJob(job) . That does fail the job no?

iLLiCiTiT commented 4 months ago

That's fine but you are calling

That is not fine, the AYON env preparation does not happen, but ~~rendering~~ processing happens.

Ignoring env preparation is not the right fix for this issue.

fabiaserra commented 4 months ago

That is not fine, the AYON env preparation does not happen, but ~rendering~ processing happens.

I'm saying that's fine to your comment about raising errors did not cause the fail of the task, but clearly a call of FailJob would do what you claim didn't happen

BigRoy commented 4 months ago

I gave this a quick go. If we just raise an error with a clear log message then this comes out fine from the job and the task basically 'fails' for that machine. It doesn't mark the Task or Job failed - it just logs an error and the 'requeues' the task.

The machine will then continue with another task.

The job will remain active - it will just have a +1 on the logged errors:

We should not trigger FailJob but just raise an error and all is good.

BigRoy commented 4 months ago

We could in special circumstances if we know exactly what the error is and that the machine will 100% definitely fail again - add the machine to the deny list (or remove from allow list). But I'd argue that's way too magical for the majority of cases.

Deadline has sufficient capabilities to decide to do on an error - like e.g. failing the job if it hits 10+ errors or marking a machine as "bad" after it has e.g. 5+ errors:

ynput / ayon-deadline