Closed kalisp closed 3 months ago
If I remember correctly we did try to achieve that in past and failed with that.
GlobalJobPreLoad does exactly what is in the name, prepares job, not task. So at the moment when it is triggered it actually does not have access to a task yet.
At the same time raising error in GlobalJobPreLoad did not cause fail of the task and continued to render. Althou this should be re-validated we're talking about deadline from 2020 maybe earlier. But render fail can be caused only by Plugin calling FailRender or using FailRenderException , maybe that could be triggered from job preload?
At the same time raising error in GlobalJobPreLoad did not cause fail of the task and continued to render.
That's fine but you are calling RepositoryUtils.FailJob(job)
. That does fail the job no?
That's fine but you are calling
That is not fine, the AYON env preparation does not happen, but rendering processing happens.
Ignoring env preparation is not the right fix for this issue.
That is not fine, the AYON env preparation does not happen, but ~rendering~ processing happens.
I'm saying that's fine to your comment about raising errors did not cause the fail of the task, but clearly a call of FailJob
would do what you claim didn't happen
I gave this a quick go. If we just raise an error with a clear log message then this comes out fine from the job and the task basically 'fails' for that machine. It doesn't mark the Task or Job failed - it just logs an error and the 'requeues' the task.
The machine will then continue with another task.
The job will remain active - it will just have a +1 on the logged errors:
We should not trigger FailJob
but just raise an error and all is good.
We could in special circumstances if we know exactly what the error is and that the machine will 100% definitely fail again - add the machine to the deny list (or remove from allow list). But I'd argue that's way too magical for the majority of cases.
Deadline has sufficient capabilities to decide to do on an error - like e.g. failing the job if it hits 10+ errors or marking a machine as "bad" after it has e.g. 5+ errors:
Is there an existing issue for this?
Please describe the feature you have in mind and explain what the current shortcomings are?
It is possible that issue in
inject_environment
(as not installed AYON, or wrong credentials or timeout in server communication) could cause full stop of the job and not only task that has that issue (ViaRepositoryUtils.FailJob(job)
https://github.com/ynput/ayon-deadline/blob/develop/client/ayon_deadline/repository/custom/plugins/GlobalJobPreLoad.py#L584
How would you imagine the implementation of the feature?
This needs to be replicated and if it is the case, only failing task should be stopped from running (as AYON environments must be present and we cannot fail silently). Failed task could be picked up by another worker which may be configured correctly and handle that task.
Are there any labels you wish to add?
Describe alternatives you've considered:
No response
Additional context:
No response