Closed BigRoy closed 3 months ago
NOTE: This should be tested on linux mashines too.
@fabiaserra - any chance you could roll this out on Linux machines and test?
@fabiaserra - any chance you could roll this out on Linux machines and test?
Yeah, this is the exact same change I did on my fork when we started discussing it but I haven't been able yet to do a test to confirm it's working, I will try soon
BTW bump version of deadline plugin pls
By the way - there may be cases where failing the job still makes sense. E.g. the:
if ayon_publish_job == "1" and ayon_render_job == "1":
raise RuntimeError(
"Misconfiguration. Job couldn't be both render and publish."
)
Or if the AyonExecutable
is not configured at all.
if not exe_list:
raise RuntimeError(
"Path to AYON executable not configured."
"Please set it in Ayon Deadline Plugin."
)
Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?
Maybe this:
if not all(add_kwargs.values()):
raise RuntimeError((
"Missing required env vars: AYON_PROJECT_NAME,"
" AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
))
May also make sense to always fail since it should behave quite similar across the workers/machines?
There are also cases where it may make sense to directly mark the Worker as bad for the job.
For example this:
exe_list = get_ayon_executable()
exe = FileUtils.SearchFileList(exe_list)
if not exe:
raise RuntimeError((
"Ayon executable was not found in the semicolon "
"separated list \"{}\"."
"The path to the render executable can be configured"
" from the Plugin Configuration in the Deadline Monitor."
).format(exe_list))
This may fail per worker depending on whether it has the exe to be found at any of the paths.
There is a high likelihood that that machine may not find it the next run either?
So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob
...
As such - we may do a follow-up PR or issue to be more specific with our raised errors. For example raising a dedicated error for when we should fail the job.
class AYONJobConfigurationError(RuntimeError):
"""An error of which we know when raised that the full job should fail
and retrying by other machines will be worthless.
This may be the case if e.g. not the fully required env vars are configured
to inject the AYON environment.
"""
Or a dedicated error when we should mark the Worker as bad:
class AYONWorkerBadForJobError(RuntimeError):
"""When raised, the worker will be marked bad for the current job.
This should be raised when we know that the machine will most likely
also fail on subsequent tries.
"""
However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again. So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.
I can confirm this works in Linux
Works for me on Windows - @kalisp @iLLiCiTiT ready for merge when you've decided next steps. :)
A client also ended up reporting this worked for them and actually helped them - so, on to merging! :)
Changelog Description
Do not force the full job to fail - it may be just this machine has a server connection timeout. This allows a single machine to fail - yet others to continue in the queue. This also allows a single machine to try again - since it'll use Deadline's default "mark bad after X errors" setting to decide when a worker is 'bad'.
Additional info
Fix #14
Testing notes: