ynput / ayon-deadline

Deadline addon for AYON
Apache License 2.0
3 stars 11 forks source link

Fail GlobalJobPreLoad harder when we know for sure repeating the task won't work #20

Open BigRoy opened 3 months ago

BigRoy commented 3 months ago

Is there an existing issue for this?

Please describe the feature you have in mind and explain what the current shortcomings are?

Since #15 the full job won't fail if the GlobalJobPreLoad has an error during the AYON environment injection.

However, there may very well be cases where we can now that restarting won't be feasible as reported here

For example:

    if ayon_publish_job == "1" and ayon_render_job == "1":
        raise RuntimeError(
            "Misconfiguration. Job couldn't be both render and publish."
        )

Or if the AyonExecutable is not configured at all.

    if not exe_list:
        raise RuntimeError(
            "Path to AYON executable not configured."
            "Please set it in Ayon Deadline Plugin."
        )

Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?

Maybe this:

        if not all(add_kwargs.values()):
            raise RuntimeError((
                "Missing required env vars: AYON_PROJECT_NAME,"
                " AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
            ))

May also make sense to always fail since it should behave quite similar across the workers/machines?


There are also cases where it may make sense to directly mark the Worker as bad for the job.

For example this:

        exe_list = get_ayon_executable()
        exe = FileUtils.SearchFileList(exe_list)

        if not exe:
            raise RuntimeError((
               "Ayon executable was not found in the semicolon "
               "separated list \"{}\"."
               "The path to the render executable can be configured"
               " from the Plugin Configuration in the Deadline Monitor."
            ).format(exe_list))

This may fail per worker depending on whether it has the exe to be found at any of the paths.

There is a high likelihood that that machine may not find it the next run either? So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...

How would you imagine the implementation of the feature?

For example raising a dedicated error for when we should fail the job.

class AYONJobConfigurationError(RuntimeError):
    """An error of which we know when raised that the full job should fail
    and retrying by other machines will be worthless.

    This may be the case if e.g. not the fully required env vars are configured
    to inject the AYON environment.
    """

Or a dedicated error when we should mark the Worker as bad:

class AYONWorkerBadForJobError(RuntimeError):
    """When raised, the worker will be marked bad for the current job.

    This should be raised when we know that the machine will most likely
    also fail on subsequent tries.
    """

However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again. So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.

Are there any labels you wish to add?

Describe alternatives you've considered:

Just leave it completely up to the Deadline Settings for 'monitoring failures' instead of forcing a behavior onto it. - Yet at the same time, we do want to avoid many machines trying many times if we know early on all would fail regardless.

Additional context:

No response