ynput / ayon-deadline

Deadline addon for AYON
Apache License 2.0
3 stars 11 forks source link

Do not fail job on error in GlobalJobPreLoad #15

Closed BigRoy closed 3 months ago

BigRoy commented 4 months ago

Changelog Description

Do not force the full job to fail - it may be just this machine has a server connection timeout. This allows a single machine to fail - yet others to continue in the queue. This also allows a single machine to try again - since it'll use Deadline's default "mark bad after X errors" setting to decide when a worker is 'bad'.

Additional info

Fix #14

Testing notes:

  1. Submit job to Deadline
  2. Turn off your AYON server (or do whatever to make the GlobalJobPreLoad fail)
  3. It should not instantly mark the full job and all its task "failed"
iLLiCiTiT commented 4 months ago

NOTE: This should be tested on linux mashines too.

BigRoy commented 4 months ago

@fabiaserra - any chance you could roll this out on Linux machines and test?

fabiaserra commented 4 months ago

@fabiaserra - any chance you could roll this out on Linux machines and test?

Yeah, this is the exact same change I did on my fork when we started discussing it but I haven't been able yet to do a test to confirm it's working, I will try soon

iLLiCiTiT commented 4 months ago

BTW bump version of deadline plugin pls

BigRoy commented 4 months ago

By the way - there may be cases where failing the job still makes sense. E.g. the:

    if ayon_publish_job == "1" and ayon_render_job == "1":
        raise RuntimeError(
            "Misconfiguration. Job couldn't be both render and publish."
        )

Or if the AyonExecutable is not configured at all.

    if not exe_list:
        raise RuntimeError(
            "Path to AYON executable not configured."
            "Please set it in Ayon Deadline Plugin."
        )

Will always fail - since it's set on the job or deadlineplugin and will be the same result for all machines. So it may make sense to fail the job then?

Maybe this:

        if not all(add_kwargs.values()):
            raise RuntimeError((
                "Missing required env vars: AYON_PROJECT_NAME,"
                " AYON_FOLDER_PATH, AYON_TASK_NAME, AYON_APP_NAME"
            ))

May also make sense to always fail since it should behave quite similar across the workers/machines?


There are also cases where it may make sense to directly mark the Worker as bad for the job.

For example this:

        exe_list = get_ayon_executable()
        exe = FileUtils.SearchFileList(exe_list)

        if not exe:
            raise RuntimeError((
               "Ayon executable was not found in the semicolon "
               "separated list \"{}\"."
               "The path to the render executable can be configured"
               " from the Plugin Configuration in the Deadline Monitor."
            ).format(exe_list))

This may fail per worker depending on whether it has the exe to be found at any of the paths.

There is a high likelihood that that machine may not find it the next run either? So we could mark the worker "bad" for the job? Using RepositoryUtils.AddBadSlaveForJob...

As such - we may do a follow-up PR or issue to be more specific with our raised errors. For example raising a dedicated error for when we should fail the job.

class AYONJobConfigurationError(RuntimeError):
    """An error of which we know when raised that the full job should fail
    and retrying by other machines will be worthless.

    This may be the case if e.g. not the fully required env vars are configured
    to inject the AYON environment.
    """

Or a dedicated error when we should mark the Worker as bad:

class AYONWorkerBadForJobError(RuntimeError):
    """When raised, the worker will be marked bad for the current job.

    This should be raised when we know that the machine will most likely
    also fail on subsequent tries.
    """

However - a server timeout should allow the job to just error and let it requeue with the same worker.. so it can try again. So a lot of error attributed to not being able to access the server itself should not generate such a hard failure.

fabiaserra commented 4 months ago

I can confirm this works in Linux

BigRoy commented 4 months ago

Works for me on Windows - @kalisp @iLLiCiTiT ready for merge when you've decided next steps. :)

BigRoy commented 3 months ago

A client also ended up reporting this worked for them and actually helped them - so, on to merging! :)