Integrator: Atomicity of operation

antirotor commented 3 years ago

Problem

When integrator is doing its job it goes through lot of operations - writing to database, copying files, etc. Problem is when something fails. It will leave stuff in invalid state.

Case study

Publishing renders on farm. It will create versions in database and then starts copying. For some reason (but it really doesn't matter if it is missing file or invalid permissions) copying of files fails, integrator crashes and then we have versions in database and partial stuff on disk.

Solution

Parts of it can be solved by validators (checking if we have all files before integrating them) but some errors can occur during integration itself. Those should fail gracefully, rolling back whole "session".

Integrator should create journal of operations. If one of them failed then next action should go over this journal and revert everything back.

[cuID:OP-1192]

iLLiCiTiT commented 3 years ago

Mongo modification abilities

pymongo can use (and internally is using) operations that are sent to mongo in bulks
it is possible to store operations into list and then tell to execute them in mongo

Example

Right now we're creating version or updating existing version directly in mongo at the beginning of integration, but we can store InsertOne or UpdateOne operation which can be executed at the end of integration with method bulk_write (documentation)

this requires to create _id of new version document in code with ObjectId()
until integration is finished you can't tell that somebody is already trying to publish that version
- if you would run 2 processes to publish with same subset, asset and project then one process may override files of the second and in database will be 2 version documents with same parent and same name (version number)

antirotor commented 3 years ago

Database is one thing, file system other. Atomicity there is much harder to achieve - but it is equally important, especially with big renders, etc.

antirotor commented 3 years ago

Related to #416

antirotor commented 3 years ago

This hit the fan today, when ftrack integrator fails on publishing renders on farm because of invalid SSL certificate. All work almost done, files published, version created in Avalon and that last one bit fails. So deadline will try again and again creating more versions on disk and database.

antirotor commented 3 years ago

if you would run 2 processes to publish with same subset, asset and project then one process may override files of the second

We should probably introduce some "soft" locks. My idea is that:

When publishing reaches integrator, lock in db should stop integration of possible concurrent publishes of the same subset.
Journal of all file io operations is created and live to the end of session. If failure is detected at the end, this journal is reversed and files returned to pre-integration state.
We need to think about how many tries (if more then one) Deadline - or farms in general - should have on publishing job.

mkolar commented 3 years ago

also depends on #1355

BigRoy commented 2 years ago

We need to think about how many tries (if more then one) Deadline - or farms in general - should have on publishing job.

We had previously set this to 3 total errors on the job and every machine got at most 1 try and then got blacklisted immediately. Worked a charm for us.

BigRoy commented 2 years ago

The logic added in this commit integrates files as .tmp files during integration, and upon no errors try to rename the individual files. However, as came up on Discord renaming those .tmp files can produce errors with hardlinks. That particular issue does not occur if we instead write to .tmp folder and only rename the folder instead of the direct hardlinked filepath.

So I was thinking of proposing something along the lines of a transactional temp directory. Here's some quick pseudocode I threw together:

# Consider this pseudocode - I wrote this without testing/running at all!
import os
import sys
import six
import contextlib

def transaction_dir(path):
    """Use temporary directory that we can rollback if errors occur.

    This yields a temporary folder next to the provided `path` which
    is basically `path + ".tmp{tmp_id}"`. This folder can be used to
    position and copy files to during the context. After the context
    if no errors occurred the folder is renamed to `path`.

    If errors occur it tries to behave a transactional rollback and
    remove the temporary folder. Actual files MOVED to the
    transaction dir are not MOVED back. This context manager only
    tries to manage the folder itself and move its content on success
    or remove the folder when errors occurred.

    """

    # Final path is not allowed to exist - we do not overwrite
    assert not os.path.exists(path), "Path already exists: %s" % path

    # Create the folder with .tmp and a tiny uuid or timestamp
    # e.g. /folder.tmp.202212001
    # todo: Use an actual uuid or timestamp
    tmp = path + ".tmp.202212001"
    create_folder(tmp)

    def _rollback():
        # Before Py3.8
        # On Windows, shutil.rmtree delete the contents of a 
        # directory junction before removing the junction.
        # Source: https://docs.python.org/3/library/shutil.html#shutil.rmtree
        shutil.rmtree(tmp)
        six.reraise(*sys.exc_info())

    try:
        # Yield the temporary folder path
        yield tmp
    except Exception:
        # Rollback: remove the tmp folder
        _rollback()
    else:
        # Success: Rename the folder to final location
        try:
            os.rename(tmp, path)
        except Exception:
            # If the final rename of the folder fails we should still
            # roll back and report the error
            _rollback()

    finally:
        # Confirm the temp folder is removed
        # and the expected final path exists
        # just for sake of sanity
        try:
            assert not os.path.exists(tmp)
            assert os.path.exists(path)
        except Exception:
            _rollback()

with transaction_dir("C:/test/v001") as folder:
    # folder is the path to the v001.tmp directory
    # that gets renamed to v001 after this context
    # >>> copy_file(src, os.path.join(folder, "myfile.txt"))
    pass

Thoughts?

Like anything filesystem related this is a non-trivial beast to tackle with a filesystem not being transaction based like a database. But this might at least solve the hardlink issue during integration some were facing on Windows every now and then.

antirotor commented 2 years ago

I think that the tmp directory approach is definitely closer to atomicity then individual files. But still as every publish plugin can do whatever it wants it is hard to enforce it. One way would be that plugins inherit from TransactionalPlugin and that will define common methods to register generated file, added db data and changes to files/db. New file could be "batch grouped" to tmp folders and those renamed when "commiting" transactions. More difficult would be handling of changes then adding/removing stuff.

BigRoy commented 1 year ago

Just wondering, how much more trivial would a rollback become for the AYON database backend?

ynput / OpenPype