Open antirotor opened 3 years ago
pymongo
can use (and internally is using) operations that are sent to mongo in bulksRight now we're creating version or updating existing version directly in mongo at the beginning of integration, but we can store InsertOne
or UpdateOne
operation which can be executed at the end of integration with method bulk_write
(documentation)
_id
of new version document in code with ObjectId()
parent
and same name
(version number)Database is one thing, file system other. Atomicity there is much harder to achieve - but it is equally important, especially with big renders, etc.
Related to #416
This hit the fan today, when ftrack integrator fails on publishing renders on farm because of invalid SSL certificate. All work almost done, files published, version created in Avalon and that last one bit fails. So deadline will try again and again creating more versions on disk and database.
if you would run 2 processes to publish with same subset, asset and project then one process may override files of the second
We should probably introduce some "soft" locks. My idea is that:
also depends on #1355
We need to think about how many tries (if more then one) Deadline - or farms in general - should have on publishing job.
We had previously set this to 3 total errors on the job and every machine got at most 1 try and then got blacklisted immediately. Worked a charm for us.
The logic added in this commit integrates files as .tmp
files during integration, and upon no errors try to rename the individual files. However, as came up on Discord renaming those .tmp
files can produce errors with hardlinks. That particular issue does not occur if we instead write to .tmp
folder and only rename the folder instead of the direct hardlinked filepath.
So I was thinking of proposing something along the lines of a transactional temp directory. Here's some quick pseudocode I threw together:
# Consider this pseudocode - I wrote this without testing/running at all!
import os
import sys
import six
import contextlib
def transaction_dir(path):
"""Use temporary directory that we can rollback if errors occur.
This yields a temporary folder next to the provided `path` which
is basically `path + ".tmp{tmp_id}"`. This folder can be used to
position and copy files to during the context. After the context
if no errors occurred the folder is renamed to `path`.
If errors occur it tries to behave a transactional rollback and
remove the temporary folder. Actual files MOVED to the
transaction dir are not MOVED back. This context manager only
tries to manage the folder itself and move its content on success
or remove the folder when errors occurred.
"""
# Final path is not allowed to exist - we do not overwrite
assert not os.path.exists(path), "Path already exists: %s" % path
# Create the folder with .tmp and a tiny uuid or timestamp
# e.g. /folder.tmp.202212001
# todo: Use an actual uuid or timestamp
tmp = path + ".tmp.202212001"
create_folder(tmp)
def _rollback():
# Before Py3.8
# On Windows, shutil.rmtree delete the contents of a
# directory junction before removing the junction.
# Source: https://docs.python.org/3/library/shutil.html#shutil.rmtree
shutil.rmtree(tmp)
six.reraise(*sys.exc_info())
try:
# Yield the temporary folder path
yield tmp
except Exception:
# Rollback: remove the tmp folder
_rollback()
else:
# Success: Rename the folder to final location
try:
os.rename(tmp, path)
except Exception:
# If the final rename of the folder fails we should still
# roll back and report the error
_rollback()
finally:
# Confirm the temp folder is removed
# and the expected final path exists
# just for sake of sanity
try:
assert not os.path.exists(tmp)
assert os.path.exists(path)
except Exception:
_rollback()
with transaction_dir("C:/test/v001") as folder:
# folder is the path to the v001.tmp directory
# that gets renamed to v001 after this context
# >>> copy_file(src, os.path.join(folder, "myfile.txt"))
pass
Thoughts?
Like anything filesystem related this is a non-trivial beast to tackle with a filesystem not being transaction based like a database. But this might at least solve the hardlink
issue during integration some were facing on Windows every now and then.
I think that the tmp directory approach is definitely closer to atomicity then individual files. But still as every publish plugin can do whatever it wants it is hard to enforce it. One way would be that plugins inherit from TransactionalPlugin
and that will define common methods to register generated file, added db data and changes to files/db. New file could be "batch grouped" to tmp folders and those renamed when "commiting" transactions. More difficult would be handling of changes then adding/removing stuff.
Just wondering, how much more trivial would a rollback become for the AYON database backend?
Problem
When integrator is doing its job it goes through lot of operations - writing to database, copying files, etc. Problem is when something fails. It will leave stuff in invalid state.
Case study
Publishing renders on farm. It will create versions in database and then starts copying. For some reason (but it really doesn't matter if it is missing file or invalid permissions) copying of files fails, integrator crashes and then we have versions in database and partial stuff on disk.
Solution
Parts of it can be solved by validators (checking if we have all files before integrating them) but some errors can occur during integration itself. Those should fail gracefully, rolling back whole "session".
Integrator should create journal of operations. If one of them failed then next action should go over this journal and revert everything back.
[cuID:OP-1192]