yuch7 / cwlexec

A new open source tool to run CWL workflows on LSF
Other
36 stars 8 forks source link

Don't *copy* result files. #63

Open vinjana opened 3 years ago

vinjana commented 3 years ago

Currently, cwlexec copies files from the work directories to the output directory (here if I am correct).

If possible, avoid copying output files. These files can be huge (e.g. we usually have files 100 GB, but they can be much bigger; this is common with human whole genome sequencing files) and copying is really a waste of space and time. While space may not be a problem, because copies can be deleted after processing, time may be more of a problem in a network-based storage with tight requirements for short processing times (e.g. for routine cancer diagnostics).

Alternatives are (at least on POSIX filesystems):

I am not sure what the standard says about it, but even if the standard says "do copy", for some of our workflows we'd rather drop CWL than accept copies.

It may be desirable to give the user the possibility between copying, symlinking or hardlinking. However, replacing a symlink by the pointed to file is a small problem. So symlinking as default seems to be a reasonable default.

For both linking approaches file ownership may be more of an issue, because the access rights are identical for all hard/softlinks to the same data.

mr-c commented 3 years ago

I am not sure what the standard says about it, but even if the standard says "do copy", for some of our workflows we'd rather drop CWL than accept copies.

Hello @vinjana

The CWL standards do not forbid symlinking or hardlinking :-)