wustl-oncology / cloud-workflows

Infrastructure and tooling required to get genomic workflows running in the cloud
1 stars 7 forks source link

cloudize-workflow.py - Handle Directory input types #2

Closed johnmaruska closed 2 years ago

johnmaruska commented 3 years ago

Problem: The cloudize-workflow.py script to modify input yaml and upload relevant files doesn't handle Directory inputs.

Directory inputs have the problem of unknown size and a naive approach would likely result in failing runs because the CWL must be modified to increase tmpdirMin size on each task using the directory in order to size the compute worker in GCP to download those inputs. Additionally, if the dir is 10x larger than it needs for an individual run (e.g. vep_cache_dir).

We need to figure out how we want to handle these directory inputs, and when that's figured out then the script needs to be modified to make the yaml reflect that approach.

Could potentially use reference disk images, or pre-made GCP directories with a mapping to point to them.

Workaround for the time being is manually upload relevant files for the directory and manually modify the input (e.g. a subset of vep_cache_dir has been uploaded to griffith-lab-cromwell/input_data/VEP_cache)

johnmaruska commented 2 years ago

This stopped being relevant when we switched to WDL, which does not have a Directory type. Converting inputs from CWL to WDL already requires some manual intervention, so there isn't much need to handle this conversion.

Optionally could automate the Directory => zip change, including zipping them and uploading it