wustl-oncology / analysis-wdls

Scalable genomic analysis pipelines, written in WDL
MIT License
5 stars 11 forks source link

use localization_optional when possible #64

Open chrisamiller opened 2 years ago

chrisamiller commented 2 years ago

There is an option that allows large files (like bams) to be streamed from a gs:// address when the tool supports doing so: https://cromwell.readthedocs.io/en/stable/optimizations/FileLocalization/

We would like to implement this where ever possible to save on disk usage and time spent localizing large bam files. (Mutect and GATK HaplotypeCaller are two obvious places to start)

Other options could include a "hack" to stream the bam through samtools first. Varscan in particular, seems amenable to this, as it is parallelized in chunks, so we could samtools view region >tmp.bam && mpileup tmp.bam | java -jar varscan.jar ... (or maybe even mpileup can stream directly?)

chrisamiller commented 2 years ago

some chatter that might be useful here: https://griffithlab.slack.com/archives/C03CERB39A4/p1655229025691229

tmooney commented 1 year ago

To record some experimentation: since Picard doesn't currently support cloud URLs (see also: broadinstitute/picard#1653) I tried piping a streamed input into a few of the Picard QC steps. This ended up taking a similar amount of time to not doing this--in some cases slightly less, and in some slightly more. Therefore I'd say for now it's not worth implementing localization_optional for those steps.

chrisamiller commented 1 year ago

Good to know - thanks, Tom!

malachig commented 3 weeks ago

Another point to consider, over time we have encountered sporadic failures in the steps that use localization_optional and have felt the need to turn it off in a few places to improve reliability. Unfortunately when it does fail, it sometimes does so three times in a row.