Open chrisamiller opened 2 years ago
some chatter that might be useful here: https://griffithlab.slack.com/archives/C03CERB39A4/p1655229025691229
To record some experimentation: since Picard doesn't currently support cloud URLs (see also: broadinstitute/picard#1653) I tried piping a streamed input into a few of the Picard QC steps. This ended up taking a similar amount of time to not doing this--in some cases slightly less, and in some slightly more. Therefore I'd say for now it's not worth implementing localization_optional
for those steps.
Good to know - thanks, Tom!
Another point to consider, over time we have encountered sporadic failures in the steps that use localization_optional and have felt the need to turn it off in a few places to improve reliability. Unfortunately when it does fail, it sometimes does so three times in a row.
There is an option that allows large files (like bams) to be streamed from a gs:// address when the tool supports doing so: https://cromwell.readthedocs.io/en/stable/optimizations/FileLocalization/
We would like to implement this where ever possible to save on disk usage and time spent localizing large bam files. (Mutect and GATK HaplotypeCaller are two obvious places to start)
Other options could include a "hack" to stream the bam through samtools first. Varscan in particular, seems amenable to this, as it is parallelized in chunks, so we could
samtools view region >tmp.bam && mpileup tmp.bam | java -jar varscan.jar ...
(or maybe evenmpileup
can stream directly?)