wustl-oncology / analysis-wdls

Scalable genomic analysis pipelines, written in WDL
MIT License
5 stars 11 forks source link

Explore streaming sequence to Fastq and/or trimFastq steps #36

Open chrisamiller opened 2 years ago

chrisamiller commented 2 years ago

These two steps eat up lots of disk costs: sequenceToFastq: 0.139237127 local-disk 552 SSD (two of those) trimFastq: 0.792021773 local-disk 1148 SSD (two of those)

1) sequenceToFastq runs even when fastqs are given, and makes a copy - this is wasteful. Can we set conditional execution of that step (does WDL support that?)

2) trimfastq could probably by piped directly into bwa, sacrificing some composability for speed. There is already optional adapter trimming in sequence_align_and_tag.wdl. Seems like we could add other trimming there as well (or in the HISAT or STAR steps for RNA)

Layth17 commented 1 year ago

Referencing this video: https://www.youtube.com/watch?v=13YfaNPv088

WDL can indeed do conditional execution, however this seems to be implemented in WDL version 1.1 ( and not our current 1.0 )

1.1 https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#conditional-if-block 1.0 https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md

Boolean flag # some flag for existence or non-existence

if ( flag ) {  
   call some_task { ... }
}

edit: it seems that the conditional if-block does not have their ✨ symbol (which is used to indicate a new feature) next to it. Maybe it is actually in version 1.0 as well...

Layth17 commented 1 year ago

Ok, so I tested the following (syntax-wise, version 1.0) and it works:

workflow sequenceToTrimmedFastq {
   input {
     ...
    Boolean bfastq1 = unaligned.sequence.fastq1
    Boolean bfastq2 = unaligned.sequence.fastq2
   }

   if (bfastq1 && bfastq2) {
     call stf.sequenceToFastq as sequenceToFastq {
       input:
       ....
     }
   }

   call tf.trimFastq {
     input:
     reads1=select_first([sequenceToFastq.read1_fastq, unaligned.sequence.fastq1]),
     reads2=select_first([sequenceToFastq.read2_fastq, unaligned.sequence.fastq2]),
     ...
   }

we would need to use select_first([]) to choose between things that are optionally generated.

I can only imagine that setting a Boolean to a File would return false if the file does not exist, but I cannot confirm that yet.

I think this is what you are looking for here @chrisamiller ?