johnbradley commented 3 years ago

The pipeline requires all input files to be in the repo's base directory. Intermediate and output files are also stored in the repo's base directory. Because of this it is not safe to run more than one pipeline at a time.

johnbradley commented 3 years ago

@wodanaz I was wondering how you wanted the pipeline to work with regards to the input, intermediate, and output files. The current logic assumes all files to be in the current directory. Is this an intensional design that you want to maintain? Or would creating a temporary directory to hold the intermediate files be ok? Also what files do you consider the be the output files? Thanks.

wodanaz commented 3 years ago

I think it was not intentional because we didn't now that scalating and automating was going to be a thing. Having a temp directory to hold intermediate data will make everything very nice. The most important output files will be the masked.fasta, the gatk.tab file, the *gatk.filt.vcf.gz, and table.sort.tab.

I am realizing with how we are looking at the first set of data that for this experiment. the generation of filt.tab and depth.tab tables is not necessary because they will be enormous spreadsheets and the masked*fasta files already have that info.

Thanks!

johnbradley commented 3 years ago

I am realizing with how we are looking at the first set of data that for this experiment. the generation of *filt.tab and depth.tab tables is not necessary because they will be enormous spreadsheets and the maskedfasta files already have that info.

So can we remove the GATK 11. Compile all tab tables into one for depth and genotype part from the pipeline?

wodanaz commented 3 years ago

Yes, we can remove it for now. However, I have a question. can we put a yes or no argument to decide that at the beginning of the experiment? I am suggesting it because for the experiment were we search for escape variants in Sempowsky Lab. So it would be nice to generalize this pipeline for whether this is surveillance or experimental.

Just like we previously discussed where we would do:

./run-escape-variants.sh --genome=MT246667.fasta --surveillance=(yes/no)

johnbradley commented 3 years ago

Yes, we can remove it for now. However, I have a question. can we put a yes or no argument to decide that at the beginning of the experiment? I am suggesting it because for the experiment were we search for escape variants in Sempowsky Lab. So it would be nice to generalize this pipeline for whether this is surveillance or experimental.

Just like we previously discussed where we would do:

./run-escape-variants.sh --genome=MT246667.fasta --surveillance=(yes/no)

Will do. Unfortunately the bash long arguments aren't portable so I'm going to use short arguments.

@wodanaz What do you think about the following command line options?

Runs a Slurm pipeline determining escape variants in fastq.gz files.

usage: ./run-escape-variants.sh -g genome -i inputdir [-o outdir] [-w workdir] [-e email] [-s]
options:
-g genome    *.fasta genome to use - required
-i inputdir  directory containing *.fastq.gz files to process - required
-o outdir    directory to hold output files and logs directory - defaults to current directory
-w workdir   directory that will hold a tempdir - defaults to current directory
-e email     email address to notify on pipeline completion - defaults to empty(no email sent)
-s           runs surveillance mode - default is run experimental mode

NOTE: The input genome must first be indexed by running ./setup-variants-pipeline.sh.
NOTE: The inputdir, outdir, and workdir must be directories shared across the slurm cluster.

So the equivalent command for the current functionality is:

./run-escape-variants.sh -g MT246667.fasta -i .

wodanaz commented 3 years ago

-s runs surveillance mode - default is run experimental mode

I really love this solution. Thanks so much. I hope I will be writing smarter code in the future.

johnbradley commented 3 years ago

Fixed by #11

wodanaz / Assembling_viruses

Support running multiple pipelines at once #9

-s runs surveillance mode - default is run experimental mode