Adds first two sbatch scripts for ev pipeline

johnbradley commented 3 years ago

Adds ./run-escape-variants.sh shell script that will run the pipeline. The run-escape-variants.sh script creates a directory for logs and runs the scripts/escape-variants-pipeline.sh sbatch script. The escape-variants-pipeline.sh script will run additional sbatch jobs for the various steps of the pipeline. Only the first two steps from Escape_Variants.md have been implemented. Step "1. Index Genome Reference" is scripts/index-reference-genome.sh. Step "2. Remove Nextera Adapters" is scripts/remove-nextera-adapters.sh. All of the logic is based on Escape_Variants.md.

One intentional change from the Escape_Variants.md logic is to use a sbatch array job. We do this to control how many processes are running at once and to more easily monitor/control the jobs. The MAX_ARRAY_JOBS environment variable in escape-variants-pipeline.sh can be adjusted to run more array jobs at the same time.

The sbatch scripts create logs in the '/logs' directory that can be monitored to watch pipeline progress. The logs directory has been added to the .gitignore file so we do not accidentally commit these changes.

Limitations: There are still hard coded files such as MT246667.fasta. Only one pipeline can be run at a time due to input, intermediate, and result files being stored in the current directory.

johnbradley commented 3 years ago

Testing

To test out these changes one would need to clone this repo on HARDAC, checkout this branch, place the input files (MT246667.fasta and *.fasta.gz files) in the root directory of this repo. Then start the pipeline by running the main script:

./run-escape-variants.sh

The above script creates an sbatch job that runs the pipeline and returns immediately. You can monitor the progress with squeue and monitoring the logs. The main log is named ev-pipeline-.out so an example tail command is:

tail -f logs/ev-pipeline-24427412.out

Code Explaination

The main pipeline sbatch script is scripts/escape-variants-pipeline.sh. This script's job is to run other sbatch scripts. From a high level escape-variants-pipeline.sh follows Escape_Variants.md. The main change is the use of sbatch array jobs.

Before

So instead of logic like this:

module load cutadapt
module load TrimGalore/0.6.5-fasrc01

ls *.fastq.gz > reads.list
for i in `cat reads.list`; do
root=`basename $i .fastq.gz`;
echo '#!/usr/bin/env bash' > $root.trimgalore.sh;
echo "trim_galore --fastqc --nextera $i  " >> $root.trimgalore.sh
done

for file in *trimgalore.sh ; do sbatch $file ; done

After

These changes create an array job:

https://github.com/wodanaz/Assembling_viruses/blob/d945d079a8e096349ba66a0baca177932358a36e/scripts/escape-variants-pipeline.sh#L25-L33

The array job flag specifies the number range for the SLURM_ARRAY_TASK_ID environment variable and how many processes to run at once. For example --array=1-10%4. Will run 10 total array jobs with the numbers 1-10 but only 4 at a time.

The above code then runs the associated sbatch array job script: https://github.com/wodanaz/Assembling_viruses/blob/d945d079a8e096349ba66a0baca177932358a36e/scripts/remove-nextera-adapters.sh#L10-L16

johnbradley commented 3 years ago

I would like feedback on the filenames and if you can follow the scripts. The scripts are connected together like so:

./run-escape-variants.sh
       `-> scripts/escape-variants-pipeline.sh
                    `-> scripts/index-reference-genome.sh
                    `-> scripts/remove-nextera-adapters.sh

Thanks.

wodanaz commented 3 years ago

Thanks John, this looks great and going to start implementing it on a set of test samples.

Fortunately, the indexing step can be run just once and you can always move those index files from a directory to the working directory. Even better, you can project your run to the directory where the indexed genomes are located. We also have the option to map to different genomes, so we can have as many indexed genomes to map.

I kept that step there in order to help the postdoc at Sempowski lab to make sure she had those files in her working directory.

johnbradley commented 3 years ago

@wodanaz by implementing it on a set of test samples do you mean running it on a set of test samples?

My next step was going to be another PR adding additional steps from Escape_Variants.md to the pipeline.

wodanaz commented 3 years ago

Yes, I mean running it on some old set of fastq files or samples I know are good.

wodanaz commented 3 years ago

@johnbradley it is working, I have a set of 4+ trimmed and clean fastq files that are being generated.

wodanaz / Assembling_viruses