wodanaz / Assembling_viruses

0 stars 0 forks source link

Add Snakefile to setup genome #49

Closed johnbradley closed 3 years ago

johnbradley commented 3 years ago

Adds a Snakemake workflow to run setup steps on the genome.

Input and Output

The input genome is named NC_045512.fasta and stored in a resources directory per the Snakemake docs recommendation. The Snakefile consists of an "all" rule that specifies the files to generate. These output files are the bwa index files, samtools index file and the picard dictionary file. There are three rules for creating these output files.

Per rule conda environment

Instead of a single environment each rule has it's own conda environment. This allows greater flexibility when choosing tools for various steps. Snakemake handles creating and using these environments when snakemake is run with the --use-conda flag.

Logging config

The rules specify a location for their log files. Note the shell "&>{log}" part that saves command output to the log files.

Future Changes

The docs also recommend putting each rule in a separate file but for clarity (while we are getting started) I left the rules in the main Snakefile.

To run on a slurm cluster requires a "slurm" profile. I will address this need in a later change.

Running

Right now I have installed snakemake on my laptop and run the pipeline like so:

snakemake --cores 1 --use-conda

When complete the genome indexes and dict file will be in the resources/* directory.

multiext function

The rules make use of the multiext function. This function appends a list of suffixes to a filename and returns an array of filenames. The multiext("resources/NC_045512.fasta", ".amb", ".ann", ".bwt", ".pac", ".sa") function call returns the following list of filenames as a python array:

['resources/NC_045512.fasta.amb',
'resources/NC_045512.fasta.ann',
'resources/NC_045512.fasta.bwt',
'resources/NC_045512.fasta.pac',
'resources/NC_045512.fasta.sa']

This is part of issue #46.

johnbradley commented 3 years ago

After running the pipeline I created a html report using snakemake --report that created a report.html file. When opened it looks like so: Screen Shot 2021-06-01 at 10 10 20 AM You can click on each rule and see more information: Screen Shot 2021-06-01 at 10 10 35 AM

The statistics tab has details some runtime statistics: Screen Shot 2021-06-01 at 10 12 39 AM