Adds sbatch scripts for gatk steps

johnbradley commented 3 years ago

Adds scripts for GATK steps to escape-variants-pipeline.sh based on Escape_Variants.md. The scripts are intentionally trying to be as close as possible to the Escape_Variants.md code. This is to help in comparing against the notes in Escape_Variants.md. The main change from Escape_Variants.md is using sbatch-array jobs. Any module load commands without a software version have been updated to the current HARDAC default version.

Concern over sed commands in GATK step 10

For the GATK step 10. Make final consensus fasta sequence using the SNPs in the vcf file. I included some code runs some really specific sed values: https://github.com/wodanaz/Assembling_viruses/blob/c6252b23e9496d407558f83818030cfc2e92fb4d/scripts/fix-merged-bed-positions.sh#L12-L14 The markdown suggests looking at the document by running head *merged.bed. Is there another approach I should use here since we are automating this step?

Testing

To test these changes you should still run ./run-escape-variants.sh after placing the *.fastq.gz files into the base directory of a clone of this repo. Before re-running the pipeline the intermediate data must be deleted. Also only a single instance of the pipeline can be run at a time. I will create issues to address these limitations. I would also like to improve the readability of some of the scripts.

After this PR is merged the pipeline should be able to run to completion without error. I want to address the remaining issues and an bugs found with testing before considering the pipeline production ready.

Fixes #2

wodanaz commented 3 years ago

@johnbradley I need to think about it. This is a final step for the consensus sequence output. It is specific because it depends on the genome reference. In other words, the correct masking of the consensus sequence occurs when the name of the genome/chromosome coincides with the 1st column of the bed file and the chrom column in the vcf file.

Moreover, there is this weird effect caused by the ways bcftools and bedtools start counting (some programs count sequence at position 0, others at position 1). Here I have to modify the position 1 for 0 and make sure the first line correspond to an actual coordinate. So far I do it "by hand" with sed because haven't been able to figure out a better way to tell bcf tools how to mask.

I have noted that other people making assemblies just assume the first 5-25 bp are always masked. which can help making a more automatic pipeline.

What would be your thoughts?

johnbradley commented 3 years ago

What would be your thoughts?

Is there a program that will convert count sequence at position 0 files to count sequence at position 1 files (or whatever variation of this we need)?

wodanaz commented 3 years ago

@johnbradley Well, I don't know. bcftools is one of the most simple programs I know that make a consensus sequence. There are other programs that are more elaborated that take a different approach and they generate an assembly based on the map. One example is coronaspades (based on SPAdes https://cab.spbu.ru/software/coronaspades/). I haven't tried yet but seems that it could help us in automating this step.

Give me some time and I take a look at this software to see how can be implemented.

Thanks!

johnbradley commented 3 years ago

@wodanaz Sounds good. I'll create an issue to automate the consensus sequence issue.

wodanaz / Assembling_viruses

Adds sbatch scripts for gatk steps #5

Concern over sed commands in GATK step 10

Testing