Automate consensus sequence generation

johnbradley commented 3 years ago

The escape variants pipeline has some hard coded sed commands that are part of consensus sequence generation. This process occurs in the GATK 10. Make final consensus fasta sequence using the SNPs in the vcf file step in Escape_Variants.md. The markdown suggests running head *merged.bed to determine what needs to be changed.

Update the pipeline so that no manual intervention is not required.

Details from https://github.com/wodanaz/Assembling_viruses/pull/5#issuecomment-775313150

This is a final step for the consensus sequence output. It is specific because it depends on the genome reference. In other words, the correct masking of the consensus sequence occurs when the name of the genome/chromosome coincides with the 1st column of the bed file and the chrom column in the vcf file.

Relevant code: https://github.com/wodanaz/Assembling_viruses/blob/ebbb2bf0526d567a1bce766b6b32d8f236fd9f06/scripts/fix-merged-bed-positions.sh#L12-L14 called from https://github.com/wodanaz/Assembling_viruses/blob/ebbb2bf0526d567a1bce766b6b32d8f236fd9f06/scripts/escape-variants-pipeline.sh#L140-L143

wodanaz commented 3 years ago

@johnbradley I looked into coronaSPAdes software. It is a great tool that generates assemblies but not in an intuitive format. I installed it in my home directory (/home/ab620/SPAdes-3.15.0-Linux/bin) and added the path to my .bashrc source.

It does what it is supposed to do but I am not happy with the result because we since cannot generate full coverage across the genomes, coronaSPAdes produces assemblies with >-separated contigs and scaffods.

I would prefer to stick with bcftools and just use a general rule for all the new genomes where we mask the first 10 nucleotides. I think that can be solved by making the first line of the merged bed file always MT246667

johnbradley commented 3 years ago

I would prefer to stick with bcftools and just use a general rule for all the new genomes where we mask the first 10 nucleotides. I think that can be solved by making the first line of the merged bed file always MT246667

@wodanaz Sounds good. One of my goals is to make sure these scripts are easy to read and edit. To that end would you feel comfortable making the above change to the scripts? If not I'm glad to make the change to the scripts if you first update Escape_Variants.md with the changes, and we can circle back around to improving the readability/editability of the scripts.

johnbradley commented 3 years ago

@wodanaz Is there anything you need from me for this issue?

wodanaz commented 3 years ago

So, is it ready to run?

johnbradley commented 3 years ago

So, is it ready to run?

I am confused. The pipeline is ready to run via run-escape-variants.sh or run-dds-escape-variants.sh, but it still includes a step with some hard coded sed commands: https://github.com/wodanaz/Assembling_viruses/blob/a4336481521c2224003bff3471e6c2b3bde77a12/scripts/fix-merged-bed-positions.sh#L11-L13 Earlier you said

I would prefer to stick with bcftools and just use a general rule for all the new genomes where we mask the first 10 nucleotides. I think that can be solved by making the first line of the merged bed file always MT246667

I'm not sure what commands/changes are needed to mask the nucleotides and sort the merged bed file appropriately.

wodanaz commented 3 years ago

So, is it ready to run?

I am confused. The pipeline is ready to run via run-escape-variants.sh or run-dds-escape-variants.sh, but it still includes a step with some hard coded sed commands: https://github.com/wodanaz/Assembling_viruses/blob/a4336481521c2224003bff3471e6c2b3bde77a12/scripts/fix-merged-bed-positions.sh#L11-L13

Earlier you said

I would prefer to stick with bcftools and just use a general rule for all the new genomes where we mask the first 10 nucleotides. I think that can be solved by making the first line of the merged bed file always MT246667

I'm not sure what commands/changes are needed to mask the nucleotides and sort the merged bed file appropriately.

Nevermind, I got it! Thanks

johnbradley commented 3 years ago

Closing this issue since we have removed the hard coded sed statements in question.

wodanaz / Assembling_viruses

Automate consensus sequence generation #6