wodanaz / Assembling_viruses

0 stars 0 forks source link

Add pangolin step #34

Closed johnbradley closed 3 years ago

johnbradley commented 3 years ago

Add a step to the end of the pipeline that runs the pangolin software on consensus sequences.

Notes from @wodanaz

concatenates all new consensus sequences and runs pangolin


cat *fasta > consensus_sequences.fasta
conda activate pangolin

pangolin consensus_sequences.fasta --outfile $Project_name ( same as "-i sars-cov2-example" from the dds project)


> search for variants of concern

grep -E 'B.1.351|B.1.1.7|P.1|P.2|B.1.427|B.1.429|B.1.526' lineage_report.csv sars-cov2-example.csv > sars-cov2-example_lineages_of_concern.csv



> finally, save these new files into the DDS and keep them in HARDAC. It should be good to take into account that sometimes, there might not be variants of concern. So, if the file is empty, or no variants are found with grep. It should not save this file
johnbradley commented 3 years ago

@wodanaz Should consensus_sequences.fasta be made up just the *.cleaned.fasta files instead of all *.fasta files? The cat *.fasta will include some *.masked.fasta files.

Code that creates the .cleaned.fasta and .masked.fasta files: https://github.com/wodanaz/Assembling_viruses/blob/7e7dc31dce71194ce8055e0752812fdf9b0150a1/scripts/run-bcftools-consensus.sh#L26-L27

wodanaz commented 3 years ago

Correct, we should use *cleaned.fasta

johnbradley commented 3 years ago

@wodanaz The grep command is searching two files: lineage_report.csv and sars-cov2-example.csv. Should we just be searching just sars-cov2-example.csv($Project_name.csv) ?

The pangolin comand creates a single csv file with a default name of lineage_report.csv, but since we are specifying --outfile it would create something like sars-cov2-example.csv.

grep -E 'B.1.351|B.1.1.7|P.1|P.2|B.1.427|B.1.429|B.1.526' lineage_report.csv sars-cov2-example.csv > sars-cov2-example_lineages_of_concern.csv

wodanaz commented 3 years ago

Sorry, it should have been a single file, the output from pangolin.

I added sars-cov2-example.csv trying to represent the name variable given in the flag -i at the beginning of the pipeline.

that means, the pangolin run should have the variable of -i as the output name.