yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Extracting gene mutations using matUtils extract and generate subsequent tree #367

Open Biophylo2001 opened 4 months ago

Biophylo2001 commented 4 months ago

Hi, Is there any way to extract all mutations in a Particular gene for all samples , say in the Spike region and then build a phylogenetic tree with it Something like this ? matUtils extract --i my_pb_file.pb -m S_gene_mutation -t tree_with_S_gene_mutations.nwk

Or use the Mutation Annotated jsonl tree file to filter mutation in specific gene?

Thank you ,

AngieHinrichs commented 4 months ago

Hi @Sanyukta2001, I don't think we have a single command to do that, but I think it could be done by extracting VCF from your protobuf tree file, filtering the VCF to keep only the Spike mutations, and running usher-sampled on the filtered VCF to build a new tree. It would go something like this, assuming the reference sequence is NC_045512.2 (Wuhan/Hu-1) for your tree so Spike coords are 21563-25384:

matUtils extract -i my_pb_file.pb -v my_pb_mutations.vcf

# Filter VCF to only mutations within the Spike gene coordinates
grep ^# my_pb_mutations.vcf > my_pb_mutations.filtered.vcf
grep -v ^# my_pb_mutations.vcf | awk '$2 >= 21563 && $2 <= 25384' >> my_pb_mutations.filtered.vcf

# Use filtered VCF to build a new tree
echo '()' > emptyTree.nwk
usher-sampled -t emptyTree.nwk -v my_pb_mutations.filtered.vcf -o my_pb_file.spikeOnly.pb

If your tree is large (say >10,000 samples) then it might be faster to add the usher-sampled option --optimization_radius 0 to prevent usher-sampled from doing rounds of optimization interspersed with adding samples from VCF, and then run matOptimize after usher-sampled.