yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

¿Is UShER specific for SARS-CoV-2 or does it work with other viral organism sequences? #318

Closed AndreaAguadoM closed 1 year ago

AndreaAguadoM commented 1 year ago

Hello! My name is Andrea, and I would be interesting in using this tool with other type of viral sequences, e.g. HIV-1. Could it be possible? Thanks in advance!

jmcbroome commented 1 year ago

Hello, Andrea. Yes, it is possible! You will need to construct a mutation-annotated tree (MAT) for your disease of interest.

One option is to use a Nextstrain Auspice v2 JSON, if you have one. These files are already MATs (just in a different format) and we take them natively.

matUtils extract -i input.json -o output.pb

The other approach is a little more lengthy and requires constructing a VCF file for your species. This is what you will need to do if all you have is a FASTA of sequences. I have a repository with a small pipeline as an example here: https://github.com/jmcbroome/pathogen-protobuf. Essentially, you align all of your sequences to your reference, then allow UShER to construct your tree for you.

Alternatively, if you have both a tree in Newick format and a VCF ready, you can construct the protobuf without inferring a new tree.

usher -v input.vcf -t input.nwk -o output.pb

One you have a protobuf file for your species, you should be able to apply anything in our toolkit. You may also be interested in using our Python API for scripting with the MAT.

It's worth noting that UShER will slow down significantly on pathogens with very large genomes, and that right now we don't handle insertion/deletions. You may also encounter some odd behavior or assumptions for some operations, but most of the core code should work just fine. If you do encounter any bugs, please report it in the issues! Effectively generalizing this toolkit to other pathogens is very important to us.

AndreaAguadoM commented 1 year ago

That is perfect! Thank you so much! I will be using it with HIV-1, which genome is even shorter than SARS-CoV-2, so I do not foresee much problems. Are you planning to implement insertion/deletion consideration in a near future? Kind regards!

jmcbroome commented 1 year ago

Indel support is one of our active projects, but it requires some pretty significant changes to the data structure itself to properly handled nested/overlapping indel events. I'm afraid we don't have an ETA on it at this time. However, that won't stop you from creating an HIV-1 parsimony tree with UShER!