yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Duplicate Nodes with Special Characters in Name #368

Open gp201 opened 4 months ago

gp201 commented 4 months ago

Description

When the nodes have certain special characters a duplicate node is created.

Steps to Reproduce

1.usher -t tree.nwk -v aligned.vcf -o tree.pb Observed in final_tree.nh 2.matUtils extract -i tree.pb -C lineagePaths.txt -j auspice_tree.json -S samplePaths.txt Observed in auspice_tree.json

Expected Behavior

The phylogenetic tree should not contain duplicate nodes.

Actual Behavior

The node 'hRSV/A/Germany/22-02516/2021' is present twice in the tree.

Additional Information

Files to reproduce the bug bug_example.zip. Run run.sh to generate relevant files.

Environment

Conda Usher: 0.6.3

Please let me know whether this is a genuine error or an oversight on my end. Thank you.

AngieHinrichs commented 4 months ago

Hi @gp201, that is a great little test case. It appears that usher's Newick parsing does not handle quoting. So it appears to usher that there are two distinct sequences: 'hRSV/A/Germany/22-02516/2021' in tree.nwk (treating the quotes as part of the name) with no substitutions relative to the reference, and a different sequence hRSV/A/Germany/22-02516/2021 in aligned.vcf (no quotes).

It would be better for usher's Newick parsing to recognize quoting, but in the meantime there is a straightforward workaround: strip all quote characters from your input Newick. (Also make sure before creating Newick and VCF that none of the input sequence names contain any characters with special meaning in Newick like [():,;].) Here is an example command that would do that:

sed -re "s/['\"]//g;" tree.nwk > tree.noQuotes.nwk
usher -t tree.noQuotes.nwk -v aligned.vcf -o tree.pb