Open gp201 opened 4 months ago
Hi @gp201, that is a great little test case. It appears that usher's Newick parsing does not handle quoting. So it appears to usher that there are two distinct sequences: 'hRSV/A/Germany/22-02516/2021'
in tree.nwk (treating the quotes as part of the name) with no substitutions relative to the reference, and a different sequence hRSV/A/Germany/22-02516/2021
in aligned.vcf (no quotes).
It would be better for usher's Newick parsing to recognize quoting, but in the meantime there is a straightforward workaround: strip all quote characters from your input Newick. (Also make sure before creating Newick and VCF that none of the input sequence names contain any characters with special meaning in Newick like [():,;]
.) Here is an example command that would do that:
sed -re "s/['\"]//g;" tree.nwk > tree.noQuotes.nwk
usher -t tree.noQuotes.nwk -v aligned.vcf -o tree.pb
Description
When the nodes have certain special characters a duplicate node is created.
Steps to Reproduce
1.
usher -t tree.nwk -v aligned.vcf -o tree.pb
Observed in final_tree.nh 2.matUtils extract -i tree.pb -C lineagePaths.txt -j auspice_tree.json -S samplePaths.txt
Observed in auspice_tree.jsonExpected Behavior
The phylogenetic tree should not contain duplicate nodes.
Actual Behavior
The node 'hRSV/A/Germany/22-02516/2021' is present twice in the tree.
Additional Information
Files to reproduce the bug bug_example.zip. Run
run.sh
to generate relevant files.Environment
Conda Usher: 0.6.3
Please let me know whether this is a genuine error or an oversight on my end. Thank you.