xflouris / libpll

Phylogenetic Likelihood Library
GNU Affero General Public License v3.0
27 stars 6 forks source link

NEWICK parser: breaks if taxa names contain special chars or start with a digit #72

Closed amkozlov closed 8 years ago

amkozlov commented 8 years ago

In bison files, taxon name is defined as STRING, which is in turn defined as follows:

[a-zA-Z_][a-zA-Z_0-9]*                       {pll_rtree_lval.s = xstrndup(pll_rtree_text, pll_rtree_leng); return STRING;}

This is incorrect, since something like "123" or "KJ953909|Pentalinon_luteum" are perfectly valid taxa names in a Newick file.

I'm not sure if it's possible with bison, but ideally taxon name should be defined as anything between [(,] and [,:)], and obviously doesn't include any of these four characters.

snacktavish commented 8 years ago

This sounds like it might be the same issue we have in FastDate - https://github.com/xflouris/speed-dating/issues/50

xflouris commented 8 years ago

I'll update the newick parsers of libpll/fastdate with the one from the new PTP after the merge, as I dont have much time these days.

@amkozlov : Yes, in fact flex/bison handle context-free grammars.

The PTP parser accepts anything between single/double quoted literals (even [(,]:)), and anything except [(,][:) when non-quoted.

amkozlov commented 8 years ago

This would be great, thanks!

xflouris commented 8 years ago

fixed