woodbri / address-standardizer

An address parser and standardizer in C++
Other
7 stars 1 forks source link

Tokenizer does not split off ° symbol #32

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

In lex-spain.txt we have:

LEXENTRY:   °   °   UNITH   ATT_SUF,DET_SUF

this should split the token '4°' into two tokens '4' and '°'

After some debugging, it looks like a problem with the regex. We generate "\B°\b" which is correct of all other splitting, but the '\B' seems to be invalid between '\d' and '°'

I tried to work around this by changing Tokenizer.cpp:50 to:

boost::u32regex re = boost::make_u32regex( std::string( "\\<(\\d+)([[:alpha:]\\p{L}°])\\>" ) );
boost::u32regex re = boost::make_u32regex( std::string( "^(\\d+)([[:alpha:]\\p{L}°])$" ) );

but this did not work either. More investigation is needed.

woodbri commented 8 years ago

closed with commit 8243688