Handle multiple identical adjacent tokens

woodbri / address-standardizer

An address parser and standardizer in C++

Other

7 stars 1 forks source link

Handle multiple identical adjacent tokens #9

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

Need to handle multiple adjacent WORD tokens

collapse them within a rule
keep them across rules?
handle them in the tokenizer?
handle them in the Grammar search algorithm
consider extending the grammar to support
- <token>+ one or more token
- <token>* zero or more token
- <token>{n,m} for n to m token

woodbri commented 8 years ago

Decided that this is not an issue because it can be handled to some extent in the grammar via:

[section]
WORD -> <OutClass> -> <score>
WORD WORD -> <OutClass> <OutClass> -> <score>
WORD WORD WORD -> <OutClass> <OutClass> <OutClass> -> <score>
WORD WORD WORD WORD -> <OutClass> <OutClass> <OutClass> <OutClass> -> <score>

The search algorithm should be robust enough to handle this. I would decrease the score slightly for each additional WORD token.