Redesign Tokenizer to work with Lexicon

woodbri / address-standardizer

An address parser and standardizer in C++

Other

7 stars 1 forks source link

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

Issue: the Lexicon entries can be multiple words but the current Tokenizer does not take that into account.

Potential solutions:

extract all Lexicon keys and build a regex to compare to the head of the string being Tokenized
cycle thru all keys matching against head of string

Item 2. can be optimized by grouping Lexicon phrases based on 1-2 initial chars of the phrase and only cycling thru those and each token pass.

woodbri commented 8 years ago

closed with push 811c997..b593965. sStill needs some more testing with a lexicon, but work with an empty lexicon.