woodbri / address-standardizer

An address parser and standardizer in C++
Other
7 stars 1 forks source link

Look into breaking the lexicon regex in multiple smaller regex #14

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

Currently, Lexicon::regex() returns one HUGE regex string. This might be too large for a large lexicon. There are two potential ways that this might be improved:

  1. implement something equivalent to perl's Regex::Optimizer
  2. change the call to Lexicon::regex(char) and create multiple regex patterns based on the leading

Tokenizer would need changes for item 2 but my thought is that a smaller regex will be more memory efficient and will get evaluated faster.

woodbri commented 8 years ago

I've added some timing stats in src/tester/t2.cpp and at the moment, it looks the performance bottleneck is in the search algorithm, so this is probably a lower priority.

woodbri commented 8 years ago

Closing this with push 35f54f0..3c649e8 to develop. Regex are now optimized and Tokenizer runs about 5 time faster.