woodbri / address-standardizer

An address parser and standardizer in C++
Other
7 stars 1 forks source link

Convert data from lex, gaz and rules to new formats #13

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

The new code handles things differently so it might not make sense to convert the old files. This need to be evaluated. lex and gaz files should not be a problem, the rules should be able to be greatly simplified because the Grammar files support hierarchical grammar definitions which would remove a lot of the redundancy in the rules.txt file.

woodbri commented 8 years ago

The old rules file will not convert in a useful way because it is very flat list rather than being a tree. I'm not sure it make sense to try and build a tree out of the phrases in the rules and then convert that to the new grammar. My first attempt created a huge flat file that had performance problems. Building a tree would probably work better, but it should be easy to just build a new grammar from scratch which we will have to do for all the other countries.

woodbri commented 8 years ago

The old grammar files to not map well the new grammar and the new grammars are easy to change. The might be some value adding words from the lexicon or gazetteer, but blinding dumping lots of words into it only makes it slower.