woodbri / address-standardizer

An address parser and standardizer in C++
Other
7 stars 1 forks source link

Redesign Grammar and Search to improve preformance #35

Closed woodbri closed 8 years ago

woodbri commented 8 years ago

The current design is clean and simple and relies on storing the data in a map. The down side of this is that we are constantly running find(rule) on the map. Since the grammar is really a tree of nodes, it might be faster to create nodes with pointers to children and just follow the pointers.

TODO:

Collected some stats on how many calls to find():

$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK'
Timer: read lexicon: 2 ms
Timer: load lexicon: 149 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 164 ms
-------------------------------
TOKEN:  1               NUMBER  BADTOKEN                1
TOKEN:  A               WORD,SINGLE     BADTOKEN                0
TOKEN:  SEASTONE                WORD    BADTOKEN                0
TOKEN:  COTTAGES                WORD,TYPE       BADTOKEN                1
TOKEN:  WEYBOURNE               WORD    BADTOKEN                0
TOKEN:  HOLT            WORD    BADTOKEN                0
TOKEN:  NR25            MIXED,PCH       BADTOKEN                0
TOKEN:  7HG             MIXED,PCT       BADTOKEN                0
TOKEN:  UK              NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 16 ms
Timer: Search::searchAndReclassBest: 6816 ms
Stats: findMetas: 198583
Stats: findRules: 149479
bestCost: 0.566667
TOKEN:  1       1       NUMBER  HOUSE           1
TOKEN:  A       A       SINGLE  HOUSE           0
TOKEN:  SEASTONE        SEASTONE        WORD    STREET          0
TOKEN:  COTTAGES        COTTAGES        TYPE    SUFTYP          1
TOKEN:  WEYBOURNE       WEYBOURNE       WORD    CITY            0
TOKEN:  HOLT    HOLT    WORD    CITY            0
TOKEN:  NR25    NR25    MIXED   POSTAL          0
TOKEN:  7HG     7HG     MIXED   POSTAL          0
TOKEN:  UK      UNITED KINGDOM OF GREAT BRITAIN NATION  NATION          1

real    0m7.161s
user    0m7.138s
sys     0m0.020s
woodbri commented 8 years ago

I've rewriting the grammar and search to use pointers rather than maps to lookup rules. The above test now takes 0.663s instead of 7.161s:

$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK'
Timer: read lexicon: 2 ms
Timer: load lexicon: 151 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 164 ms
-------------------------------
TOKEN:  1               NUMBER  BADTOKEN                1
TOKEN:  A               WORD,SINGLE     BADTOKEN                0
TOKEN:  SEASTONE                WORD    BADTOKEN                0
TOKEN:  COTTAGES                WORD,TYPE       BADTOKEN                1
TOKEN:  WEYBOURNE               WORD    BADTOKEN                0
TOKEN:  HOLT            WORD    BADTOKEN                0
TOKEN:  NR25            MIXED,PCH       BADTOKEN                0
TOKEN:  7HG             MIXED,PCT       BADTOKEN                0
TOKEN:  UK              NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 17 ms
OK loading grammar:
Timer: Search::searchAndReclassBest: 317 ms
Stats: findMetas: 720
Stats: findRules: 12730
bestCost: 0.566667
TOKEN:  1       1       NUMBER  HOUSE           1
TOKEN:  A       A       SINGLE  HOUSE           0
TOKEN:  SEASTONE        SEASTONE        WORD    STREET          0
TOKEN:  COTTAGES        COTTAGES        TYPE    SUFTYP          1
TOKEN:  WEYBOURNE       WEYBOURNE       WORD    CITY            0
TOKEN:  HOLT    HOLT    WORD    CITY            0
TOKEN:  NR25    NR25    MIXED   POSTAL          0
TOKEN:  7HG     7HG     MIXED   POSTAL          0
TOKEN:  UK      UNITED KINGDOM OF GREAT BRITAIN NATION  NATION          1

real    0m0.663s
user    0m0.649s
sys     0m0.012s

Closing this with commit 312a661 to develop