Redesign Grammar and Search to improve preformance

woodbri / address-standardizer

An address parser and standardizer in C++

Other

7 stars 1 forks source link

The current design is clean and simple and relies on storing the data in a map. The down side of this is that we are constantly running find(rule) on the map. Since the grammar is really a tree of nodes, it might be faster to create nodes with pointers to children and just follow the pointers.

TODO:

verify that the bulk of the time is spent doing find()
redesign the grammar to use pointers rather them the current map

Collected some stats on how many calls to find():

$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK'
Timer: read lexicon: 2 ms
Timer: load lexicon: 149 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 164 ms
-------------------------------
TOKEN:  1               NUMBER  BADTOKEN                1
TOKEN:  A               WORD,SINGLE     BADTOKEN                0
TOKEN:  SEASTONE                WORD    BADTOKEN                0
TOKEN:  COTTAGES                WORD,TYPE       BADTOKEN                1
TOKEN:  WEYBOURNE               WORD    BADTOKEN                0
TOKEN:  HOLT            WORD    BADTOKEN                0
TOKEN:  NR25            MIXED,PCH       BADTOKEN                0
TOKEN:  7HG             MIXED,PCT       BADTOKEN                0
TOKEN:  UK              NATION  BADTOKEN                1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 16 ms
Timer: Search::searchAndReclassBest: 6816 ms
Stats: findMetas: 198583
Stats: findRules: 149479
bestCost: 0.566667
TOKEN:  1       1       NUMBER  HOUSE           1
TOKEN:  A       A       SINGLE  HOUSE           0
TOKEN:  SEASTONE        SEASTONE        WORD    STREET          0
TOKEN:  COTTAGES        COTTAGES        TYPE    SUFTYP          1
TOKEN:  WEYBOURNE       WEYBOURNE       WORD    CITY            0
TOKEN:  HOLT    HOLT    WORD    CITY            0
TOKEN:  NR25    NR25    MIXED   POSTAL          0
TOKEN:  7HG     7HG     MIXED   POSTAL          0
TOKEN:  UK      UNITED KINGDOM OF GREAT BRITAIN NATION  NATION          1

real    0m7.161s
user    0m7.138s
sys     0m0.020s

$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK' Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK' Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK' UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK' Timer: read lexicon: 2 ms Timer: load lexicon: 151 ms Timer: init tokenizer: 0 ms Timer: tokenizer::getTokens: 164 ms ------------------------------- TOKEN: 1 NUMBER BADTOKEN 1 TOKEN: A WORD,SINGLE BADTOKEN 0 TOKEN: SEASTONE WORD BADTOKEN 0 TOKEN: COTTAGES WORD,TYPE BADTOKEN 1 TOKEN: WEYBOURNE WORD BADTOKEN 0 TOKEN: HOLT WORD BADTOKEN 0 TOKEN: NR25 MIXED,PCH BADTOKEN 0 TOKEN: 7HG MIXED,PCT BADTOKEN 0 TOKEN: UK NATION BADTOKEN 1 ------------------------------- Timer: read grammar file: 0 ms Timer: init Grammar: 17 ms OK loading grammar: Timer: Search::searchAndReclassBest: 317 ms Stats: findMetas: 720 Stats: findRules: 12730 bestCost: 0.566667 TOKEN: 1 1 NUMBER HOUSE 1 TOKEN: A A SINGLE HOUSE 0 TOKEN: SEASTONE SEASTONE WORD STREET 0 TOKEN: COTTAGES COTTAGES TYPE SUFTYP 1 TOKEN: WEYBOURNE WEYBOURNE WORD CITY 0 TOKEN: HOLT HOLT WORD CITY 0 TOKEN: NR25 NR25 MIXED POSTAL 0 TOKEN: 7HG 7HG MIXED POSTAL 0 TOKEN: UK UNITED KINGDOM OF GREAT BRITAIN NATION NATION 1 real 0m0.663s user 0m0.649s sys 0m0.012s

woodbri / address-standardizer

Redesign Grammar and Search to improve preformance #35