Closed woodbri closed 8 years ago
I've rewriting the grammar and search to use pointers rather than maps to lookup rules. The above test now takes 0.663s instead of 7.161s:
$ time ../src/tester/t2 lex-greatbritain.txt grammar/greatbritain.grammar '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Address: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
Normalized: '1A Seastone Cottages Weybourne HOLT NR25 7HG UK'
UpperCase: '1A SEASTONE COTTAGES WEYBOURNE HOLT NR25 7HG UK'
Timer: read lexicon: 2 ms
Timer: load lexicon: 151 ms
Timer: init tokenizer: 0 ms
Timer: tokenizer::getTokens: 164 ms
-------------------------------
TOKEN: 1 NUMBER BADTOKEN 1
TOKEN: A WORD,SINGLE BADTOKEN 0
TOKEN: SEASTONE WORD BADTOKEN 0
TOKEN: COTTAGES WORD,TYPE BADTOKEN 1
TOKEN: WEYBOURNE WORD BADTOKEN 0
TOKEN: HOLT WORD BADTOKEN 0
TOKEN: NR25 MIXED,PCH BADTOKEN 0
TOKEN: 7HG MIXED,PCT BADTOKEN 0
TOKEN: UK NATION BADTOKEN 1
-------------------------------
Timer: read grammar file: 0 ms
Timer: init Grammar: 17 ms
OK loading grammar:
Timer: Search::searchAndReclassBest: 317 ms
Stats: findMetas: 720
Stats: findRules: 12730
bestCost: 0.566667
TOKEN: 1 1 NUMBER HOUSE 1
TOKEN: A A SINGLE HOUSE 0
TOKEN: SEASTONE SEASTONE WORD STREET 0
TOKEN: COTTAGES COTTAGES TYPE SUFTYP 1
TOKEN: WEYBOURNE WEYBOURNE WORD CITY 0
TOKEN: HOLT HOLT WORD CITY 0
TOKEN: NR25 NR25 MIXED POSTAL 0
TOKEN: 7HG 7HG MIXED POSTAL 0
TOKEN: UK UNITED KINGDOM OF GREAT BRITAIN NATION NATION 1
real 0m0.663s
user 0m0.649s
sys 0m0.012s
Closing this with commit 312a661 to develop
The current design is clean and simple and relies on storing the data in a map. The down side of this is that we are constantly running find(rule) on the map. Since the grammar is really a tree of nodes, it might be faster to create nodes with pointers to children and just follow the pointers.
TODO:
Collected some stats on how many calls to find():