xen / dawgdic

Automatically exported from code.google.com/p/dawgdic
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

dawgdic-build gives `error: failed to insert key:` on certain keys #8

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. download test150.txt (attached)
2. run dawgdic-build test150.txt

What is the expected output? What do you see instead?

Trying to get a built dawgdic (traced this issue back from a project I'm 
working on with the `dawg` module in python.

Instead I get the error:
    error: failed to insert key: +.n|+.x|a.x|c.n;b.nAgAmAC0A

What version of the product are you using? On what operating system?

using version 0.4.4 (latest) on mac os X 10.6

Please provide any additional information below.

The attached file is the output of a python process that uses 
`f.write(dawg.BytesDawg()._raw_key(k, v))` to generate the lines. It works for 
virtually all values I throw at it it, but very rarely (maybe one out of 
several million keys) fails like this.

I'm hoping to at least find a work-around -- this library is really 
extraordinary.

Thanks.

Original issue reported on code.google.com by peter.mc...@gmail.com on 11 Oct 2012 at 6:06

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for your report.

I've found that test150.txt is not correctly ordered.
In detail, the 65th line is greater than the 66th line.

65: +.n|+.x|a.x|c.n;b.nAgAmACoA
66: +.n|+.x|a.x|c.n;b.nAgAmAC0A

I've checked this as follows:

$ LC_ALL=C sort test150.txt > sorted.txt
$ diff test
$ diff test150.txt sorted.txt 
65,66d64
< +.n|+.x|a.x|c.n;b.nAgAmACoA
< +.n|+.x|a.x|c.n;b.nAgAmAC0A
67a66,67
> +.n|+.x|a.x|c.n;b.nAgAmAC0A
> +.n|+.x|a.x|c.n;b.nAgAmACoA

Would you test again after this correction?

Original comment by susumu.y...@gmail.com on 12 Oct 2012 at 1:05

GoogleCodeExporter commented 9 years ago
yes sorry it does seem that the file is out of order, thanks.

However now, using my full data set (a 2.3GB text file) I get the following 
error:

bash-3.2$ dawgdic-build -g testBundleSubs2sorted.txt testBundleSubs2.dawg
no. keys: 46112348
no. states: 307452395
no. transitions: 348225617
no. merged states: 1073607010
no. merging states: 39222660
no. merged transitions: 1032833787
error: failed to build Dictionary

I think in fact that this is the problem I was actually tracing back from the 
python wrapper.
The data is quite large, does dawgdic have any kind of size limits that may be 
the problem?
Unfortunately I can't share the full data file.

Thanks

Original comment by peter.mc...@gmail.com on 13 Oct 2012 at 3:29

GoogleCodeExporter commented 9 years ago
The number of units in Dictionary is limited to 2^29 (around 536M).

I'm not sure how many keys you can put in a Dictionary, but the number of units 
must be greater than the number of transitions.

In this case, maybe you can put a half of the data set in one Dictionary.

Original comment by susumu.y...@gmail.com on 14 Oct 2012 at 11:42

GoogleCodeExporter commented 9 years ago
Ok, I suspected I was hitting some sort of size limit.
Rather than split the data over multiple dictionaries I've switched over to 
using marisa-trie (I haven't yet hit any upper limits there).
Thanks for the help, and for the library, which will almost certainly be useful 
for the future.

Original comment by peter.mc...@gmail.com on 14 Oct 2012 at 2:24

GoogleCodeExporter commented 9 years ago
It's a nice idea.

I heard that 1 billion keys (probably several times larger than your data set) 
can be stored in one marisa-trie instance.

But please note that marisa-trie is much slower than dawgdic.
If you need a fast data structure, it might be a bad choice.

Good luck.

Original comment by susumu.y...@gmail.com on 15 Oct 2012 at 9:02