zverok / spylls

Pure Python spell-checker, (almost) full port of Hunspell
https://spylls.readthedocs.io
Mozilla Public License 2.0
284 stars 21 forks source link

Support for extracting all valid words from a dictionary #10

Closed fabiospampinato closed 3 years ago

fabiospampinato commented 3 years ago

Hello there, first of all thanks a lot for writing Spylls, all the documentation and the blog series, I haven't read it all in depth yet but so far it has been wonderfully informative and from what I've seen it's the best sort of documentation on how Hunspell works available, and Spylls itself is probably the best port of Hunspell to another language.

I have a use case where I need to perform spell-checking on the browser basically, and from what I've seen, mainly from playing around with nspell, an Hunspell-like approach to spell-checking would be pretty prohibitive for my use case, for some languages at least, it takes too long to generate words or even just parse dictionaries, plus keeping all those strings in memory can blow up memory usage fairly rapidly which is a concern for my use case.

I think it's possible to approach the problem differently, pre-parsing dictionaries into a compact ~binary representation and working with that directly, trading some lookup performance, hopefully not too much performance, for amazing memory usage and startup times. There's some shallow documentation about that here but I think you alluded to something like this in a blog post when mentioning another spell checker so you might already be familiar with this.

Anyway I'd like to spend some time implementing something like that, for doing that I would first of all need to extract all valid words from all dictionaries, and that sounds like a task Spylls could be well suited to address, it's not a problem if it'll take hours for some dictionary, I care more about extracting all valid words as correctly as possible than doing it quickly.

So could Spylls expose an API for doing that?

zverok commented 3 years ago

Hey!

Thanks for your kind words about my work. The question of linearizing the dictionary (Hunspell and its predecessors call this "unmunching" for some reason) is a quirky one.

Basically, if we ignore word compounding, it is more or less trivial: look at word flags, see which suffixes and prefixes it might have, produce all valid combinations". Here is proof-of-concept Spylls-based script for doing so: https://gist.github.com/zverok/c574b7a9c42cc17bdc2aa396e3edd21a I haven't tested it extensively, but it seems to work and produce valid results, so you can try to use it for your idea!

The problem is, this script can be made universal to handle any possible dictionary because of word compounding. In languages with word compounding, the "potentially possible" words list is incredibly huge. (The topic is covered in this article).

For example, in nb_NO (one of two variants of Norwegian) dictionary, there are 11k words with COMPOUNDFLAG. This means that "theoretically possible" is any combination of those 11k words (not only every pair but also three, four, ... in a row). The dictionary contains enough information to tell "whether this is the valid compound word", but not enough to produce all really existing/used compounds (see example in the article I've linked: “røykfritt” (“smoke-free” = “non-smoking”) and “frittrøyk” are both "valid" for Hunspell, but the second one is actually not an existing word... But I assume some poet might use it, and Norvegian readers might understand what they mean in the context.)

That's why "unmunching" the list of words for languages with compounding is impossible—so, to have just a list of "all actually existing Norwegian words" Hunspell's dictionary is useless.

Another curious example: English dictionary have "compounding rules" used to check numerals, like "11th is valid, and 10000011th is valid, but 10000011nd is not". Attempt to produce "all possible" combinations from this rule will obviously lead to an infinite loop :)

As far as I understand, that's the biggest obstacle for "full list"-based spellcheckers: AFAIK, LanguageTool grammar checking project uses morfologik spellchecker which encodes word lists with FSA very efficiently but falls back to Hunspell for languages like German or Norwegian due to compounding.

fabiospampinato commented 3 years ago

Here is proof-of-concept Spylls-based script for doing so: gist.github.com/zverok/c574b7a9c42cc17bdc2aa396e3edd21a

Awesome! Thanks a lot.

In languages with word compounding, the "potentially possible" words list is incredibly huge.

English dictionary have "compounding rules" used to check numerals, like "11th is valid, and 10000011th is valid, but 10000011nd is not". Attempt to produce "all possible" combinations from this rule will obviously lead to an infinite loop :)

That's indeed a big problem 🤔 Thanks for pointing that out.

Potentially that could be solved by taking a hybrid approach, using a FSA for regular words and using some dynamic logic for accounting for compounding, however with an hybrid approach now I can't just ignore Hunspell's logic entirely, I can't just throw away all the flags after parsing, and accounting for compounding will most probably be tough at best.

I'll have to think more about this.

zverok commented 3 years ago

Potentially that could be solved by taking a hybrid approach

Yes, that's what comes to mind:

  1. "unmunch" should preserve flags (maybe recoding it into more obvious format) like "onlyincompound", "compoundflag", "comboundbegin" etc
  2. Then, at least the part of Hunspell's algorithms that calculates compounding by flags should be ported (compound_by_flags, but instead of calling affix_forms it should just look in the unmunched word list and deduce whether compound is good)
  3. But looking closer to it, unmunching should be quite smart to understand also "compoundforbid" (this affix should never be for compound word) and "compoundpermit" (this affix can even be in the middle of the compound), and tagging the unmunched words appropriately (this particular form is ok in any compound... this is not)
  4. Mechanism of compounding by rules seem to be used (in all dictionaries I saw it) only for numerals producing, so I believe it is more productive to hardcode it somehow in more simple form

In general, it all looks bearable, yet not a small amount of work. If you'll take this road and will need more info on how this mad mechanism works, feel free to ping me (here or by email), I spent a lot of intimate moments with it and will be glad if my acquired knowledge, if incomplete, would be of some use.

fabiospampinato commented 3 years ago

In general, it all looks bearable, yet not a small amount of work.

Yeah, another problem is that I don't really have a clue about how to make a serialized DA-FSA that can be worked with directly without requiring it to be deserialized first, there's some somewhat approachable material on ~serializing a Trie, but the compression ratios would be much better with a DA-FSA, and for that I've only been able to find some papers that talk about it and some probably buggy implementations nobody seems to use.

If you'll take this road and will need more info on how this mad mechanism works, feel free to ping me (here or by email), I spent a lot of intimate moments with it and will be glad if my acquired knowledge, if incomplete, would be of some use.

Thank you, the knowledge you've distilled in what you posted online has already been very useful to me. I'll let you know if I'll actually end up working on that, for now I'm thinking it would probably be unwise for me to dedicate that much time to it. I think the next step for me should be to check what the performance characteristics of a WASM port of Hunspell are, I probably overlooked a few things the first time I looked into that, if memory usage with that isn't insane like it is with nspell I'll probably just use that for now, but even in that scenario I'll still very much be interested in making something better for my use case.

I'm potentially open to look into alternative approaches, but I care about memory usage and startup time, and for that working with a succinct DA-FSA seems optimal really (just load a compressed string into memory and that's it).

zverok commented 3 years ago

I don't really have a clue about how to make a serialized DA-FSA that can be worked with directly without requiring it to be deserialized first

You might be interested in morfologik, BTW. They solve exactly this problem. I actually did a port of the algorithm to Ruby (you might notice repeating motive in my activities :joy:); even the naive implementation in a "known to be slow" language is quite effective. But that reimplementation was "blind" (I just ported Java to Ruby, cleaned up, and debugged it to work exactly like Java), so I don't have a deep understanding, but I assume on the top level it layouted like this:

1. Assume we have words to encode: "aardvark", "abacus", "abbot", "abbreviation", ...

2. In FSA, it would be coded (logical) this way:

a -> a -> rdvark
       b -> a -> cus
              b -> o -> t
                     r -> eviation

3. Phisically, it would be layouted like this:

<start point>
<address to "A" as a first letter>
<address to "B" as a first letter>
....

A-as-a-first-letter point
<address to "A" as a second letter>
<address to "B" as a second letter>
....

A-as-a-second-letter point
<address to "A" as a third letter...>

So, when we want to check "aardvark", we do this:

Something like that

fabiospampinato commented 3 years ago

I'll look into morfologik, thanks for the pointer.

The way you explained it seems fairly approachable, perhaps I should try implementing something like that.

However there might be a couple of problematic details with that:

The best documentation I've found so far about this is this lecture, but it stops at encoding Tries not full-blown DA-FSAs.

fabiospampinato commented 3 years ago

that's at minimum 20 bits of information just for each of those pointers.

Well I guess each pointer rather than encoding the distance from the start could be encoding the instance from itself, saving a good amount of memory, but as in theory one doesn't need to encode this information at all perhaps it'd be better just to try to implement the more optimal data structure.

zverok commented 3 years ago

You should consider I retold "how it works" from the memory of a loose reimplementation I did 2 years ago :) But I know that morfologik's encoding is quite efficient:

Example: for Polish, the 3.5 million word list becomes less than 1MB file in the FSA format.

fabiospampinato commented 3 years ago

Interesting, perhaps then it would be pretty good for my use case 🤔 I definitely have to look into that.

fabiospampinato commented 3 years ago

Apparently the current size of that Polish dictionary is closer to 3MB, but still pretty good, maybe they just added more words to it https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-polish/src/main/resources/morfologik/stemming/polish/polish.dict

fabiospampinato commented 3 years ago

I've started dedicating some time to this. Some random thoughts:

Traceback (most recent call last):
  File "src/unmunch.py", line 32, in <module>
    from spylls.hunspell.dictionary import Dictionary
  File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/__init__.py", line 1, in <module>
    from .dictionary import Dictionary
  File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/dictionary.py", line 11, in <module>
    from spylls.hunspell.algo import lookup, suggest
  File "/usr/local/Cellar/pypy3/7.3.3/libexec/site-packages/spylls/hunspell/algo/lookup.py", line 37, in <module>
    import spylls.hunspell.algo.permutations as pmt
AttributeError: module 'spylls' has no attribute 'hunspell'

image

Traceback (most recent call last):
  File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
    dictionary = Dictionary.from_files(sys.argv[1])
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
    aff, context = readers.read_aff(FileReader(path + '.aff'))
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
    dir_value = read_directive(source, line, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
    value = read_value(source, name, *arguments, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 251, in read_value
    for line in _read_array(int(count))
Traceback (most recent call last):
  File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
    dictionary = Dictionary.from_files(sys.argv[1])
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
    aff, context = readers.read_aff(FileReader(path + '.aff'))
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
    dir_value = read_directive(source, line, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
    value = read_value(source, name, *arguments, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 251, in read_value
    for line in _read_array(int(count))
ValueError: invalid literal for int() with base 10: 'ցվեցին/CH'
Traceback (most recent call last):
  File "/Users/fabio/Desktop/notable-spellchecker/src/unmunch.py", line 34, in <module>
    dictionary = Dictionary.from_files(sys.argv[1])
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py", line 141, in from_files
    aff, context = readers.read_aff(FileReader(path + '.aff'))
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 105, in read_aff
    dir_value = read_directive(source, line, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 161, in read_directive
    value = read_value(source, name, *arguments, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 249, in read_value
    return [
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 250, in <listcomp>
    make_affix(directive, flag, crossproduct, *line, context=context)
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/aff.py", line 294, in make_affix
    return kind_class(
  File "<string>", line 9, in __init__
  File "/usr/local/lib/python3.9/site-packages/spylls/hunspell/data/aff.py", line 266, in __post_init__
    self.cond_regexp = re.compile(self.condition.replace('-', '\\-') + '$')
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/local/Cellar/python@3.9/3.9.2_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/sre_parse.py", line 549, in _parse
    raise source.error("unterminated character set",
re.error: unterminated character set at position 36
fabiospampinato commented 3 years ago
fabiospampinato commented 3 years ago
fabiospampinato commented 3 years ago
fabiospampinato commented 3 years ago

Short summary:

Overall the unmunch script seems to have worked pretty well, I'd say the next steps are:

fabiospampinato commented 3 years ago

Here's the result of running Hunspell's unmunch over the problematic dictionaries:

Better than nothing I guess, it'd be great if I could regenerate these words lists with Spylls though, disabling compound rules entirely as I'll handle them specially.

zverok commented 3 years ago

@fabiospampinato Interesting investigation! I'll definitely dive into fixing/investigating all problems somewhere upcoming week (can't promise I'll investigate the PyPy one, but broken dictionaries shouldn't be). Just to be in sync: where did you take the dictionaries from? (So I'll be testing on the exact same ones)

fabiospampinato commented 3 years ago

can't promise I'll investigate the PyPy one, but broken dictionaries shouldn't be

No worries, that doesn't actually matter to me, I'd be happy to leave this running overnight or something, I care much more about it working well.

Just to be in sync: where did you take the dictionaries from? (So I'll be testing on the exact same ones)

I'm using the dictionaries listed here: https://github.com/wooorm/dictionaries

fabiospampinato commented 3 years ago

Update on the progress with the succinct trie:

Basically I'm kind of ecstatic, this data structure is amazing.

I'm just hoping the situation regarding compound words has a reasonably simple solution, because potentially until I get that working I can't ship this.

fabiospampinato commented 3 years ago

A little analysis of alphabets as extracted from words lists ordered by frequencies:

fabiospampinato commented 3 years ago

Interestingly all alphabets but one seem to have <= 256 letters, which should allow me to use at most 1 byte per letter to store them. One should probably be able to pack more than one letter in one byte for languages with very short alphabets, I'll have to investigate that.

That language with more than 256 letters may be a bit of an issue encoding-wise 🤔

fabiospampinato commented 3 years ago

Update on the encoded trie, which is largely done and so far has the following characteristics:

I'll put the finishing touches on this thing, and then I'll try to understand what the situation is regarding compound words, any guidance on that would be greatly appreciated.

Mainly I'd say I'd like to have a way to extract compound words up to a limit just to gauge the situation across dictionaries, then I can think about how to address it.

fabiospampinato commented 3 years ago

Two random thoughts:

zverok commented 3 years ago

NB: I follow your comments, but during the weekdays have only an ability to spend a shallow attention on them, probably will read more deeply on weekend. But two things that you might found interesting that I know about morfologik's approach:

  1. They seem to encode data byte-by-byte, so the format "knows" nothing about encoding (e.g. UTF string "мама" will be put in the FSA as a sequence of bytes [208, 188, 208, 176, 208, 188, 208, 176]). It is probably harder to debug, but easier for memory layouting, so the encoder/decoder can be absolutely "dumb".
  2. They seem to store only "canonical" capitalization, so only "gatto" (cat) can be found in FSA, but not "Gatto" or "GATTO", and only "Parigi" (Paris), but not "PARIS".

(I still suspect that digging in their code/links/docs could be fruitful for your tasks, too.)

I'll investigate dictionary bugs in the next few days, and I'll hopefully be able to provide some guidance for compounding, when the time comes.

zverok commented 3 years ago

This might be a good starting point for their approach to FSA building: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/main/java/morfologik/fsa/builders/FSABuilder.java

fabiospampinato commented 3 years ago

Sorry I'm dumping too many words in this thread 😅.

An update/summary:

zverok commented 3 years ago

About dictionary parsing (hmm, it would be cool to split this ticket into several discussions actually... Maybe we can go with a private repo for your project with me as a contributor, so I can see/answer issues and discussions there? Or WDYT?):

I'll investigate "forever to complete" (infinite loops or too long lists) dictionaries next.

PS: About this one:

It'd be great if you could update the unmunch script to extract compound words

I don't believe it is the right way, and not sure I have a resource to do it. If you want to try approaching this yourself, I might help with some advice; if you'll go the preprocessing way (try split, then see if it is a proper compound; which I believe is more promising), I can be of some advice, too.

fabiospampinato commented 3 years ago

About dictionary parsing (hmm, it would be cool to split this ticket into several discussions actually... Maybe we can go with a private repo for your project with me as a contributor, so I can see/answer issues and discussions there? Or WDYT?):

Sure, I'll open up a repo for this tomorrow.

In the meantime I'll close this issue as I would consider the original request addressed.

Regarding the throwing dictionaries:

I don't believe it is the right way, and not sure I have a resource to do it. If you want to try approaching this yourself, I might help with some advice; if you'll go the preprocessing way (try split, then see if it is a proper compound; which I believe is more promising), I can be of some advice, too.

I think I'll put this on hold for now, as all the possible paths seem complicated and I can't dedicate too much time to this at the moment, I'll probably get back to this in a couple of months or so, or if a simpler path materializes. In the meantime I'll ship what I've got and see how it behaves in practice.