zverok / spylls

Pure Python spell-checker, (almost) full port of Hunspell
https://spylls.readthedocs.io
Mozilla Public License 2.0
284 stars 21 forks source link

spylls fails to load Dutch dictionary #7

Closed rsmith-nl closed 3 years ago

rsmith-nl commented 3 years ago

When I tried to load the Dutch dictionary from https://github.com/OpenTaal/opentaal-hunspell, it failed:

In [1]: from spylls.hunspell import Dictionary

In [2]: dictionary = Dictionary.from_files("github/opentaal-hunspell/nl")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-a2daf207bcaa> in <module>
----> 1 dictionary = Dictionary.from_files("github/opentaal-hunspell/nl")

/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py in from_files(cls, path)
    116 
    117         aff, context = readers.read_aff(FileReader(path + '.aff'))
--> 118         dic = readers.read_dic(FileReader(path + '.dic', encoding=context.encoding), aff=aff, context=context)
    119 
    120         return cls(aff, dic)

/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/dic.py in read_dic(source, aff, context)
     58                 # So we just mutate the list of parts we are currently processing, so those fetched
     59                 # by numeric alias would be handled.
---> 60                 parts.extend(aff.AM[part])
     61             else:
     62                 # ...otherwise, it is still part of the word

KeyError: '10'

Looking at the nl.dic file, the relevant word is on line 18906: boekentop 10.

Since hunspell accepts this dictionary, and uses it:

hunspell -d github/opentaal-hunspell/nl
Hunspell 1.7.0
boekentop 10
*
*

boekentop-10
& boekentop-10 1 0: boekentop 10

This seems to indicate that the comment on line 56 of hunspell/readers/dic.py: # If it is just numeric AND not the first part in string, it is "morphology alias" is not correct. I cannot find the term "morphology alias" in hunspell(5), so I'm not sure what is meant by that. The manual does show numerical flags used. But numbers in the stem should not be interpreted, AFAICT.

zverok commented 3 years ago

Awesome find, thank you :) It was false assumption on my side that with " ", the second part is always morphological alias (those are documented in this section): image

Digging deeper in Hunspell's code, I found out that morphological aliases should be split of the stem with \t, otherwise it is a part of the word indeed. The dictionary now loads successfully in spylls (and indeed suggests "boekentop 10" for "boekentop10" and "boekentop-10"), I'll just redocument it a bit and push/release.

Thank you for the time you spent on the investigation!

rsmith-nl commented 3 years ago

Victor,

Thank you for writing spylls!

Over the years I've tried using hunspell for checking LaTeX files several times. But even with the -t flag it insists on checking LaTeX command names, which is silly and annoying. With spylls it will become easier to write something that works better.

zverok commented 3 years ago

The new version (0.1.2) was released, it should hopefully work with Dutch dictionaries!