Closed shantanuo closed 3 years ago
Oh, that's an interesting one. Can imagine how it happened, though. Can you please show me the dictionaries you are using, for me to test it easier?
Here is how to repeat:
!wget -N https://kagapa.s3.ap-south-1.amazonaws.com/with_acor_N.oxt
!unzip -o ./with_acor_N.oxt
from spylls.hunspell import Dictionary
dictionary = Dictionary.from_files('./dicts/mr_IN')
#####
# Returns True
print(dictionary.lookup('मानवी'))
# This should not return anything because the word is correct
for suggestion in dictionary.suggest('मानवी'):
print(suggestion)
#####
# Returns False
print(dictionary.lookup('मान्वी'))
# Should return suggestions, but getting an error
for suggestion in dictionary.suggest('मान्वी'):
print(suggestion)
It works as expected using hunspell module.
#sudo apt-get install -y libhunspell-dev
#pip install python-dev
#pip install hunspell
import hunspell
spellchecker = hunspell.HunSpell(
"./dicts/mr_IN.dic",
"./dicts/mr_IN.aff",
)
spellchecker.spell('मानवी')
for suggestion in spellchecker.suggest('मानवी'):
print(suggestion)
spellchecker.spell('मान्वी')
for suggestion in spellchecker.suggest('मान्वी'):
print(suggestion)
What is the advantage of using spylls over hunspell?
Thanks for the details, I'll look into it!
What is the advantage of using spylls over hunspell?
If you just need to check spelling, I believe there is not much: maybe the fact that spylls
is pure Python and therefore can be installed where hunspell couldn't (some CI?), and can be hackable (looking into dictionary contents, into settings, etc.).
The goal of the project is to be readable and hackable, while (hopefully) repeating all hunspell's behavior.
Yes. I can see where it can be useful. For e.g. someone can resolve this bug...
https://github.com/hunspell/hunspell/issues/497
If I can nest more than 2 levels of affix rules, it will be helpful.
Type error is fixed in master
, thanks for noticing!
for suggestion in dictionary.suggest('मान्वी'):
print(suggestion)
# Now prints:
# मानवी
# मानावी
As for whether the suggestions should be printed for the already correct word, I prefer to keep it simple. It is just as easy for client code to check "whether it is correct", and printing suggestions for correct word might be considered a useful functionality, too (print words similar to this one).
Hunspell module returns these 4 suggestions. While spylls return only 2
मानावी
मान्यवर
मान्यही
मानव्य
One word "मानावी" is common in both. The word returned by spylls "मानवी" is not there in hunspell. Can you guess the reason?
मानावी
मानवी
As a matter a fact, the word that is there in spylls and not in hunspell 'मानवी' is the correct expected word! I will like to know how this has been achieved.
I will like to know how this has been achieved.
That's an interesting one :) Most of the algorithms in the original Hunspell work well and tested with 1- or 2-byte characters. As Marathi chars are 3-byte, some of the Hunspell's internals fallback to "default" (almost "random") mode, including n-gram-based suggestion (word distance similarity). Due to Python's excellent Unicode support, spylls don't have this limitation. So, the algorithms are the same, they are just working more correctly with 3-byte chars.
It works for some words but getting error in case of others.