zverok / spylls

Pure Python spell-checker, (almost) full port of Hunspell
https://spylls.readthedocs.io
Mozilla Public License 2.0
285 stars 21 forks source link

TypeError: '<' not supported between instances of 'Word' and 'Word' #4

Closed shantanuo closed 3 years ago

shantanuo commented 3 years ago

It works for some words but getting error in case of others.

from spylls.hunspell import Dictionary
dictionary = Dictionary.from_files('/root/marathi/dicts/mr_IN')

for suggestion in dictionary.suggest('मान्वी'):
  print(suggestion)
मानवी
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-cec47a2e0f5b> in <module>
      3 dictionary = Dictionary.from_files('/root/marathi/dicts/mr_IN')
      4 
----> 5 for suggestion in dictionary.suggest('मान्वी'):
      6   print(suggestion)

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/dictionary.py in suggest(self, word)
    201         """
    202 
--> 203         yield from self.suggester(word)

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in __call__(self, word)
    181             word: Word to check
    182         """
--> 183         yield from (suggestion.text for suggestion in self.suggest_internal(word))
    184 
    185     def suggest_internal(self, word: str) -> Iterator[Suggestion]:  # pylint: disable=too-many-statements

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in <genexpr>(.0)
    181             word: Word to check
    182         """
--> 183         yield from (suggestion.text for suggestion in self.suggest_internal(word))
    184 
    185     def suggest_internal(self, word: str) -> Iterator[Suggestion]:  # pylint: disable=too-many-statements

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in suggest_internal(self, word)
    345 
    346         ngrams_seen = 0
--> 347         for sug in self.ngram_suggestions(word, handled=handled):
    348             for res in handle_found(Suggestion(sug, 'ngram'), check_inclusion=True):
    349                 ngrams_seen += 1

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in ngram_suggestions(self, word, handled)
    508                     known={*(word.lower() for word in handled)},
    509                     maxdiff=self.aff.MAXDIFF,
--> 510                     onlymaxdiff=self.aff.ONLYMAXDIFF)
    511 
    512     def phonet_suggestions(self, word: str) -> Iterator[str]:

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/ngram_suggest.py in ngram_suggest(misspelling, dictionary_words, prefixes, suffixes, known, maxdiff, onlymaxdiff)
     81             heapq.heappushpop(root_scores, (score, word.stem, word))
     82         else:
---> 83             heapq.heappush(root_scores, (score, word.stem, word))
     84 
     85     roots = heapq.nlargest(MAX_ROOTS, root_scores)

TypeError: '<' not supported between instances of 'Word' and 'Word'
zverok commented 3 years ago

Oh, that's an interesting one. Can imagine how it happened, though. Can you please show me the dictionaries you are using, for me to test it easier?

shantanuo commented 3 years ago

Here is how to repeat:

!wget -N https://kagapa.s3.ap-south-1.amazonaws.com/with_acor_N.oxt
!unzip -o ./with_acor_N.oxt

from spylls.hunspell import Dictionary
dictionary = Dictionary.from_files('./dicts/mr_IN')

#####
# Returns True
print(dictionary.lookup('मानवी'))

# This should not return anything because the word is correct
for suggestion in dictionary.suggest('मानवी'):
  print(suggestion)

#####
# Returns False
print(dictionary.lookup('मान्वी'))

# Should return suggestions, but getting an error
for suggestion in dictionary.suggest('मान्वी'):
  print(suggestion)
shantanuo commented 3 years ago

It works as expected using hunspell module.

#sudo apt-get install -y libhunspell-dev
#pip install python-dev 
#pip install hunspell

import hunspell

spellchecker = hunspell.HunSpell(
    "./dicts/mr_IN.dic",
    "./dicts/mr_IN.aff",
)

spellchecker.spell('मानवी')

for suggestion in spellchecker.suggest('मानवी'):
  print(suggestion)

spellchecker.spell('मान्वी')

for suggestion in spellchecker.suggest('मान्वी'):
  print(suggestion)

What is the advantage of using spylls over hunspell?

zverok commented 3 years ago

Thanks for the details, I'll look into it!

What is the advantage of using spylls over hunspell?

If you just need to check spelling, I believe there is not much: maybe the fact that spylls is pure Python and therefore can be installed where hunspell couldn't (some CI?), and can be hackable (looking into dictionary contents, into settings, etc.).

The goal of the project is to be readable and hackable, while (hopefully) repeating all hunspell's behavior.

shantanuo commented 3 years ago

Yes. I can see where it can be useful. For e.g. someone can resolve this bug...

https://github.com/hunspell/hunspell/issues/497

If I can nest more than 2 levels of affix rules, it will be helpful.

zverok commented 3 years ago

Type error is fixed in master, thanks for noticing!

for suggestion in dictionary.suggest('मान्वी'): 
  print(suggestion) 
# Now prints:
# मानवी
# मानावी

As for whether the suggestions should be printed for the already correct word, I prefer to keep it simple. It is just as easy for client code to check "whether it is correct", and printing suggestions for correct word might be considered a useful functionality, too (print words similar to this one).

shantanuo commented 3 years ago

Hunspell module returns these 4 suggestions. While spylls return only 2

मानावी
मान्यवर
मान्यही
मानव्य

One word "मानावी" is common in both. The word returned by spylls "मानवी" is not there in hunspell. Can you guess the reason?

मानावी
मानवी

As a matter a fact, the word that is there in spylls and not in hunspell 'मानवी' is the correct expected word! I will like to know how this has been achieved.

zverok commented 3 years ago

I will like to know how this has been achieved.

That's an interesting one :) Most of the algorithms in the original Hunspell work well and tested with 1- or 2-byte characters. As Marathi chars are 3-byte, some of the Hunspell's internals fallback to "default" (almost "random") mode, including n-gram-based suggestion (word distance similarity). Due to Python's excellent Unicode support, spylls don't have this limitation. So, the algorithms are the same, they are just working more correctly with 3-byte chars.