zverok / spylls

Pure Python spell-checker, (almost) full port of Hunspell
https://spylls.readthedocs.io
Mozilla Public License 2.0
282 stars 18 forks source link

Using spylls to clean-up text file #8

Closed shantanuo closed 3 years ago

shantanuo commented 3 years ago

Is it possible to run spylls against a large corpus and remove all mis-spelled words? Something like asked here... https://stackoverflow.com/questions/65785287/using-hunspell-to-find-incorrect-words-in-jamspell

zverok commented 3 years ago

It is kind of possible, but spylls maybe not the best tool for the task, and you'll need to write some Python :) On the highest level, the code will look like:

  1. Load your corpus
  2. Tokenize it into words (with some existing Python tokenization library)
  3. Check each word with spylls Dictionary.lookup() method
  4. Drop those which are False
  5. ...save the filtered corpus.

Probably you can do the same with hunspell (command-line tool or Python wrapper) and it will be more performant...