wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Possible issue when misspelled word has 2 non consecutive deletes #104

Closed rockroland closed 3 years ago

rockroland commented 3 years ago

I successfully implemented symspell using a mysql memory table and with an odbc driver and some post processing to filter candidates the performance is absolutely amazing! But I ran across a case today that I thought might be an issue.

I was looking for corrections to some abbreviated words and I had this abbreviation: IMPROVMNT and expected the suggested correction to be :IMPROVEMENT which is only 2 deletes away but instead got no suggestions.

In my prepared symspell database I have nearby candidates in my index such as IMPROVEMNT and IMPROVMENT but since my precalculation iteratively deletes one letter at a time to populate the index I would never have words where the 2 deletes are non-consecutive. Am I missing something?

wolfgarbe commented 3 years ago

You want to match IMPROVMNT to IMPROVEMENT The two terms have a Damerau–Levenshtein edit distance distance of 2. This means, you have to recursively delete 2 two characters both from all dictionary terms (only once in the pre-calculation phase) and from the (misspelled) input term.

Lets have a look at your example:

First delete iteration of Dictionary term: IMPROVEMENT MPROVEMENT IPROVEMENT IMROVEMENT IMPOVEMENT IMPRVEMENT IMPROEMENT IMPROVMENT * IMPROVEENT IMPROVEMNT IMPROVEMET IMPROVEMEN

Second delete iteration of first iteration delete IMPROVMENT (do this for each of the first iteration deletes) MPROVMENT
IPROVMENT
IMROVMENT
IMPOVMENT
IMPRVMENT
IMPROMENT
IMPROVENT
IMPROVMNT -> this is the match for your input term IMPROVMET
IMPROVMEN

The above method does generates non-consecutive deletes. And don't forget that you do need also to generate deletes from the input term, not only for the dictionary term. While for your specific example this is not necessary, in many other cases it is!

rockroland commented 3 years ago

Thanks. I need to review if my pre-calculated dictionary does the second round of deletes. Otherwise this is working great..