wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.15k stars 298 forks source link

prioritizing types of distance errors? #5

Closed allanchao closed 7 years ago

allanchao commented 7 years ago

Great library! It works very well and is highly performant. To be honest I don't know or understand the underlying algorithm, but I have a suggestion / request from a user's perspective.

Right now the results seem only ordered by the # of character modifications, and are agnostic to the type of modification. However, this results in some unexpected "corrections". I think adjacent letter swaps should be highest priority, followed by missing a repeated letter, followed by add a letter, followed by remove a letter, followed by total letter replacement (this is most likely to result in a different intended word). Alternatively some sort of ranking/sorting involving for including more % of characters in the input word.

Examples: basicly --> basic --> expected basically collegue --> college --> expected colleague finaly --> final --> expected finally jist --> list --> expected gist (this one could be theoretically helped by the j sound being the same as g sound) liase --> laser --> expected liaise peice --> price --> expected piece politican --> political --> expected politician realy --> real --> expected really rember --> member --> expected remember (this is two steps away, so maybe ignore it) seige --> beige --> expected siege tonge --> lounge --> expected tongue

This is an incomplete example list that I just got with a quick list from https://en.oxforddictionaries.com/spelling/common-misspellings and using SymSpell.editDistanceMax = 3; (because 2 missed too many misspellings).

Overall, great library, thank you for maintaining it.

wolfgarbe commented 7 years ago

Thank you for your feedback.

  1. The results are not only ordered by edit distance, but also by word frequency. This means that for the same edit distance those words are prioritized which have a higher likelihood to occur in the text.

  2. There are all kinds of possible ranking policies: by edit type, by keyboard proximity, by phonetic similarity (soundex) etc. The idea is that SymSpell finds all possible suggestions within a certain edit distance in a very short time. It is up to you to implement your preferred policy to reorder/sort the candidates in a second step. This gives the user more freedom, while keeping the package compact.

  3. Have a look at preDict which is based on SymSpell (v3.0), but implements weighted Damerau-Levenshtein distance, where each operation (delete, insert, swap, replace) can have different edit weights. They also added a phonetic proximity and keyboard proximity.