wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

How to calculate distance manually in symspell? #60

Closed MukhtarShaima closed 5 years ago

MukhtarShaima commented 5 years ago

Greetings Sir, I'm working on spell checker for Urdu Language. Tried with your algorithm it gives me great results. Now, Can you explain how symspell algorithm works? Like,if I wanted to calculate the edit distance of LD manual I know what's the way, So how to calculate deletion in symspell or sympound manually,or what's the procedure for calculating manually.

Plus,I didn't got the algorithm.i know there is some deletion instead on insertion and all. But I don't have the proper understanding of the given algorithm. I read the algorithm written by you on medium.

wolfgarbe commented 5 years ago

SymSpell uses the Damerau-Levenshtein edit distance. So you can calculate it manually in order to get the edit distance used by SymSpell.

The Symmetric delete algorithm is only used as a fast and efficient pre-filter in order to reduce the number of candidates from the big dictionary.

After this filtering step only for a very small number of candidates we need to do the (numerically expensive) calculation of the Damerau-Levenshtein edit distance.

Norvigs algorithm applies deletes, transposes, replaces and inserts operations to the input term, until it matches one of the dictionary terms. The lookup in the dictionary is done between the modified input term and the original dictionary terms.

SymSpell applies only delete operations, but both to the input term and all dictionary terms (the latter only once during a pre-calculation phase). The lookup in the dictionary is done between the delete of the input term and the deletes of the dictionary terms.

SymSpell is much faster, as the number of possible deletes to an input term is much smaller than all possible combinations of deletes, transposes, replaces and inserts to an input term.