wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Next valid letters #38

Closed trans closed 6 years ago

trans commented 6 years ago

Does SymSpell have a way to get the next valid set of letters given a prefix? E.g. if I give it "carpo" it would return "o" and "r" for "carpool" and "carport" (maybe others).

wolfgarbe commented 6 years ago

SymSpell with maximum edit distance=2 and input="carpo" will return both "carpool" and "carport". But it will also return "carbon" (edit distance=2). And it would not return "carpooling" (edit distance=5).

For an autocomplete of a given prefix you should use a trie, also called digital tree, radix tree or prefix tree.

trans commented 6 years ago

Hmm... ok, then I suppose I could write a function that takes "carpool", "carport" and "carbon" and extracts the sixth letter of each, if the prefix matches. That would basically give me what I want even if "carpooling" is not included. It wouldn't be perfect but probably good enough.

Yes, for straight up autocomplete, a binary search on an alphabetically sorted array is fastest (though not as memory efficient).

wolfgarbe commented 6 years ago

You should then skip early all candidates which don't share a common prefix with the input term - before they are added to the suggestion list. After line 436 you should add: if (!suggestion.StartsWith(input)) continue; That prevents unnecessary Levenshtein calculations (which are time consuming) for suggestions you would throw away later anyway.

A lookup in a Trie is faster than binary search: Trie lookup: O(L) , L is the average length of a word Binary search: O(log N), N is the number of words in the dictionary

trans commented 6 years ago

Thank you. Nice tip!!!

Yes the lookup of the prefix in a trie is faster but then you have to traverse all the branches to collect the completions. With two binary searches on an ordered array it's already done -- all the completions are between the two indexes.