wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

any comparation with lucene's Levenshtein Automaton ? #72

Open byronhe opened 5 years ago

byronhe commented 5 years ago

http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

wolfgarbe commented 5 years ago

Something like http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata or https://issues.apache.org/jira/browse/LUCENE-2507 ?

From what I understand from Michael McCandless post:

Prior to 4.0, FuzzyQuery took a brute force approach: it visits every single unique term in the dictionary, computes the Levenshtein distance for it, and accepts the term if the edit distance<=maximum edit distance.

From 4.0 they were pre-calculating a Levenshtein automaton, that accepts only the terms within edit distance N. And then at query time they were intersecting that automaton with all the terms in the dictionary.

This would mean that they save time, because the don't need to calculate the Levenshtein distance for every single term, but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary, whereas SymSpell has to calculate the Levenshtein distance only for 0.0042% to 0.016% of the vocabulary.

It would sure be interesting to extend the benchmark, to see how the two algorithms compare in reality.

missinglink commented 2 years ago

but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary

This is not the case if the dictionary is also indexed in a trie structure.

In that case, intersecting the automata with the dictionary is a simple recursive algorithm which only follows branches of the dictionary with corresponding transitions in the state machine.

wolfgarbe commented 2 years ago

It seems like somebody did indeed benchmark SymSpell vs. Lucene's spelling correction: image Source: https://github.com/Shivakumar-Narayanan/Spell-Check https://docs.google.com/document/d/1QQROh8ndwBHbPwx2t1kKZcquHDDfF0Pz