wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.15k stars 298 forks source link

CreateDictionaryEntry behavior when `word` is empty string #118

Closed mammothb closed 2 years ago

mammothb commented 2 years ago

I noticed a strange behavior with Edits method when maxDictionaryEditDistance is larger than some of the words being added the dictionary and would like to understand the rationale/reason behind it.

The following code snippet replicate the observation

const int initialCapacity = 82834;
const int maxEditDistance = 5;
const int prefixLength = 7;
var symSpell = new SymSpell(initialCapacity, maxEditDistance, prefixLength);

symSpell.CreateDictionaryEntry("asdf", 2);
symSpell.CreateDictionaryEntry("qwer", 2);
symSpell.CreateDictionaryEntry("zxcv", 2);
symSpell.CreateDictionaryEntry("uiopjkl", 2);
symSpell.CreateDictionaryEntry("", 2);

foreach (KeyValuePair<int, string[]> entry in symSpell.deletes)
{
    string output = " (" + entry.Key + ") [";
    foreach (string val in entry.Value)
    {
        output += "\"" + val + "\", ";
    }
    output += "]";
    Console.WriteLine(output);
}

The list of deletes for the empty string "" is (18652612) ["asdf", "qwer", "zxcv", "", ] which contains all the dictionary words that are below the maxEditDistance. I assume this is because at a max edit distance of 5, "asdf", "qwer", and "zxcv" which are 4 edit distance away from "" are viable candidates.

However, may I know if there are cases where this entry will be actually selected and used to select a candidate? So far I can only think of doing Lookup("", Verbosity.All) but that will still return "" as the top result since it's an exact match but the other results are unused.

wolfgarbe commented 2 years ago

This has been a fix introduced in v5.0 https://github.com/wolfgarbe/SymSpell#changes-in-v50 It ensures that suggestions for input_words with Length <= editDistanceMax are complete. E.g. input_word "axe" has a length <= maxEditDistance 5. One of the input_deletes becomes "". And then we have a match with the dictionary_delete "". And the dictionary_delete "" points to all dictionary_words with length <= maxEditDistance as suggestions.

All input words of length <= n are within maximum edit distance n of all dictionary words of length <=n .

Does this answer your question?

mammothb commented 2 years ago

Yes, I understand how it works better now, thanks for the explanation!