wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

Word completion #99

Closed fortierq closed 3 years ago

fortierq commented 3 years ago

Hello, Let's say I have a string s of size n and a dictionary D. Is it possible to find the word in D whose prefix of size n is closest to s? That is to say, complete s in a word of D. n may vary and I don't want to create a new SymSpell dictionary for each possible n.

wolfgarbe commented 3 years ago

For prefix search & auto-completion you probably should use my PruningRadixTrie instead - 1000x faster Radix trie for prefix search & auto-complete https://github.com/wolfgarbe/PruningRadixTrie

SymSpell is inefficient for small prefixes of long words. I would recommend SymSpell for prefix search only, if the prefix may contain spelling errors.

fortierq commented 3 years ago

Thanks. I may have errors in the prefix. However I have only few searchs to perform, so SymSpell might be enough. I guess I should iterate through the dictionary and select the word whose n-prefix is closest to my query s?

wolfgarbe commented 3 years ago

You don't need to iterate yourself through the dictionary. SymSpell.Lookup() will search the dictionary for you:

If you search for car it will return both car and carpet, but also bar.

But the choice of maxEditDistanceDictionary and maxEditDistanceLookup is important. In order to find the suggestion carpet for input term/prefix car both maximum edit distances need to be >=3. The larger the edit distance is, the higher is the memory consumption of the precalculated dictionary is and the slower the lookup in the dictionary becomes. Also, with higher maximum edit distances the number of returned suggestions increases, many of them probably unwanted and unexpected (the precision decreases). E.g. If you have a prefix of length 3 and an edit distance of 3, then ALL words of length 3 are returned, even if they have no letter in common with the prefix, and all words of length 4 which have only a single letter at the same position in common, and all words of length 5, which have only 2 letters at the same position in common.

//create object
int initialCapacity = 82765;
int maxEditDistanceDictionary = 3; //maximum edit distance per dictionary precalculation
var symSpell = new SymSpell(initialCapacity, maxEditDistanceDictionary);

//load dictionary
string baseDirectory = AppDomain.CurrentDomain.BaseDirectory;
string dictionaryPath= baseDirectory + "../../../../SymSpell/frequency_dictionary_en_82_765.txt";
int termIndex = 0; //column of the term in the dictionary text file
int countIndex = 1; //column of the term frequency in the dictionary text file
if (!symSpell.LoadDictionary(dictionaryPath, termIndex, countIndex))
{
  Console.WriteLine("File not found!");
  //press any key to exit program
  Console.ReadKey();
  return;
}

//lookup suggestions for single-word input strings
string inputTerm="car";
int maxEditDistanceLookup = 3; //max edit distance per lookup (maxEditDistanceLookup<=maxEditDistanceDictionary)
var suggestionVerbosity = SymSpell.Verbosity.All; //Top, Closest, All
var suggestions = symSpell.Lookup(inputTerm, suggestionVerbosity, maxEditDistanceLookup);