Closed nhaberl closed 4 years ago
SymSpell or any other (Damerau-Levenshtein) edit distance based spelling correction is not the best solution for autocomplete:
If you type "mic" for "microsoft" this would require an edit distance= 6 which is slow, computing intense,and will return all words from the dictionary (instead of only those starting with the "mic" prefix)
Trie or Radix Trie based solutions are better suited.
Soon will be releasing an autocomplete solution based on Pruning Radix Tries that is 1000x faster than the available solutions.
If you want nevertheless use SymSpell for autocomplete and attach an ID you have three possibilities:
You may either amend the words dictionary which currently is
Dictionary<string, Int64> words; //word, word count
into something like
Dictionary<string, (Int64,Int32)> words; //word,(word count, ID)
you may abuse the Int64 count as ID. Currently SymSpell uses the count to display suggestion in order of descending word frequency, but if this is not important than you can use the count otherwise, e.g. as ID.
You may maintain your own word -> ID translation table, where you translate the suggestions to ID in a post processing step.
Thanks for your quick answer! Yes I am using a RadixTrie (Concurrent Suffix Tree) which is optimized for concurrent usage for contains like searches but the problem is if someone is misspelling any word like 'managment' I won't find anything ... so my guess was to "normalize" this word and afterwards searching in my Trie structure.
Any thoughts on using your library in that case ? How will your autocomplete solution look like ?
You may
How will your autocomplete solution look like ?
It will create a Trie from text file of terms and term bigrams with word frequency counts as input. For autosuggestion it will then return the topK best suggestions for any given prefix (e.g. the top 10 suggestions with the highest word frequency - or price, or profit margin etc.), sorted by word frequency. For 6 million unique terms and bigrams and a prefix="a" (length=1) it will return the top 10 in 0,03 ms. This is an exhaustive search for the top 10, not just the first 10 results that match the prefix.
Hey, is there a chance to add metatdata to the dictionary items so we can use it when get the suggestionitems .... would be great for autocomplete scenarios (like an ID or something)