wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.09k stars 283 forks source link

Add metatdata to dictionary items #76

Closed nhaberl closed 4 years ago

nhaberl commented 4 years ago

Hey, is there a chance to add metatdata to the dictionary items so we can use it when get the suggestionitems .... would be great for autocomplete scenarios (like an ID or something)

wolfgarbe commented 4 years ago

SymSpell or any other (Damerau-Levenshtein) edit distance based spelling correction is not the best solution for autocomplete:

If you type "mic" for "microsoft" this would require an edit distance= 6 which is slow, computing intense,and will return all words from the dictionary (instead of only those starting with the "mic" prefix)

Trie or Radix Trie based solutions are better suited.

Soon will be releasing an autocomplete solution based on Pruning Radix Tries that is 1000x faster than the available solutions.

If you want nevertheless use SymSpell for autocomplete and attach an ID you have three possibilities:

  1. You may either amend the words dictionary which currently is Dictionary<string, Int64> words; //word, word count into something like Dictionary<string, (Int64,Int32)> words; //word,(word count, ID)

  2. you may abuse the Int64 count as ID. Currently SymSpell uses the count to display suggestion in order of descending word frequency, but if this is not important than you can use the count otherwise, e.g. as ID.

  3. You may maintain your own word -> ID translation table, where you translate the suggestions to ID in a post processing step.

nhaberl commented 4 years ago

Thanks for your quick answer! Yes I am using a RadixTrie (Concurrent Suffix Tree) which is optimized for concurrent usage for contains like searches but the problem is if someone is misspelling any word like 'managment' I won't find anything ... so my guess was to "normalize" this word and afterwards searching in my Trie structure.

Any thoughts on using your library in that case ? How will your autocomplete solution look like ?

wolfgarbe commented 4 years ago

You may

  1. combine auto completion and spell correction by first trying autocomplete, and if no suggestions returned then fall back to spelling correction suggestion (this is how I am doing it in my search solutions).
  2. Alternatively you can do both auto completion and spell correction and combine the results (usually only one of them will contribute).
  3. combine auto completion and spell correction by first trying spelling correction, and feed the spelling suggestions to autocomplete. If no spelling suggestions were returned (or only suggestions with large edit distance and low word frequency) send the original input directly to auto completion (bypassing the spelling correction).
  4. Sending always the spelling correction results to auto completion will not work for a short prefix (correct or misspelled), because the spelling correction will nor be able to correct the input term if the edit distance to the next valid word is too large. Then the auto correction will have no input.

How will your autocomplete solution look like ?

It will create a Trie from text file of terms and term bigrams with word frequency counts as input. For autosuggestion it will then return the topK best suggestions for any given prefix (e.g. the top 10 suggestions with the highest word frequency - or price, or profit margin etc.), sorted by word frequency. For 6 million unique terms and bigrams and a prefix="a" (length=1) it will return the top 10 in 0,03 ms. This is an exhaustive search for the top 10, not just the first 10 results that match the prefix.