wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.15k stars 299 forks source link

More than 2 columns and space seperated words #33

Closed Amir-Eskandari closed 6 years ago

Amir-Eskandari commented 6 years ago

Hi,

1 - I want to add more columns like 'category' or 'type' or 'Culture' in the dataset and in that case maybe i need to have a word twice in the dataset. for adding more clolumns which you mentioned it's possible, should I change the LoadDictionary method to support more than 2 columns ?

2 - what can I do for space seperated words, something like Mercedes benz ?

Best, Amir

wolfgarbe commented 6 years ago
  1. Let's say you have a table with 3 columns: term, count, culture In this table you have four lines: car 2000 en-us car 1000 en-gb color 500 en-us colour 200 en-gb

You can simply read your dictionary/dataset like this: int initialCapacity = 82765; int maxEditDistanceDictionary = 2; var symSpell = new SymSpell(initialCapacity, maxEditDistanceDictionary); int termIndex = 0; //column of the term in the dictionary text file int countIndex = 1; //column of the term frequency in the dictionary text file symSpell.LoadDictionary(dictionaryPath, termIndex, countIndex)

The culture column is simply ignored by SymSpell The word "car" appears in two lines, but it is combined to a single entry in the internal dictionary. The two values in the count column (2000 and 1000) are added to count=3000.

  1. You can use CreateDictionaryEntry("Mercedes benz".ToLower(), count, staging); While CreateDictionaryEntry() supports space separated words, LoadDictionary() does not yet (because the conflict to decide whether it is a custom data column it should ignore or multiple terms which belong to a space separated word). So you would have to modify/implement your own LoadDictionary() using the existing CreateDictionaryEntry()
Amir-Eskandari commented 6 years ago

Hello,

firstly thank you for fast response. secondly, If I want to create my on LoadDictionary(), I need to have access to some fileds like deletes in order to check it before CommitStaged() then can I make a public property named Deletes and push it on the master branch then I can use your nuget in my web site.

by the way there are some properties like EntryCount and WordCount which they don't check if the deletes or words are null or not, I can fix them too.

Best, Amir

wolfgarbe commented 6 years ago

I'm not sure what you are trying to achieve, but in order to support both space separated words and custom columns you could just change the dictionary format from space separated columns to comma separated columns: mercedes benz,500,en-us

273:  string[] lineParts = line.Split(null);
273:  string[] lineParts = line.Split(',');

Alternatively you could put space separated words inside quotation marks and adapt the parsing. No access to deletes is required.

Of course you can make whatever changes you want in your own fork, but the behaviour and structure of internal fields like deletes can change in the future. That's the reason they are not public, in order to prevent breaking changes of the library. LoadDictionary will support space separated words in the future.

Wolf

Amir-Eskandari commented 6 years ago

I didn't want to change it in my own fork in order to use the update of your NuGet package, but seems like I have to do it to make it compatible with my needs.

thank you, Amir