wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

What is the expected content of the dictionary file when using custom separator with LoadBigramDictionary? #78

Closed mammothb closed 4 years ago

mammothb commented 4 years ago

When I tried to load a dictionary file (using '$' as the separator) that looks like:

different distributions$123
growth factor$456

I wasn't able to add any terms to bigrams since line.Split(separatorChars) split it into 2 parts. May I know how should we format the dictionary file content when loading it with LoadBigramDictionary and a custom separator?

wolfgarbe commented 4 years ago

There was a bug in my code. Please see the corrected version:

`

public bool LoadBigramDictionary(Stream corpusStream, int termIndex, int countIndex, char[] separatorChars = defaultSeparatorChars)
{
    using (StreamReader sr = new StreamReader(corpusStream, System.Text.Encoding.UTF8, false))
    {
        String line;
        int linePartsLength = (separatorChars == defaultSeparatorChars) ? 3 : 2;
        //process a single line at a time only for memory efficiency
        while ((line = sr.ReadLine()) != null)
        {
            string[] lineParts = line.Split(separatorChars);

            if (lineParts.Length >= linePartsLength)
            {
                //if default (whitespace) is defined as separator take 2 term parts, otherwise take only one
                string key = (separatorChars == defaultSeparatorChars) ? lineParts[termIndex] + " " + lineParts[termIndex + 1]: lineParts[termIndex];
                //Int64 count;
                if (Int64.TryParse(lineParts[countIndex], out Int64 count))
                {
                    bigrams[key] = count;
                    if (count < bigramCountMin) bigramCountMin = count;
                }
            }
        }  
    }
    return true;
}

`

mammothb commented 4 years ago

I see, so we change the number of line parts to check for depending on whether a custom separator was used. Thanks!