wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
MIT License
3.12k stars 284 forks source link

What is the expected content of the dictionary file when using custom separator with LoadBigramDictionary? #78

Closed mammothb closed 4 years ago

mammothb commented 4 years ago

When I tried to load a dictionary file (using '$' as the separator) that looks like:

different distributions$123
growth factor$456

I wasn't able to add any terms to bigrams since line.Split(separatorChars) split it into 2 parts. May I know how should we format the dictionary file content when loading it with LoadBigramDictionary and a custom separator?

wolfgarbe commented 4 years ago

There was a bug in my code. Please see the corrected version:


public bool LoadBigramDictionary(Stream corpusStream, int termIndex, int countIndex, char[] separatorChars = defaultSeparatorChars)
    using (StreamReader sr = new StreamReader(corpusStream, System.Text.Encoding.UTF8, false))
        String line;
        int linePartsLength = (separatorChars == defaultSeparatorChars) ? 3 : 2;
        //process a single line at a time only for memory efficiency
        while ((line = sr.ReadLine()) != null)
            string[] lineParts = line.Split(separatorChars);

            if (lineParts.Length >= linePartsLength)
                //if default (whitespace) is defined as separator take 2 term parts, otherwise take only one
                string key = (separatorChars == defaultSeparatorChars) ? lineParts[termIndex] + " " + lineParts[termIndex + 1]: lineParts[termIndex];
                //Int64 count;
                if (Int64.TryParse(lineParts[countIndex], out Int64 count))
                    bigrams[key] = count;
                    if (count < bigramCountMin) bigramCountMin = count;
    return true;


mammothb commented 4 years ago

I see, so we change the number of line parts to check for depending on whether a custom separator was used. Thanks!