wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
3.12k stars 284 forks source link

using accent mark #94

Closed alet149 closed 4 years ago

alet149 commented 4 years ago

I am using symspell library in c#, I have a spanish corpus, but I have a problem when the correct word has accent mark. The problem is the edit distance in the correct word is largest than no correct word. This is my case: bad word: MEDEĽTIN word correct in dictionary: medellín suggestions list returned with edit distance max, Frequency: medefin 2, 6 Medellin 3, 14671 What can I do? thanks

wolfgarbe commented 4 years ago

I don't think that the accent mark is a problem. I guess the problem is that SymSpell expects all words in the dictionary to be lower case, but in your dictionary Medellin has the first letter in upper case. Therefore SymSpell calculates an edit distance of 3 instead of only 2. If you write medellin in the dictionary (with lower case), then both medefin und medellin will have an edit distance of 2, but medellin will win because of its much larger word frequency.

alet149 commented 4 years ago

thanks Wolf, I had a mistake when I wrote my case. In the dictionary I have this: medefin 6 medellín 14671}

bad word: MEDEĽTIN

when i run lookup method with verbosity= All, I receive this suggestions list : [0], {{medefin, 2, 6}} [1], {{medellín, 3, 14671}} [2], {{medestia, 3, 30871}} . . . I attached image suggestions

Thanks for your help!

wolfgarbe commented 4 years ago

In SymSpell (and in UTF-8) letters with and without or with different accents are treated as different letters. Therefore they differ and have a large edit distance, even if they look similar or sound similar (I don't know how those letters in Spanish are pronounced). And this is reasonable, as you probably want to correct words with a missing or wrong accent. In your example both the ľ and l are treated as different letters, as well as i and í are treated as different letters.

That said, you can remove the accent (https://en.wikipedia.org/wiki/Diacritic) from letters, and transform the letters to their root form. Then the edit distance will be closer to that what you are probably expecting.

You can do this as a post-processing step. Use the unmodified SymSpell, get the suggestion list, remove the accents from both the input term and the suggestions. Re-calculate the Damerau-Levenshtein edit distance (from the terms with removed accents), and re-sort the suggestion list according to the new edit distance.


public static String RemoveDiacritics(string s)
{
    String normalizedString = s.Normalize(NormalizationForm.FormD);

    if (normalizedString != s)
    {
        StringBuilder stringBuilder = new StringBuilder();
        for (int i = 0; i < normalizedString.Length; i++)
        {
            Char c = normalizedString[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                stringBuilder.Append(c);
        }
        return stringBuilder.ToString();
    }
    else return s;
}

string[] termList= new string[] { "MEDEĽTIN", "medefin", "medellín", "medestia" };
foreach (string s in termList) Console.WriteLine(s+" "+s.ToLower()+" "+ RemoveDiacritics(s.ToLower()));

Output:

MEDEĽTIN medeľtin medeltin medefin medefin medefin medellín medellín medellin medestia medestia medestia