ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

Bug with jaro_winkler function? #47

Open flo-blg opened 4 years ago

flo-blg commented 4 years ago

Hi,

I'm facing a strange result using jaro_winkler function, which looks like a bug:

In [73]: Levenshtein.jaro_winkler('guerrilla girls', 'guerilla girls')
Out[73]: 0.9295238095238095

I was surprised to see such a low score for this simple "r" omission from a 15 characters string.

So I tried replacing the second "r" in the first string with a "b". The only thing that changes in this test is that the "r" omission becomes a "b" omission in the second string.

And now the score is pretty good, and much closer from what I expected:

In [74]: Levenshtein.jaro_winkler('guerbilla girls', 'guerilla girls')
Out[74]: 0.9866666666666667

I tried the two same tests with another library (jaro-winkler), and the two scores are equal in both situations (and they are equal to the second test made with python-Levenshtein):

In [77]: jaro.jaro_winkler_metric('guerrilla girls', 'guerilla girls')
Out[77]: 0.9866666666666667
In [78]: jaro.jaro_winkler_metric('guerbilla girls', 'guerilla girls')
Out[78]: 0.9866666666666667

What do you think about it? The first result is really weird, no?