ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

Jaro Winkler distance equals 1 for strings that are not identical #11

Closed rcalsaverini closed 10 years ago

rcalsaverini commented 10 years ago

Hi, I was testing your library and found a case of two non-identical strings that gives a Jaro Winkler similarity of 1:

In [7]: Levenshtein.jaro_winkler('gestor de dho', 'gestor de residuos')
Out[7]: 1.0

This doesn't seem correct. I thought that Jaro Winkler can only be 1.0 for identical strings. Is this a bug?

Thanks for your attention.

ztane commented 10 years ago

The docstring for jaro_winkler states the following:

Compute Jaro string similarity metric of two strings.

jaro_winkler(string1, string2[, prefix_weight])

The Jaro-Winkler string similarity metric is a modification of Jaro
metric giving more weight to common prefix, as spelling mistakes are
more likely to occur near ends of words.

The prefix weight is inverse value of common prefix length sufficient
to consider the strings `identical'.  If no prefix weight is
specified, 1/10 is used.

Examples:
>>> jaro_winkler('Brian', 'Jesus')
0.0
>>> jaro_winkler('Thorkel', 'Thorgier')
0.86785714285714288
>>> jaro_winkler('Dinsdale', 'D')
0.73750000000000004
>>> jaro_winkler('Thorkel', 'Thorgier', 0.25)
1.0

The default prefix weight of 0.1 means that the rest of the characters are considered more leniently. Indeed, just try to change the weight to 0.05, to see the difference:

>>> Levenshtein.jaro_winkler('gestor de dho', 'gestor de residuos', 0.1)
1.0
>>> Levenshtein.jaro_winkler('gestor de dho', 'gestor de residuos', 0.05)
0.9316239316239316

I am not the author of the original code, just the maintainer; judging from the examples, one should use Jaro-Winkler to match individual name components, not full names.

In any case, this is not a bug but a documented feature.