Closed rcalsaverini closed 10 years ago
The docstring for jaro_winkler
states the following:
Compute Jaro string similarity metric of two strings.
jaro_winkler(string1, string2[, prefix_weight])
The Jaro-Winkler string similarity metric is a modification of Jaro
metric giving more weight to common prefix, as spelling mistakes are
more likely to occur near ends of words.
The prefix weight is inverse value of common prefix length sufficient
to consider the strings `identical'. If no prefix weight is
specified, 1/10 is used.
Examples:
>>> jaro_winkler('Brian', 'Jesus')
0.0
>>> jaro_winkler('Thorkel', 'Thorgier')
0.86785714285714288
>>> jaro_winkler('Dinsdale', 'D')
0.73750000000000004
>>> jaro_winkler('Thorkel', 'Thorgier', 0.25)
1.0
The default prefix weight of 0.1 means that the rest of the characters are considered more leniently. Indeed, just try to change the weight to 0.05, to see the difference:
>>> Levenshtein.jaro_winkler('gestor de dho', 'gestor de residuos', 0.1)
1.0
>>> Levenshtein.jaro_winkler('gestor de dho', 'gestor de residuos', 0.05)
0.9316239316239316
I am not the author of the original code, just the maintainer; judging from the examples, one should use Jaro-Winkler to match individual name components, not full names.
In any case, this is not a bug but a documented feature.
Hi, I was testing your library and found a case of two non-identical strings that gives a Jaro Winkler similarity of 1:
This doesn't seem correct. I thought that Jaro Winkler can only be 1.0 for identical strings. Is this a bug?
Thanks for your attention.