ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

Q: Different weight in .ratio and .distance #35

Open Talon24 opened 5 years ago

Talon24 commented 5 years ago

Calling Levenshtein.ratio generates a different result than calculating the ratio by hand. Shown in this example

In [1]: import Levenshtein as lev
   ...: a = "twostring"
   ...: b = "threestring"
   ...: ldist = lev.distance(a, b)
   ...: lensum = len(a) + len(b)
   ...: ratio = lev.ratio(a, b)
   ...: myratio = (lensum - ldist) / lensum # ~ line 771
   ...: print("lev.ratio: {}\n my ratio: {}".format(ratio, myratio))
   ...:
lev.ratio: 0.7
 my ratio: 0.8

After reading through the code, i noticed you call levenshtein_common for the ratio, you increase the cost of the replace operation. Is there a special reason why the functions should calculate this differently?

sstadick commented 5 years ago

I have also run into this inconsistency. It seems like they should use the same cost.

ztane commented 5 years ago

I am just a maintainer, not the original author. But please see the discussion here: https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed

maxbachmann commented 3 years ago

Ratio is based on the InDel-Distance (only allows Insertions/Deletions), while the distance is based on the uniform Levenshtein distance. I suppose this is done, so the results of ratio are closer to the results of difflibs ratio function, while distance still allows the use of the normal uniform Levenshtein distance. I agree, that this can be surprising and the documentation should probably include a note on this difference in behavior.