ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.25k stars 156 forks source link

Operation weights not configurable #86

Open KthProg opened 1 year ago

KthProg commented 1 year ago

Hello.

Generally Lev allows configurable weights, I'd really like to see that here as using Lev as a substring "approximate" matcher doesn't work when the insert op is weighted the same as other operations. It's hard to tweak search results when the weights cannot be changed.

maxbachmann commented 1 year ago

You can use the Levenshtein fork for this

pip install Levenshtein

Which allows the following:

from Levenshtein import distance
distance(s1, s2, weights=(insertion, deletion, substitution))

A future release will allow the usage of character dependent weights as well: https://github.com/maxbachmann/RapidFuzz/issues/241

gamesguru commented 1 year ago

@maxbachmann

Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.

And is there any maintenance work being done on the old version? We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility. Perhaps we can back port a few essentials?

maxbachmann commented 1 year ago

Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.

The Levenshtein library only provides the Levenshtein distance with weights. Rapidfuzz provides a couple more metrics. When you say sort through 10,000 elements do you mean search for 1 query in these 10k elements, or do you mean a task like deduplication which will pretty much require 10k x 10k comparisions?

And is there any maintenance work being done on the old version?

No only the latest version is maintained

We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility.

I am this new maintainer ;)

Perhaps we can back port a few essentials?

The biggest problem is the fact, that rapidfuzz uses scikit-build for building. It is getting increasingly hard to build wheels for these versions, since tools like cibuildwheel and manylinux dropped support for them. I currently maintain Python support as long as all of the following are true:

It is not really worth it to maintain support for other versions, since the user base is really slim and most of them can just use an older version of the projects. Python 3.4 is a special case, since it was never supported by the library. I did have support for Python 2.7 + Python 3.5 until rapidfuzz < 2.0.

gamesguru commented 1 year ago

It's only one query over 10k items. But fairly long items and has to be <100ms ideally.

The original library python-Levenshtein<=0.12.2 did work on python3.4

Python 3.5 would be a good start. I'm not sure where the distance() function is defined?

Old pip versions also don't respect python_requires=, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.

maxbachmann commented 1 year ago

It's only one query over 10k items. But fairly long items and has to be <100ms ideally.

for long items the new implementation is significantly faster:

A large amount of those performance advantages did already exist prior to rapidfuzz v2.0

The original library python-Levenshtein<=0.12.2 did work on python3.4

in case the old implementation worked for you, you can always just pin it

Python 3.5 would be a good start. I'm not sure where the distance() function is defined?

This function was new in v2.0.0 of rapidfuzz. v1.9.1 is the last version which supported Python 3.5. In this older version there is rapidfuzz.string_metrics.levenshtein.

Old pip versions also don't respect python_requires=, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.

Generally I mention these changes in the changelog: https://github.com/maxbachmann/RapidFuzz/blob/main/CHANGELOG.rst: