Open KthProg opened 1 year ago
You can use the Levenshtein fork for this
pip install Levenshtein
Which allows the following:
from Levenshtein import distance
distance(s1, s2, weights=(insertion, deletion, substitution))
A future release will allow the usage of character dependent weights as well: https://github.com/maxbachmann/RapidFuzz/issues/241
@maxbachmann
Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.
And is there any maintenance work being done on the old version? We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility. Perhaps we can back port a few essentials?
Is there any reference on which of these functions are fastest vs. slowest? We need to sort through about 10,000 search items very quickly. Or I can experiment with them.
The Levenshtein
library only provides the Levenshtein distance with weights. Rapidfuzz provides a couple more metrics. When you say sort through 10,000 elements do you mean search for 1 query in these 10k elements, or do you mean a task like deduplication which will pretty much require 10k x 10k comparisions?
And is there any maintenance work being done on the old version?
No only the latest version is maintained
We are supporting Python 3.4 still, and the new maintainer isn't providing backwards compatibility.
I am this new maintainer ;)
Perhaps we can back port a few essentials?
The biggest problem is the fact, that rapidfuzz
uses scikit-build
for building. It is getting increasingly hard to build wheels for these versions, since tools like cibuildwheel
and manylinux
dropped support for them.
I currently maintain Python support as long as all of the following are true:
It is not really worth it to maintain support for other versions, since the user base is really slim and most of them can just use an older version of the projects. Python 3.4 is a special case, since it was never supported by the library. I did have support for Python 2.7 + Python 3.5 until rapidfuzz < 2.0
.
It's only one query over 10k items. But fairly long items and has to be <100ms ideally.
The original library python-Levenshtein<=0.12.2
did work on python3.4
Python 3.5 would be a good start. I'm not sure where the distance()
function is defined?
Old pip versions also don't respect python_requires=
, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings
were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.
It's only one query over 10k items. But fairly long items and has to be <100ms ideally.
for long items the new implementation is significantly faster:
A large amount of those performance advantages did already exist prior to rapidfuzz v2.0
The original library python-Levenshtein<=0.12.2 did work on python3.4
in case the old implementation worked for you, you can always just pin it
Python 3.5 would be a good start. I'm not sure where the distance() function is defined?
This function was new in v2.0.0 of rapidfuzz. v1.9.1 is the last version which supported Python 3.5. In this older version there is rapidfuzz.string_metrics.levenshtein
.
Old pip versions also don't respect python_requires=, and it's necessary to tell it specifically what bounds to use for each version of python. It seems f-strings were introduced at some point, and now it is only 3.7+? Getting specific version pins on these breaking changes would help.
Generally I mention these changes in the changelog: https://github.com/maxbachmann/RapidFuzz/blob/main/CHANGELOG.rst:
Hello.
Generally Lev allows configurable weights, I'd really like to see that here as using Lev as a substring "approximate" matcher doesn't work when the insert op is weighted the same as other operations. It's hard to tweak search results when the weights cannot be changed.