ztane / python-Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
GNU General Public License v2.0
1.26k stars 155 forks source link

I made a fork with wheels #61

Closed polm closed 1 year ago

polm commented 3 years ago

I forked this package and put wheels (OSX, Windows, Linux) on PyPI as levenshtein.

https://pypi.org/project/levenshtein/

This should fix compiler related issues with install that people seem to be having.

The last release of this project is from 2014, and I can't find any comments from the maintainer since 2019, so I assume they've moved on. At the moment the fork is just minimal changes to make wheels, but if there's interest I would be up for maintaining it.

ztane commented 3 years ago

Howdy, there is no need to fork, I can pass the maintainership forward. I have not been using python-Levenshtein for ages, back in the day it was the only fast alternative when I needed lots of string matching but its license has been a problem lately.

Lucas-C commented 3 years ago

Just FYI, there is another published version with wheels there : https://pypi.org/project/python-Levenshtein-wheels/ by @Tobotimus, who already opened a PR about wheel support on this repo 2 years ago: https://github.com/ztane/python-Levenshtein/pull/36

Maybe the people who worked on forks & alternative Pypi releases could team up & gather their efforts on this repo, given your proposal to hand over maintainership @ztane?

The next step could be to migrate this project into https://github.com/python-Levenshtein/python-Levenshtein and give GitHub maintainer permissions to a few volunteers who have already contributed to this package? Or submit it to https://jazzband.co?

ztane commented 3 years ago

@Lucas-C I looked into creating an organization. To create an organization on Github one needs to set up a single communications email for the organization, which presumably will be used for recovery and such.

polm commented 3 years ago

Is having a single communications email a problem? I think there still needs to be a primary maintainer if it's an organization, but presumably that mail wouldn't have to be used regularly once other admin users are added.

maxbachmann commented 3 years ago

@ztane I am working on a project with a similar feature set: https://github.com/maxbachmann/RapidFuzz (In fact mostly created because of the License) and I am willing to help maintaining this project

polm commented 3 years ago

This project is MIT Licsensed and has Levenshtein distance, so it may be a good alternative.

https://github.com/Martinsos/edlib

maxbachmann commented 3 years ago

edlib provides some useful information about the alignment. However at least for short input (< 128 characters) it is a lot slower, than python-Levenshtein. I did perform a benchmark for multiple Levenshtein implementations a while ago:

@ztane What is your current plan on the maintenance? A couple of things that I think would be useful (I would be willing to work on them)

woctezuma commented 3 years ago

If I understand this plot [1] correctly, then all these libraries i) provide support for Levenshtein distance ii) in Python, and iii) for every input string length, rapidfuzz [2] and polyleven [3] are a lot faster than the other libraries.

If true, then these two libraries look more enticing anyway... So I guess I am missing something there about support for python-Levenshtein. Does python-Levenshtein provide some utility that these two other libraries miss?

On a side-note, I don't understand why the runtime seems quadratic w.r.t. the input length N for python-Levenshtein and linear for other libraries, while you mention in [1] that the time complexity:

Shouldn't all of these plots show a quadratic curve in the case N=M?

Or is it quadratic, but due to the division by 64, the x-axis is scaled by a factor 8 when N=M? In this case, shouldn't the performance with rapidfuzz and N=512 be on par or better than with python-Levenshtein and N=64?

Performance

Sorry if I look inquisitive, but I am curious now. :)

[1] https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein [2] https://github.com/maxbachmann/RapidFuzz [3] https://github.com/fujimotos/polyleven

maxbachmann commented 3 years ago

So I guess I am missing something there about support for python-Levenshtein. Does python-Levenshtein provide some utility that these two other libraries miss?

For most use cases these libraries would be a better fit, since they are faster and more open licensed. However, there are many people using python-Levenshtein. This has probably multiple reasons. Some I could think of: 1) It already exists for a long time 2) It is easier to find a library called Levenshtein than RapidFuzz/Polyleven

It makes sense to fix some of the long-standing bugs and make it compatible with future Python versions. It would be even possible to implement the Levenshtein distance in faster for python-Levenshtein as well. AFAIK this should not take a lot of time, since I already work on very similar libraries.

in the worst case for rapidfuzz is O(NM) as well. You mention O([N/64]M), but that is O(N*M) as well, right?

All the other libraries use an implementation based on Myers or Hyyrö. They convert one of the two strings into bitvectors of computer word length. Afterwards they iterate over the second string and calculate the Levenshtein distance for the whole bitvector at once using bitwise operations. In all of these implementations computer words of 64 bit are used. However it would be possible to further extend this to 128/256/512 using SIMD. You can clearly see this behavior in the graph: 1) There are steps each 64 character, since afterwards another bitvector is required 2) There is linear growth in between, since each character of the second string is compared to each bitvector.

Especially in the graph of polyleven you can see the start of the exponential growth.

In this case, shouldn't the performance with rapidfuzz and N=512 be on par or better than with python-Levenshtein and N=64?

You can not really compare them this way, since they use completely different algorithms. So the runtime for each step already takes different times. This can be seen for editdistance and edlib. Both of them use the bitparallel implementation, but have a very inefficient implementation (E.g. one of them uses a hash map which is very slow).

In my opinion RapidFuzz is the best suited of the libraries for the following reasons (as the author I am super biased)

maxbachmann commented 3 years ago

@ztane any updates?

ztane commented 3 years ago

@maxbachmann hey, yeah, I do not really have much to update. Unfortunately enough, I've not needed approximate string-matching in scale for ages and therefore would have only limited time for Python-Levenshtein. And since the library is still so widely used I'd hesitate making any changes to it in my limited time. It would perhaps make most sense to have a library that's drop-in API compatible with python-Levenshtein (just need to change the import) and point people to it.

maxbachmann commented 3 years ago

@ztane As far as I understood you, you prefer a fork with a new package name over passing on ownership of python-Levenshtein. @polm does not has the time to take on maintenance of the project, so he handed over the ownership of the pypi project Levenshtein to me. I created a fork of the project, which can be found here: https://github.com/maxbachmann/Levenshtein. It will remain API compatible to python-Levenshtein (same import + same API). So all that changes for users is that they need to install Levenshtein instead of python-Levenshtein.

I just released version 0.13.0 which:

The full release notes can be found here: https://github.com/maxbachmann/Levenshtein/releases/tag/v0.13.0

ztane commented 3 years ago

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did it to scratch an itch.

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Right now it is rather silly for me to spend time "maintaining" (i.e. adding new features etc, writing documentation that did not exist in the first place) to a project that I cannot use myself, project based on algorithms whose papers I've not read etc, answering questions about algorithms whose correctness I cannot vouch for etc.

graingert commented 3 years ago

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did to scratch an itch.

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Right now it is rather silly for me to spend time "maintaining" (i.e. adding new features etc, writing documentation that did not exist in the first place) to a project that I cannot use myself, project based on algorithms whose papers I've not read etc, answering questions about algorithms whose correctness I cannot vouch for etc.

Why not transfer the project to jazzband?

polm commented 3 years ago

Hey, just wanted to clarify that like @maxbachmann said, my situation has changed since I opened this issue and I no longer have time to maintain this. Anyone who's using the code is welcome to the wheel code I added and I've transferred the levenshtein name on PyPI to Max.

I think @ztane's points about the license are correct. Since it sounds like Max is rewriting things anyway maybe it makes sense to have a v2 branch that just preserves the API, but replaces the entire implementation from scratch, and re-license that. I think the point that most people using the library aren't following development, and that that gives merit to continuing to use the name, is absolutely true.

maxbachmann commented 3 years ago

Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that.

Since it sounds like Max is rewriting things anyway maybe it makes sense to have a v2 branch that just preserves the API, but replaces the entire implementation from scratch, and re-license that. I think the point that most people using the library aren't following development, and that that gives merit to continuing to use the name, is absolutely true.

Yes actually I do rewrite most things in C++ under the MIT license already, since 1) I agree that a utility library like this should not use the GPL library (in fact a big share of the projects using python-Levenshtein break this license) 2) C++ support nowadays is good enough, that it is really not worth the pain to write the algorithms in C (Happy to hear your opinion on the C vs C++ usage)

Current state of things

Python2.7 support

Do we still need Python2.7 support for a rewrite? python-Levenshtein still has around 10% downloads for python2.7, so even though it requires some effort (string handling in Python worked differently in Python2.7) it might be worth maintaining it a bit longer.

C Library

@ztane are there many people using this as C library? C++ is a lot simpler to use for these algorithms, since we often match many different string types, so templates are very helpful. And in my experience it is simple enough to write a wrapper for something like this in C when needed.

rewritten algorithms

All algorithms I rewrite are currently placed in the following projects: https://github.com/maxbachmann/rapidfuzz-cpp, which is then used both in https://github.com/maxbachmann/Levenshtein and https://github.com/maxbachmann/RapidFuzz and are all MIT licensed.

distance / ratio / hamming

These algorithms are all already implemented in rapidfuzz, so I simply reuse their optimized implementations

jaro / jaro-winkler

The implementation in python-Levenshtein has a few bugs. I developed a way faster implementation of the algorithm. However it still requires some more development time. It will afterwards be placed in rapidfuzz-cpp

median, median_improve, quickmedian, setmedian, seqratio, setratio

I have no clue about these algorithms and since I am not able to copy them from the current implementation (license), I am currently unable to re implement them.

editop / opcodes / inverse / appy_edit / matching_blocks / subtract_edit

They have really weird APIs, which do not use keyword arguments and instead overload on the args count .... Also it lead to many bug reports, that matching_blocks does not actually return the same as in difflib. For rapidfuzz I created a fast rewrite of the algorithm from difflib in C++. However I am still unsure how to properly replace the current mess.

Licensing

I would generally love to use a more open License for the project. However I would probably need help by someone for the whole median algorithms I do not know.

Future of python-Levenshtein

For users it would probably help to keep publishing to python-Levenshtein, so their software keeps working in upcoming python versions.

Collaboration

I personally do not like the idea transfer the project to jazzband, since we are not really just a Python project. E.g. there are definetly people using rapidfuzz-cpp as a standalone C++ library (someone recently packaged it for Conan as well). However especially when someone else like @ztane is interested in helping with more open licensed implementations it could be worth placing:

in a common organisation

ztane commented 3 years ago

I'd say one good reason for the rewrite is totally dropping the Python 2 support. Would make the code cleaner too.

As for the C++, it probably is fine to use a C++ compiler, but C++ can be tricky otherwise in that need to recompile for different stdlib implementation, more so than for C, so it would be nice if the C++ code wouldn't use much of the standard library. The C++ templates and such would be fine still.

maxbachmann commented 3 years ago

I'd say one good reason for the rewrite is totally dropping the Python 2 support. Would make the code cleaner too.

Generally I would split it into a Python Wrapper and the implementation of the algorithms. The current macro based selection is e.g. broken in the latest version since safe_malloc is only defined when Python is used.

As for the C++, it probably is fine to use a C++ compiler, but C++ can be tricky otherwise in that need to recompile for different stdlib implementation, more so than for C, so it would be nice if the C++ code wouldn't use much of the standard library. The C++ templates and such would be fine still.

Which parts of the stlib/C++ do you consider problematic? I currently use:

Note that the implementation so far is mostly header only, since the functions are either templates or very short. Most things could probably be implemented using less of these features, but it should be documented which features can/can't be used.

maxbachmann commented 3 years ago

@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did it to scratch an itch.

@ztane I think this would be a good first step, so I can upload new releases for python-Levenshtein as well (https://pypi.org/user/maxbachmann/)

maxbachmann commented 3 years ago

@ztane ping regarding the maintainership of the python-Levenshtein PyPI project

maxbachmann commented 2 years ago

@ztane it has been 2 months. Are there any updates on this?

maxbachmann commented 2 years ago

I reached similar performance improvements for Levenshtein.jaro_winkler, which will be added to the project once the JaroWinkler package is released.

@ztane is there any update on this? I would love to take up maintainership of the project.

maxbachmann commented 2 years ago

In addition I significantly improved the editops implementation in RapidFuzz. It is not only faster, but requires 16x less memory as well. This will be added to the Levenshtein project once RapidFuzz v2.0.0 is released. editops

maxbachmann commented 2 years ago

Released v0.18.0 which includes the performance improvements and fixes multiple bugs from the original implementation. E.g. the following code previously did allocate a string using an uninitialized variable as size.

>>> import Levenshtein
>>> Levenshtein.quickmedian(["tes", "teste"], [0, 0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+91594b00 is not in range [U+0000; U+10ffff]

@ztane Do you still plan to hand over maintainership of this project? It has now been around a year since my original offer to maintain this project. I can completely understand if you want to continue maintaining the project. However, I would appreciate it if you could give an update on this so that I don't just waste my time here.

woctezuma commented 2 years ago

I don't think you waste your time, as some people have found your project thanks to this issue and your plots. ❤️

maxbachmann commented 2 years ago

I finally had the time to continue working on the library. The latest release finalizes the work required to keep the Levenshtein library working in Python 3.12 when the deprecated Unicode APIs get removed (see #54).

maxbachmann commented 2 years ago

I do now have maintainer privileges for the python-Levenshtein project on PyPI and will continue to upload subsequent releases both to the Levenshtein and python-Levenshtein projects.

Since after install the project was always used Levenshtein anyways, I converted python-Levenshtein to a metapackage which simply installs the corresponding version of Levenshtein (e.g. python-Levenshtein==0.20.4 will install Levenshtein==0.20.4 in the background). This avoids clashes when people install both Levenshtein and python-Levenshtein which would previously install into the same directory. The code for this metapackage can be found here: https://github.com/maxbachmann/python-Levenshtein

woctezuma commented 2 years ago

Congratulations! 🎉

polm commented 1 year ago

Closing since @maxbachmann has taken care of this. Thanks Max!