-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I made a fork with wheels #61
Comments
Howdy, there is no need to fork, I can pass the maintainership forward. I have not been using python-Levenshtein for ages, back in the day it was the only fast alternative when I needed lots of string matching but its license has been a problem lately. |
Just FYI, there is another published version with wheels there : https://pypi.org/project/python-Levenshtein-wheels/ by @Tobotimus, who already opened a PR about wheel support on this repo 2 years ago: #36 Maybe the people who worked on forks & alternative Pypi releases could team up & gather their efforts on this repo, The next step could be to migrate this project into https://github.com/python-Levenshtein/python-Levenshtein and give GitHub maintainer permissions to a few volunteers who have already contributed to this package? Or submit it to https://jazzband.co? |
@Lucas-C I looked into creating an organization. To create an organization on Github one needs to set up a single communications email for the organization, which presumably will be used for recovery and such. |
Is having a single communications email a problem? I think there still needs to be a primary maintainer if it's an organization, but presumably that mail wouldn't have to be used regularly once other admin users are added. |
@ztane I am working on a project with a similar feature set: https://github.com/maxbachmann/RapidFuzz (In fact mostly created because of the License) and I am willing to help maintaining this project |
This project is MIT Licsensed and has Levenshtein distance, so it may be a good alternative. |
edlib provides some useful information about the alignment. However at least for short input (< 128 characters) it is a lot slower, than python-Levenshtein. I did perform a benchmark for multiple Levenshtein implementations a while ago: @ztane What is your current plan on the maintenance? A couple of things that I think would be useful (I would be willing to work on them)
|
If I understand this plot [1] correctly, then all these libraries i) provide support for Levenshtein distance ii) in Python, and iii) for every input string length, If true, then these two libraries look more enticing anyway... So I guess I am missing something there about support for On a side-note, I don't understand why the runtime seems quadratic w.r.t. the input length N for
Shouldn't all of these plots show a quadratic curve in the case N=M? Or is it quadratic, but due to the division by 64, the x-axis is scaled by a factor 8 when N=M? Sorry if I look inquisitive, but I am curious now. :) [1] https://maxbachmann.github.io/RapidFuzz/string_metric.html#levenshtein |
For most use cases these libraries would be a better fit, since they are faster and more open licensed. However, there are many people using python-Levenshtein. This has probably multiple reasons. Some I could think of:
It makes sense to fix some of the long-standing bugs and make it compatible with future Python versions. It would be even possible to implement the Levenshtein distance in faster for python-Levenshtein as well. AFAIK this should not take a lot of time, since I already work on very similar libraries.
All the other libraries use an implementation based on Myers or Hyyrö. They convert one of the two strings into bitvectors of computer word length. Afterwards they iterate over the second string and calculate the Levenshtein distance for the whole bitvector at once using bitwise operations. In all of these implementations computer words of 64 bit are used. However it would be possible to further extend this to 128/256/512 using SIMD. You can clearly see this behavior in the graph:
Especially in the graph of polyleven you can see the start of the exponential growth.
You can not really compare them this way, since they use completely different algorithms. So the runtime for each step already takes different times. This can be seen for editdistance and edlib. Both of them use the bitparallel implementation, but have a very inefficient implementation (E.g. one of them uses a hash map which is very slow). In my opinion RapidFuzz is the best suited of the libraries for the following reasons (as the author I am super biased)
|
@ztane any updates? |
@maxbachmann hey, yeah, I do not really have much to update. Unfortunately enough, I've not needed approximate string-matching in scale for ages and therefore would have only limited time for Python-Levenshtein. And since the library is still so widely used I'd hesitate making any changes to it in my limited time. It would perhaps make most sense to have a library that's drop-in API compatible with python-Levenshtein (just need to change the import) and point people to it. |
@ztane As far as I understood you, you prefer a fork with a new package name over passing on ownership of python-Levenshtein. @polm does not has the time to take on maintenance of the project, so he handed over the ownership of the pypi project I just released version 0.13.0 which:
The full release notes can be found here: https://github.com/maxbachmann/Levenshtein/releases/tag/v0.13.0 |
@maxbachmann well, I can pass the maintainership of the python-Levenshtein PyPI project and would gladly do so. The reason why I've been stuck as the maintainer was that back in the day there was no fast enough library for different distance calculations back in the day for Python 3 and there were PRs already and miohtama didn't have time to merge and test the Python 3 changes, so I did it to scratch an itch. Back in that project GPL was acceptable licensing option. But right now, I am all the time doing projects where relicensing the entire code base is an impossibility, so the thing I do is to not use python-Levenshtein. In my opinion having a utility library licensed under GPL is a mistake. I would rather see people not forking the codebase but doing a complete rewrite that was under some sensible license, and I could have time to help with that. Right now it is rather silly for me to spend time "maintaining" (i.e. adding new features etc, writing documentation that did not exist in the first place) to a project that I cannot use myself, project based on algorithms whose papers I've not read etc, answering questions about algorithms whose correctness I cannot vouch for etc. |
Why not transfer the project to jazzband? |
Hey, just wanted to clarify that like @maxbachmann said, my situation has changed since I opened this issue and I no longer have time to maintain this. Anyone who's using the code is welcome to the wheel code I added and I've transferred the I think @ztane's points about the license are correct. Since it sounds like Max is rewriting things anyway maybe it makes sense to have a v2 branch that just preserves the API, but replaces the entire implementation from scratch, and re-license that. I think the point that most people using the library aren't following development, and that that gives merit to continuing to use the name, is absolutely true. |
Yes actually I do rewrite most things in C++ under the MIT license already, since
Current state of thingsPython2.7 supportDo we still need Python2.7 support for a rewrite? python-Levenshtein still has around 10% downloads for python2.7, so even though it requires some effort (string handling in Python worked differently in Python2.7) it might be worth maintaining it a bit longer. C Library@ztane are there many people using this as C library? C++ is a lot simpler to use for these algorithms, since we often match many different string types, so templates are very helpful. And in my experience it is simple enough to write a wrapper for something like this in C when needed. rewritten algorithmsAll algorithms I rewrite are currently placed in the following projects: https://github.com/maxbachmann/rapidfuzz-cpp, which is then used both in https://github.com/maxbachmann/Levenshtein and https://github.com/maxbachmann/RapidFuzz and are all MIT licensed. distance / ratio / hammingThese algorithms are all already implemented in rapidfuzz, so I simply reuse their optimized implementations jaro / jaro-winklerThe implementation in python-Levenshtein has a few bugs. I developed a way faster implementation of the algorithm. However it still requires some more development time. It will afterwards be placed in rapidfuzz-cpp median, median_improve, quickmedian, setmedian, seqratio, setratioI have no clue about these algorithms and since I am not able to copy them from the current implementation (license), I am currently unable to re implement them. editop / opcodes / inverse / appy_edit / matching_blocks / subtract_editThey have really weird APIs, which do not use keyword arguments and instead overload on the args count .... LicensingI would generally love to use a more open License for the project. However I would probably need help by someone for the whole median algorithms I do not know. Future of python-LevenshteinFor users it would probably help to keep publishing to python-Levenshtein, so their software keeps working in upcoming python versions. CollaborationI personally do not like the idea transfer the project to jazzband, since we are not really just a Python project. E.g. there are definetly people using rapidfuzz-cpp as a standalone C++ library (someone recently packaged it for Conan as well). However especially when someone else like @ztane is interested in helping with more open licensed implementations it could be worth placing:
in a common organisation |
I'd say one good reason for the rewrite is totally dropping the Python 2 support. Would make the code cleaner too. As for the C++, it probably is fine to use a C++ compiler, but C++ can be tricky otherwise in that need to recompile for different stdlib implementation, more so than for C, so it would be nice if the C++ code wouldn't use much of the standard library. The C++ templates and such would be fine still. |
Generally I would split it into a Python Wrapper and the implementation of the algorithms. The current macro based selection is e.g. broken in the latest version since safe_malloc is only defined when Python is used.
Which parts of the stlib/C++ do you consider problematic? I currently use:
Note that the implementation so far is mostly header only, since the functions are either templates or very short. Most things could probably be implemented using less of these features, but it should be documented which features can/can't be used. |
@ztane I think this would be a good first step, so I can upload new releases for python-Levenshtein as well (https://pypi.org/user/maxbachmann/) |
@ztane ping regarding the maintainership of the python-Levenshtein PyPI project |
@ztane it has been 2 months. Are there any updates on this? |
I reached similar performance improvements for @ztane is there any update on this? I would love to take up maintainership of the project. |
Released v0.18.0 which includes the performance improvements and fixes multiple bugs from the original implementation. E.g. the following code previously did allocate a string using an uninitialized variable as size. >>> import Levenshtein
>>> Levenshtein.quickmedian(["tes", "teste"], [0, 0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+91594b00 is not in range [U+0000; U+10ffff] @ztane Do you still plan to hand over maintainership of this project? It has now been around a year since my original offer to maintain this project. I can completely understand if you want to continue maintaining the project. However, I would appreciate it if you could give an update on this so that I don't just waste my time here. |
I don't think you waste your time, as some people have found your project thanks to this issue and your plots. ❤️ |
I finally had the time to continue working on the library. The latest release finalizes the work required to keep the Levenshtein library working in Python 3.12 when the deprecated Unicode APIs get removed (see #54). |
I do now have maintainer privileges for the Since after install the project was always used |
Congratulations! 🎉 |
Closing since @maxbachmann has taken care of this. Thanks Max! |
I forked this package and put wheels (OSX, Windows, Linux) on PyPI as
levenshtein
.https://pypi.org/project/levenshtein/
This should fix compiler related issues with install that people seem to be having.
The last release of this project is from 2014, and I can't find any comments from the maintainer since 2019, so I assume they've moved on. At the moment the fork is just minimal changes to make wheels, but if there's interest I would be up for maintaining it.
The text was updated successfully, but these errors were encountered: