-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base64
module: Link against SIMD library for 10x performance.
#124951
Comments
I think this is a good idea, but it seems that relying too much on external C libraries may cause instability (such as many linking and compilation problems). Do you have any better solution to this problem? |
We're not really concerned about that. We aleady link against a bunch of libraries and have vendored a few (as much as we try to avoid that), including some involving SIMD code, in the past. |
For the performances themselves, they are worth the inclusion IMO (base64 is very common in web applications)
The default base64 lib comes with coreutils IIRC (at least on most Linux distributions). But for SIMD support, I don't know of any that is well-known (there is an Alpine Linux package for that though which directly uses the lib you linked but well.. that's the only one).
Yes, but that could give distros more confidence and reasons in pushing it into a proper
I think it's a good idea to make built-in functions faster when possible if the cost of introducing the overall machinary is not too much (but yes, let's open a separate issue if needs arise). |
From a general approach, I am not sure it's worth maintaining SIMD-related logic in CPython. |
FWIW, we already have SIMD and/or accelerated instruction use in hashlib code (and I expect to add more, as HACL* can provide it to us - our goal is to eliminate the need to link hashlib against openssl).
It feels like the kind of thing that a C version of, if written in the appropriate manner, might get a compiler to auto-vectorize? But tricking compilers into doing that for you instead of explicitly just doing what you wanted to happen can be less maintainable. In general though I wouldn't reach for SIMD on things like our main data types unless it demonstrated a meaningful measurable performance suite win. (bytes.translate is rather niche in that regard) JSON is a similarly important widely used bulk data format. A lot more complicated to use SIMD on. There are a variety of libraries out there doing it. If anyone wants to consider a behavior compatible version doing that as the A more interesting library to consider linking against than what I opened with may be https://github.com/simdutf/simdutf which covers UTF8 & friends as well as base64. (though it is C++ and I believe we've so far avoided requiring C++ linkage within CPython itself?) Unrelated other than the SIMD theme - https://pypi.org/project/stringzilla/. It does things that even adopting simplest conservative versions of them for use by our CPython internals in a compatible manner might be enough to show up on the performance suite. Someone would need to take on trying it and testing to find out. |
My #.1 criteria is maintainability. #.2 criteria is "is it widely used on enough data to matter?" with a #.3 feeding into that of "Would doing this reduce the need to train people to always use a special PyPI package instead of the stdlib for their common task that either can do because of a performance concern?" |
Performance enhancement
Proposal:
https://pypi.org/project/pybase64/ aka https://github.com/mayeut/pybase64 (BSD licensed) exists. On top of some of its own SIMD code for base64 module extra features (character translation)^, it links against https://github.com/aklomp/base64, a BSD licensed C99 library with SIMD acceleration giving 5-20x performance on base64 encoding and decoding operations vs our existing generic byte based base64 C code.
We could adopt a bunch of the pybase64 code to make the default base64 module experience better - it is relatively straight forward extension module code (as one would expect). On the other hand, I expect pybase64 to still be where new development and further improvements in this space continue to happen as people who care strongly about performance need the latest and greatest from PyPI regardless of their current CPython version. (looping in @mayeut for thoughts on that)
Practicalities: Library availability? we'd vendor a libbase64 build for use on our binary distributions. I don't think it is currently widely available (? I only did a quick search on Ubuntu) as a package on Linux distributions though so we'd currently need to vendor our own copy in tree to be fair and match the good performance there (yuck, but ideally only temporary until distros pick it up as a package of its own, consider it similar to a Modules/_decimal/libmpdec/ situation - our configure.ac finds an installed one & distros link against that)
Risks: It is a new C library dependency. Security concerns within it thus become our own. As
base64
is frequently used to process untrusted input. But its surface of possible problems is limited (very simple data format). We should ensure the library gets proper oss-fuzz test coverage before adoption (@aklomp for visibility).^
bytes.translate
,bytearray.translate
, orstr.translate
might benefit from similar SIMD treatment - which would be better from a CPython perspective than only doing that within this module? If so, lets file a new issue just for that bit.Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
If we spawn Discuss threads around this, lets edit and drop links here.
The text was updated successfully, but these errors were encountered: