-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompiled wheels with CuBLAS activated #243
Comments
As per @jmtatsch's reply to my idea of pushing pre-compiled Docker images to Docker hub, providing precompiled wheels is likely equally problematic due to:
So you'd potentially need to pre-compile for every possible combination of hardware and CUDA version. Assuming (conservatively) there's say 32 (ARM64 + Intel) hardware combinations and 11 possible major CUDA versions, that's 352 wheels. Then you need to somehow educate the user on how to choose the single best wheel for his environment. 🤦 You could provide a static wheel with all AVX* disabled, no CUDA CuBLAS support and instead just use OpenBLAS, but then you'd get people complaining how much faster |
Not an easy task I confess. |
@ParisNeo on Discord trying to decide what all the moving parts are: |
AVX2+FMA and OpenCL compatibility is a pretty good assumption (going back to Ryzen 1000/Intel 4000), and it would be reasonably fast on most hardware. Users on older hardware or other ISAs can build for themselves, and you could print a warning in the console if the device supports higher feature sets. CUDA is a problem though. |
It would add another dependency, but could sanity check the AVX or FMA capabilities of the host before loading the model:
|
Please support at least in default openBlas precompiled ? |
@raymond-infinitecode do you have f16C, AVX, AV2, AVX512, and / or FMA support on your CPU? |
Hi, yes I have all of them except AVX512. Is that enough for performance boost without openBlas ? |
"Enough" also depends on the size of your CPU L1/L2/L3 caches, main memory I/O speed, the size of your model, etc. In any case @AlphaAtlas's proposal would mean not supporting the AVX2 extensions on your CPU, so it would run slower than it could compared to if you compiled from source. You're still interested in convenience over performance? |
I am all out for performance. In the event if we manage to borrow a xeon 32 core, how well would python binding utilize it ? |
Horribly. After 4 physical cores each additional core provides diminishing returns. 8 cores is about optimal. That Xeon 32 core likely has AVX512, which means you would get a nice performance bump even if it really can't utilise much over 8-12 cores. |
The readme tells people to use the command line option "-t 8", causing 8 threads to be started. On systems with fewer than 8 cores, this causes a significant slowdown. Remove the option from the example command lines and use /proc/cpuinfo on Linux to determine a sensible default.
Hi, I am interested in helping tackle this problem. I think taking inspiration from pytorch might be the best idea: I see this as two problems: generating all the required wheels, and how to get them into users hands with the least friction. For generating wheels, I think macOS/Windows wheels are easiest to build in CI. For Linux, building wheels in docker containers each release is pretty straight forward, and I am happy to donate compute to do that. I also have a server colocated with a gigabit connection that could be used to host all of the wheels needed. I already host an apt repo https://apt.cli.rs/ so this isn't my first time doing something like this. So basically, I'm happy to host all the wheels, I can probably store at least a TB of wheels (back of the napkin math says each release will be 2-300MBs to store). As for how to make it easy for the user: For getting the correct architecture on x86 we can do something like this: https://github.com/flababah/cpuid.py (writing instructions to rwx memory is basically what JITs do, so this isn't actually that cursed). Then we have a zero-dependency script users can run which can automatically calculate architecture and OS, which gets us most of the way. For ARM, unless I'm mistaken (and I could be, development is quick these days) GGML only supports NEON, and I think defaulting to compiling with NEON makes sense? There are few platforms without it these days. Metal support can be automatically chosen based on OS, so that's easy to choose. In fact there's pretty much only a few choices for macOS: ARM Neon or x86 with AVX2 (Metal can be included by default for both). We also can ask about whether the user wants BLAS (with a short snippet about why they might want it). I'm happy to work on this but I'd appreciate some upstream buy-in. I am imagining a script in the top-level of the repo users run to download the correct wheel for their platform and requirements. Also feedback/suggestions/etc are appreciated, I am not as knowledgeable about all the various methods of accelerating GGML, I think I covered most of them. |
Happy to support you with smoke testing in this endeavor if it reduces the number of build related bugs logged against Firstly, why not statically compile OpenBLAS into the non-GPU accelerated wheels? That avoids any dynamic dependances and wouldn't probably make them much larger:
EDIT: Oops, 60X larger. Maybe provide a statistically compiled OpenBLAS that just supports AVX2 and earlier CPU optimizations? Secondly, do you have all the different hardware architectures, or are you planning on cross compiling? We'd at least need a MacOS environment to smoke test any Apple Metal BLAS builds. I have a mix of GPUs from a GTX 980Ti, GTX 1080Ti, RTX 2070 and RTX 3090Ti, although currently only the 1080Ti and 3090Ti are plugged into an Ubuntu 22.04.2 server with an older AMD Ryzen CPU. Varying the version of CUDA can be done through different CUDA instances using the Docker files supplied with |
Doing this on Linux is easy enough. On Windows... its a bit less easy but doable.
Yeah limiting openblas to just AVX2 (maybe NEON?) could work.
I can run smoke tests on all of the x86 CPU-only builds automatically. I don't currently have a GPU in my server, so help on smoke tests for that would be appreciated, I also don't have any Apple devices, so help testing that would be very useful. I was planning on using Github Actions for macOS wheels, and doing the x86 builds in containers on Linux and maybe also Windows in VMs I can run. For x86, no cross compiling should be needed, its just passing the requisite flags to limit to compiling for SSE/AVX/AVX2/etc. For ARM, I have a device I could set up to compile, but tbh a VM may be faster, I'll have to benchmark it. |
I can cover the CUDA testing on Ubuntu Linux using NVidia docker instances. |
Great! I think that covers most targets, minus macOS. I'm actually unsure if you can run Metal jobs in Github Actions, do you know if that's possible? |
Any hardware-specific requirements such as GPUs aren't going to be supported, at a guess. The code will be running on virtualized x86_64 instances with virtual video drivers. That's why I was offering my CUDA APIs. Once you have the wheels built it is relatively easy to add a Github action to install them on CUDA docker images and push them to Docker hub. @abetlen is already doing it here. Then it is a matter of polling Docker hub for new CUDA llama-cpp-python images and smoke testing them on my kit. Not ideal, but at least that way we would discover any upstream |
Hi there. Is it possible to have a parameter to activate/deactivate cpu features? |
@ParisNeo the llama.cpp library does not support doing that at runtime unfortunately :( @gjmulder A horrible hack / shower thought I had was to have a docker image that ships with a compiler, the python source, and all dependencies installed then just have a script in the docker image to rebuild the shared library on startup since that's the only platform specific component. You could maybe volume in the .so and use the makefile(?) to do this if it's not available. That might be a terrible idea and a huge abuse of how docker is meant to work but could be worth a try? |
Can someone list the possible CPU flags / GPU combinations? |
|
I'm trying to see if potentionally building wheels for the main cpu features and the last 2 versions of CUDA only is viable. So CPU features would be: And then for GPU we have METAL, and CUBLAS for each CUDA version? |
Optional CPU hardware acceleration features are: Pytorch avoids messy CUDA deps by distributing its own CUDA with the package. |
We could potentionally have the following base wheels precompilled:
For each one of those support N latest versions of CUDA. |
|
11.8: cudatoolkit-feedstock |
That's a good start:
^ Those combination multiplied by each supported Python version py3.7->3.11, although 3.7 can probably be removed since it was EOL this summer @abetlen That would be py3.8->3.11 * 5 possible combinations = |
I have an old Ryzen CPU that doesn't support AVX512. I am using CUDA 12.2 on a new Ubuntu install as I have a recent GPU and don't like to upgrade if I can avoid it. Ideally, you need to provide wheels for: py3.8->3.11 (5) * CUDA (2) * x86 (3) = 60 wheels. |
You are right, it's more. Since you want both CUDA on each cpu option |
Related #595 |
List:
|
This Bash script automatically generates the supported flags on Linux:
Maybe you want to name the wheels similar to this suffix so that we can doc the above script to automatically choose the best CPU wheel? EDIT: This only works on physical hardware. We've had a number of cases where virtualised CPUs report hardware extensions but those extensions are disabled by the virtualisation. |
That shouldn't matter for building the wheels, right? It only matters at runtime if a fearure is enabled or not. |
Correct. I'm just thinking ahead on how to explain to people which wheel they should use. Otherwise we're going to get a lot of "Invalid instruction" issues. |
So i'm able to build the wheels using: CMAKE_ARGS="-DLLAMA_AVX512=off -DLLAMA_AVX2=off" FORCE_CMAKE=1 python3 setup.py bdist_wheel I'm not sure how to give them a specific name to associate them with features? The one on my VM comes out as:
Regardless of which feature flags I enable |
I just realized that the |
Example script to detect and run specific wheels: #!/bin/sh
cpuinfo="$(cat /proc/cpuinfo)"
if [ $(echo "$cpuinfo" | grep -c avx512) -gt 0 ]; then
./llama_avx512 "$@"
elif [ $(echo "$cpuinfo" | grep -c avx2) -gt 0 ]; then
./llama_avx2 "$@"
else
./llama_avx "$@"
fi |
It looks like you can't override
This bash script looks to build the
I've prepended an |
Found this project https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels Maybe @jllllll can share some light on this. |
The large number of meaningful build configurations makes building pre-compiled wheels for cuBLAS pretty unwieldy. This is the workflow that I am currently using: It builds 240 wheels + 40 CPU-only wheels for the various CPU instruction, CUDA version and Python version combinations. The sheer number of wheels being built for every version means that it is simply impractical for people to search through releases for a wheel they want to use. This allows for much simpler installation by modifying the URL according to the configuration you wish to install, instead of searching through thousands of wheels for the one you want. It also allows for significantly easier manual downloading by simply clicking the options you want in your browser. The biggest downside that I've encountered with this is that the GitHub Actions API key assigned to the runner has a limited amount of API requests that it can make. This means that many wheels can't be uploaded automatically to a release and need to be manually uploaded after the workflow is finished. I have written the workflow to account for this by uploading wheels that fail to upload automatically to a release as build artifacts. |
This is the workflow for building CPU-only wheels: This workflow names each wheel according to the build by using this:
|
@jllllll Awesome work, I noticed you still had py3.7 support which is EOL. Removing that one should reduce the # of wheels. Not sure if significant, but anything helps. |
Support for py3.7 was really only added because I had room for it in the workflow and figured I might as well since there are 3.7 wheels being uploaded to this repo and I wanted to maintain support parity. If an additional configuration is needed in the future, 3.7 wheels will be the first to go. |
That makes total sense. Amazing work! I think the only thing missing is some kind of wrapper to auto detect which Wheel is the indicated one to use. This would help other projects that are based on
|
You can take the approach that ctransformers took: https://github.com/marella/ctransformers/blob/main/ctransformers/lib.py They package together libs of different configurations and use the Due to how they build the wheel, it doesn't get associated with a particular Python minor version. Instead, it just works with all Python 3 versions. If fully renaming the package is what you want, it is certainly doable. This is a workflow that does just that: It uses a fixed name, but can be relatively easily adapted to use a dynamically generated name. As far as a wrapper for installing wheels is concerned, |
@jllllll Thanks for the info, really appreciate it! 💪 |
@jllllll Random question, would it be possible to add arm64 wheels? |
I think it is technically possible, but it would require the use of qemu or docker and a lot of trial and error. GitHub is planning to add arm64 runners at some point, which would make this significantly easier. No clue when that will be though. |
@jllllll This is what I use for doing arm64 builds https://github.com/docker/setup-qemu-action |
I use your library in my UI:
https://github.com/nomic-ai/gpt4all-ui
But Somehow, some users report that when they run my software and it tries to install your library, they get errors as the pip is trying to recompile the wheel.
Is there a way to precompile this for many lpatforms so that they won't need to build as this requires them to have a build environment which is not easy for noobs.
Thanks for this wonderful backend
The text was updated successfully, but these errors were encountered: