Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompiled wheels with CuBLAS activated #243

Closed
ParisNeo opened this issue May 19, 2023 · 49 comments
Closed

Precompiled wheels with CuBLAS activated #243

ParisNeo opened this issue May 19, 2023 · 49 comments
Labels
build enhancement New feature or request hardware Hardware specific issue

Comments

@ParisNeo
Copy link

I use your library in my UI:
https://github.com/nomic-ai/gpt4all-ui

But Somehow, some users report that when they run my software and it tries to install your library, they get errors as the pip is trying to recompile the wheel.

Is there a way to precompile this for many lpatforms so that they won't need to build as this requires them to have a build environment which is not easy for noobs.

Thanks for this wonderful backend

@gjmulder gjmulder added build hardware Hardware specific issue labels May 19, 2023
@gjmulder
Copy link
Contributor

gjmulder commented May 19, 2023

As per @jmtatsch's reply to my idea of pushing pre-compiled Docker images to Docker hub, providing precompiled wheels is likely equally problematic due to:

  1. llama.ccp interrogating the hardware it is being compiled on and then aggressively optimising its compiled code to perform for that specific hardware (e.g. ARM64 or x86_64 (and then within x86_64 it may (or may not) use F16C, AVX, AVX2 and/or AVX512 Intel hardware acceleration extensions))
  2. If you pre-compile against CUDA, the exact same CUDA version needs to be pre-installed, unless you built a static binary that included all the CUDA libraries. And even then you might get NVidia GPU driver compatibility issues which I suspect may be the problem in issue CUDA Forward Compatibility on non supported HW #234

So you'd potentially need to pre-compile for every possible combination of hardware and CUDA version. Assuming (conservatively) there's say 32 (ARM64 + Intel) hardware combinations and 11 possible major CUDA versions, that's 352 wheels. Then you need to somehow educate the user on how to choose the single best wheel for his environment. 🤦

You could provide a static wheel with all AVX* disabled, no CUDA CuBLAS support and instead just use OpenBLAS, but then you'd get people complaining how much faster llama.cpp is 🤷‍♂️

@gjmulder gjmulder added the enhancement New feature or request label May 19, 2023
@ParisNeo
Copy link
Author

Not an easy task I confess.

@gjmulder
Copy link
Contributor

Not an easy task I confess.

@ParisNeo on Discord trying to decide what all the moving parts are:

gpt4all-webui

@AlphaAtlas
Copy link

AlphaAtlas commented May 19, 2023

AVX2+FMA and OpenCL compatibility is a pretty good assumption (going back to Ryzen 1000/Intel 4000), and it would be reasonably fast on most hardware. Users on older hardware or other ISAs can build for themselves, and you could print a warning in the console if the device supports higher feature sets.

CUDA is a problem though.

@gjmulder
Copy link
Contributor

AVX2+FMA and OpenCL compatibility is a pretty good assumption (going back to Ryzen 1000/Intel 4000), and it would be reasonably fast on most hardware

It would add another dependency, but could sanity check the AVX or FMA capabilities of the host before loading the model:

import cpuinfo

def check_instruction_set(instruction_set):
    info = cpuinfo.get_cpu_info()
    if 'flags' in info.keys():
        if instruction_set in info['flags']:
            print(f"{instruction_set} is supported")
        else:
            print(f"{instruction_set} is not supported")
    else:
        print("Unable to retrieve CPU information")

instruction_sets = ['avx', 'avx2', 'avx512', 'fma3', 'fma4']

for instruction_set in instruction_sets:
    check_instruction_set(instruction_set)

@raymond-infinitecode
Copy link

Please support at least in default openBlas precompiled ?

@gjmulder
Copy link
Contributor

Please support at least in default openBlas precompiled ?

@raymond-infinitecode do you have f16C, AVX, AV2, AVX512, and / or FMA support on your CPU?

@raymond-infinitecode
Copy link

raymond-infinitecode commented May 24, 2023

Hi, yes I have all of them except AVX512. Is that enough for performance boost without openBlas ?

@gjmulder
Copy link
Contributor

"Enough" also depends on the size of your CPU L1/L2/L3 caches, main memory I/O speed, the size of your model, etc.

In any case @AlphaAtlas's proposal would mean not supporting the AVX2 extensions on your CPU, so it would run slower than it could compared to if you compiled from source.

You're still interested in convenience over performance?

@raymond-infinitecode
Copy link

I am all out for performance. In the event if we manage to borrow a xeon 32 core, how well would python binding utilize it ?

@gjmulder
Copy link
Contributor

Horribly. After 4 physical cores each additional core provides diminishing returns. 8 cores is about optimal.

That Xeon 32 core likely has AVX512, which means you would get a nice performance bump even if it really can't utilise much over 8-12 cores.

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
@emmatyping
Copy link

Hi, I am interested in helping tackle this problem.

I think taking inspiration from pytorch might be the best idea:
https://pytorch.org/get-started/locally/

I see this as two problems: generating all the required wheels, and how to get them into users hands with the least friction.

For generating wheels, I think macOS/Windows wheels are easiest to build in CI. For Linux, building wheels in docker containers each release is pretty straight forward, and I am happy to donate compute to do that. I also have a server colocated with a gigabit connection that could be used to host all of the wheels needed. I already host an apt repo https://apt.cli.rs/ so this isn't my first time doing something like this. So basically, I'm happy to host all the wheels, I can probably store at least a TB of wheels (back of the napkin math says each release will be 2-300MBs to store).

As for how to make it easy for the user:
Pytorch asks for your CUDA version, so I think it is reasonable to ask users for that info.

For getting the correct architecture on x86 we can do something like this: https://github.com/flababah/cpuid.py (writing instructions to rwx memory is basically what JITs do, so this isn't actually that cursed). Then we have a zero-dependency script users can run which can automatically calculate architecture and OS, which gets us most of the way.

For ARM, unless I'm mistaken (and I could be, development is quick these days) GGML only supports NEON, and I think defaulting to compiling with NEON makes sense? There are few platforms without it these days.

Metal support can be automatically chosen based on OS, so that's easy to choose. In fact there's pretty much only a few choices for macOS: ARM Neon or x86 with AVX2 (Metal can be included by default for both).

We also can ask about whether the user wants BLAS (with a short snippet about why they might want it).

I'm happy to work on this but I'd appreciate some upstream buy-in. I am imagining a script in the top-level of the repo users run to download the correct wheel for their platform and requirements. Also feedback/suggestions/etc are appreciated, I am not as knowledgeable about all the various methods of accelerating GGML, I think I covered most of them.

@gjmulder
Copy link
Contributor

gjmulder commented Jun 25, 2023

Happy to support you with smoke testing in this endeavor if it reduces the number of build related bugs logged against llama-cpp-python, which are almost certainly mostly due to upstream llama.cpp cmake issues and local environmental issues.

Firstly, why not statically compile OpenBLAS into the non-GPU accelerated wheels? That avoids any dynamic dependances and wouldn't probably make them much larger:

llama.cpp$ ls -l ./main 
-rwxrwxr-x 1 user user 535416 Jun 25 08:59 ./main
llama.cpp$ git diff Makefile
diff --git a/Makefile b/Makefile
index 5dd676f..7ba5084 100644
--- a/Makefile
+++ b/Makefile
@@ -45,8 +45,8 @@ endif
 # -Ofast tends to produce faster code, but may not be available for some compilers.
 #OPT = -Ofast
 OPT = -O3
-CFLAGS   = -I.              $(OPT) -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples $(OPT) -std=c++11 -fPIC
+CFLAGS   = -I.              $(OPT) -std=c11   -fPIC -static
+CXXFLAGS = -I. -I./examples $(OPT) -std=c++11 -fPIC -static
 LDFLAGS  =
 
 ifdef LLAMA_DEBUG
llama.cpp$ ldd ./main 
	not a dynamic executable

llama.cpp$ ls -l ./main 
-rwxrwxr-x 1 user user 30180528 Jun 25 09:04 ./main

EDIT: Oops, 60X larger. Maybe provide a statistically compiled OpenBLAS that just supports AVX2 and earlier CPU optimizations?

Secondly, do you have all the different hardware architectures, or are you planning on cross compiling? We'd at least need a MacOS environment to smoke test any Apple Metal BLAS builds.

I have a mix of GPUs from a GTX 980Ti, GTX 1080Ti, RTX 2070 and RTX 3090Ti, although currently only the 1080Ti and 3090Ti are plugged into an Ubuntu 22.04.2 server with an older AMD Ryzen CPU. Varying the version of CUDA can be done through different CUDA instances using the Docker files supplied with llama-cpp-python.

@emmatyping
Copy link

Firstly, why not statically compile OpenBLAS into the non-GPU accelerated wheels?

Doing this on Linux is easy enough. On Windows... its a bit less easy but doable.

Maybe provide a statistically compiled OpenBLAS that just supports AVX2 and earlier CPU optimizations?

Yeah limiting openblas to just AVX2 (maybe NEON?) could work.

Secondly, do you have all the different hardware architectures, or are you planning on cross compiling? We'd at least need a MacOS environment to smoke test any Apple Metal BLAS builds.

I can run smoke tests on all of the x86 CPU-only builds automatically. I don't currently have a GPU in my server, so help on smoke tests for that would be appreciated, I also don't have any Apple devices, so help testing that would be very useful.

I was planning on using Github Actions for macOS wheels, and doing the x86 builds in containers on Linux and maybe also Windows in VMs I can run. For x86, no cross compiling should be needed, its just passing the requisite flags to limit to compiling for SSE/AVX/AVX2/etc.

For ARM, I have a device I could set up to compile, but tbh a VM may be faster, I'll have to benchmark it.

@gjmulder
Copy link
Contributor

I can cover the CUDA testing on Ubuntu Linux using NVidia docker instances.

@emmatyping
Copy link

Great! I think that covers most targets, minus macOS. I'm actually unsure if you can run Metal jobs in Github Actions, do you know if that's possible?

@gjmulder
Copy link
Contributor

Any hardware-specific requirements such as GPUs aren't going to be supported, at a guess. The code will be running on virtualized x86_64 instances with virtual video drivers. That's why I was offering my CUDA APIs.

Once you have the wheels built it is relatively easy to add a Github action to install them on CUDA docker images and push them to Docker hub. @abetlen is already doing it here.

Then it is a matter of polling Docker hub for new CUDA llama-cpp-python images and smoke testing them on my kit. Not ideal, but at least that way we would discover any upstream llama.cpp breaking CUDA changes within a short while. There's been a few 😄

@ParisNeo
Copy link
Author

Hi there. Is it possible to have a parameter to activate/deactivate cpu features?

@abetlen
Copy link
Owner

abetlen commented Jun 26, 2023

@ParisNeo the llama.cpp library does not support doing that at runtime unfortunately :(

@gjmulder A horrible hack / shower thought I had was to have a docker image that ships with a compiler, the python source, and all dependencies installed then just have a script in the docker image to rebuild the shared library on startup since that's the only platform specific component. You could maybe volume in the .so and use the makefile(?) to do this if it's not available. That might be a terrible idea and a huge abuse of how docker is meant to work but could be worth a try?

@gjmulder gjmulder changed the title A precompiled wheels with cublas activated Precompiled wheels with CuBLAS activated Jun 30, 2023
@gaby
Copy link

gaby commented Aug 9, 2023

Can someone list the possible CPU flags / GPU combinations?

@gjmulder
Copy link
Contributor

gjmulder commented Aug 9, 2023

llama.cpp$ sed -n 's/.*\(LLAMA_[^ )]*\).*/\1/p' CMakeLists.txt | tr -d '"}' | sort -u
LLAMA_ACCELERATE
LLAMA_ALL_WARNINGS
LLAMA_ALL_WARNINGS_3RD_PARTY
LLAMA_AVX
LLAMA_AVX2
LLAMA_AVX512
LLAMA_AVX512_VBMI
LLAMA_AVX512_VNNI
LLAMA_BLAS
LLAMA_BLAS_VENDOR
LLAMA_BUILD
LLAMA_BUILD_EXAMPLES
LLAMA_BUILD_SERVER
LLAMA_BUILD_TESTS
LLAMA_CLBLAST
LLAMA_CUBLAS
LLAMA_CUDA_DMMV_F16
LLAMA_CUDA_DMMV_X
LLAMA_CUDA_DMMV_Y
LLAMA_CUDA_KQUANTS_ITER
LLAMA_EXTRA_INCLUDES
LLAMA_EXTRA_LIBS
LLAMA_F16C
LLAMA_FMA
LLAMA_GPROF
LLAMA_K_QUANTS
LLAMA_LTO
LLAMA_METAL
LLAMA_NATIVE
LLAMA_SANITIZE_ADDRESS
LLAMA_SANITIZE_THREAD
LLAMA_SANITIZE_UNDEFINED
LLAMA_STANDALONE
LLAMA_STATIC
LLAMA_WASM_SINGLE_FILE

@gaby
Copy link

gaby commented Aug 9, 2023

I'm trying to see if potentionally building wheels for the main cpu features and the last 2 versions of CUDA only is viable.

So CPU features would be: f16C, AVX, AV2, AVX512, right?

And then for GPU we have METAL, and CUBLAS for each CUDA version?

@gjmulder
Copy link
Contributor

gjmulder commented Aug 10, 2023

Optional CPU hardware acceleration features are: FMA, f16C, AVX, AVX2, AVX512. Almost all CPUs will have FMA, f16C, AVX, many will have AVX2, a few will have AVX512. Don't forget LLAMA_BLAS for CPU-only.

Pytorch avoids messy CUDA deps by distributing its own CUDA with the package.

@gaby
Copy link

gaby commented Aug 10, 2023

We could potentionally have the following base wheels precompilled:

  • FMA, F16C, AVX
  • FMA, F16C, AVX, AVX2
  • FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

@gjmulder
Copy link
Contributor

gjmulder commented Aug 10, 2023

We could potentionally have the following base wheels precompilled:

  • FMA, F16C, AVX
  • FMA, F16C, AVX, AVX2
  • FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

11.8 & 12.2

@gjmulder
Copy link
Contributor

11.8: cudatoolkit-feedstock

@gaby
Copy link

gaby commented Aug 10, 2023

We could potentionally have the following base wheels precompilled:

  • FMA, F16C, AVX
  • FMA, F16C, AVX, AVX2
  • FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

11.8 & 12.2

That's a good start:

  • FMA, F16C, AVX
  • FMA, F16C, AVX, AVX2
  • FMA, F16C, AVX, AVX2, AVX512
  • FMA, F16C, AVX, AVX2, AVX512, CUDA11.8
  • FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

^ Those combination multiplied by each supported Python version py3.7->3.11, although 3.7 can probably be removed since it was EOL this summer @abetlen

That would be py3.8->3.11 * 5 possible combinations = 20 wheels, 25 if py3.7 is kept.

@gjmulder
Copy link
Contributor

  • FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

I have an old Ryzen CPU that doesn't support AVX512. I am using CUDA 12.2 on a new Ubuntu install as I have a recent GPU and don't like to upgrade if I can avoid it.

Ideally, you need to provide wheels for:

py3.8->3.11 (5) * CUDA (2) * x86 (3) = 60 wheels.

@gaby
Copy link

gaby commented Aug 10, 2023

You are right, it's more. Since you want both CUDA on each cpu option

@gaby
Copy link

gaby commented Aug 10, 2023

Related #595

@gaby
Copy link

gaby commented Aug 10, 2023

  • FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

I have an old Ryzen CPU that doesn't support AVX512. I am using CUDA 12.2 on a new Ubuntu install as I have a recent GPU and don't like to upgrade if I can avoid it.

Ideally, you need to provide wheels for:

py3.8->3.11 (5) * CUDA (2) * x86 (3) = 60 wheels.

List:

  • FMA, F16C, AVX
  • FMA, F16C, AVX, CUDA11.8
  • FMA, F16C, AVX, CUDA12.2
  • FMA, F16C, AVX, AVX2
  • FMA, F16C, AVX, AVX2, CUDA11.8
  • FMA, F16C, AVX, AVX2, CUDA12.2
  • FMA, F16C, AVX, AVX2, AVX512
  • FMA, F16C, AVX, AVX2, AVX512, CUDA11.8
  • FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

@gjmulder
Copy link
Contributor

gjmulder commented Aug 10, 2023

This Bash script automatically generates the supported flags on Linux:

$ grep -oE 'avx|avx2|avx512|f16c|fma' /proc/cpuinfo | sort -ur | tr "\n" "_"
fma_f16c_avx2_avx_

Maybe you want to name the wheels similar to this suffix so that we can doc the above script to automatically choose the best CPU wheel?

EDIT: This only works on physical hardware. We've had a number of cases where virtualised CPUs report hardware extensions but those extensions are disabled by the virtualisation.

@gaby
Copy link

gaby commented Aug 11, 2023

This Bash script automatically generates the supported flags on Linux:

$ grep -oE 'avx|avx2|avx512|f16c|fma' /proc/cpuinfo | sort -ur | tr "\n" "_"
fma_f16c_avx2_avx_

Maybe you want to name the wheels similar to this suffix so that we can doc the above script to automatically choose the best CPU wheel?

EDIT: This only works on physical hardware. We've had a number of cases where virtualised CPUs report hardware extensions but those extensions are disabled by the virtualisation.

That shouldn't matter for building the wheels, right? It only matters at runtime if a fearure is enabled or not.

@gjmulder
Copy link
Contributor

That shouldn't matter for building the wheels, right? It only matters at runtime if a feature is enabled or not

Correct. I'm just thinking ahead on how to explain to people which wheel they should use. Otherwise we're going to get a lot of "Invalid instruction" issues.

@gaby
Copy link

gaby commented Aug 14, 2023

So i'm able to build the wheels using:

CMAKE_ARGS="-DLLAMA_AVX512=off -DLLAMA_AVX2=off" FORCE_CMAKE=1 python3 setup.py bdist_wheel

I'm not sure how to give them a specific name to associate them with features? The one on my VM comes out as:

llama_cpp_python-0.1.77-cp310-cp310-linux_x86_64.whl

Regardless of which feature flags I enable

@gaby
Copy link

gaby commented Aug 14, 2023

I just realized that the llama.cpp team is already doing this for windows. They are compiling llama.cpp for Windows using Github Actions here: https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml#L197

@gaby
Copy link

gaby commented Aug 14, 2023

Example script to detect and run specific wheels:

#!/bin/sh

cpuinfo="$(cat /proc/cpuinfo)"
if [ $(echo "$cpuinfo" | grep -c avx512) -gt 0 ]; then
	./llama_avx512 "$@"
elif [ $(echo "$cpuinfo" | grep -c avx2) -gt 0 ]; then
	./llama_avx2 "$@"
else
	./llama_avx "$@"
fi

Source: ggerganov/llama.cpp#537 (comment)

@gjmulder
Copy link
Contributor

gjmulder commented Aug 14, 2023

I'm not sure how to give them a specific name to associate them with features? The one on my VM comes out as:

llama_cpp_python-0.1.77-cp310-cp310-linux_x86_64.whl

Regardless of which feature flags I enable

It looks like you can't override setup.py's name etc:

llama-cpp-python$ grep name setup.py 
    name="llama_cpp_python",

This bash script looks to build the CMake flags and patch setup.py accordingly. It should be relatively easy to extend to support CUDA:

#!/bin/bash

# Flags combinations
FLAGS_COMBINATIONS=(
  "FMA,F16C,AVX"
  "FMA,F16C,AVX,AVX2"
  "FMA,F16C,AVX,AVX2,AVX512"
)

cp -v setup.py setup.py.bak

# For each flags combination
for FLAGS in "${FLAGS_COMBINATIONS[@]}"; do
  # Start with all flags turned off
  CMAKE_ARGS="-DLLAMA_AVX512=off -DLLAMA_AVX2=off -DLLAMA_AVX=off -DLLAMA_F16C=off -DLLAMA_FMA=off"
  
  # Construct the suffix for the package name
  PACKAGE_SUFFIX=""
  IFS=',' read -ra ADDR <<< "$FLAGS"
  for FLAG in "${ADDR[@]}"; do
    FLAG_NAME="LLAMA_$FLAG"
    CMAKE_ARGS=$(echo $CMAKE_ARGS | sed "s/-D$FLAG_NAME=off/-D$FLAG_NAME=on/")
    
    # Append the flag to the package suffix in lowercase
    PACKAGE_SUFFIX="${PACKAGE_SUFFIX}_$(echo $FLAG | tr '[:upper:]' '[:lower:]')"
  done
  
  # Patch setup.py with the new name
  sed -i "s/name=\"llama_cpp_python\"/name=\"llama_cpp_python$PACKAGE_SUFFIX\"/" setup.py
  grep name setup.py
  
  # Execute the desired command with the constructed CMake arguments
  echo eval "CMAKE_ARGS=\"$CMAKE_ARGS\" FORCE_CMAKE=1 python3 setup.py bdist_wheel"

  # Restore the original setup.py for the next iteration
  cp -v setup.py.bak setup.py
done

I've prepended an echo to the build line so you can do a dry run first. Remove the echo to actually do the builds.

@gaby
Copy link

gaby commented Aug 27, 2023

Found this project https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels

Maybe @jllllll can share some light on this.

@jllllll
Copy link

jllllll commented Aug 27, 2023

The large number of meaningful build configurations makes building pre-compiled wheels for cuBLAS pretty unwieldy.

This is the workflow that I am currently using:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels.yml

It builds 240 wheels + 40 CPU-only wheels for the various CPU instruction, CUDA version and Python version combinations. The sheer number of wheels being built for every version means that it is simply impractical for people to search through releases for a wheel they want to use.
To solve this issue, I am uploading wheels to grouped releases, rather than version-specific releases:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases
I then created a pip package index through GitHub Pages to allow pip to parse through the wheels based on the index URL given:
https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels

This allows for much simpler installation by modifying the URL according to the configuration you wish to install, instead of searching through thousands of wheels for the one you want. It also allows for significantly easier manual downloading by simply clicking the options you want in your browser.

The biggest downside that I've encountered with this is that the GitHub Actions API key assigned to the runner has a limited amount of API requests that it can make. This means that many wheels can't be uploaded automatically to a release and need to be manually uploaded after the workflow is finished. I have written the workflow to account for this by uploading wheels that fail to upload automatically to a release as build artifacts.

@jllllll
Copy link

jllllll commented Aug 27, 2023

This is the workflow for building CPU-only wheels:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels-cpu.yml

This workflow names each wheel according to the build by using this:

python -m build --wheel -C--build-option=egg_info "-C--build-option=--tag-build=+cpu$env:AVXVER"

@gaby
Copy link

gaby commented Aug 27, 2023

@jllllll Awesome work, I noticed you still had py3.7 support which is EOL. Removing that one should reduce the # of wheels. Not sure if significant, but anything helps.

@jllllll
Copy link

jllllll commented Aug 27, 2023

Support for py3.7 was really only added because I had room for it in the workflow and figured I might as well since there are 3.7 wheels being uploaded to this repo and I wanted to maintain support parity.

If an additional configuration is needed in the future, 3.7 wheels will be the first to go.

@gaby
Copy link

gaby commented Aug 27, 2023

Support for py3.7 was really only added because I had room for it in the workflow and figured I might as well since there are 3.7 wheels being uploaded to this repo and I wanted to maintain support parity.

If an additional configuration is needed in the future, 3.7 wheels will be the first to go.

That makes total sense. Amazing work! I think the only thing missing is some kind of wrapper to auto detect which Wheel is the indicated one to use. This would help other projects that are based on llama-cpp-python integrate this wrapper into their tooling.

  • Finding py version
  • Findind wether basic, avx, or avx512
  • Use CUDA or not

@jllllll
Copy link

jllllll commented Aug 27, 2023

You can take the approach that ctransformers took: https://github.com/marella/ctransformers/blob/main/ctransformers/lib.py

They package together libs of different configurations and use the py-cpuinfo package to determine which to load.

Due to how they build the wheel, it doesn't get associated with a particular Python minor version. Instead, it just works with all Python 3 versions.

If fully renaming the package is what you want, it is certainly doable. This is a workflow that does just that:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels-ggml.yml

It uses a fixed name, but can be relatively easily adapted to use a dynamically generated name.
These lines are where the names are replaced: 81 86 87 90

As far as a wrapper for installing wheels is concerned, py-cpuinfo will work for finding the supported instructions. Detection of CUDA is a little more complicated without Pytorch, but shouldn't be too hard. Don't really need to worry about finding the Python version as pip can handle that on it's own.

@gaby
Copy link

gaby commented Aug 27, 2023

@jllllll Thanks for the info, really appreciate it! 💪

@gaby
Copy link

gaby commented Sep 5, 2023

@jllllll Random question, would it be possible to add arm64 wheels?

@jllllll
Copy link

jllllll commented Sep 5, 2023

I think it is technically possible, but it would require the use of qemu or docker and a lot of trial and error.
The OS images + CUDA Toolkit will take up a fair amount of disk space, so a lot of manual file deletions would be needed to make room for it all. I don't have any experience with using qemu in GitHub Actions, so this would take quite a bit of time for me to figure out.

GitHub is planning to add arm64 runners at some point, which would make this significantly easier. No clue when that will be though.

@gaby
Copy link

gaby commented Sep 5, 2023

@jllllll This is what I use for doing arm64 builds https://github.com/docker/setup-qemu-action

@abetlen abetlen mentioned this issue Mar 3, 2024
@abetlen abetlen closed this as completed Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build enhancement New feature or request hardware Hardware specific issue
Projects
None yet
Development

No branches or pull requests

8 participants