Precompiled wheels with CuBLAS activated #243

ParisNeo · 2023-05-19T11:59:00Z

I use your library in my UI:
https://github.com/nomic-ai/gpt4all-ui

But Somehow, some users report that when they run my software and it tries to install your library, they get errors as the pip is trying to recompile the wheel.

Is there a way to precompile this for many lpatforms so that they won't need to build as this requires them to have a build environment which is not easy for noobs.

Thanks for this wonderful backend

gjmulder · 2023-05-19T14:59:27Z

As per @jmtatsch's reply to my idea of pushing pre-compiled Docker images to Docker hub, providing precompiled wheels is likely equally problematic due to:

llama.ccp interrogating the hardware it is being compiled on and then aggressively optimising its compiled code to perform for that specific hardware (e.g. ARM64 or x86_64 (and then within x86_64 it may (or may not) use F16C, AVX, AVX2 and/or AVX512 Intel hardware acceleration extensions))
If you pre-compile against CUDA, the exact same CUDA version needs to be pre-installed, unless you built a static binary that included all the CUDA libraries. And even then you might get NVidia GPU driver compatibility issues which I suspect may be the problem in issue CUDA Forward Compatibility on non supported HW #234

So you'd potentially need to pre-compile for every possible combination of hardware and CUDA version. Assuming (conservatively) there's say 32 (ARM64 + Intel) hardware combinations and 11 possible major CUDA versions, that's 352 wheels. Then you need to somehow educate the user on how to choose the single best wheel for his environment. 🤦

You could provide a static wheel with all AVX* disabled, no CUDA CuBLAS support and instead just use OpenBLAS, but then you'd get people complaining how much faster llama.cpp is 🤷‍♂️

ParisNeo · 2023-05-19T16:17:29Z

Not an easy task I confess.

gjmulder · 2023-05-19T16:39:12Z

Not an easy task I confess.

@ParisNeo on Discord trying to decide what all the moving parts are:

gpt4all-webui

AlphaAtlas · 2023-05-19T21:59:14Z

AVX2+FMA and OpenCL compatibility is a pretty good assumption (going back to Ryzen 1000/Intel 4000), and it would be reasonably fast on most hardware. Users on older hardware or other ISAs can build for themselves, and you could print a warning in the console if the device supports higher feature sets.

CUDA is a problem though.

gjmulder · 2023-05-20T08:27:16Z

AVX2+FMA and OpenCL compatibility is a pretty good assumption (going back to Ryzen 1000/Intel 4000), and it would be reasonably fast on most hardware

It would add another dependency, but could sanity check the AVX or FMA capabilities of the host before loading the model:

import cpuinfo

def check_instruction_set(instruction_set):
    info = cpuinfo.get_cpu_info()
    if 'flags' in info.keys():
        if instruction_set in info['flags']:
            print(f"{instruction_set} is supported")
        else:
            print(f"{instruction_set} is not supported")
    else:
        print("Unable to retrieve CPU information")

instruction_sets = ['avx', 'avx2', 'avx512', 'fma3', 'fma4']

for instruction_set in instruction_sets:
    check_instruction_set(instruction_set)

raymond-infinitecode · 2023-05-24T04:36:00Z

Please support at least in default openBlas precompiled ?

gjmulder · 2023-05-24T07:19:51Z

Please support at least in default openBlas precompiled ?

@raymond-infinitecode do you have f16C, AVX, AV2, AVX512, and / or FMA support on your CPU?

raymond-infinitecode · 2023-05-24T07:33:38Z

Hi, yes I have all of them except AVX512. Is that enough for performance boost without openBlas ?

gjmulder · 2023-05-24T07:45:20Z

"Enough" also depends on the size of your CPU L1/L2/L3 caches, main memory I/O speed, the size of your model, etc.

In any case @AlphaAtlas's proposal would mean not supporting the AVX2 extensions on your CPU, so it would run slower than it could compared to if you compiled from source.

You're still interested in convenience over performance?

raymond-infinitecode · 2023-05-24T07:50:08Z

I am all out for performance. In the event if we manage to borrow a xeon 32 core, how well would python binding utilize it ?

gjmulder · 2023-05-24T08:11:15Z

Horribly. After 4 physical cores each additional core provides diminishing returns. 8 cores is about optimal.

That Xeon 32 core likely has AVX512, which means you would get a nice performance bump even if it really can't utilise much over 8-12 cores.

The readme tells people to use the command line option "-t 8", causing 8 threads to be started. On systems with fewer than 8 cores, this causes a significant slowdown. Remove the option from the example command lines and use /proc/cpuinfo on Linux to determine a sensible default.

emmatyping · 2023-06-25T08:38:25Z

Hi, I am interested in helping tackle this problem.

I think taking inspiration from pytorch might be the best idea:
https://pytorch.org/get-started/locally/

I see this as two problems: generating all the required wheels, and how to get them into users hands with the least friction.

For generating wheels, I think macOS/Windows wheels are easiest to build in CI. For Linux, building wheels in docker containers each release is pretty straight forward, and I am happy to donate compute to do that. I also have a server colocated with a gigabit connection that could be used to host all of the wheels needed. I already host an apt repo https://apt.cli.rs/ so this isn't my first time doing something like this. So basically, I'm happy to host all the wheels, I can probably store at least a TB of wheels (back of the napkin math says each release will be 2-300MBs to store).

As for how to make it easy for the user:
Pytorch asks for your CUDA version, so I think it is reasonable to ask users for that info.

For getting the correct architecture on x86 we can do something like this: https://github.com/flababah/cpuid.py (writing instructions to rwx memory is basically what JITs do, so this isn't actually that cursed). Then we have a zero-dependency script users can run which can automatically calculate architecture and OS, which gets us most of the way.

For ARM, unless I'm mistaken (and I could be, development is quick these days) GGML only supports NEON, and I think defaulting to compiling with NEON makes sense? There are few platforms without it these days.

Metal support can be automatically chosen based on OS, so that's easy to choose. In fact there's pretty much only a few choices for macOS: ARM Neon or x86 with AVX2 (Metal can be included by default for both).

We also can ask about whether the user wants BLAS (with a short snippet about why they might want it).

I'm happy to work on this but I'd appreciate some upstream buy-in. I am imagining a script in the top-level of the repo users run to download the correct wheel for their platform and requirements. Also feedback/suggestions/etc are appreciated, I am not as knowledgeable about all the various methods of accelerating GGML, I think I covered most of them.

gjmulder · 2023-06-25T09:15:22Z

Happy to support you with smoke testing in this endeavor if it reduces the number of build related bugs logged against llama-cpp-python, which are almost certainly mostly due to upstream llama.cpp cmake issues and local environmental issues.

Firstly, why not statically compile OpenBLAS into the non-GPU accelerated wheels? That avoids any dynamic dependances and wouldn't probably make them much larger:

llama.cpp$ ls -l ./main 
-rwxrwxr-x 1 user user 535416 Jun 25 08:59 ./main

llama.cpp$ git diff Makefile
diff --git a/Makefile b/Makefile
index 5dd676f..7ba5084 100644
--- a/Makefile
+++ b/Makefile
@@ -45,8 +45,8 @@ endif
 # -Ofast tends to produce faster code, but may not be available for some compilers.
 #OPT = -Ofast
 OPT = -O3
-CFLAGS   = -I.              $(OPT) -std=c11   -fPIC
-CXXFLAGS = -I. -I./examples $(OPT) -std=c++11 -fPIC
+CFLAGS   = -I.              $(OPT) -std=c11   -fPIC -static
+CXXFLAGS = -I. -I./examples $(OPT) -std=c++11 -fPIC -static
 LDFLAGS  =
 
 ifdef LLAMA_DEBUG

llama.cpp$ ldd ./main 
	not a dynamic executable

llama.cpp$ ls -l ./main 
-rwxrwxr-x 1 user user 30180528 Jun 25 09:04 ./main

EDIT: Oops, 60X larger. Maybe provide a statistically compiled OpenBLAS that just supports AVX2 and earlier CPU optimizations?

Secondly, do you have all the different hardware architectures, or are you planning on cross compiling? We'd at least need a MacOS environment to smoke test any Apple Metal BLAS builds.

I have a mix of GPUs from a GTX 980Ti, GTX 1080Ti, RTX 2070 and RTX 3090Ti, although currently only the 1080Ti and 3090Ti are plugged into an Ubuntu 22.04.2 server with an older AMD Ryzen CPU. Varying the version of CUDA can be done through different CUDA instances using the Docker files supplied with llama-cpp-python.

emmatyping · 2023-06-25T18:42:00Z

Firstly, why not statically compile OpenBLAS into the non-GPU accelerated wheels?

Doing this on Linux is easy enough. On Windows... its a bit less easy but doable.

Maybe provide a statistically compiled OpenBLAS that just supports AVX2 and earlier CPU optimizations?

Yeah limiting openblas to just AVX2 (maybe NEON?) could work.

Secondly, do you have all the different hardware architectures, or are you planning on cross compiling? We'd at least need a MacOS environment to smoke test any Apple Metal BLAS builds.

I can run smoke tests on all of the x86 CPU-only builds automatically. I don't currently have a GPU in my server, so help on smoke tests for that would be appreciated, I also don't have any Apple devices, so help testing that would be very useful.

I was planning on using Github Actions for macOS wheels, and doing the x86 builds in containers on Linux and maybe also Windows in VMs I can run. For x86, no cross compiling should be needed, its just passing the requisite flags to limit to compiling for SSE/AVX/AVX2/etc.

For ARM, I have a device I could set up to compile, but tbh a VM may be faster, I'll have to benchmark it.

gjmulder · 2023-06-25T19:15:30Z

I can cover the CUDA testing on Ubuntu Linux using NVidia docker instances.

emmatyping · 2023-06-25T20:05:02Z

Great! I think that covers most targets, minus macOS. I'm actually unsure if you can run Metal jobs in Github Actions, do you know if that's possible?

gjmulder · 2023-06-25T20:26:48Z

Any hardware-specific requirements such as GPUs aren't going to be supported, at a guess. The code will be running on virtualized x86_64 instances with virtual video drivers. That's why I was offering my CUDA APIs.

Once you have the wheels built it is relatively easy to add a Github action to install them on CUDA docker images and push them to Docker hub. @abetlen is already doing it here.

Then it is a matter of polling Docker hub for new CUDA llama-cpp-python images and smoke testing them on my kit. Not ideal, but at least that way we would discover any upstream llama.cpp breaking CUDA changes within a short while. There's been a few 😄

ParisNeo · 2023-06-26T20:19:32Z

Hi there. Is it possible to have a parameter to activate/deactivate cpu features?

abetlen · 2023-06-26T23:35:21Z

@ParisNeo the llama.cpp library does not support doing that at runtime unfortunately :(

@gjmulder A horrible hack / shower thought I had was to have a docker image that ships with a compiler, the python source, and all dependencies installed then just have a script in the docker image to rebuild the shared library on startup since that's the only platform specific component. You could maybe volume in the .so and use the makefile(?) to do this if it's not available. That might be a terrible idea and a huge abuse of how docker is meant to work but could be worth a try?

gaby · 2023-08-09T01:04:25Z

Can someone list the possible CPU flags / GPU combinations?

gjmulder · 2023-08-09T06:37:49Z

llama.cpp$ sed -n 's/.*\(LLAMA_[^ )]*\).*/\1/p' CMakeLists.txt | tr -d '"}' | sort -u
LLAMA_ACCELERATE
LLAMA_ALL_WARNINGS
LLAMA_ALL_WARNINGS_3RD_PARTY
LLAMA_AVX
LLAMA_AVX2
LLAMA_AVX512
LLAMA_AVX512_VBMI
LLAMA_AVX512_VNNI
LLAMA_BLAS
LLAMA_BLAS_VENDOR
LLAMA_BUILD
LLAMA_BUILD_EXAMPLES
LLAMA_BUILD_SERVER
LLAMA_BUILD_TESTS
LLAMA_CLBLAST
LLAMA_CUBLAS
LLAMA_CUDA_DMMV_F16
LLAMA_CUDA_DMMV_X
LLAMA_CUDA_DMMV_Y
LLAMA_CUDA_KQUANTS_ITER
LLAMA_EXTRA_INCLUDES
LLAMA_EXTRA_LIBS
LLAMA_F16C
LLAMA_FMA
LLAMA_GPROF
LLAMA_K_QUANTS
LLAMA_LTO
LLAMA_METAL
LLAMA_NATIVE
LLAMA_SANITIZE_ADDRESS
LLAMA_SANITIZE_THREAD
LLAMA_SANITIZE_UNDEFINED
LLAMA_STANDALONE
LLAMA_STATIC
LLAMA_WASM_SINGLE_FILE

gaby · 2023-08-09T22:05:48Z

I'm trying to see if potentionally building wheels for the main cpu features and the last 2 versions of CUDA only is viable.

So CPU features would be: f16C, AVX, AV2, AVX512, right?

And then for GPU we have METAL, and CUBLAS for each CUDA version?

gjmulder · 2023-08-10T06:50:06Z

Optional CPU hardware acceleration features are: FMA, f16C, AVX, AVX2, AVX512. Almost all CPUs will have FMA, f16C, AVX, many will have AVX2, a few will have AVX512. Don't forget LLAMA_BLAS for CPU-only.

Pytorch avoids messy CUDA deps by distributing its own CUDA with the package.

gaby · 2023-08-10T12:29:48Z

We could potentionally have the following base wheels precompilled:

FMA, F16C, AVX
FMA, F16C, AVX, AVX2
FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

gjmulder · 2023-08-10T12:48:43Z

We could potentionally have the following base wheels precompilled:

FMA, F16C, AVX

FMA, F16C, AVX, AVX2

FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

11.8 & 12.2

gjmulder · 2023-08-10T12:59:57Z

11.8: cudatoolkit-feedstock

gaby · 2023-08-10T13:00:11Z

We could potentionally have the following base wheels precompilled:

FMA, F16C, AVX

FMA, F16C, AVX, AVX2

FMA, F16C, AVX, AVX2, AVX512

For each one of those support N latest versions of CUDA.

11.8 & 12.2

That's a good start:

FMA, F16C, AVX
FMA, F16C, AVX, AVX2
FMA, F16C, AVX, AVX2, AVX512
FMA, F16C, AVX, AVX2, AVX512, CUDA11.8
FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

^ Those combination multiplied by each supported Python version py3.7->3.11, although 3.7 can probably be removed since it was EOL this summer @abetlen

That would be py3.8->3.11 * 5 possible combinations = 20 wheels, 25 if py3.7 is kept.

gjmulder · 2023-08-10T13:07:03Z

FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

I have an old Ryzen CPU that doesn't support AVX512. I am using CUDA 12.2 on a new Ubuntu install as I have a recent GPU and don't like to upgrade if I can avoid it.

Ideally, you need to provide wheels for:

py3.8->3.11 (5) * CUDA (2) * x86 (3) = 60 wheels.

gaby · 2023-08-10T13:09:10Z

You are right, it's more. Since you want both CUDA on each cpu option

gaby · 2023-08-10T13:10:06Z

Related #595

gaby · 2023-08-10T13:14:18Z

FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

I have an old Ryzen CPU that doesn't support AVX512. I am using CUDA 12.2 on a new Ubuntu install as I have a recent GPU and don't like to upgrade if I can avoid it.

Ideally, you need to provide wheels for:

py3.8->3.11 (5) * CUDA (2) * x86 (3) = 60 wheels.

List:

FMA, F16C, AVX
FMA, F16C, AVX, CUDA11.8
FMA, F16C, AVX, CUDA12.2
FMA, F16C, AVX, AVX2
FMA, F16C, AVX, AVX2, CUDA11.8
FMA, F16C, AVX, AVX2, CUDA12.2
FMA, F16C, AVX, AVX2, AVX512
FMA, F16C, AVX, AVX2, AVX512, CUDA11.8
FMA, F16C, AVX, AVX2, AVX512, CUDA12.2

gjmulder · 2023-08-10T16:44:01Z

This Bash script automatically generates the supported flags on Linux:

$ grep -oE 'avx|avx2|avx512|f16c|fma' /proc/cpuinfo | sort -ur | tr "\n" "_"
fma_f16c_avx2_avx_

Maybe you want to name the wheels similar to this suffix so that we can doc the above script to automatically choose the best CPU wheel?

EDIT: This only works on physical hardware. We've had a number of cases where virtualised CPUs report hardware extensions but those extensions are disabled by the virtualisation.

gaby · 2023-08-11T22:25:23Z

This Bash script automatically generates the supported flags on Linux:
$ grep -oE 'avx|avx2|avx512|f16c|fma' /proc/cpuinfo | sort -ur | tr "\n" "_"
fma_f16c_avx2_avx_
Maybe you want to name the wheels similar to this suffix so that we can doc the above script to automatically choose the best CPU wheel?

EDIT: This only works on physical hardware. We've had a number of cases where virtualised CPUs report hardware extensions but those extensions are disabled by the virtualisation.

That shouldn't matter for building the wheels, right? It only matters at runtime if a fearure is enabled or not.

gjmulder · 2023-08-12T07:05:41Z

That shouldn't matter for building the wheels, right? It only matters at runtime if a feature is enabled or not

Correct. I'm just thinking ahead on how to explain to people which wheel they should use. Otherwise we're going to get a lot of "Invalid instruction" issues.

gaby · 2023-08-14T01:13:13Z

So i'm able to build the wheels using:

CMAKE_ARGS="-DLLAMA_AVX512=off -DLLAMA_AVX2=off" FORCE_CMAKE=1 python3 setup.py bdist_wheel

I'm not sure how to give them a specific name to associate them with features? The one on my VM comes out as:

llama_cpp_python-0.1.77-cp310-cp310-linux_x86_64.whl

Regardless of which feature flags I enable

gaby · 2023-08-14T01:20:10Z

I just realized that the llama.cpp team is already doing this for windows. They are compiling llama.cpp for Windows using Github Actions here: https://github.com/ggerganov/llama.cpp/blob/master/.github/workflows/build.yml#L197

gaby · 2023-08-14T01:47:08Z

Example script to detect and run specific wheels:

#!/bin/sh

cpuinfo="$(cat /proc/cpuinfo)"
if [ $(echo "$cpuinfo" | grep -c avx512) -gt 0 ]; then
	./llama_avx512 "$@"
elif [ $(echo "$cpuinfo" | grep -c avx2) -gt 0 ]; then
	./llama_avx2 "$@"
else
	./llama_avx "$@"
fi

Source: ggerganov/llama.cpp#537 (comment)

gjmulder · 2023-08-14T06:48:33Z

I'm not sure how to give them a specific name to associate them with features? The one on my VM comes out as:
llama_cpp_python-0.1.77-cp310-cp310-linux_x86_64.whl
Regardless of which feature flags I enable

It looks like you can't override setup.py's name etc:

llama-cpp-python$ grep name setup.py 
    name="llama_cpp_python",

This bash script looks to build the CMake flags and patch setup.py accordingly. It should be relatively easy to extend to support CUDA:

#!/bin/bash

# Flags combinations
FLAGS_COMBINATIONS=(
  "FMA,F16C,AVX"
  "FMA,F16C,AVX,AVX2"
  "FMA,F16C,AVX,AVX2,AVX512"
)

cp -v setup.py setup.py.bak

# For each flags combination
for FLAGS in "${FLAGS_COMBINATIONS[@]}"; do
  # Start with all flags turned off
  CMAKE_ARGS="-DLLAMA_AVX512=off -DLLAMA_AVX2=off -DLLAMA_AVX=off -DLLAMA_F16C=off -DLLAMA_FMA=off"
  
  # Construct the suffix for the package name
  PACKAGE_SUFFIX=""
  IFS=',' read -ra ADDR <<< "$FLAGS"
  for FLAG in "${ADDR[@]}"; do
    FLAG_NAME="LLAMA_$FLAG"
    CMAKE_ARGS=$(echo $CMAKE_ARGS | sed "s/-D$FLAG_NAME=off/-D$FLAG_NAME=on/")
    
    # Append the flag to the package suffix in lowercase
    PACKAGE_SUFFIX="${PACKAGE_SUFFIX}_$(echo $FLAG | tr '[:upper:]' '[:lower:]')"
  done
  
  # Patch setup.py with the new name
  sed -i "s/name=\"llama_cpp_python\"/name=\"llama_cpp_python$PACKAGE_SUFFIX\"/" setup.py
  grep name setup.py
  
  # Execute the desired command with the constructed CMake arguments
  echo eval "CMAKE_ARGS=\"$CMAKE_ARGS\" FORCE_CMAKE=1 python3 setup.py bdist_wheel"

  # Restore the original setup.py for the next iteration
  cp -v setup.py.bak setup.py
done

I've prepended an echo to the build line so you can do a dry run first. Remove the echo to actually do the builds.

gaby · 2023-08-27T15:57:54Z

Found this project https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels

Maybe @jllllll can share some light on this.

jllllll · 2023-08-27T16:24:50Z

The large number of meaningful build configurations makes building pre-compiled wheels for cuBLAS pretty unwieldy.

This is the workflow that I am currently using:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels.yml

It builds 240 wheels + 40 CPU-only wheels for the various CPU instruction, CUDA version and Python version combinations. The sheer number of wheels being built for every version means that it is simply impractical for people to search through releases for a wheel they want to use.
To solve this issue, I am uploading wheels to grouped releases, rather than version-specific releases:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/releases
I then created a pip package index through GitHub Pages to allow pip to parse through the wheels based on the index URL given:
https://jllllll.github.io/llama-cpp-python-cuBLAS-wheels

This allows for much simpler installation by modifying the URL according to the configuration you wish to install, instead of searching through thousands of wheels for the one you want. It also allows for significantly easier manual downloading by simply clicking the options you want in your browser.

The biggest downside that I've encountered with this is that the GitHub Actions API key assigned to the runner has a limited amount of API requests that it can make. This means that many wheels can't be uploaded automatically to a release and need to be manually uploaded after the workflow is finished. I have written the workflow to account for this by uploading wheels that fail to upload automatically to a release as build artifacts.

jllllll · 2023-08-27T16:33:43Z

This is the workflow for building CPU-only wheels:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels-cpu.yml

This workflow names each wheel according to the build by using this:

python -m build --wheel -C--build-option=egg_info "-C--build-option=--tag-build=+cpu$env:AVXVER"

gaby · 2023-08-27T16:35:07Z

@jllllll Awesome work, I noticed you still had py3.7 support which is EOL. Removing that one should reduce the # of wheels. Not sure if significant, but anything helps.

jllllll · 2023-08-27T16:37:16Z

Support for py3.7 was really only added because I had room for it in the workflow and figured I might as well since there are 3.7 wheels being uploaded to this repo and I wanted to maintain support parity.

If an additional configuration is needed in the future, 3.7 wheels will be the first to go.

gaby · 2023-08-27T17:11:30Z

Support for py3.7 was really only added because I had room for it in the workflow and figured I might as well since there are 3.7 wheels being uploaded to this repo and I wanted to maintain support parity.

If an additional configuration is needed in the future, 3.7 wheels will be the first to go.

That makes total sense. Amazing work! I think the only thing missing is some kind of wrapper to auto detect which Wheel is the indicated one to use. This would help other projects that are based on llama-cpp-python integrate this wrapper into their tooling.

Finding py version
Findind wether basic, avx, or avx512
Use CUDA or not

jllllll · 2023-08-27T17:20:33Z

You can take the approach that ctransformers took: https://github.com/marella/ctransformers/blob/main/ctransformers/lib.py

They package together libs of different configurations and use the py-cpuinfo package to determine which to load.

Due to how they build the wheel, it doesn't get associated with a particular Python minor version. Instead, it just works with all Python 3 versions.

If fully renaming the package is what you want, it is certainly doable. This is a workflow that does just that:
https://github.com/jllllll/llama-cpp-python-cuBLAS-wheels/blob/main/.github/workflows/build-wheels-ggml.yml

It uses a fixed name, but can be relatively easily adapted to use a dynamically generated name.
These lines are where the names are replaced: 81 86 87 90

As far as a wrapper for installing wheels is concerned, py-cpuinfo will work for finding the supported instructions. Detection of CUDA is a little more complicated without Pytorch, but shouldn't be too hard. Don't really need to worry about finding the Python version as pip can handle that on it's own.

gaby · 2023-08-27T19:06:15Z

@jllllll Thanks for the info, really appreciate it! 💪

gaby · 2023-09-05T22:38:02Z

@jllllll Random question, would it be possible to add arm64 wheels?

jllllll · 2023-09-05T23:20:25Z

I think it is technically possible, but it would require the use of qemu or docker and a lot of trial and error.
The OS images + CUDA Toolkit will take up a fair amount of disk space, so a lot of manual file deletions would be needed to make room for it all. I don't have any experience with using qemu in GitHub Actions, so this would take quite a bit of time for me to figure out.

GitHub is planning to add arm64 runners at some point, which would make this significantly easier. No clue when that will be though.

gaby · 2023-09-05T23:26:03Z

@jllllll This is what I use for doing arm64 builds https://github.com/docker/setup-qemu-action

gjmulder added build hardware Hardware specific issue labels May 19, 2023

gjmulder added the enhancement New feature or request label May 19, 2023

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023

Default to 4 threads (abetlen#243)

4f54609

gjmulder mentioned this issue Jun 20, 2023

Distribute wheels with cuBLAS support for all supported NVIDIA GPU architectures #400

Closed

gjmulder changed the title ~~A precompiled wheels with cublas activated~~ Precompiled wheels with CuBLAS activated Jun 30, 2023

gjmulder mentioned this issue Jul 10, 2023

Precompiled wheels #440

Closed

abetlen mentioned this issue Mar 3, 2024

Binary wheels #1247

Merged

abetlen closed this as completed Apr 6, 2024

Precompiled wheels with CuBLAS activated #243

Precompiled wheels with CuBLAS activated #243

Comments

ParisNeo commented May 19, 2023

gjmulder commented May 19, 2023 • edited Loading

ParisNeo commented May 19, 2023

gjmulder commented May 19, 2023

AlphaAtlas commented May 19, 2023 • edited Loading

gjmulder commented May 20, 2023

raymond-infinitecode commented May 24, 2023

gjmulder commented May 24, 2023

raymond-infinitecode commented May 24, 2023 • edited Loading

gjmulder commented May 24, 2023

raymond-infinitecode commented May 24, 2023

gjmulder commented May 24, 2023

emmatyping commented Jun 25, 2023

gjmulder commented Jun 25, 2023 • edited Loading

emmatyping commented Jun 25, 2023

gjmulder commented Jun 25, 2023

emmatyping commented Jun 25, 2023

gjmulder commented Jun 25, 2023

ParisNeo commented Jun 26, 2023

abetlen commented Jun 26, 2023 • edited Loading

gaby commented Aug 9, 2023

gjmulder commented Aug 9, 2023

gaby commented Aug 9, 2023

gjmulder commented Aug 10, 2023 • edited Loading

gaby commented Aug 10, 2023

gjmulder commented Aug 10, 2023 • edited Loading

gjmulder commented Aug 10, 2023

gaby commented Aug 10, 2023

gjmulder commented Aug 10, 2023

gaby commented Aug 10, 2023

gaby commented Aug 10, 2023

gaby commented Aug 10, 2023

gjmulder commented Aug 10, 2023 • edited Loading

gaby commented Aug 11, 2023

gjmulder commented Aug 12, 2023

gaby commented Aug 14, 2023

gaby commented Aug 14, 2023

gaby commented Aug 14, 2023

gjmulder commented Aug 14, 2023 • edited Loading

gaby commented Aug 27, 2023

jllllll commented Aug 27, 2023 • edited Loading

jllllll commented Aug 27, 2023

gaby commented Aug 27, 2023

jllllll commented Aug 27, 2023 • edited Loading

gaby commented Aug 27, 2023

jllllll commented Aug 27, 2023 • edited Loading

gaby commented Aug 27, 2023

gaby commented Sep 5, 2023

jllllll commented Sep 5, 2023 • edited Loading

gaby commented Sep 5, 2023

gjmulder commented May 19, 2023 •

edited

Loading

AlphaAtlas commented May 19, 2023 •

edited

Loading

raymond-infinitecode commented May 24, 2023 •

edited

Loading

gjmulder commented Jun 25, 2023 •

edited

Loading

abetlen commented Jun 26, 2023 •

edited

Loading

gjmulder commented Aug 10, 2023 •

edited

Loading

gjmulder commented Aug 10, 2023 •

edited

Loading

gjmulder commented Aug 10, 2023 •

edited

Loading

gjmulder commented Aug 14, 2023 •

edited

Loading

jllllll commented Aug 27, 2023 •

edited

Loading

jllllll commented Aug 27, 2023 •

edited

Loading

jllllll commented Aug 27, 2023 •

edited

Loading

jllllll commented Sep 5, 2023 •

edited

Loading