Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 12.5 support or GPU acceleration not working after graphics driver update #2394

Closed
CodeMazeSolver opened this issue May 24, 2024 · 29 comments · Fixed by #2994
Closed

CUDA 12.5 support or GPU acceleration not working after graphics driver update #2394

CodeMazeSolver opened this issue May 24, 2024 · 29 comments · Fixed by #2994
Labels
bug Something isn't working unconfirmed

Comments

@CodeMazeSolver
Copy link

Hey there,
I'm running

LocalAI version:

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -v $PWD/models:/models --name local-ai localai/localai:latest-aio-gpu-nvidia-cuda-12 --models-path /models --context-size 1000 --threads 14

LocalAI version: v2.15.0 (f69de3b)

Environment, CPU architecture, OS, and Version:

13th Gen Intel(R) Core(TM) i9-13900H 2.60 GHz, on Windows 11 with Docker for Windows.

Describe the bug

I get this debug message right before the model is loaded.

stderr ggml_cuda_init: failed to initialize CUDA: named symbol not found

Which indicated to me that the models will not use GPU support. However, this worked just fine before.

After updating the graphics driver, the CUDA version was changed, too. From CUDA version 12.4 to 12.5. It seems like the CUDA environment is no longer used by any LLM. However, the CUDA version is detected correctly when starting the LocalAI Docker container.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 ...    On  |   00000000:01:00.0  On |                  N/A |
| N/A   48C    P8              7W /  105W |     148MiB /  16376MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
NVIDIA GPU detected. Attempting to find memory size...
Total GPU Memory: 16376 MiB

Instead of utilizing the GPU, the application uses the fallback and runs only on the CPU.

To Reproduce

Expected behavior

Utilizing the GPU.

Logs

Here are the full logs for the mistral-7b-instruct-v0.1.Q5_K_M.gguf model, but I tried several models that worked before. None utilize the GPU after installing the new graphics driver.

localai.log

Additional context

Also checking in the task manager shows that there is no GPU usage taking place.

@CodeMazeSolver CodeMazeSolver added bug Something isn't working unconfirmed labels May 24, 2024
@nmbgeek
Copy link

nmbgeek commented May 26, 2024

I am trying to deploy localai to a ubuntu 24.04 server (proxmox vm) with A2000 passed through and think I am running into the same issue. I initially had 550 drivers installed on the server which corresponded to what I see in nvidia-smi when localai starts, but have also purged nvidia drivers and put the server back to 535 drivers. Regardless I get this message in the lags when attempting to use a GPU model: INF GPU device found but no CUDA backend present.

I have tried images tagged master-aio-gpu-nvidia-cuda-12, master-aio-gpu-nvidia-cuda-11, master-cublas-cuda12-ffmpeg, and have also tried with the env variable REBUILD=true. I am currently able to run v2.15.0-aio-gpu-nvidia-cuda-12 and everything seems to work. I haven't tried any builds between that one and the current one.

@Hideman85
Copy link

Same here, I tried the docker version without success

Logs here
docker run -p 8080:8080 --rm -v ./Documents/AIModels/:/build/models -ti localai/localai:latest-aio-gpu-nvidia-cuda-12
===> LocalAI All-in-One (AIO) container starting...
NVIDIA GPU detected
/aio/entrypoint.sh: line 52: nvidia-smi: command not found
NVIDIA GPU detected, but nvidia-smi is not installed. GPU acceleration will not be available.
AMD GPU detected
AMD GPU detected, but ROCm is not installed. GPU acceleration will not be available.
GPU acceleration is not enabled or supported. Defaulting to CPU.
[...]
10:27AM INF core/startup process completed!
10:27AM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
10:27AM INF Success ip=172.17.0.1 latency=29.872297ms method=POST status=200 url=/v1/chat/completions
10:27AM INF Trying to load the model 'b5869d55688a529c3738cb044e92c331' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback piper stablediffusion rwkv whisper huggingface bert-embeddings /build/backend/python/bark/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/transformers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/coqui/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/vllm/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/mamba/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/petals/run.sh]'
10:27AM INF [llama-cpp] Attempting to load
10:27AM INF Loading model 'b5869d55688a529c3738cb044e92c331' with backend llama-cpp
10:27AM INF GPU device found but no CUDA backend present
10:27AM INF [llama-cpp] attempting to load with AVX2 variant

@madgagarin
Copy link

madgagarin commented May 28, 2024

confirm, only cpu
docker 2.16
ubuntu 24.04
nvidia 550
INF GPU device found but no CUDA backend present

@stephenleo
Copy link

stephenleo commented May 29, 2024

Same issue. spent over 2 days trying to figure out what happened till I found this issue.
Installing an older version of NVIDIA driver from https://www.nvidia.com/download/index.aspx?lang=en-us fixed the issue.
Specifically, I downloaded and installed the 551.86 driver

@crazymxm
Copy link

crazymxm commented May 30, 2024

Error : INF GPU device found but no CUDA backend present.

I think , I had found the reason!

IF YOU DID NOT MAKE TO DIST, llama-cpp-cuda
WILL NOT INCLUDED in backend!

New version has changed backends so much. And did not update the documents.

New version 's make file, here
dist:
STATIC=true $(MAKE) backend-assets/grpc/llama-cpp-avx2
ifeq ($(OS),Darwin)
$(info ${GREEN}I Skip CUDA build on MacOS${RESET})
else
$(MAKE) backend-assets/grpc/llama-cpp-cuda
endif
$(MAKE) build
mkdir -p release

@jobongo
Copy link

jobongo commented May 30, 2024

I can confirm.. Running in WSL 2. I tried rebuilding from source and during the build it states that CUDA was found but falls back to AVX2 when loading model. Downgrading the drivers to 551.86 "fixes" the issue.

@CodeMazeSolver
Copy link
Author

Same issue. spent over 2 days trying to figure out what happened till I found this issue. Installing an older version of NVIDIA driver from https://www.nvidia.com/download/index.aspx?lang=en-us fixed the issue. Specifically, I downloaded and installed the 551.86 driver

Yes, I also moved to an older version of the driver for now.

@Phate334
Copy link

Phate334 commented Jun 4, 2024

Upgrade to latest CUDA toolkit can fix it.

Driver Version: 555.42.02 CUDA Version: 12.5

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

@sbushmanov
Copy link

sbushmanov commented Jun 6, 2024

@Phate334

Will you elaborate a bit on your setup (OS, repos nvidia drivers installed from) because I have exactly the same CUDA as yours, but still no joy?

$ nvcc -V 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
apt-cache policy nvidia-driver-555
nvidia-driver-555:
  Installed: 555.42.02-0ubuntu1
  Candidate: 555.42.02-0ubuntu1
  Version table:
 *** 555.42.02-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
        100 /var/lib/dpkg/status
     555.42.02-0ubuntu0~gpu22.04.1 500
        500 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Packages

Ubuntu 22.04

@Phate334
Copy link

Phate334 commented Jun 6, 2024

apt-cache policy nvidia-driver-555
...

I use runfile to install toolkit instead of apt. LocalAI v2.16.0

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=runfile_local

@nmbgeek
Copy link

nmbgeek commented Jun 9, 2024

@Phate334

I have the 555.42 and CUDA 12.5 running and working everywhere except for localai.
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi returns:

Sun Jun  9 12:54:06 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000               Off |   00000000:00:10.0 Off |                  Off |
| 30%   32C    P8              4W /   70W |       2MiB /   6138MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvcc -V returns:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

nvidia-smi returns:

nvidia-smi
Sun Jun  9 12:57:01 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A2000               Off |   00000000:00:10.0 Off |                  Off |
| 30%   32C    P8              4W /   70W |       2MiB /   6138MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Now with all of that in place and a reboot of the system for good measure I am getting this when running localai/localai:master-aio-gpu-nvidia-cuda-12

===> LocalAI All-in-One (AIO) container starting...
NVIDIA GPU detected
/aio/entrypoint.sh: line 52: nvidia-smi: command not found
NVIDIA GPU detected, but nvidia-smi is not installed. GPU acceleration will not be available.
GPU acceleration is not enabled or supported. Defaulting to CPU.

@sbushmanov
Copy link

sbushmanov commented Jun 9, 2024

@nmbgeek

I'm on Ubuntu 22.04
After fresh reinstall everything connected to cuda (including nvidia-driver-555)
with the following:

$ nvcc -V     
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

$ nvidia-smi                                                                                                                                                 
Sun Jun  9 17:46:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   59C    P0             28W /   80W |    7044MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      6416      G   /usr/lib/xorg/Xorg                            381MiB |
|    0   N/A  N/A      6924      G   /usr/bin/gnome-shell                           96MiB |
|    0   N/A  N/A    326812      G   ...SidePanel --variations-seed-version        349MiB |
|    0   N/A  N/A    351309      G   x-terminal-emulator                            10MiB |
|    0   N/A  N/A    352079      C   .../backend-assets/grpc/llama-cpp-avx2       6152MiB |
+-----------------------------------------------------------------------------------------+

everything works just fine.

Versions I've got:

$ apt-cache policy nvidia-driver-555                                                                                                                                                       
nvidia-driver-555:
  Installed: 555.42.02-0ubuntu1
  Candidate: 555.42.02-0ubuntu1
  Version table:
     555.52.04-0ubuntu0~gpu22.04.1 500
        500 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy/main amd64 Packages
 *** 555.42.02-0ubuntu1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
        100 /var/lib/dpkg/status

$ apt-cache policy cuda                                                                                                                                                                    
cuda:
  Installed: 12.5.0-1
  Candidate: 12.5.0-1
  Version table:
 *** 12.5.0-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
        100 /var/lib/dpkg/status
     12.5.0-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     12.4.1-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64  Packages
     12.4.1-1 600

@WolframRavenwolf
Copy link

Also have the "GPU device found but no CUDA backend present" issue:

  • Host is up-to-date Ubuntu 22.04.4 LTS with NVIDIA RTX 6000 Ada Generation.
  • Docker images localai/localai:latest-aio-gpu-nvidia-cuda-12 and localai/localai:master-aio-gpu-nvidia-cuda-12 both have this issue.
  • LocalAI version: v2.17.0 (2f29797) + 2437a27 (2437a27)
  • Rebuilding the image didn't fix it.
  • Using a cuda-11 instead of cuda-12 image didn't fix it.
  • Upgrading from nvidia-driver-535 (NVIDIA-SMI 535.171.04, Driver Version: 535.171.04, CUDA Version: 12.2) to nvidia-driver-550 (NVIDIA-SMI 550.67, Driver Version: 550.67, CUDA Version: 12.4) didn't fix it.
  • CUDA works perfectly well with other AI app containers on this host, like Aphrodite-Engine, Ollama, vLLM.
docker exec -it localai bash

root@localai:/build# nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

root@localai:/build# nvidia-smi

Tue Jun 18 10:39:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    Off |   00000000:55:00.0 Off |                  Off |
| 30%   36C    P8             28W /  300W |    7024MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@MCP-LTS
Copy link

MCP-LTS commented Jun 19, 2024

for a reason in auto detection does not call the following from makefile i dont know where and how the autodetection happens

https://github.com/mudler/LocalAI/blob/master/Makefile

backend-assets/grpc/llama-cpp-cuda: backend-assets/grpc
	cp -rf backend/cpp/llama backend/cpp/llama-cuda
	$(MAKE) -C backend/cpp/llama-cuda purge
	$(info ${GREEN}I llama-cpp build info:cuda${RESET})
	CMAKE_ARGS="$(CMAKE_ARGS) -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA=ON" $(MAKE) VARIANT="llama-cuda" build-llama-cpp-grpc-server
	cp -rfv backend/cpp/llama-cuda/grpc-server backend-assets/grpc/llama-cpp-cuda

to make it work temporarily i used
image: localai/localai:master-cublas-cuda12-ffmpeg
and put in variables the following to force compiling llama-cuda

    environment:
#had to set the following for llamma.cpp to make llama-cuda
      - BUILD_GRPC_FOR_BACKEND_LLAMA=true
      - VARIANT=llama-cuda
      - GRPC_BACKENDS=backend-assets/grpc/llama-cpp-cuda

      - REBUILD=true
      - BUILD_TYPE=cublas

note: nvidia-smi both host and inside the container use cuda 12.5

hope helps someone to find a way to autodetect and make the above at the normal rebuild process or in the predefined container

@WolframRavenwolf
Copy link

Thanks, @MCP-LTS, this made CUDA work for me inside the LocalAI container!

Now we just need this to be fixed inside the official images since rebuilding took hours on my (actually pretty beefy) AI server.

@dallumnz
Copy link

Thank you @MCP-LTS, this also works on with v2.17.1.

@jobongo
Copy link

jobongo commented Jun 20, 2024

I found another Github issue for Ollama that seems to be related. ollama/ollama#4563 (comment).

Seems that the new Nvidia driver doesn't load the necessary kernel module in Linux. I have not tested this out with LocalAI yet on my Linux deployments.

I also run LocalAI on Windows WSL2 with Docker Desktop and was having the same issue. In the same thread as before it mentions updates to Docker Dektop.

I updated Docker Desktop to 4.31.1 (https://docs.docker.com/desktop/release-notes/) and it finally works with the latest drivers (555.99)

So... For anyone out there that is running this in WSL2, try updating Docker Desktop.

@rwlove
Copy link

rwlove commented Jun 21, 2024

@MCP-LTS I followed your workaround and the build seemed to succeed. However when I try to chat with a model I get the following error:

12:18AM INF [llama-cpp] attempting to load with CUDA variant
12:18AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-cuda
12:18AM DBG GRPC(Mistral-7B-Instruct-v0.3.Q4_K_M.gguf-127.0.0.1:34409): stderr /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-cuda: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

Running LocalAI on a K8S node.

On the node:

➜  ~ ls -l /usr/lib64/libcuda.so.1
lrwxrwxrwx 1 root root 20 May 15 11:53 /usr/lib64/libcuda.so.1 -> libcuda.so.555.42.02

In the container:

root@localai-local-ai-649fd7f4bd-rqtlj:/build# find /usr | grep libcuda
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-config-version.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-config.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-header-search.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudadevrt.a
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so.12
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so.12.5.39
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-12.5/targets/x86_64-linux/lib/stubs/libcuda.so

Any suggestions?

@ER-EPR
Copy link

ER-EPR commented Jun 25, 2024

I can't successfully build llama-cuda inside the container. I prefer it to be delivered precompiled within the images. When will this be fixed?

@ER-EPR
Copy link

ER-EPR commented Jun 26, 2024

To avoid repeated rebuild I use this dockerfile to build a new image and it works.

FROM localai/localai:v2.17.1-cublas-cuda12-ffmpeg

ENV BUILD_GRPC_FOR_BACKEND_LLAMA=true
ENV VARIANT=llama-cuda
ENV GRPC_BACKENDS=backend-assets/grpc/llama-cpp-cuda
ENV REBUILD=true
ENV BUILD_TYPE=cublas

RUN cd /build && rm -rf ./local-ai && make build -j${BUILD_PARALLELISM:-1}

@di-rse
Copy link

di-rse commented Jul 3, 2024

I can confirm that this is still an issue with v2.18.1. Using @ER-EPR's solution for the time being.

@2snoopy88
Copy link

2snoopy88 commented Jul 8, 2024

@rwlove

@MCP-LTS I followed your workaround and the build seemed to succeed. However when I try to chat with a model I get the following error:

12:18AM INF [llama-cpp] attempting to load with CUDA variant
12:18AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-cuda
12:18AM DBG GRPC(Mistral-7B-Instruct-v0.3.Q4_K_M.gguf-127.0.0.1:34409): stderr /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-cuda: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

Running LocalAI on a K8S node.

On the node:

➜  ~ ls -l /usr/lib64/libcuda.so.1
lrwxrwxrwx 1 root root 20 May 15 11:53 /usr/lib64/libcuda.so.1 -> libcuda.so.555.42.02

In the container:

root@localai-local-ai-649fd7f4bd-rqtlj:/build# find /usr | grep libcuda
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-config-version.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-config.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/cmake/libcudacxx/libcudacxx-header-search.cmake
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudadevrt.a
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so.12
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart.so.12.5.39
/usr/local/cuda-12.5/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-12.5/targets/x86_64-linux/lib/stubs/libcuda.so

Any suggestions?

I use docker container, when I add gpu configuration in docker-compose.yaml, it works fine for me.

deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

@ascheucher-shopify-partner
Copy link

ascheucher-shopify-partner commented Jul 9, 2024

EDIT: Sorry, this doesn't work! It doesn't complain, but doesn't utilize the GPU at all. :/

Former not working workaround:
As the current release still doesn't work for all, I want to add my updated solution using LocalAI 2.18.1 with CUDA 12.5 / 555.

I created a Dockerfile based on @MCP-LTS tips, slightly different than @ER-EPR's one to use the latest versions.

Had a clean installation of Ubuntu 24.04.

First I purged all nvidia drivers:

sudo apt purge nvidia-driver-* nvidia-utils-* nvidia-kernel-* libnvidia-* xserver-xorg-video-nvidia-550 nvidia-compute* nvidia-dkms* nvidia-fs-dkms nvidia-prime nvidia-settings
# check for missing ones and purge them as well
dpkg -l | grep nvidia

Then installed from scrath:


Install container toolki [following this tutorial](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

```bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Some Docker Stuff:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

rootless mode:

nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
systemctl --user restart docker
sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place

also:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

check for version:

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi

I added this:
when gpu is detected, but it fails to init (some background):

vi /etc/nvidia-container-runtime/config.toml

# change no-cgroups=true to no-cgroups=false

then Cuda Toolkit installation
Instructions found here.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-5
sudo apt-get -y install nvidia-container-runtime

nvidia drivers:

First try with the open kernel module drivers:

sudo apt-get install -y nvidia-driver-555-open
sudo apt-get install -y cuda-drivers-555

Disable IOMMU:

  • is active? sudo dmesg | grep -i iommu
  • disable: sudo vi /etc/default/grub
  • to GRUB_CMDLINE_LINUX_DEFAULT, addd following: amd_iommu=off"
  • update grub: sudo update-grub
  • sudo reboot now
  • check with: cat /proc/cmdline

LocalAI
Start with the getting started section

Get the GPUs ID:

nvidia-smi -L

Build a custom image, as long the CUDA detection / compilation does not work in the original image:

Create a docker file ~/Documents/LocalAI/CUDA-not-found-workaround/Dockerfile:

FROM localai/localai:v2.18.1-cublas-cuda12-ffmpeg

ENV BUILD_GRPC_FOR_BACKEND_LLAMA=true
ENV VARIANT=llama-cuda
ENV GRPC_BACKENDS=backend-assets/grpc/llama-cpp-cuda
ENV REBUILD=true
ENV BUILD_TYPE=cublas

RUN cd /build && rm -rf ./local-ai && make build -j${BUILD_PARALLELISM:-1}

In the same directory build the image:

docker build -t local-ai-cuda-hack-v2.18.1-cublas-cuda12-ffmpeg .

Start with:

docker volume create localai-models
docker rm local-ai; docker run -p 8080:8080 --name local-ai -ti -v localai-models:/build/models -e NVIDIA_VISIBLE_DEVICES=GPU-233d81ca-903f-0195-63b2-798f5fb087eb  --runtime=nvidia --memory=16g local-ai-cuda-hack-v2.18.1-cublas-cuda12-ffmpeg --context-size 1000 --threads 8

check in container: docker exec -it local-ai nvidia-smi

Looks good as well:

user@orca:~/Documents/LocalAI/CUDA-not-found-workaround$ docker exec -it local-ai nvidia-smi
Tue Jul  9 13:34:19 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:08:00.0 Off |                  N/A |
|  0%   31C    P8              7W /  350W |    6490MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       339      C   .../backend-assets/grpc/llama-cpp-cuda          0MiB |
+-----------------------------------------------------------------------------------------+

I am not sure, whether everything done here is necessary, but this finally made it work for me for gpt-4 at least.

This is the startup log, there is no nvidi-smi output (before I had a nvidia-smi output there):

user@orca:~$ docker rm local-ai; docker run -p 8080:8080 --name local-ai -ti -v localai-models:/build/models -e NVIDIA_VISIBLE_DEVICES=GPU-233d81ca-903f-0195-63b2-798f5fb087eb  --runtime=nvidia --memory=16g local-ai-cuda-hack --context-size 1000 --threads 8
local-ai
go mod edit -replace github.com/donomii/go-rwkv.cpp=/build/sources/go-rwkv.cpp
go mod edit -replace github.com/ggerganov/whisper.cpp=/build/sources/whisper.cpp
go mod edit -replace github.com/ggerganov/whisper.cpp/bindings/go=/build/sources/whisper.cpp/bindings/go
go mod edit -replace github.com/go-skynet/go-bert.cpp=/build/sources/go-bert.cpp
go mod edit -replace github.com/M0Rf30/go-tiny-dream=/build/sources/go-tiny-dream
go mod edit -replace github.com/mudler/go-piper=/build/sources/go-piper
go mod edit -replace github.com/mudler/go-stable-diffusion=/build/sources/go-stable-diffusion
go mod edit -replace github.com/nomic-ai/gpt4all/gpt4all-bindings/golang=/build/sources/gpt4all/gpt4all-bindings/golang
go mod edit -replace github.com/go-skynet/go-llama.cpp=/build/sources/go-llama.cpp
go mod download
mkdir -p pkg/grpc/proto
protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
    backend/backend.proto
mkdir -p backend-assets/grpc
I local-ai build info:
I BUILD_TYPE: cublas
I GO_TAGS:
I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=v2.18.1" -X "github.com/go-skynet/LocalAI/internal.Commit=b941732f5494a7f3d7cc88b8fc64aaeae27a97e2"
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.18.1" -X "github.com/go-skynet/LocalAI/internal.Commit=b941732f5494a7f3d7cc88b8fc64aaeae27a97e2"" -tags "" -o local-ai ./
1:15PM INF env file found, loading environment variables from file envFile=.env
1:15PM INF Setting logging to info
1:15PM INF Starting LocalAI using 8 threads, with models path: /build/models
1:15PM INF LocalAI version:  ()
WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory
WARNING: error parsing the pci address "simple-framebuffer.0"
1:15PM INF Preloading models from /build/models

  Model name: text-embedding-ada-002



  You can test this model with curl like this:

  curl http://localhost:8080/embeddings -X POST -H "Content-Type:
  application/json" -d '{ "input": "Your text string goes here", "model": "text-
  embedding-ada-002" }'



  Model name: whisper-1



  ## example audio file

  wget --quiet --show-progress -O gb1.ogg
...

@salmundani
Copy link

Having the same issue in the latest LocalAI version with Ubuntu 22.04.4 LTS, CUDA 12.4 and Nvidia drivers v550. I also tried upgrading to CUDA 12.5 and drivers v555, but still doesn't work.

@mudler
Copy link
Owner

mudler commented Jul 23, 2024

can someone help me in testing the image referenced in #2994 (comment) ? I could test only with CUDA 12.2 and 12.4 and seems to work perfectly fine - I miss a testbed for 12.5.

Container image: ttl.sh/localai-ci-pr-2994:sha-1a89570-cublas-cuda12-ffmpeg

@rmarku
Copy link

rmarku commented Jul 23, 2024

I am having this problem since some old versions. Tested 2994PR image (changed docker compose image). Keep getting GPU device found but no CUDA backend present
Ollama is working well, running in docker.

Log:

api-1  | 5:31PM INF Trying to load the model 'Meta-Llama-3-8B-Instruct.Q4_0.gguf' with the backend '[llama-cpp llama-ggml gpt4all llama-cpp-fallback piper rwkv stablediffusion whisper huggingface bert-embeddings /build/backend/python/openvoice/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/mamba/run.sh /build/backend/python/transformers/run.sh /build/backend/python/exllama/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/vllm/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/bark/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/petals/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/coqui/run.sh /build/backend/python/diffusers/run.sh]'
api-1  | 5:31PM INF [llama-cpp] Attempting to load
api-1  | 5:31PM INF Loading model 'Meta-Llama-3-8B-Instruct.Q4_0.gguf' with backend llama-cpp
api-1  | 5:31PM INF GPU device found but no CUDA backend present
api-1  | 5:31PM INF [llama-cpp] attempting to load with AVX2 variant
api-1  | 5:31PM INF [llama-cpp] Loads OK

Getting same problem:

Host info:

╰─❯ nvidia-smi                                
Tue Jul 23 14:33:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   61C    P8             18W /   80W |      92MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2345      G   /usr/bin/X                                     57MiB |
|    0   N/A  N/A      2540      G   Hyprland                                        3MiB |
+-----------------------------------------------------------------------------------------+
╰─❯ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

Inside docker:

root@edfefd60f9f2:/build# nvidia-smi   
Tue Jul 23 17:35:02 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   61C    P5             20W /   80W |      92MiB /   8192MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
root@edfefd60f9f2:/build# 

@mudler
Copy link
Owner

mudler commented Jul 23, 2024

I am having this problem since some old versions. Tested 2994PR image (changed docker compose image). Keep getting GPU device found but no CUDA backend present Ollama is working well, running in docker.

You can ignore that message - it's a red herring because container images do not have the CUDA LocalAI alternative binary, however they are built with nvcc so are already working for Nvidia GPUs - I agree the message should be suppressed in that case and it is misleading, that's where we have to improve logging.

If you could paste the logs with --debug that'd be great - we can check if the GPU was actually used or not.

@rmarku
Copy link

rmarku commented Jul 23, 2024

Looks like it is working, speed 41 t/s looks like GPU usage.
Here the logs:

api-1  | 6:10PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-avx2
api-1  | 6:10PM DBG GRPC Service for Meta-Llama-3-8B-Instruct.Q4_0.gguf will be running at: '127.0.0.1:41077'
api-1  | 6:10PM DBG GRPC Service state dir: /tmp/go-processmanager4137429056
api-1  | 6:10PM DBG GRPC Service Started
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr I0000 00:00:1721758247.258414      38 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr I0000 00:00:1721758247.258629      38 ev_epoll1_linux.cc:125] grpc epoll fd: 3
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr I0000 00:00:1721758247.258818      38 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr I0000 00:00:1721758247.261197      38 ev_epoll1_linux.cc:359] grpc epoll fd: 5
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr I0000 00:00:1721758247.261709      38 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stdout Server listening on 127.0.0.1:41077
api-1  | 6:10PM DBG GRPC Service Ready
api-1  | 6:10PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:Meta-Llama-3-8B-Instruct.Q4_0.gguf ContextSize:8192 Seed:860747239 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:20 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /models/Meta-Llama-3-8B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   0:                       general.architecture str              = llama
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   1:                               general.name str              = models
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   2:                          llama.block_count u32              = 32
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  10:                          general.file_type u32              = 2
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - kv  21:               general.quantization_version u32              = 2
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - type  f32:   65 tensors
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - type q4_0:  225 tensors
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_model_loader: - type q6_K:    1 tensors
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_vocab: special tokens cache size = 256
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_vocab: token to piece cache size = 0.8000 MB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: format           = GGUF V3 (latest)
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: arch             = llama
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: vocab type       = BPE
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_vocab          = 128256
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_merges         = 280147
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: vocab_only       = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_ctx_train      = 8192
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_embd           = 4096
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_layer          = 32
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_head           = 32
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_head_kv        = 8
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_rot            = 128
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_swa            = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_embd_head_k    = 128
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_embd_head_v    = 128
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_gqa            = 4
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_embd_k_gqa     = 1024
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_embd_v_gqa     = 1024
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: f_norm_eps       = 0.0e+00
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_ff             = 14336
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_expert         = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_expert_used    = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: causal attn      = 1
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: pooling type     = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: rope type        = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: rope scaling     = linear
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: freq_base_train  = 500000.0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: freq_scale_train = 1
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: n_ctx_orig_yarn  = 8192
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: rope_finetuned   = unknown
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: ssm_d_conv       = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: ssm_d_inner      = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: ssm_d_state      = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: ssm_dt_rank      = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: model type       = 8B
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: model ftype      = Q4_0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: model params     = 8.03 B
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: general.name     = models
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: LF token         = 128 'Ä'
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_print_meta: max token length = 256
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr ggml_cuda_init: found 1 CUDA devices:
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr   Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors: ggml ctx size =    0.27 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors: offloading 32 repeating layers to GPU
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors: offloading non-repeating layers to GPU
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors: offloaded 33/33 layers to GPU
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors:        CPU buffer size =   281.81 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr .......................................................................................
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: n_ctx      = 8192
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: n_batch    = 512
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: n_ubatch   = 512
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: flash_attn = 0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: freq_base  = 500000.0
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: freq_scale = 1
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: graph nodes  = 1030
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr llama_new_context_with_model: graph splits = 2
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stdout {"timestamp":1721758252,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":1}
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stdout {"timestamp":1721758252,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":8192}
api-1  | 6:10PM INF [llama-cpp] Loads OK
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stdout {"timestamp":1721758252,"level":"INFO","function":"launch_slot_with_data","line":884,"message":"slot is processing task","slot_id":0,"task_id":0}
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr sampling: 
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stdout {"timestamp":1721758252,"level":"INFO","function":"update_slots","line":1785,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":0,"p0":0}
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr 	repeat_last_n = 0, repeat_penalty = 0.000, frequency_penalty = 0.000, presence_penalty = 0.000
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr 	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.900
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr 	mirostat = 2, mirostat_lr = 0.100, mirostat_ent = 5.000
api-1  | 6:10PM DBG GRPC(Meta-Llama-3-8B-Instruct.Q4_0.gguf-127.0.0.1:41077): stderr check_double_bos_eos: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
api-1  | 6:10PM DBG Sending chunk: {"created":1721758237,"object":"chat.completion.chunk","id":"3dee1239-0adb-4b4e-bce3-182739ec8fd3","model":"llama3-8b-instruct","choices":[{"index":0,"finish_reason":"","delta":{"content":"H"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
api-1  | 
api-1  | 6:10PM DBG Sending chunk: {"created":1721758237,"object":"chat.completion.chunk","id":"3dee1239-0adb-4b4e-bce3-182739ec8fd3","model":"llama3-8b-instruct","choices":[{"index":0,"finish_reason":"","delta":{"content":"e"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
api-1  | 
api-1  | 6:10PM DBG Sending chunk: {"created":1721758237,"object":"chat.completion.chunk","id":"3dee1239-0adb-4b4e-bce3-182739ec8fd3","model":"llama3-8b-instruct","choices":[{"index":0,"finish_reason":"","delta":{"content":"l"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

@titlestad
Copy link

What do you mean it's working? @mudler it's not working here? Can you say why it is working so I may be able to fix my setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

Successfully merging a pull request may close this issue.