You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
structggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu;
allocate CPU memory that is not freed anywhere.
Steps to Reproduce
# docker image capable of running llama.cpp, you could also just run the other instructions on your local machine
docker run -it --rm --ipc host --network host --gpus all nvcr.io/nvidia/pytorch:22.11-py3 bash
# install
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python/vendor
rmdir llama.cpp/
git clone https://github.com/ggerganov/llama.cpp.git
cd ../..
# download model
python3 llama-cpp-python/docker/open_llama/hug_model.py -a SlyEcho -s open_llama_3b -f "q5_1"
# build
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python/
Expected Behavior
llama.cpp should not leak memory when compiled with LLAMA_CUBLAS=1
Current Behavior
llama.cpp leaks memory when compiled with LLAMA_CUBLAS=1
Environment and Context
You can find my environment below, but we were able to reproduce this issue on multiple machines.
CPU: AMD Ryzen 7 3700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070 Super
OS: 22.04.1-Ubuntu
Python3 version: 3.10.6
Make version: 4.3
g++ version: 11.3.0
Failure Information
In
ggml-cuda.cu
, the functionsggml_cuda_transform_tensor
llama.cpp/ggml-cuda.cu
Line 3110 in 061f5f8
ggml_cuda_assign_buffers_impl
llama.cpp/ggml-cuda.cu
Line 3194 in 061f5f8
allocate CPU memory that is not freed anywhere.
Steps to Reproduce
Then run the following python script:
You will notice that the CPU memory used by the program increases slowly but surely
Proposed fix
Allocate
extra
llama.cpp/ggml.h
Line 430 in 061f5f8
#2146
The text was updated successfully, but these errors were encountered: