Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory not cleaned up after off-loading layers to GPU using n_gpu_layers #223

Open
4 tasks done
nidhishs opened this issue May 17, 2023 · 36 comments
Open
4 tasks done
Labels
bug Something isn't working hardware Hardware specific issue llama.cpp Problem with llama.cpp shared lib

Comments

@nidhishs
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama-cpp-python to do.

def generate_text(model_id:str, prompt:str) -> str:
    llm = LlamaCpp(model_path=f'./weights/{model_id}.bin', n_gpu_layers=40)
    output = llm(prompt)
    return output

After calling this function, the llm object still occupies memory on the GPU.

Current Behavior

Please provide a detailed written description of what llama-cpp-python did, instead.
The llm object should clean up after itself and clear GPU memory. The GPU memory is only released after terminating the python process.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 5800X 8-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  58%
    CPU max MHz:         4850.1948
    CPU min MHz:         2200.0000
    BogoMIPS:            7602.85
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 
                         constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt ae
                         s xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core
                          perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rd
                         t_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerp
                         tr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgi
                         f v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    4 MiB (8 instances)
  L3:                    32 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  • Operating System, e.g. for Linux:

Linux name 6.2.13-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 26 Apr 2023 20:50:14 +0000 x86_64 GNU/Linux

@gjmulder gjmulder added bug Something isn't working hardware Hardware specific issue llama.cpp Problem with llama.cpp shared lib labels May 17, 2023
@iactix
Copy link

iactix commented May 21, 2023

I am encountering this too. Windows, CuBLAS, AMD CPU, RTX 1080. When the llm model is destroyed you get the RAM back but the VRAM stays occupied until the whole python script using it quits. That means you get out of memory doing a second inference this way. Which makes GPU acceleration unusable for me, currently. Kind of a big deal I think.

@gjmulder
Copy link
Contributor

gjmulder commented May 21, 2023

Is this a bug then in how llama-cpp-python is managing the shared lib libllama or a bug within libllama?

I suspect the latter, which means a bug needs to be logged with llama.cpp that reproduces the issue as simply as possible.

Does someone have a few lines of python to reproduce the problem? Ideally with informative output of nvidia-smi before, during and after the bug occurrence.

EDIT: Sorry, I see that the OP did provide some code. What happens if you do a:

del llm

before returning? My python-fu isn't that strong, but I suspect you need to explicitly destroy the object to destroy the reference to the code running on the GPU. Of course this looks horribly inefficient as every time the method is being called the model needs to be reloaded on the GPU.

@nidhishs
Copy link
Author

I’ll make a minimal example and update you.

@iactix
Copy link

iactix commented May 21, 2023

By now I am doing

llm.reset()
llm.set_cache(None)
llm = None
del llm
llm = None

Changes nothing. RAM goes down, VRAM keeps up.

ggerganov/llama.cpp#1456

@nidhishs
Copy link
Author

This snippet should do the job:

from llama_cpp import Llama
import gc
import os

def measure_resources(func):
    def get_ram_usage(pid):
        ram = os.popen(f'pmap {pid} | tail -1').read().strip()
        return ram.split(' ')[-1]
    
    def get_gpu_usage(pid):
        gpu = os.popen(f'nvidia-smi --query-compute-apps=pid,used_memory --format=csv | grep {pid}').read().strip()
        return gpu.split(', ')[-1] if gpu else '0 MiB'

    def wrapper():
        pid = os.getpid()
        print('pid:', pid)
        pre_ram, pre_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('pre_ram:', pre_ram, 'pre_gpu:', pre_gpu)
        func()
        post_ram, post_gpu = get_ram_usage(pid), get_gpu_usage(pid)
        print('post_ram:', post_ram, 'post_gpu:', post_gpu)

    return wrapper

@measure_resources
def generate_text():
    llm = Llama(model_path='./weights/oasst-30b.bin', n_gpu_layers=40)
    del llm
    gc.collect()

if __name__ == '__main__':
    generate_text()

Output:

pid: 13121
pre_ram: 720676K pre_gpu: 0 MiB
llama.cpp: loading model from ./weights/oasst-30b.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32016
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 135.75 KB
llama_model_load_internal: mem required  = 25573.29 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 40 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 15307 MB
llama_init_from_file: kv self size  =  780.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
post_ram: 25209048K post_gpu: 16074 MiB

Interestingly RAM usage also doesn't go down until process is terminated? Perhaps I'm missing something.

@gjmulder
Copy link
Contributor

gjmulder commented May 21, 2023

Good repro! I patched it to use $MODEL.

You probably want to append to the bug llama.cpp/issues/1456, but they may ask which llama.cpp commit, which is non-obvious from the:

$ pip install --force-reinstall --ignore-installed --no-cache-dir --verbose llama-cpp-python

$ git diff repro*py
diff --git a/repro.py b/repro2.py
index 03f365a..d628b1d 100644
--- a/repro.py
+++ b/repro2.py
@@ -24,7 +24,7 @@ def measure_resources(func):
 
 @measure_resources
 def generate_text():
-    llm = Llama(model_path='./weights/oasst-30b.bin', n_gpu_layers=40)
+    llm = Llama(model_path=os.environ.get("MODEL"), n_gpu_layers=40)
     del llm
     gc.collect()
 
$ pip list | grep llama-cpp-python
llama-cpp-python              0.1.52

$ python ./repro2.py 
pid: 1750642
pre_ram: 718468K pre_gpu: 0 MiB
llama.cpp: loading model from /data/llama/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  72.75 KB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 12602 MB
llama_init_from_file: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
post_ram: 26360776K post_gpu: 13236 MiB

@nidhishs
Copy link
Author

oobabooga/text-generation-webui/#2087 was able to fix the RAM not being released. Have we already integrated this change?

@iactix
Copy link

iactix commented May 26, 2023

oobabooga/text-generation-webui/#2087 was able to fix the RAM not being released. Have we already integrated this change?

It seems to me they only fixed the RAM not VRAM? RAM was already freed when I tried this. Anyway it clearly seems to be a llama.cpp problem and I don't know how this can be open for a week or so. The fix must be a single line or something. Over there. As a workaround, I could imagine wrapping the model useage entirely into a thread and killing that after use could work to force it to free everything like it does when the pyhton script exits, but I have not tested it. I doubt it would be a fix llama-cpp-python could actually implement though.

@nidhishs
Copy link
Author

Yeah they fixed the RAM issue. I have been following the thread on llama.cpp and seems like the author of the GPU implementation was able to fix it after being able to reproduce with my snippet. Not sure when the fix will be pushed though.

@iactix
Copy link

iactix commented May 26, 2023

But it is a VRAM issue.

@nidhishs
Copy link
Author

Yes, the author was able to clean up VRAM. Check the thread issue #1456.

@iactix
Copy link

iactix commented May 26, 2023

llama_free function works well for cpu ram.
For vram, still not work.

Is the last thing I see there

Edit: It is possible I misinterpreted the most recent comment there, I don't know what they tested. The guy above indeed says the issue is fixed in his branch. Hopefully there will just be a fix in llama.cpp soon?

@gjmulder
Copy link
Contributor

gjmulder commented May 28, 2023

Fix looks to be available in upstream PR which also adds a new llama.cpp CLI arg --tensor-split which looks to be needed to be supported by llama-cpp-python:

@JohannesGaessler:
I added a fix in this PR ggerganov/llama.cpp#1607 where I'm refactoring the CUDA code. However, I added a new CLI argument --tensor-split and because of that the Python script that I used to reproduce the memory leak seems to now be broken

CUDA error, out of memory when reload

@edp1096
Copy link

edp1096 commented May 29, 2023

llama_free function works well for cpu ram.
For vram, still not work.

Is the last thing I see there

Edit: It is possible I misinterpreted the most recent comment there, I don't know what they tested. The guy above indeed says the issue is fixed in his branch. Hopefully there will just be a fix in llama.cpp soon?

I'v tested save-load-state with ngl param appending and my personal go binding on my single gpu.
Although i'v not test llama-cpp-python, I think current version of llama-cpp-python should also work with only dll/so file changing regardless whether support --tensor-split arg.

@kdanielive
Copy link

I tried manually installing the llama-cpp-python with the llama.cpp given through the PR here, both including and not including the --tensor-split arg but resulted in segmentation fault while loading model.

Until llama-cpp-python gets updated, the best strategy when you need to reload multiple models right now might be to use subprocess package to execute a separate python script that loads the llama.cpp model and outputs results. This successfully releases the gpu vram.

@chen369
Copy link

chen369 commented May 31, 2023

I tried manually installing the llama-cpp-python with the llama.cpp given through the PR here, both including and not including the --tensor-split arg but resulted in segmentation fault while loading model.

Until llama-cpp-python gets updated, the best strategy when you need to reload multiple models right now might be to use subprocess package to execute a separate python script that loads the llama.cpp model and outputs results. This successfully releases the gpu vram.

Yeah, this is what I was doing as a workaround.
Thanks,

@iactix
Copy link

iactix commented Jun 11, 2023

This may be somewhat fixed in the latest llama-cpp-python version. The VRAM goes down when the model is unloaded. However the dedicated GPU memory usage does not return to the same level it was before first loading, and it still goes down further when terminating the python script. But when loading it again, at least now it returns to the same usage it had before, so it should not run out of VRAM anymore, as far as I can tell. But really, the VRAM usage should completely go away when unloading. The reason for unloading is that you want to make that VRAM available to something else.

@JohannesGaessler
Copy link

The llama.cpp CUDA code allocates static memory buffers for holding temporary results. This is done to avoid having to allocate memory during the computation which would be much slower. So that's most likely the reason VRAM is not completely freed until the process exits. The static buffers are currently proportional to batch size in size so that parameter can be lowered to reduce VRAM usage.

@iactix
Copy link

iactix commented Jun 11, 2023

I don't understand how it can even keep any buffers if i delete the model, and even if it's possible, it should not be allowed to do that. I realize it keeps its memory when i have the model created, but when i do not, there should not be any trace of me even using llama-cpp-python.

So, maybe a usecase helps. My AI server runs all the time. But I kick it out of memory if I haven't used it for 10 minutes. If it keeps stuff in memory (RAM or VRAM) this is a problem when I want to play Diablo4.

@JohannesGaessler
Copy link

It has to do with how the CUDA code works. The memory is not tied to a specific model object but rather it is tied to global static variables.

@iactix
Copy link

iactix commented Jun 11, 2023

I see. I hope it can be forced to release its memory without relying on the process quitting, otherwise that sounds pretty incompetent on nvidia's side. I really don't want to wrap it in its own process just to work around what I would consider to be a serious memory leak. Or maybe there is something llama-cpp-python can stop holding on to, to trigger full destruction, idk.

@JohannesGaessler
Copy link

It doesn't have anything to do with what NVIDIA did, it's a consequence of the llama.cpp code. There is a global memory buffer pool and a global scratch buffer that are not tied to a specific model.

@iactix
Copy link

iactix commented Jun 11, 2023

That's great, makes it solveable. May I suggest a "cleanup" call in the API or something.

@JohannesGaessler
Copy link

JohannesGaessler commented Jun 11, 2023

Right now using multiple models in the same process won't work correctly anyways. I'll include a fix that just frees the buffers upon model deletion the next time I make a PR.

@iactix
Copy link

iactix commented Jun 11, 2023

Thank you, sounds great! <3

@iactix
Copy link

iactix commented Jun 30, 2023

Still basically a memory leak issue for 1 1/2 months now.

@eugen-ajechiloae-clearml

Hey guys, in case you have CPU memory issues, check out this issue ggerganov/llama.cpp#2145.
As a temporary workaround until this is fixed officially, you could use this fork https://github.com/eugen-ajechiloae-clearml/llama.cpp:

git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python/vendor
rmdir llama.cpp/
git clone https://github.com/eugen-ajechiloae-clearml/llama.cpp
cd ../..
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python/

@iactix
Copy link

iactix commented Jul 19, 2023

It is a GPU memory issue. VRAM rises just importing llama-cpp-python. It is not a lot but in my book that's a no-go already. Then when I load a model with BLAS (cuda) and a few layers and do inference, VRAM goes to 5GB. Fine. Then I delete/unload the model, goes down to 2.5GB VRAM usage. Terminate the python process, goes down to 1.1 GB VRAM usage. When that should not make anything go down after the model was already deleted. Does it really take 2 months to add some function that frees some kind of sloppy shared resources? By now I am getting the impression this is deemed perfectly cool behavior by llama.cpp and llama-cpp-python can't do anything about it due to how some bindings-magic works probably.

@JohannesGaessler
Copy link

VRAM rises just importing llama-cpp-python.

That is 100% a llama-cpp-python issue, most likely from the eager initialization of the CUDA backend.

Then I delete/unload the model, goes down to 2.5GB VRAM usage. Terminate the python process, goes down to 1.1 GB VRAM usage.

That is very likely due to the buffer pool for prompt processing, see ggerganov/llama.cpp#1935 , will be fixed by ggerganov/llama.cpp#2160 .

Does it really take 2 months to add some function that frees some kind of sloppy shared resources? By now I am getting the impression this is deemed perfectly cool behavior by llama.cpp

I'm doing this as a hobby and I don't particularly care about the use cases of other people. I personally only use llama.cpp from the command line or the native server. So I'm not going to spend my time on a temporary fix that manages the deallocation of the buffer pool when the proper fix would be to implement kernels that don't need temporary buffers in the first place. If someone else does care they can make a PR for it.

@iactix
Copy link

iactix commented Jul 19, 2023

I mean if there is a global buffer pool, is it even sloppy to give it like a "flush" function that llama-cpp-python could call?

@JohannesGaessler
Copy link

Feel free to implement it. As I said, I'm not going to spend my time on a temporary solution.

@iactix
Copy link

iactix commented Jul 19, 2023

Yeah I understand that completely. I didn't mean to sound ungrateful either. And I know I can't demand anything if I'm not going to do it myself and all that. But there's also people being in better positions to do it with less effort, and I guess I just don't understand why such a well managed project doesn't just kind of priority fix something like that. I mean I know I can't technically call it a leak on llama's end. But apparently the bindings can't fix this either and in combination it's just pretty much a gpu memory leak. Idk. Also thanks for your gpu inference, it's pretty cool.

@JohannesGaessler
Copy link

I just don't understand why such a well managed project doesn't just kind of priority fix something like that.

It's just a matter of manpower. I just work on the things that I want for myself when I feel like it and the only significant CUDA infrastructure contributor other than me is slaren.

@iactix
Copy link

iactix commented Jul 19, 2023

I see. I mean I have never made a pull request in my life but maybe I will actually look into it.

@JohannesGaessler
Copy link

If you do you should invoke clearing the buffer pool at the same time that the VRAM scratch buffer gets deallocated.

@iactix
Copy link

iactix commented Jul 22, 2023

I am sorry to report that I did in fact opt to not go for a temporary fix, since who knows what the next tool I use decides to keep in global buffer. So I wrapped all my llama-cpp-python stuff in a process wrapper. Here is the code in case a second person wants to use the vram for stable diffusion or something:

import multiprocessing
import time

#example worker, where you would put your stuff and report back
def worker_func(input_queue, output_queue):
    while True:
        task = input_queue.get()
        if task["command"] == "exit":
            break
        elif task["command"] == "input":
            result = task["data"] #whatever
            output_queue.put({"command":"result", "data": result})
            if input_queue.empty():
                output_queue.put({"command":"status", "data": "idle"})

class WorkerProcess:
    def __init__(self, worker_func, idle_timeout):
        self.worker_func = worker_func
        self.process = None
        self.status = "stopped"
        self.input_queue = multiprocessing.Queue()
        self.output_queue = multiprocessing.Queue()
        self.results = []
        self.idle_start_time = None
        self.idle_timeout = idle_timeout  # The maximum allowed idle time in seconds

    def start_worker(self):
        if self.status != "stopped":
            return
        if self.process == None:
            self.process = multiprocessing.Process(target=self.worker_func, args=(self.input_queue, self.output_queue))
            self.process.start()
            self.status = "idle"

    def stop_worker(self):
        if self.status != "idle":
            # only idle worker can be stopped to prevent blocking for now
            return
        if self.process != None:
            if self.process.is_alive():
                self.input_queue.put({"command":"exit"})
            self.process.join()
            self.process = None
            self.status = "stopped"

    def start_task(self, input_data):
        if self.status == "busy":
            return
        if self.status == "stopped":
            self.start_worker()
        self.input_queue.put({"command":"input", "data": input_data})
        self.status = "busy"
        self.last_task_start_time = time.time()  # Update the last task start time

    def get_status(self):
        return self.status
    
    def update(self):
        if self.process != None and not self.process.is_alive():
            self.process.join()
            self.process = None

        while self.output_queue.empty() != True:
            o = self.output_queue.get()
            if o["command"] == "status":
                if self.status != "stopped":
                    self.status = o["data"]
            if o["command"] == "result":
                self.results.append(o["data"])
        
        # Check if the worker has been idle for longer than the allowed timeout
        if self.status == "idle" and self.last_task_start_time is not None and time.time() - self.last_task_start_time > self.idle_timeout:
            self.stop_worker()

    def has_result(self):
        if len(self.results) > 0:
            return True
        return False
        
    def get_result(self):
        if len(self.results) > 0:
            return self.results.pop(0)
        return None

I guess it should work for streaming updates too once that works correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hardware Hardware specific issue llama.cpp Problem with llama.cpp shared lib
Projects
None yet
Development

No branches or pull requests

8 participants