Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enable GPU acceleration #425

Closed
wants to merge 2 commits into from
Closed

Conversation

maozdemir
Copy link
Contributor

@maozdemir maozdemir commented May 23, 2023

This pull request enables GPU to be used with privateGPT. (along with some markdownlint enhancements.)

Fixes #121
Fixes #306

@maozdemir
Copy link
Contributor Author

I did notice the duplicate in README.md. Will correct that

@johnbrisbin
Copy link

That looks great. I have a few questions though.

  1. Rather than scraping nvidia-smi, have you considered using pycuda? It is simple to get free memory as a plain number from the API, though admittedly the scraping is very concise, if a little brittle.
  2. Similarly, you can conditionalize is_gpu_enabled on finding a working cuda interface since the underlying LLM code might misbehave if it found no GPU but the flag was set.
  3. GPT4ALL: I would print the warning but allow the GPT4ALL to continue execution rather than failing because the GPU was flagged on.
  4. I would consider having two models in the prefs. One for use with GPU and a more modest one for CPU based ops. That way you could put the program and your data on a drive and take it with you, with no reconfiguration, running it on your laptop or other GPU limited machine when away from home.
  5. The estimation of memory used is one way to handle it, but as you noted in the comment you are depending on an estimate for a particular model that probably won't transfer all that well to other models either for the layers or the base memory cost of the LLM.
  6. BTW, which model(s) have you successfully tested with this code?
  7. Did you try the Embeddings GPU setting with the ingest process? It should make a big difference. I hope.
  8. For the LlamaCpp there is a threads parameter that you can also pass that helps quite a bit with the performance of the part that is not accelerated by the GPU or if the GPU is not used at all. Adding -> n_threads=psutil.cpu_count(logical=False) <- to the parameter list will allow it to use the number of cores (not threads) as the thread count for the LLM.
  9. Good addition for the ## Using GPU acceleration. You picked an 11.8 version of cuda-kit which is a few steps back from what Nvidia would have you install from their site. Is there a reason to prefer this one?

Ok, that is more than a few questions, but I really do like what you have done.

@johnbrisbin
Copy link

johnbrisbin commented May 24, 2023

In your README.md section, it looks like you need a:
cd llama-cpp-python
just before the:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; py ./setup.py install
else setup.py is not found.
The build that was to occur there also failed because it did not like my cmake version. It failed to enlighten me as to the necessary version, however.
Updating to the current version of cmake was sufficient to get a good build of "llama-cpp-python==0.1.54"

@maozdemir
Copy link
Contributor Author

maozdemir commented May 24, 2023

@johnbrisbin

  1. Rather than scraping nvidia-smi, have you considered using pycuda? It is simple to get free memory as a plain number from the API, though admittedly the scraping is very concise, if a little brittle.

That could be done, but that'd also add yet another requirement to the project, which I imagine is very not cool for the users who won't be utilising their GPU.

  1. Similarly, you can conditionalize is_gpu_enabled on finding a working cuda interface since the underlying LLM code might misbehave if it found no GPU but the flag was set.

That's a good point, I'll be looking into foolproofing that.

  1. GPT4ALL: I would print the warning but allow the GPT4ALL to continue execution rather than failing because the GPU was flagged on.

If the user wants to run a GPU environment, running a GPT4All is simply pointless, so maybe I should move it even to the beginning of the file to prevent time loss due to loading embeddings etc. It's just more than a warning that'll make fail the script to get the attention of the user.

  1. I would consider having two models in the prefs. One for use with GPU and a more modest one for CPU based ops. That way you could put the program and your data on a drive and take it with you, with no reconfiguration, running it on your laptop or other GPU limited machine when away from home.

That is also a good idea, I'm thinking of adding GPU_MODEL_PATH into the environment variables file in order to accomplish that.

  1. The estimation of memory used is one way to handle it, but as you noted in the comment you are depending on an estimate for a particular model that probably won't transfer all that well to other models either for the layers or the base memory cost of the LLM.

I don't think there is any other way of accomplishing it feasibly, or even calculating and adjusting the GPU layers without rewriting the langchain's llama-cpp implementation, though I do understand that that should not be overlooked. The current implementation in the proposal tries to get the most out of the GPU in a very questionable way.

  1. BTW, which model(s) have you successfully tested with this code?

I have tested with:

All of the models here that have GGML tag should work.

  1. Did you try the Embeddings GPU setting with the ingest process? It should make a big difference. I hope.

It makes almost no difference.
With CUDA enabled:

TotalSeconds      : 173.907001
TotalMilliseconds : 173907.001

Without CUDA enabled:

TotalSeconds      : 192.4191741
TotalMilliseconds : 192419.1741

This might be about my little cute GTX-965M though. I've implemented it in 76f042a regardless. Further testing is welcome.

  1. For the LlamaCpp there is a threads parameter that you can also pass that helps quite a bit with the performance of the part that is not accelerated by the GPU or if the GPU is not used at all. Adding -> n_threads=psutil.cpu_count(logical=False) <- to the parameter list will allow it to use the number of cores (not threads) as the thread count for the LLM.

I avoided adding that because people that are running on low resources will get affected. On low-end computers like mine (with an i7-6700HQ) the device can get near unusable (my laptop crashes when it's on 100% for too long, and no it's not a cooling issue). Plus, since this PR is mostly about GPU acceleration/utilisation, I doubt that this would be the place to implement that?

  1. Good addition for the ## Using GPU acceleration. You picked an 11.8 version of cuda-kit which is a few steps back from what Nvidia would have you install from their site. Is there a reason to prefer this one?

PyTorch supports only 11.7 and 11.8 currently. In order to not break anything (like user environments etc.) 11.8 was the pick.

@maozdemir
Copy link
Contributor Author

In your README.md section, it looks like you need a: cd llama-cpp-python just before the: $Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; py ./setup.py install else setup.py is not found. The build that was to occur there also failed because it did not like my cmake version. It failed to enlighten me as to the necessary version, however. Updating to the current version of cmake was sufficient to get a good build of "llama-cpp-python==0.1.54"

Apparently I cd'ed back one too much... Thanks for the feedback.

@johnbrisbin
Copy link

@maozdemir, thanks for responding. Looks like you might be in a different time zone.

  1. Rather than scraping nvidia-smi, have you considered using pycuda? It is simple to get free memory as a plain number from the API, though admittedly the scraping is very concise, if a little brittle.

That could be done, but that'd also add yet another requirement to the project, which I imagine is very not cool for the users who won't be utilising their GPU.

If the project is to use GPUs generally or well, pycuda is inevitable, if not strictly required today, In my LTH opinion, calling a library function is always preferable to invoking a shell. Invoking a shell has a long history of introducing both security vulnerabilities and failure opportunities.

3. GPT4ALL: I would print the warning but allow the GPT4ALL to continue execution rather than failing because the GPU was flagged on.

If the user wants to run a GPU environment, running a GPT4All is simply pointless, so maybe I should move it even to the beginning of the file to prevent time loss due to loading embeddings etc. It's just more than a warning that'll make fail the script to get the attention of the user.

I mention later the two pref sets, CPU and GPU. Until we get someplace like that the user will be able to switch to GPU and back to CPU by just changing the model name while in the fail-it strategy will require two modification each time. With the usual reduction of success likelihood.

8. For the LlamaCpp there is a threads parameter that you can also pass that helps quite a bit with the performance of the part that is not accelerated by the GPU or if the GPU is not used at all. Adding -> n_threads=psutil.cpu_count(logical=False) <- to the parameter list will allow it to use the number of cores (not threads) as the thread count for the LLM.

I avoided adding that because people that are running on low resources will get affected. On low-end computers like mine (with an i7-6700HQ) the device can get near unusable (my laptop crashes when it's on 100% for too long, and no it's not a cooling issue). Plus, since this PR is mostly about GPU acceleration/utilisation, I doubt that this would be the place to implement that?

Since the thread count setting proposed is tied to the capabilities of the machine (and set conservatively), it should not choke a machine that is capable of running an LLM with same day service. Note that this is set to Real Core count, not Virtual Thread count. Thus, on your machine it would use 4 threads. You could also add an optional env value of max threads... And it is a huge win on the machines that I run on.
As to the 'topicality' of the use-of-threads enhancement, you could always take advantage of the 'same line in the diff'-rule (like the '5-second' rule) to ask for a dispensation.

BTW, I have an idea for accurate determination of memory requirements on the GPU for setting the layer counts. I will let you know if it actually works.

Thanks for listening to my questions and considering my suggestions.

@Kaszanas
Copy link

Kaszanas commented May 24, 2023

I would love further instructions on how to exactly specify the model for GPU usage in the .env file.

When trying to run the GPU version, the ingest works fine but this does not:

python .\privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from ./models/ggml-gpt4all-j-v1.3-groovy.bin
Traceback (most recent call last):
  File "G:\Projects\1_Python\privateGPT\privateGPT.py", line 105, in <module>
    main()
  File "G:\Projects\1_Python\privateGPT\privateGPT.py", line 60, in main
    llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False, n_gpu_layers=calculate_layer_count())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: ./models/ggml-gpt4all-j-v1.3-groovy.bin. Received error [WinError -529697949] Windows Error 0xe06d7363 (type=value_error)
Exception ignored in: <function Llama.__del__ at 0x000001F48ED9B4C0>
Traceback (most recent call last):
  File "G:\Projects\1_Python\privateGPT\venv\Lib\site-packages\llama_cpp\llama.py", line 1076, in __del__
    if self.ctx is not None:
       ^^^^^^^^
AttributeError: 'Llama' object has no attribute 'ctx'


I have a feeling that there needs to be a clear documentation for that.

@maozdemir
Copy link
Contributor Author

@Kaszanas Check the repo's README. https://github.com/maozdemir/privateGPT/tree/gpu

@johnbrisbin
Copy link

I encountered another issue:
The version of torch required is greater than the one required for non-CUDA support. This was a 2+ version to support Cuda 11.8.

@Kaszanas
Copy link

@Kaszanas Check the repo's README. https://github.com/maozdemir/privateGPT/tree/gpu

Unfortunately the README doesn't explain that very well, sorry.

@maozdemir
Copy link
Contributor Author

@Kaszanas probably sometrhing went wrong during the compilation of llama-cpp-python ,can you try uninstalling and installing back?

@maozdemir
Copy link
Contributor Author

maozdemir commented May 24, 2023

@johnbrisbin can you use this wizard? https://pytorch.org/get-started/locally/

Also I'll read your comment when I have time, I'm not ignoring it.:)

@Kaszanas
Copy link

@maozdemir Compilation ran successfully, GPU ingest works as intended. This issue is only present when trying to run the privateGPT script. I could try and show you step by step but I don't know if I will be able to find the time.

Will let you know if I do.

@maozdemir
Copy link
Contributor Author

maozdemir commented May 24, 2023

@maozdemir Compilation ran successfully, GPU ingest works as intended. This issue is only present when trying to run the privateGPT script. I could try and show you step by step but I don't know if I will be able to find the time.

Will let you know if I do.

@Kaszanas well the only time I saw that error was when I cloned the llama.cpp repo into wrong directory... I'll be waiting for your feedback .

GPU ingesting is not related to llama-cpp-python package, or llama.cpp. It uses huggingface's CUDA implementation. llama.cpp uses cuBLAS, which is ran on privateGPT.py

@Kaszanas
Copy link

@maozdemir Compilation ran successfully, GPU ingest works as intended. This issue is only present when trying to run the privateGPT script. I could try and show you step by step but I don't know if I will be able to find the time.

Will let you know if I do.

@Kaszanas well the only time I saw that error was when I cloned the llama.cpp repo into wrong directory... I'll be waiting for your feedback .

I ran the commands straight from README.
For GPU support it could be viable to add additional information about installing PyTorch with CUDA enabled as it seems to be required as well? And requirements only have a CPU enabled version. This is another step in the setup process for GPU I would imagine.

@StephenDWright
Copy link

First of all, great contribution, was looking out for this and was excited to see someone put it together so quickly. Unfortunately I haven't got it to use my GPU. I've deleted and pulled everything so many times. Made sure to make adjustments to env and the script, made sure to pull and build following your instructions. Everything goes smoothly but it still uses my CPU instead of GPU.

@maozdemir
Copy link
Contributor Author

First of all, great contribution, was looking out for this and was excited to see someone put it together so quickly. Unfortunately I haven't got it to use my GPU. I've deleted and pulled everything so many times. Made sure to make adjustments to env and the script, made sure to pull and build following your instructions. Everything goes smoothly but it still uses my CPU instead of GPU.

Are you on an NVidia GPU?

@StephenDWright
Copy link

Yes I am, currently a 12 GB 3060. I know you had to ask because there will always be someone who will try to run it on an Radeon Graphics Card lol.

@maozdemir
Copy link
Contributor Author

@maozdemir Compilation ran successfully, GPU ingest works as intended. This issue is only present when trying to run the privateGPT script. I could try and show you step by step but I don't know if I will be able to find the time.
Will let you know if I do.

@Kaszanas well the only time I saw that error was when I cloned the llama.cpp repo into wrong directory... I'll be waiting for your feedback .

I ran the commands straight from README. For GPU support it could be viable to add additional information about installing PyTorch with CUDA enabled as it seems to be required as well? And requirements only have a CPU enabled version. This is another step in the setup process for GPU I would imagine.

You are right, should add
pip3 install -U torch torchvision --index-url https://download.pytorch.org/whl/cu118

I am still investigating the issue you are having, testing on fresh Windows installs.

@StephenDWright; when you launch the privateGPT.py do you see CUBLAS=1 or CUBLAS=0 at the bottom of model properties?

@StephenDWright
Copy link

StephenDWright commented May 25, 2023

@maozdemir I see Blas = 0. I am assuming you are referring to that. This is the output to the terminal. Thanks for taking the time to troubleshoot btw.

Using embedded DuckDB with persistence: data will be stored in: db
llama.cpp: loading model from models/ggml-vic13b-q5_1.bin
llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1000
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 90.75 KB
llama_model_load_internal: mem required = 11359.05 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size = 781.25 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

@maozdemir
Copy link
Contributor Author

@StephenDWright you're welcome, this will help me with writing a better README too :) so thanks for your feedback. the possible cause is your llama-cpp-python was not compiled with CUBLAS. Can you try uninstalling the existing package and then with the current instructions? (with those environment variables etc)

I am not sure why people are having troubles, I have actually ran on a clean Windows successfully, and also on several Linux machines...

@StephenDWright
Copy link

Before I do that, I did it again yesterday, this was some of the output while building after running this command:
$Env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"; $Env:FORCE_CMAKE=1; py ./setup.py install

I took this output to mean it was compiling with CUBLAS.

Extract of Terminal Output:

Not searching for unused variables given on the command line.
-- The C compiler identification is MSVC 19.35.32215.0
-- The CXX compiler identification is MSVC 19.35.32215.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.35.32215/bin/Hostx86/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.35.32215/bin/Hostx86/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.39.1.windows.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Found CUDAToolkit: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/include (found version "11.8.89")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.8/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- x86 detected
-- GGML CUDA sources found, configuring CUDA architecture
-- Configuring done (13.4s)
-- Generating done (0.0s)
-- Build files have been written to: C:/Users/Stephen/Programming/PGPT/privateGPT/llama-cpp-python/_skbuild/win-amd64-3.11/cmake-build
[1/6] Generating build details from Git
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.39.1.windows.1")
[4/6] Building CUDA object vendor\llama.cpp\CMakeFiles\ggml.dir\ggml-cuda.cu.obj
ggml-cuda.cu
[5/6] Install the project...-- Install configuration: "Release"
-- Installing: C:/Users/Stephen/Programming/PGPT/privateGPT/llama-cpp-python/_skbuild/win-amd64-3.11/cmake-install/llama_cpp/llama.dll

copying llama_cpp\llama.py -> _skbuild\win-amd64-3.11\cmake-install\llama_cpp\llama.py
copying llama_cpp\llama_cpp.py -> _skbuild\win-amd64-3.11\cmake-install\llama_cpp\llama_cpp.py
copying llama_cpp\llama_types.py -> skbuild\win-amd64-3.11\cmake-install\llama_cpp\llama_types.py
copying llama_cpp_init
.py -> skbuild\win-amd64-3.11\cmake-install\llama_cpp_init.py
creating directory skbuild\win-amd64-3.11\cmake-install\llama_cpp/server
copying llama_cpp/server\app.py -> skbuild\win-amd64-3.11\cmake-install\llama_cpp/server\app.py
copying llama_cpp/server_init
.py -> skbuild\win-amd64-3.11\cmake-install\llama_cpp/server_init.py
copying llama_cpp/server_main
.py -> skbuild\win-amd64-3.11\cmake-install\llama_cpp/server_main.py

running install

@maozdemir
Copy link
Contributor Author

maozdemir commented May 25, 2023

@StephenDWright alright, that doesn't seem to be the issue. Assuming that you already have CUDA drivers installed, the only thing that comes to my mind is torch pip3 install -U torch torchvision --index-url https://download.pytorch.org/whl/cu118

@johnbrisbin
Copy link

johnbrisbin commented May 25, 2023

@johnbrisbin can you use this wizard? https://pytorch.org/get-started/locally/

Yes, I used that prior to commenting, and it worked. I was just pointing out an implicit requirement above the current privateGPT. I.e. a pre-2 pytorch worked but for GPU support and CUDA 11.8 you need the 2+ version.

@johnbrisbin
Copy link

johnbrisbin commented May 25, 2023

@StephenDWright I worked through a similar problem yesterday.
The output of the spew when the model is loaded shows you do not have the correct LlamaCpp installed in your running context. When it is the running version you will see CUBLAS info at the bottom. Like this:
llama_model_load_internal: mem required = 9089.91 MB (+ 1608.00 MB per state)
llama_model_load_internal: [cublas] offloading 10 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 2269 MB

The CUBLAS lines will show even when the GPU is not active with the right version of LlamaCpp running.

  1. In my case, I was using VS Code and had some difficulty getting the LlamaCpp built, installed and active in the venv context.
  2. One thing that I encountered inside and outside the venv, was the need to uninstall the existing llama-cpp-python module and also to avoid reinstalling the old module from its cache.
    It seemed best to do the uninstall, then build and install the LlamaCpp with the disable cache flag.
    Use a command similar to this:
    pip install llama-cpp-python --no-cache-dir
  3. Also, there is a lot of legacy python3 and pip3 references going around. These will screw you if you are not careful. At one time they were necessary to insure you did not get the python2 versions. Now, standard installs of python do not create python3 and pip3 versions so they fall further down the paths to some version that did. The end result is you don't get things installed into or built with the right python versions.
    You can work around that two ways, create a link or dup the existing pip and python versions into pip3 and python3 in the active installation or remove the 3 from any commands you are executing.

Those three things bit my hindquarters yesterday.

@johnbrisbin
Copy link

I am not sure why people are having troubles, I have actually ran on a clean Windows successfully, and also on several Linux machines...

Clean Windows? That is the definition of oxymoron.
Seriously though, that is what you can test and you should test clean installations but much of these (or at least my) issues have to do with unclean installs, all the old python versions installed one place or another, all the subtle context interactions with IDEs, the slightly different versions installed for other packages or explorations, occasional lapses in the ability to follow instructions.
That is going to happen and about all you can do is perfect your instructions, embrace the suck, and get an installer going. The installer is good for about 80% of theses issues (by the 2nd or 3rd revision anyway).

@StephenDWright
Copy link

@johnbrisbin Thank you for the feedback. I am also trying to run it in VS code, in a venv. I have deleted the folder and environment and cloned so many times to start over the process.😤 I will try what you suggested regarding the cache. At least I know what I am looking for if it ever works. So you are saying using python3 and pip3 sometimes and then using python and pip can actually cause problems. Interesting. Thanks again.

@johnbrisbin
Copy link

@johnbrisbin Thank you for the feedback. I am also trying to run it in VS code, in a venv. I have deleted the folder and environment and cloned so many times to start over the process.😤 I will try what you suggested regarding the cache. At least I know what I am looking for if it ever works. So you are saying using python3 and pip3 sometimes and then using python and pip can actually cause problems. Interesting. Thanks again.

@StephenDWright, I would suggest you try 'where python' and 'where python3' in the venv terminal to check that. But for me, an active virtual environment seems to disable the where command so it outputs nothing. I had to run a simple script that imports sys and prints sys.argv[0] to find where the pythons are really located. And they were different.

Copy link
Collaborator

@imartinez imartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contribution is massive, the community has been asking for it. Thanks a lot! Please take a look at my comments and let me know if you feel it is ready to merge @maozdemir

ingest.py Outdated Show resolved Hide resolved
privateGPT.py Outdated Show resolved Hide resolved
@StephenDWright
Copy link

Really brilliant, even though I am about to give up on getting GPU to work for now after an evening of trying, it is still a great addition. 👍👍

@johnbrisbin
Copy link

13. Did you try the Embeddings GPU setting with the ingest process? It should make a big difference. I hope.

It makes almost no difference. With CUDA enabled:

TotalSeconds      : 173.907001
TotalMilliseconds : 173907.001

Without CUDA enabled:

TotalSeconds      : 192.4191741
TotalMilliseconds : 192419.1741

This might be about my little cute GTX-965M though. I've implemented it in 76f042a regardless. Further testing is welcome.

Some really good news. Just turning on the CUDA option made a huge improvement for me.

I have a collection of 1900+ epub books. I have ingested them more than once. It took 15 hours straight to ingest 1500 of them on a 16 core/32 thread 64GB machine at about 100 per hour.
Turning on CUDA in the embeddings initialization with an RTX 1660 Super installed reduced the time for the first 100 from 1 hour to 8m57s with only 2 threads CPU load. The second 100 maintained the same pace, as did the third.

It looks like your very short test was dominated by initialization time. With a real load (the whole 1900 books amount to 3.75 million chunks) the benefits are huge! 7x faster. Since the machine I have is very fast for CPU ops, the benefits for people with less capable main processors will be even better assuming a normal video card.

Congratulations, @maozdemir

@gael-vanderlee
Copy link

Would this work with AMD GPUs if pytorch is configured with ROCm ?

@johnbrisbin
Copy link

Would this work with AMD GPUs if pytorch is configured with ROCm ?

I looked into this recently and... indications are not good. There is not a one-to-one relationship between the CUDA and ROCm APIs so it looks like a simple translation is right out.
If, however, pytorch is generating code for CUDA to accomplish an algorithm, then presumably it could generate corresponding code for ROCm use. I don't know which bits of the code used in privateGPT that support CUDA might be using pytorch as a code generator. Could be fun to try on one of the simpler bits.

@StephenDWright
Copy link

StephenDWright commented May 28, 2023

@maozdemir and @johnbrisbin I finally got it to work with the GPU. Sharing so it can hopefully help with troubleshooting in the future. I encountered the following issue while setting up a virtual environment in VS Code:

Problem: Despite manually preventing llama.cpp installation from the requirements file, installing version 1.5.4, and detecting CUDA references during the compilation process, the GPU was not being utilized. The nvidia toolkit was detected, and the compilation seemed successful, but Blas was at 0 and there were no indications of GPU offloading.

Action: I found two folders in my environment's site packages: "llama_cpp" and "llama_cpp_python-0.1.54-py3.11-win-amd64.egg (0.1.54)." I deleted the "llama_cpp" folder and replaced it with the same folder from the "..win-amd..(0.1.5.4)" directory. It's unclear if copying the folder was necessary; deleting the folder might have resolved the issue alone.

The GPU is now being utilized. Regardless of the exact fix, it's evident that the problem stemmed from using the incorrect version of "llama_cpp," despite my attempts to manually install the correct one.

@maozdemir
Copy link
Contributor Author

Thanks!

@imartinez, I'll have to rewrite a good README for a clearer instructions to enable GPU, then it'll be ready to merge :)

@imartinez
Copy link
Collaborator

Thanks @StephenDWright for sharing your experience.

And thanks @maozdemir! Let me know when you are done for a final review and merge. This first GPU support could be explicitly marked as experimental in the readme, and only for experienced users given the complexity of the installation.

@yaslack
Copy link

yaslack commented May 28, 2023

I made it work with my amd gpu rx 6950 xt
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
but i don't really see acceleration, the model still load on my gpu, i get 92 sec

@darrinh
Copy link

darrinh commented May 28, 2023

Getting this message when attempting to GPU support:

Not searching for unused variables given on the command line.
-- cuBLAS found
CMake Error at /home/chatbot/privateGPT/lib/python3.10/site-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:277 (message):
  CMAKE_CUDA_ARCHITECTURES must be non-empty if set.
Call Stack (most recent call first):
  vendor/llama.cpp/CMakeLists.txt:184 (enable_language)

Ubuntu 20.04
cmake 3.26.3
cuda-nvcc-11-8

What am I missing?

EDIT
If I do

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1
python3 ./setup.py install

On seperate lines, then it appears to build successfully.

EDIT2: not sure if the above worked, I get the following when starting privateGPT.py:

(privateGPT) chatbot@chatbot:~/privateGPT$ ./privateGPT.py 
/home/chatbot/privateGPT/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
  File "/home/chatbot/privateGPT/./privateGPT.py", line 78, in <module>
    main()
  File "/home/chatbot/privateGPT/./privateGPT.py", line 38, in main
    llm = GPT4All(model=model_path, n_ctx=model_n_ctx, backend='gptj', callbacks=callbacks, verbose=False)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for GPT4All
__root__
  Invalid model directory (type=value_error)

README.md Outdated Show resolved Hide resolved
@maozdemir maozdemir reopened this May 29, 2023
@maozdemir
Copy link
Contributor Author

maozdemir commented May 29, 2023

@imartinez;

Would you like to review the current state of this PR? (sorry for the force push...)

@Kaszanas
Copy link

Kaszanas commented Jun 1, 2023

If possible, could you verify if the GPU accelerated inference works with GPT4All? If this is not the case then adding additional information to the README might be needed. Last time i ran this and compiled stuff by hand the embedding ran fine with the GPU, but inference failed. Don't remember why.

@maozdemir
Copy link
Contributor Author

maozdemir commented Jun 2, 2023

If possible, could you verify if the GPU accelerated inference works with GPT4All? If this is not the case then adding additional information to the README might be needed. Last time i ran this and compiled stuff by hand the embedding ran fine with the GPU, but inference failed. Don't remember why.

@Kaszanas, The thing is there are no way of making it work with GPT4All, at least not that I know.
Also, I think @imartinez prefers #521 over this. If he agrees, I'll be closing this PR.

@imartinez imartinez added the primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT label Oct 19, 2023
@imartinez imartinez closed this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement of query response time with GPU how to utilize GPU in Windows?
9 participants