backend: rebase llama.cpp submodule on latest upstream #2694

cebtenzzre · 2024-07-18T22:21:27Z

This PR updates the llama.cpp submodule to a version rebased on commit ~~ggerganov/llama.cpp@a15ef8f~~ ggerganov/llama.cpp@87e397d.

To rebase successfully:

Commented-out CUDA logs was replaced with ggml_backend_cuda_log_set_callback
llama.cpp now only inserts a leading space after BOS, so the hacks in llama.cpp and GPT4All to work around concatenated calls to llama_tokenize were removed

To get it to compile:

llama.cpp.cmake had to be updated, as the repo was reorganized, and the OpenCL backend was removed
llama_token_to_piece calls were updated to add the lstrip argument (set to zero)

There were runtime complains about the NONE op not being supported, so I ran test-backend-ops. This crashed partway through, which led to significant changes to the way we are initializing/destroying resources with Kompute. The results of this are:

We are now running all 243 passing tests instead of only 150, by allowing no-ops on all tensor data types (this also fixes the runtime complaints)
We now free the device and Vulkan instance as late as possible, as doing it too eagerly can result in repeated re-initialization, which eventually causes the crash during test-backend-ops. Resources are freed by calling Kompute's Manager::clear instead, which has been modified to loop until there is nothing left to clean up.
An old bug where we did not unref the GPU device on allocation failure was fixed. I believe this could have caused unnecessary VRAM usage after OOM, which would only be freed after a successful model load/unload with Kompute.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

This fixes CUDA symbol lookup errors caused by missing parts of the build script. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre · 2024-07-19T03:27:24Z

New changes:

Fix potential crash and memory leaks on OOM in Kompute
Enable GPT4All support for GPT-NeoX, Gemma 2, OpenELM, ChatGLM, and Jais architectures (all with Kompute support)
Also enable Kompute support for StarCoder2, XVERSE, Command R, and OLMo

Gemma 2 still doesn't work perfectly due to familiar issues with special tokens in output causing an assertion failure in UI code (likely an unrecognized EOS token).

manyoso · 2024-07-19T16:40:17Z

Have tested a bit and can't see anything that breaks

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

When llama.cpp was updated, I removed the space removal logic, but it turns out it's still actually needed. This is now a proper parameter, as we specifically only want to disable the *leading* space when we are tokenizing input that comes after a normal token. This fixes a regression in commit 290c629 ("backend: rebase llama.cpp submodule on latest upstream (#2694)"). Signed-off-by: Jared Van Bortel <jared@nomic.ai>

backend: rebase llama.cpp submodule on latest upstream

0edfb85

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre requested review from manyoso and removed request for manyoso July 18, 2024 22:21

cebtenzzre added 3 commits July 18, 2024 23:19

backend: make CMake scripts closer to upstream

5a9c90b

This fixes CUDA symbol lookup errors caused by missing parts of the build script. Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llamamodel : update arch whitelist

5e2b8b8

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llama.cpp: update submodule for Kompute alloc cleanup fix and arches

d0f5c89

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre force-pushed the update-llamacpp branch from 166377a to d0f5c89 Compare July 19, 2024 03:23

cebtenzzre marked this pull request as ready for review July 19, 2024 03:28

cebtenzzre requested a review from manyoso July 19, 2024 03:28

manyoso approved these changes Jul 19, 2024

View reviewed changes

llama.cpp: rebase submodule again for latest upstream changes

12a899a

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre marked this pull request as draft July 19, 2024 18:46

cebtenzzre added 2 commits July 19, 2024 14:49

backend: fix Metal build

4bcb637

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Merge branch 'main' into update-llamacpp

957ec65

cebtenzzre marked this pull request as ready for review July 19, 2024 18:51

cebtenzzre merged commit 290c629 into main Jul 19, 2024
6 of 20 checks passed

cosmic-snow mentioned this pull request Jul 30, 2024

Vulkan (Kompute) Crashes on Windows #2774

Open

cebtenzzre mentioned this pull request Jul 31, 2024

backend: fix extra spaces in tokenization and a CUDA crash #2778

Merged

ThiloteE mentioned this pull request Aug 18, 2024

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: rebase llama.cpp submodule on latest upstream #2694

backend: rebase llama.cpp submodule on latest upstream #2694

cebtenzzre commented Jul 18, 2024 •

edited

Loading

cebtenzzre commented Jul 19, 2024 •

edited

Loading

manyoso commented Jul 19, 2024

backend: rebase llama.cpp submodule on latest upstream #2694

backend: rebase llama.cpp submodule on latest upstream #2694

Conversation

cebtenzzre commented Jul 18, 2024 • edited Loading

cebtenzzre commented Jul 19, 2024 • edited Loading

manyoso commented Jul 19, 2024

cebtenzzre commented Jul 18, 2024 •

edited

Loading

cebtenzzre commented Jul 19, 2024 •

edited

Loading