Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: rebase llama.cpp submodule on latest upstream #2694

Merged
merged 7 commits into from
Jul 19, 2024

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Jul 18, 2024

This PR updates the llama.cpp submodule to a version rebased on commit ggerganov/llama.cpp@a15ef8f ggerganov/llama.cpp@87e397d.

To rebase successfully:

  • Commented-out CUDA logs was replaced with ggml_backend_cuda_log_set_callback
  • llama.cpp now only inserts a leading space after BOS, so the hacks in llama.cpp and GPT4All to work around concatenated calls to llama_tokenize were removed

To get it to compile:

  • llama.cpp.cmake had to be updated, as the repo was reorganized, and the OpenCL backend was removed
  • llama_token_to_piece calls were updated to add the lstrip argument (set to zero)

There were runtime complains about the NONE op not being supported, so I ran test-backend-ops. This crashed partway through, which led to significant changes to the way we are initializing/destroying resources with Kompute. The results of this are:

  • We are now running all 243 passing tests instead of only 150, by allowing no-ops on all tensor data types (this also fixes the runtime complaints)
  • We now free the device and Vulkan instance as late as possible, as doing it too eagerly can result in repeated re-initialization, which eventually causes the crash during test-backend-ops. Resources are freed by calling Kompute's Manager::clear instead, which has been modified to loop until there is nothing left to clean up.
  • An old bug where we did not unref the GPU device on allocation failure was fixed. I believe this could have caused unnecessary VRAM usage after OOM, which would only be freed after a successful model load/unload with Kompute.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre requested review from manyoso and removed request for manyoso July 18, 2024 22:21
This fixes CUDA symbol lookup errors caused by missing parts of the
build script.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre
Copy link
Member Author

cebtenzzre commented Jul 19, 2024

New changes:

  • Fix potential crash and memory leaks on OOM in Kompute
  • Enable GPT4All support for GPT-NeoX, Gemma 2, OpenELM, ChatGLM, and Jais architectures (all with Kompute support)
  • Also enable Kompute support for StarCoder2, XVERSE, Command R, and OLMo

Gemma 2 still doesn't work perfectly due to familiar issues with special tokens in output causing an assertion failure in UI code (likely an unrecognized EOS token).

@cebtenzzre cebtenzzre marked this pull request as ready for review July 19, 2024 03:28
@cebtenzzre cebtenzzre requested a review from manyoso July 19, 2024 03:28
@manyoso
Copy link
Collaborator

manyoso commented Jul 19, 2024

Have tested a bit and can't see anything that breaks

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre cebtenzzre marked this pull request as draft July 19, 2024 18:46
@cebtenzzre cebtenzzre marked this pull request as ready for review July 19, 2024 18:51
@cebtenzzre cebtenzzre merged commit 290c629 into main Jul 19, 2024
6 of 20 checks passed
cebtenzzre added a commit that referenced this pull request Jul 30, 2024
When llama.cpp was updated, I removed the space removal logic, but it
turns out it's still actually needed. This is now a proper parameter, as
we specifically only want to disable the *leading* space when we are
tokenizing input that comes after a normal token.

This fixes a regression in commit 290c629 ("backend: rebase llama.cpp
submodule on latest upstream (#2694)").

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants