-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Erroneous Output in llama-cli #9848
Comments
Does it work after applying this patch: diff --git a/src/llama.cpp b/src/llama.cpp
index da7afb1e..fde09bec 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9517,20 +9517,14 @@ static struct ggml_tensor * llm_build_kqv(
cur = ggml_flash_attn_ext(ctx, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
- if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_GEMMA2) {
- ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
- }
+ ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
cur = ggml_reshape_2d(ctx, cur, n_embd_head_v*n_head, n_tokens);
} else {
struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
cb(kq, "kq", il);
- if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 || model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
- // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
- // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
- ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
- }
+ ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
if (model.arch == LLM_ARCH_GROK) {
// need to do the following: |
Looks like a duplicate of #9838. The issue seems to be related to using a container. |
@bmahabirbu could you test this out? |
Yes it is a duplicate sorry, I wasn't aware @bmahabirbu logged it |
@ericcurtin my apologies for not referencing you in the first issue. I'll try this patch @ggerganov thank you! |
Unfortunately, the patch did not work. I have a feeling it's something to do with WSL2 not giving the necessary resources to the container. |
If it works correctly without the container, then the container is the cause. You can try using the official dockerfile instead and see if it works with that. |
Its also worth noting that using the official ollama container works. |
As @JohannesGaessler already pointed in the other issue, the issue may be the use of |
Makes sense! Originally I used make because docker build couldnt find libcuda using cmake |
@ggerganov @slaren @JohannesGaessler A question though that's somewhat related. llama-cli and ollama are two tools that use llama.cpp as a library. Using the exact same .gguf files with both, Ollama seems to have higher quality responses in general even when problems like above aren't encountered. We tend to call llama-cli like this in RamaLama project:
What is it about the way Ollama uses llama.cpp that seems to generate better responses? |
Likely different sampling parameters. These can have a high impact on the quality of the generated text. Try to match the settings between the two tools and see if this resolves the discrepancy. |
Thanks a bunch @slaren! This is what fixed the issue! for reference this is what the relevant part of my new containerfile looks like
I also targeted docker_arch for my GPU instead of the default |
@bmahabirbu if you could test this also works with podman and open a PR on ramalama that would be great! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
When using llama.cpp models (e.g., granite-code and llama3) with Nvidia GPU acceleration (nvidia/cuda:12.6.1-devel-ubi9 and RTX 3080 10GB VRAM), the models occasionally return nonsensical or garbled output after a few valid responses. This occurs even when the input prompts are simple, like basic arithmetic or listing prime numbers. Running the model using -ngl 50 in both configurations leads to the issue, suggesting it could be related to VRAM usage or GPU settings. This problem does not occur with Ollama’s GPU-accelerated version of llama3 using the exact same .gguf files.
The llama-cli command used is:
llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix --in-suffix --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv
ramalama project issue:
containers/ramalama#247
I don't think this kinda issue is Nvidia specific, in general Ollama seems to product higher quality responses than llama-cli.
Name and Version
$ llama-cli --version
version: 3821 (70392f1)
built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: