Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Erroneous Output in llama-cli #9848

Closed
ericcurtin opened this issue Oct 11, 2024 · 15 comments
Closed

Bug: Erroneous Output in llama-cli #9848

ericcurtin opened this issue Oct 11, 2024 · 15 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale

Comments

@ericcurtin
Copy link
Collaborator

What happened?

When using llama.cpp models (e.g., granite-code and llama3) with Nvidia GPU acceleration (nvidia/cuda:12.6.1-devel-ubi9 and RTX 3080 10GB VRAM), the models occasionally return nonsensical or garbled output after a few valid responses. This occurs even when the input prompts are simple, like basic arithmetic or listing prime numbers. Running the model using -ngl 50 in both configurations leads to the issue, suggesting it could be related to VRAM usage or GPU settings. This problem does not occur with Ollama’s GPU-accelerated version of llama3 using the exact same .gguf files.

The llama-cli command used is:

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix --in-suffix --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv

ramalama project issue:

containers/ramalama#247

I don't think this kinda issue is Nvidia specific, in general Ollama seems to product higher quality responses than llama-cli.

Name and Version

$ llama-cli --version
version: 3821 (70392f1)
built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question only
444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question and no more
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@ericcurtin ericcurtin added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Oct 11, 2024
@ggerganov
Copy link
Owner

Does it work after applying this patch:

diff --git a/src/llama.cpp b/src/llama.cpp
index da7afb1e..fde09bec 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9517,20 +9517,14 @@ static struct ggml_tensor * llm_build_kqv(
         cur = ggml_flash_attn_ext(ctx, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
 
-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_GEMMA2) {
-            ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
-        }
+        ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
 
         cur = ggml_reshape_2d(ctx, cur, n_embd_head_v*n_head, n_tokens);
     } else {
         struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
         cb(kq, "kq", il);
 
-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 || model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
-            // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
-            // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
-            ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
-        }
+        ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
 
         if (model.arch == LLM_ARCH_GROK) {
             // need to do the following:

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

Looks like a duplicate of #9838. The issue seems to be related to using a container.

@ericcurtin
Copy link
Collaborator Author

@bmahabirbu could you test this out?

@ericcurtin
Copy link
Collaborator Author

Yes it is a duplicate sorry, I wasn't aware @bmahabirbu logged it

@bmahabirbu
Copy link

@ericcurtin my apologies for not referencing you in the first issue. I'll try this patch @ggerganov thank you!

@bmahabirbu
Copy link

Unfortunately, the patch did not work. I have a feeling it's something to do with WSL2 not giving the necessary resources to the container.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

If it works correctly without the container, then the container is the cause. You can try using the official dockerfile instead and see if it works with that.

@bmahabirbu
Copy link

Its also worth noting that using the official ollama container works.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2024

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

@bmahabirbu
Copy link

Makes sense! Originally I used make because docker build couldnt find libcuda using cmake

@ericcurtin
Copy link
Collaborator Author

@ggerganov @slaren @JohannesGaessler

A question though that's somewhat related. llama-cli and ollama are two tools that use llama.cpp as a library. Using the exact same .gguf files with both, Ollama seems to have higher quality responses in general even when problems like above aren't encountered. We tend to call llama-cli like this in RamaLama project:

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

What is it about the way Ollama uses llama.cpp that seems to generate better responses?

@ggerganov
Copy link
Owner

Likely different sampling parameters. These can have a high impact on the quality of the generated text. Try to match the settings between the two tools and see if this resolves the discrepancy.

@bmahabirbu
Copy link

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

Thanks a bunch @slaren! This is what fixed the issue! for reference this is what the relevant part of my new containerfile looks like

# CUDA_DOCKER_ARCH = 
# Turing GPUs (e.g., RTX 20 Series, GTX 16 Series): Use 75
# Ampere GPUs (e.g., RTX 30 Series, A100): Use 80
# Hopper GPUs (e.g., H100): Use 90
# Volta GPUs (e.g., V100): Use 70
# Pascal GPUs (e.g., GTX 10 Series): Use 61
# Kepler GPUs (e.g., GTX 600 and 700 Series): Use 35

# Followed https://github.com/ggerganov/llama.cpp/blob/master/.devops/full-cuda.Dockerfile
# for reference to build llama.cpp with cuda using cmake

RUN git clone https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DGGML_CUDA=ON CUDA_DOCKER_ARCH=80 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    cmake --build build --config Release -j$(nproc) && \
    cd build/bin && \
    mv llama-cli /usr/bin/llama-cli && \
    mv llama-server /usr/bin/llama-server && \
    cd / && \
    rm -rf llama.cpp

I also targeted docker_arch for my GPU instead of the default

@ericcurtin
Copy link
Collaborator Author

@bmahabirbu if you could test this also works with podman and open a PR on ramalama that would be great!

Copy link
Contributor

github-actions bot commented Dec 1, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) stale
Projects
None yet
Development

No branches or pull requests

4 participants