Bug: Erroneous Output in llama-cli #9848

ericcurtin · 2024-10-11T17:10:51Z

What happened?

When using llama.cpp models (e.g., granite-code and llama3) with Nvidia GPU acceleration (nvidia/cuda:12.6.1-devel-ubi9 and RTX 3080 10GB VRAM), the models occasionally return nonsensical or garbled output after a few valid responses. This occurs even when the input prompts are simple, like basic arithmetic or listing prime numbers. Running the model using -ngl 50 in both configurations leads to the issue, suggesting it could be related to VRAM usage or GPU settings. This problem does not occur with Ollama’s GPU-accelerated version of llama3 using the exact same .gguf files.

The llama-cli command used is:

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix --in-suffix --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv

ramalama project issue:

containers/ramalama#247

I don't think this kinda issue is Nvidia specific, in general Ollama seems to product higher quality responses than llama-cli.

Name and Version

$ llama-cli --version
version: 3821 (70392f1)
built with cc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question only
444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv
> what is 2+2 answer the question and no more
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-10-11T18:27:20Z

Does it work after applying this patch:

diff --git a/src/llama.cpp b/src/llama.cpp
index da7afb1e..fde09bec 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -9517,20 +9517,14 @@ static struct ggml_tensor * llm_build_kqv(
         cur = ggml_flash_attn_ext(ctx, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
 
-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_GEMMA2) {
-            ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
-        }
+        ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);
 
         cur = ggml_reshape_2d(ctx, cur, n_embd_head_v*n_head, n_tokens);
     } else {
         struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
         cb(kq, "kq", il);
 
-        if (model.arch == LLM_ARCH_PHI2 || model.arch == LLM_ARCH_PHI3 || model.arch == LLM_ARCH_GPTNEOX || model.arch == LLM_ARCH_QWEN2 || model.arch == LLM_ARCH_NEMOTRON || model.arch == LLM_ARCH_CHATGLM) {
-            // for this arch, we need to perform the KQ multiplication with F32 precision, otherwise we get NaNs
-            // ref: https://github.com/ggerganov/llama.cpp/pull/4490#issuecomment-1859055847
-            ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
-        }
+        ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
 
         if (model.arch == LLM_ARCH_GROK) {
             // need to do the following:

slaren · 2024-10-11T18:31:34Z

Looks like a duplicate of #9838. The issue seems to be related to using a container.

ericcurtin · 2024-10-11T18:31:49Z

@bmahabirbu could you test this out?

ericcurtin · 2024-10-11T18:32:59Z

Yes it is a duplicate sorry, I wasn't aware @bmahabirbu logged it

bmahabirbu · 2024-10-11T18:53:37Z

@ericcurtin my apologies for not referencing you in the first issue. I'll try this patch @ggerganov thank you!

bmahabirbu · 2024-10-11T20:48:49Z

Unfortunately, the patch did not work. I have a feeling it's something to do with WSL2 not giving the necessary resources to the container.

slaren · 2024-10-11T20:57:13Z

If it works correctly without the container, then the container is the cause. You can try using the official dockerfile instead and see if it works with that.

bmahabirbu · 2024-10-11T21:02:21Z

Its also worth noting that using the official ollama container works.

slaren · 2024-10-11T21:06:26Z

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

bmahabirbu · 2024-10-11T21:13:11Z

Makes sense! Originally I used make because docker build couldnt find libcuda using cmake

ericcurtin · 2024-10-12T11:57:05Z

@ggerganov @slaren @JohannesGaessler

A question though that's somewhat related. llama-cli and ollama are two tools that use llama.cpp as a library. Using the exact same .gguf files with both, Ollama seems to have higher quality responses in general even when problems like above aren't encountered. We tend to call llama-cli like this in RamaLama project:

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

What is it about the way Ollama uses llama.cpp that seems to generate better responses?

ggerganov · 2024-10-12T12:01:45Z

Likely different sampling parameters. These can have a high impact on the quality of the generated text. Try to match the settings between the two tools and see if this resolves the discrepancy.

bmahabirbu · 2024-10-16T01:14:18Z

As @JohannesGaessler already pointed in the other issue, the issue may be the use of CUDA_ARCH, which is not the correct way to set the CUDA architectures. My suggestion would be to switch to using cmake to build llama.cpp as in the official dockerfile, since it has much better defaults for the CUDA archs.

Thanks a bunch @slaren! This is what fixed the issue! for reference this is what the relevant part of my new containerfile looks like

# CUDA_DOCKER_ARCH = 
# Turing GPUs (e.g., RTX 20 Series, GTX 16 Series): Use 75
# Ampere GPUs (e.g., RTX 30 Series, A100): Use 80
# Hopper GPUs (e.g., H100): Use 90
# Volta GPUs (e.g., V100): Use 70
# Pascal GPUs (e.g., GTX 10 Series): Use 61
# Kepler GPUs (e.g., GTX 600 and 700 Series): Use 35

# Followed https://github.com/ggerganov/llama.cpp/blob/master/.devops/full-cuda.Dockerfile
# for reference to build llama.cpp with cuda using cmake

RUN git clone https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DGGML_CUDA=ON CUDA_DOCKER_ARCH=80 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    cmake --build build --config Release -j$(nproc) && \
    cd build/bin && \
    mv llama-cli /usr/bin/llama-cli && \
    mv llama-server /usr/bin/llama-server && \
    cd / && \
    rm -rf llama.cpp

I also targeted docker_arch for my GPU instead of the default

ericcurtin · 2024-10-16T12:00:49Z

@bmahabirbu if you could test this also works with podman and open a PR on ramalama that would be great!

github-actions · 2024-12-01T01:08:06Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ericcurtin added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Oct 11, 2024

bmahabirbu mentioned this issue Oct 16, 2024

Fix erroneous output in CUDA containerfile containers/ramalama#322

Merged

slaren mentioned this issue Oct 19, 2024

Bug: imatrix crash - nan detected in blk.1.attn_output.weight #9899

Closed

github-actions bot added the stale label Nov 16, 2024

github-actions bot closed this as completed Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Erroneous Output in llama-cli #9848

Bug: Erroneous Output in llama-cli #9848

ericcurtin commented Oct 11, 2024

ggerganov commented Oct 11, 2024

slaren commented Oct 11, 2024

ericcurtin commented Oct 11, 2024

ericcurtin commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

slaren commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

slaren commented Oct 11, 2024 •

edited

Loading

bmahabirbu commented Oct 11, 2024

ericcurtin commented Oct 12, 2024

ggerganov commented Oct 12, 2024

bmahabirbu commented Oct 16, 2024

ericcurtin commented Oct 16, 2024

github-actions bot commented Dec 1, 2024

Bug: Erroneous Output in llama-cli #9848

Bug: Erroneous Output in llama-cli #9848

Comments

ericcurtin commented Oct 11, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Oct 11, 2024

slaren commented Oct 11, 2024

ericcurtin commented Oct 11, 2024

ericcurtin commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

slaren commented Oct 11, 2024

bmahabirbu commented Oct 11, 2024

slaren commented Oct 11, 2024 • edited Loading

bmahabirbu commented Oct 11, 2024

ericcurtin commented Oct 12, 2024

ggerganov commented Oct 12, 2024

bmahabirbu commented Oct 16, 2024

ericcurtin commented Oct 16, 2024

github-actions bot commented Dec 1, 2024

slaren commented Oct 11, 2024 •

edited

Loading