Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

m9e · 2024-05-29T06:36:59Z

What happened?

On OSX, on 02c1eca, running:

./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --mlock --host 0.0.0.0 --port 51039 -ngl 999 --chat-template llama3

I get total incoherence around token ~512. A clipped output:

User: name 50 ai pioneers, 1 per line, with 10-12 words on why each belongs on the list

Llama: Here is the list of 50 AI pioneers, one per line, with a brief description of why each belongs on the list:

1. Alan Turing - Founded computer science and proposed the Turing Test.
2. Marvin Minsky - Developed theory of artificial neural networks (ANNs).
[snip]
25. Eric Horvitz - Advanced decision theory and uncertainty in AI.
26. David Andrew Peter029.42128
3120274.

623)5]
.

3632922)
.

in one run I let the incoherence go on for quite a long time and after what may have been another ~512 tokens (just eyeballing it), it suddenly resolved back into coherence with some hallucinated lyrics.

On the other hand, if I start it with

./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --ctx-size 0 --mlock -ngl 999 --chat-template llama3 --port 50051

then all is well.

During startup, the version without --ctx-size 0 will print:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512

the version with --ctx-size 0 will print:

llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512

additionally, I believe this message in the logs only appears in the version lacking the param:

{"tid":"0x202cdfac0","timestamp":1716964019,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":0,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":511}

and I believe that exactly corresponds with the incoherence.

Name and Version

(venv) bash-3.2$ ./main --version
version: 3028 (02c1eca)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
(venv) bash-3.2$

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-05-29T07:37:55Z

This is normal behaviour - with small context size (such as 512) the server will automatically discard past tokens when the context becomes full. With instruction tuned models such as the one that you use, this can become catastrophic because the chat template likely gets destroyed and the model goes OOD. Using --ctx-size 0 will give you optimal behaviour, utilizing the maximum context for the used model

m9e · 2024-05-30T14:40:36Z

I understand the point there. Would there be downsides to making --ctx-size 0 the default when loading a instruction tuned model? Or on any given generation to set -n to (max_ctx - input_ctx) (eg, "auto max new tokens" behavior)? Or have a flag for that?

Or:

Reject as invalid params -n > -c (accounting for defaults)
Have a large warning during the ctx flush if the model is ins that the instruction template is being thrown away

Just a bunch of thoughts. Obviously I was being ignorant here (in my mind, I was conflating app-level context flushing via truncation/etc with the model level, which is obviously much more painful to our output!), but it feels like the expected behavior here is a bit of a trap that could be avoided. But I'm not sure if I'm not seeing downsides.

ggerganov · 2024-06-03T09:20:27Z

Yes, it can be improved. I will try to address similar issues within #7675

dspasyuk · 2024-06-18T21:09:42Z

This issues seems to be related to mine #7929 (comment) With today's version the problem with garbage output seems gone. Everything works as B3080 version except for context window. Before when output reached the context window size it would just reset and continue answering question forever, now once the context windows is filled with output from multiple questions the generation just stops, is there way to free context window after it gets filled automatically?

Here is how I run it:
llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --n-gpu-layers 35 -cnv --interactive-first --simple-io --interactive -b 2048 --ctx_size 4096 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 6 --chat-template llama3

github-actions · 2024-08-02T01:20:47Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

m9e added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels May 29, 2024

lhl mentioned this issue May 30, 2024

Llama 3 70B Instruct fine tune GGUF - corrupt output? #7513

Closed

github-actions bot added the stale label Jul 19, 2024

github-actions bot closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

m9e commented May 29, 2024 •

edited

Loading

ggerganov commented May 29, 2024

m9e commented May 30, 2024

ggerganov commented Jun 3, 2024

dspasyuk commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024

Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

Comments

m9e commented May 29, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented May 29, 2024

m9e commented May 30, 2024

ggerganov commented Jun 3, 2024

dspasyuk commented Jun 18, 2024 • edited Loading

github-actions bot commented Aug 2, 2024

m9e commented May 29, 2024 •

edited

Loading

dspasyuk commented Jun 18, 2024 •

edited

Loading