-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609
Comments
This is normal behaviour - with small context size (such as 512) the |
I understand the point there. Would there be downsides to making --ctx-size 0 the default when loading a instruction tuned model? Or on any given generation to set -n to (max_ctx - input_ctx) (eg, "auto max new tokens" behavior)? Or have a flag for that? Or:
Just a bunch of thoughts. Obviously I was being ignorant here (in my mind, I was conflating app-level context flushing via truncation/etc with the model level, which is obviously much more painful to our output!), but it feels like the expected behavior here is a bit of a trap that could be avoided. But I'm not sure if I'm not seeing downsides. |
Yes, it can be improved. I will try to address similar issues within #7675 |
This issues seems to be related to mine #7929 (comment) With today's version the problem with garbage output seems gone. Everything works as B3080 version except for context window. Before when output reached the context window size it would just reset and continue answering question forever, now once the context windows is filled with output from multiple questions the generation just stops, is there way to free context window after it gets filled automatically? Here is how I run it: |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
On OSX, on 02c1eca, running:
./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --mlock --host 0.0.0.0 --port 51039 -ngl 999 --chat-template llama3
I get total incoherence around token ~512. A clipped output:
in one run I let the incoherence go on for quite a long time and after what may have been another ~512 tokens (just eyeballing it), it suddenly resolved back into coherence with some hallucinated lyrics.
On the other hand, if I start it with
./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --ctx-size 0 --mlock -ngl 999 --chat-template llama3 --port 50051
then all is well.
During startup, the version without
--ctx-size 0
will print:the version with
--ctx-size 0
will print:additionally, I believe this message in the logs only appears in the version lacking the param:
{"tid":"0x202cdfac0","timestamp":1716964019,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":0,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":511}
and I believe that exactly corresponds with the incoherence.
Name and Version
(venv) bash-3.2$ ./main --version
version: 3028 (02c1eca)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
(venv) bash-3.2$
What operating system are you seeing the problem on?
Mac
Relevant log output
No response
The text was updated successfully, but these errors were encountered: