Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

Closed
rurhrlaub opened this issue Oct 8, 2024 · 6 comments · Fixed by #2874
Closed

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

rurhrlaub opened this issue Oct 8, 2024 · 6 comments · Fixed by #2874
Assignees
Labels
core-team-only investigating Core team or maintainer will or is currently looking into this issue needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@rurhrlaub
Copy link

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

anythingfllm_context

When using "Base" as the "Performance Mode", the Max Tokens setting is ignored and Llama 3.1 is invoked with 8K context size. When setting Performance Mode to "Maximum", the Max Tokens settings is ignored and Llama 3.1 is invoked with 128K context size. Created a modelfile to enforce 32K context size but the result was 128K. Workspace was set to use the system defined LLM settings.

Are there known steps to reproduce?

See above

@rurhrlaub rurhrlaub added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Oct 8, 2024
@rurhrlaub
Copy link
Author

Anything LLM v1.6.7

@timothycarambat
Copy link
Member

When using "Base" as the "Performance Mode", the Max Tokens setting is ignored and Llama 3.1 is invoked with 8K context size

This is normal and expected. See this from Ollama: ollama/ollama#1005 (comment)
And you can see what we did in this conversation: #1991

When setting Performance Mode to "Maximum", the Max Tokens settings is ignored and Llama 3.1 is invoked with 128K context size. Created a modelfile to enforce 32K context size but the result was 128K

Any parameters passed into the API will override whatever is in a Modelfile in Ollama:
https://github.com/Mintplex-Labs/anything-llm/pull/2014/files#diff-df0e7523cd11db44d61e29cfb54f0bdc2ace72ffcf18abeca888d299efd2d738R37-R40

So here, we would be passing in whatever value you have for Max Tokens in the UI. How do you see 128K and where are you seeing that?

@timothycarambat timothycarambat added the needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug label Oct 9, 2024
@rurhrlaub
Copy link
Author

Looking at the --ctx-size parameter in the shell - it is always 8K or 128K, never 32K. 128K setting is too large to execute 8K always truncates the data in context for incomplete results:

814339519 42877 42374 4004 0 31 0 35095332 14060 - S 0 ?? 0:01.61 /Applications/AnythingLLM.app/Contents/Resources/ollama/llm serve
814339519 63122 42877 4004 0 31 0 40038304 5776448 - S 0 ?? 17:02.59 /var/folders/p2/4xbgs9lx7lvdxsffq0v123h0r8lndz/T/ollama4011639316/runners/cpu_avx2/ollama_llama_server --model /Users/ruhrlaub/Library/Application Support/anythingllm-desktop/storage/models/ollama/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 57041

@rurhrlaub
Copy link
Author

Any progress or status on this? Blocking development of the next version of the Anything LLM content pack in our marketplace.

@timothycarambat timothycarambat self-assigned this Oct 24, 2024
@blaineatnoeonai
Copy link

We're having the same issue. We'd also like to run Ollama with a mid-sized context (128k is too much, 8k is too little).

@timothycarambat timothycarambat added core-team-only investigating Core team or maintainer will or is currently looking into this issue labels Dec 18, 2024
@timothycarambat
Copy link
Member

The log

87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 57041

is the n_ctx per slot

Look lower in the logs to see the real n_ctx to see if it is applied
ex: llama3.2:3b with 1000 token limit

llama_new_context_with_model: n_ctx_per_seq (1000) < n_ctx_train (xxxxx) -- the full capacity of the model will not be utilized
.....
llm_load_tensors:          CPU model buffer size =  1918.35 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 4000 # because 1,000*4 slots - this is what ollama shows for --ctx-size
llama_new_context_with_model: n_ctx_per_seq = 1000 # your original input
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-team-only investigating Core team or maintainer will or is currently looking into this issue needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants