[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

rurhrlaub · 2024-10-08T13:57:27Z

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

When using "Base" as the "Performance Mode", the Max Tokens setting is ignored and Llama 3.1 is invoked with 8K context size. When setting Performance Mode to "Maximum", the Max Tokens settings is ignored and Llama 3.1 is invoked with 128K context size. Created a modelfile to enforce 32K context size but the result was 128K. Workspace was set to use the system defined LLM settings.

Are there known steps to reproduce?

See above

rurhrlaub · 2024-10-08T14:02:17Z

Anything LLM v1.6.7

timothycarambat · 2024-10-09T17:24:20Z

When using "Base" as the "Performance Mode", the Max Tokens setting is ignored and Llama 3.1 is invoked with 8K context size

This is normal and expected. See this from Ollama: ollama/ollama#1005 (comment)
And you can see what we did in this conversation: #1991

When setting Performance Mode to "Maximum", the Max Tokens settings is ignored and Llama 3.1 is invoked with 128K context size. Created a modelfile to enforce 32K context size but the result was 128K

Any parameters passed into the API will override whatever is in a Modelfile in Ollama:
https://github.com/Mintplex-Labs/anything-llm/pull/2014/files#diff-df0e7523cd11db44d61e29cfb54f0bdc2ace72ffcf18abeca888d299efd2d738R37-R40

So here, we would be passing in whatever value you have for Max Tokens in the UI. How do you see 128K and where are you seeing that?

rurhrlaub · 2024-10-09T17:31:37Z

Looking at the --ctx-size parameter in the shell - it is always 8K or 128K, never 32K. 128K setting is too large to execute 8K always truncates the data in context for incomplete results:

814339519 42877 42374 4004 0 31 0 35095332 14060 - S 0 ?? 0:01.61 /Applications/AnythingLLM.app/Contents/Resources/ollama/llm serve
814339519 63122 42877 4004 0 31 0 40038304 5776448 - S 0 ?? 17:02.59 /var/folders/p2/4xbgs9lx7lvdxsffq0v123h0r8lndz/T/ollama4011639316/runners/cpu_avx2/ollama_llama_server --model /Users/ruhrlaub/Library/Application Support/anythingllm-desktop/storage/models/ollama/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 57041

rurhrlaub · 2024-10-23T14:38:35Z

Any progress or status on this? Blocking development of the next version of the Anything LLM content pack in our marketplace.

blaineatnoeonai · 2024-12-18T14:56:07Z

We're having the same issue. We'd also like to run Ollama with a mid-sized context (128k is too much, 8k is too little).

timothycarambat · 2024-12-18T19:19:40Z

The log

87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 57041

is the n_ctx per slot

Look lower in the logs to see the real n_ctx to see if it is applied
ex: llama3.2:3b with 1000 token limit

llama_new_context_with_model: n_ctx_per_seq (1000) < n_ctx_train (xxxxx) -- the full capacity of the model will not be utilized
.....
llm_load_tensors:          CPU model buffer size =  1918.35 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 4000 # because 1,000*4 slots - this is what ollama shows for --ctx-size
llama_new_context_with_model: n_ctx_per_seq = 1000 # your original input
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1

rurhrlaub added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Oct 8, 2024

timothycarambat added the needs info / can't replicate Issues that require additional information and/or cannot currently be replicated, but possible bug label Oct 9, 2024

timothycarambat self-assigned this Oct 24, 2024

timothycarambat added core-team-only investigating Core team or maintainer will or is currently looking into this issue labels Dec 18, 2024

timothycarambat mentioned this issue Dec 18, 2024

update ollama performance mode #2874

Merged

10 tasks

timothycarambat closed this as completed in #2874 Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

rurhrlaub commented Oct 8, 2024

rurhrlaub commented Oct 8, 2024

timothycarambat commented Oct 9, 2024

rurhrlaub commented Oct 9, 2024

rurhrlaub commented Oct 23, 2024

blaineatnoeonai commented Dec 18, 2024

timothycarambat commented Dec 18, 2024

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

[BUG]: llama3.1 8B Context Size Max Tokens Ignored in Both Performance Modes #2442

Comments

rurhrlaub commented Oct 8, 2024

How are you running AnythingLLM?

What happened?

Are there known steps to reproduce?

rurhrlaub commented Oct 8, 2024

timothycarambat commented Oct 9, 2024

rurhrlaub commented Oct 9, 2024

rurhrlaub commented Oct 23, 2024

blaineatnoeonai commented Dec 18, 2024

timothycarambat commented Dec 18, 2024