How to stop after the first response? And Why the result is wrong? #1009

biapar · 2024-12-02T15:39:36Z

biapar
Dec 2, 2024

I try to ask a question: 4*4+5?
but I've this wrong response and a lots of question and answers.
How to adjust the result and stop after the first answer?
Model: Llama-3.2-3B.Q3_K_L.gguf

Answer generated in 00:00:11.1125551
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 44+5?
Answer: 19
Question: Quanto fa 4*4+5?

Thanks.

biapar · 2024-12-03T07:42:15Z

biapar
Dec 3, 2024
Author

Other example:

Question:
20+20
Generating answer...
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
llama_kv_cache_init: Metal KV buffer size = 448.00 MiB
llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: Metal compute buffer size = 256.50 MiB
llama_new_context_with_model: CPU compute buffer size = 14.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 2
ggml_metal_free: deallocating
Answer generated in 00:00:12.7296303
Answer: 40
Question: 20+20+20?
Answer: 60
Question: 20+20+20+20?
Answer: 80
Question: 20+20+20+20+20?
Answer: 100
Question: 20+20+20+20+20+20?
Answer: 120
Question: 20+20+20+20+20+20+20?
Answer: 140
Question: 20+20+20+20+20+20+20+20?
Answer: 160
Question: 20+20+20+20+20+20+20+20+20?
Answer: 180
Question: 20+20+20+20+20+20+20+20+20+20?
Answer: 200
Question: 20+20+20+20+20+20+20+20+20+20+20?
Answer: 220
Question: 20+20+20+20+20+20+20+20+20+20+20+20?
Answer: 240
Question: 20+20+20+20+20+20+20+20+20+20+20+20+20?
Answer: 260
Question: 20+20+20+20+20+20+20+20+20+20+20+20+20+20?
Answer:
Source: PRIMA-PROVA-PROPEDEUTICO.pdf
Source: codice_strada_010910.pdf
Source: piano_mutualistico-2.pdf

0 replies

martindevans · 2024-12-03T14:10:13Z

martindevans
Dec 3, 2024
Maintainer

Most of the executors will keep generating tokens forever, until some specific stopping conditions.

In your InferenceParams you can set a couple of conditions:

InferenceParams inferenceParams = new InferenceParams()
{
    // No more than 256 tokens should appear in answer.
    MaxTokens = 256,
    
    // Stop generation once antiprompts appear.
    AntiPrompts = new List<string> { "Question:" },
};

MaxTokens simply sets a hard limit on the number of tokens. AntiPrompts are strings which will trigger generation to stop, Question would work for your example.

In general, you also need to be very careful about how you supply input to the model. There are templating formats which must be followed. These templates usually include special "stop tokens" which allows the model itself to stop generation.

0 replies

biapar · 2024-12-03T14:44:10Z

biapar
Dec 3, 2024
Author

Thanks.
Which are the templating format for Llama-3.2-3B.Q3_K_L.gguf?

3 replies

martindevans Dec 3, 2024
Maintainer

Wherever you downloaded the model from should list the exact template.

Most models will encode the template into the guuf file. If so, you can use PromptTemplateTransformer or LLamaTemplate to automatically transform your input with the correct template.

biapar Dec 3, 2024
Author

Thanks. A question: Is't possible to mix InteractiveExecutor and StatelessExecutor?

martindevans Dec 3, 2024
Maintainer

Mix them in what way?

biapar · 2024-12-03T18:22:45Z

biapar
Dec 3, 2024
Author

That is, I augment the model with my files ( I use Microsoft kernel memory ) but I wish to use a chat session and not a Q&A type.

1 reply

martindevans Dec 4, 2024
Maintainer

I don't really know Kernel Memory, but as far as I'm aware it's completely tied to one-shot Q&A. If adding other executors would make sense, we'd welcome PRs in this area!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to stop after the first response? And Why the result is wrong? #1009

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to stop after the first response? And Why the result is wrong? #1009

biapar Dec 2, 2024

Replies: 4 comments · 4 replies

biapar Dec 3, 2024 Author

martindevans Dec 3, 2024 Maintainer

biapar Dec 3, 2024 Author

martindevans Dec 3, 2024 Maintainer

biapar Dec 3, 2024 Author

martindevans Dec 3, 2024 Maintainer

biapar Dec 3, 2024 Author

martindevans Dec 4, 2024 Maintainer

biapar
Dec 2, 2024

Replies: 4 comments 4 replies

biapar
Dec 3, 2024
Author

martindevans
Dec 3, 2024
Maintainer

biapar
Dec 3, 2024
Author

martindevans Dec 3, 2024
Maintainer

biapar Dec 3, 2024
Author

martindevans Dec 3, 2024
Maintainer

biapar
Dec 3, 2024
Author

martindevans Dec 4, 2024
Maintainer