You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I build model with paged_context_fmha = true and max_num_tokens = 4096, chunked context is enabled. I see that Executor calls batch_logit_processor more than one time for the first token.
To prove that I'm printing the number of tokens in callback (FusedLogitsProcessor::process is my implementation of callback).
I send request with different input size and set maxTokens to 3.
You can see that first token logit callback is repeated ceil(input_context_size / max_num_tokens) times. In fact, the logits for calls to ceil(input_context_size/max_num_tokens) - 1 are ignored (sampling layers are not called) and Executor returns exactly 3 tokens (as expected). But it's very strange to run a logit processor for "garbage" logits.
The text was updated successfully, but these errors were encountered:
version
When I build model with paged_context_fmha = true and max_num_tokens = 4096, chunked context is enabled. I see that Executor calls batch_logit_processor more than one time for the first token.
To prove that I'm printing the number of tokens in callback (FusedLogitsProcessor::process is my implementation of callback).
I send request with different input size and set maxTokens to 3.
input_context_size: 18810
input_context_size: 15014
input_context_size: 12585
input_context_size: 8176
You can see that first token logit callback is repeated
ceil(input_context_size / max_num_tokens)
times. In fact, the logits for calls toceil(input_context_size/max_num_tokens) - 1
are ignored (sampling layers are not called) and Executor returns exactly 3 tokens (as expected). But it's very strange to run a logit processor for "garbage" logits.The text was updated successfully, but these errors were encountered: