Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] unnecessary batch logits post processor calls #2439

Open
akhoroshev opened this issue Nov 12, 2024 · 2 comments
Open

[bug] unnecessary batch logits post processor calls #2439

akhoroshev opened this issue Nov 12, 2024 · 2 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@akhoroshev
Copy link
Contributor

akhoroshev commented Nov 12, 2024

version

When I build model with paged_context_fmha = true and max_num_tokens = 4096, chunked context is enabled. I see that Executor calls batch_logit_processor more than one time for the first token.

To prove that I'm printing the number of tokens in callback (FusedLogitsProcessor::process is my implementation of callback).

I send request with different input size and set maxTokens to 3.

input_context_size: 18810

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18811
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18812

input_context_size: 15014

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15015
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15016

input_context_size: 12585

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12586
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12587

input_context_size: 8176

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8176
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8176
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8177
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8178

You can see that first token logit callback is repeated ceil(input_context_size / max_num_tokens) times. In fact, the logits for calls to ceil(input_context_size/max_num_tokens) - 1 are ignored (sampling layers are not called) and Executor returns exactly 3 tokens (as expected). But it's very strange to run a logit processor for "garbage" logits.

@akhoroshev
Copy link
Contributor Author

it would be great if you called logits post processor for request only if isLastContextChunk() || isGenerationInProgressState()

@hello-11 hello-11 added the triaged Issue has been triaged by maintainers label Nov 14, 2024
@amukkara
Copy link

@akhoroshev thanks for pointing this out.

we will make the change to invoke logits post processor only for the last context chunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants