[bug] unnecessary batch logits post processor calls #2439

akhoroshev · 2024-11-12T19:09:39Z

When I build model with paged_context_fmha = true and max_num_tokens = 4096, chunked context is enabled. I see that Executor calls batch_logit_processor more than one time for the first token.

To prove that I'm printing the number of tokens in callback (FusedLogitsProcessor::process is my implementation of callback).

I send request with different input size and set maxTokens to 3.

input_context_size: 18810

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18810
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18811
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 18812

input_context_size: 15014

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15014
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15015
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 15016

input_context_size: 12585

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12585
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12586
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 12587

input_context_size: 8176

[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8176
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8176
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8177
[TensorRT-LLM][ERROR] FusedLogitsProcessor::process, beamToken.size() 8178

You can see that first token logit callback is repeated ceil(input_context_size / max_num_tokens) times. In fact, the logits for calls to ceil(input_context_size/max_num_tokens) - 1 are ignored (sampling layers are not called) and Executor returns exactly 3 tokens (as expected). But it's very strange to run a logit processor for "garbage" logits.

The text was updated successfully, but these errors were encountered:

akhoroshev · 2024-11-12T20:21:22Z

it would be great if you called logits post processor for request only if isLastContextChunk() || isGenerationInProgressState()

amukkara · 2024-11-14T18:38:03Z

@akhoroshev thanks for pointing this out.

we will make the change to invoke logits post processor only for the last context chunk.

hello-11 added the triaged Issue has been triaged by maintainers label Nov 14, 2024

hello-11 assigned amukkara Nov 18, 2024

kaiyux mentioned this issue Nov 26, 2024

Update TensorRT-LLM #2502

Merged

kaiyux mentioned this issue Dec 24, 2024

TensorRT-LLM v0.16 Release #2611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] unnecessary batch logits post processor calls #2439

[bug] unnecessary batch logits post processor calls #2439

akhoroshev commented Nov 12, 2024 •

edited

Loading

akhoroshev commented Nov 12, 2024

amukkara commented Nov 14, 2024

[bug] unnecessary batch logits post processor calls #2439

[bug] unnecessary batch logits post processor calls #2439

Comments

akhoroshev commented Nov 12, 2024 • edited Loading

akhoroshev commented Nov 12, 2024

amukkara commented Nov 14, 2024

akhoroshev commented Nov 12, 2024 •

edited

Loading