-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make prefilling return first token for loadgen integration #143
Conversation
R: @vipannalla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll let others review too
@@ -278,7 +278,7 @@ def test_llama_e2e_two_addtional_tokens(self): | |||
slot = 0 | |||
|
|||
# pylint: disable-next=all | |||
prefill_result = engine.prefill( | |||
prefill_result, _ = engine.prefill( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the first_token
to the out_tokens array (Line 288), otherwise out_tokens
will not equal expected_output_tokens
. Same for all other calls..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
jetstream_pt/engine.py
Outdated
tokens_idx = (0, 1), | ||
valid_idx = (1, 2), | ||
length_idx = (2, 3), | ||
samples_per_slot = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Instead of magic numbers, should use same logic as in generate so its clearer to reader:
length = token_out.shape[1]
result_tokens = engine_api.ResultTokens(
data=data,
tokens_idx=(0, length),
valid_idx=(length, 2 * length),
length_idx=(2 * length, 2 * length + 1),
samples_per_slot=1,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run a E2E test and share the output token result?
We should also change the decode return token logic. My concerns is that current implementation first decode token is same as prefill token.
* Make prefilling return first token for loadgen integration * minor fix and lint * enable passing of max_decode_length as a flag
No description provided.