Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make prefilling return first token for loadgen integration #143

Merged
merged 4 commits into from
Jul 10, 2024
Merged

Conversation

sixiang-google
Copy link
Collaborator

No description provided.

@sixiang-google
Copy link
Collaborator Author

R: @vipannalla

Copy link
Collaborator

@vipannalla vipannalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll let others review too

@@ -278,7 +278,7 @@ def test_llama_e2e_two_addtional_tokens(self):
slot = 0

# pylint: disable-next=all
prefill_result = engine.prefill(
prefill_result, _ = engine.prefill(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the first_token to the out_tokens array (Line 288), otherwise out_tokens will not equal expected_output_tokens. Same for all other calls..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 282 to 285
tokens_idx = (0, 1),
valid_idx = (1, 2),
length_idx = (2, 3),
samples_per_slot = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Instead of magic numbers, should use same logic as in generate so its clearer to reader:

   length = token_out.shape[1]
   result_tokens = engine_api.ResultTokens(
       data=data,
       tokens_idx=(0, length),
       valid_idx=(length, 2 * length),
       length_idx=(2 * length, 2 * length + 1),
       samples_per_slot=1,
   )

@vipannalla vipannalla requested review from qihqi and FanhaiLu1 July 8, 2024 20:57
Copy link
Collaborator

@FanhaiLu1 FanhaiLu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run a E2E test and share the output token result?

We should also change the decode return token logic. My concerns is that current implementation first decode token is same as prefill token.

@qihqi qihqi merged commit 50a6d10 into main Jul 10, 2024
4 checks passed
wang2yn84 pushed a commit that referenced this pull request Jul 18, 2024
* Make prefilling return first token for loadgen integration

* minor fix and lint

* enable passing of max_decode_length as a flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants