Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make number of generated tokens consistent with CLI #1690

Merged
merged 2 commits into from
Aug 21, 2024
Merged

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented Aug 21, 2024

Updated the llm.benchmark() method to make the number of generated tokens used for the calculation consistent with those in the CLI via litgpt generate and litgpt chat.

@rasbt rasbt enabled auto-merge (squash) August 21, 2024 16:07
@rasbt rasbt disabled auto-merge August 21, 2024 16:11
@rasbt
Copy link
Collaborator Author

rasbt commented Aug 21, 2024

Actually, I don't think the stream=False vs stream=True difference is an issue in the LLM API. It's reproducible for the LitGPT CLI as well. @Andrei-Aksionov @apaz-cli

Here are screenshots to show that I am not making this up.

1) Stream=False

Screenshot 2024-08-21 at 11 17 18 AM
  • For a fair comparison, you need to set this to False:

include_prompt: bool = True,

2) Stream=True

Screenshot 2024-08-21 at 11 19 19 AM

TLDR

So if we see this difference between streaming and non-streaming, I think this is a legacy difference, not something introduced by the LLM API or the benchmark code. Maybe the unified code in #1675 will address this. Otherwise, this is a future issue to look into.

@Andrei-Aksionov
Copy link
Collaborator

Interesting. Thanks for the comparison @rasbt
That would an interesting task to solve. Maybe one day.

Or perhaps @apaz-cli might have noticed something regarding this issue while working on his PR.

@apaz-cli
Copy link
Contributor

@rasbt I tested both streaming and nonstreaming against the new impl. They both definitely call next_token() the same number of times, with the same args. It also makes the same call to cat().

Literally the only difference is

# Chat
# Scalarize the tensor for all iterations beyond the first
input_pos[-1:].add_(1)

vs

# Generate, new impl
# Post-prefill create scalar tensor directly
input_pos = input_pos.add_(1)

But this being significant doesn't line up with the theory. Don't know.

@rasbt rasbt enabled auto-merge (squash) August 21, 2024 17:41
@rasbt rasbt merged commit aaed893 into main Aug 21, 2024
8 of 9 checks passed
@rasbt rasbt deleted the consistent-tokens branch August 21, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants