Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto e2e benchmarker. #372

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

raikonenfnu
Copy link
Member

@raikonenfnu raikonenfnu commented Jan 25, 2024

Modifications to SharkLLM + Implementation of benchmarking script to track performance of SHARK-2.0 LLM models. Here is a sample output from the benchmarking script. https://gist.github.com/raikonenfnu/4120ddfdcb2964608c89d31079594d05

@IanNod
Copy link
Contributor

IanNod commented Jan 25, 2024

I like the idea of this but concerned it took over 35 min for Test Turbine Models. Maybe this belongs more in a nightly than for every patch?

@raikonenfnu
Copy link
Member Author

I like the idea of this but concerned it took over 35 min for Test Turbine Models. Maybe this belongs more in a nightly than for every patch?

Ah, thanks for the suggestion Ian. I think perhaps a good portion of the time is compiling the Stateless llama. let me try make it reuse the vmfb when possible. If that doesn't work, I can move it into some nightly action thing. :)

@dan-garvey
Copy link
Member

@saienduri before you left your internship you were working on benchmark following Ben's fancy double vmfb thing. What happened to that?

@raikonenfnu
Copy link
Member Author

@saienduri before you left your internship you were working on benchmark following Ben's fancy double vmfb thing. What happened to that?

Hey Dan, I think it's there, but it's using benchmark-module which is good for microbenchmarking as opposed to this one which tests perf on actual workload + e2e python.

@raikonenfnu
Copy link
Member Author

I like the idea of this but concerned it took over 35 min for Test Turbine Models. Maybe this belongs more in a nightly than for every patch?

@IanNod I brought it down to 23minutes. I think before this test, it's ~18minutes. What do you think?

@IanNod
Copy link
Contributor

IanNod commented Jan 27, 2024

I like the idea of this but concerned it took over 35 min for Test Turbine Models. Maybe this belongs more in a nightly than for every patch?

@IanNod I brought it down to 23minutes. I think before this test, it's ~18minutes. What do you think?

Huh, used to be ~10 mins. Wonder what brought it up to almost double that. I still feel this belongs more in a nightly but am fine with it for now as we have a lot of ramping up on CI work to do.

hf_auth_token=None,
compile_to="vmfb",
external_weights="safetensors",
# external_weight_file="Llama-2-7b-chat-hf-function-calling-v2_f16_int4.safetensors", Do not export weights because this doesn't get quantized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove commented code

assert benchmark_result[1]["decoded_tokens"] == 25
assert benchmark_result[1]["num_iterations"] == 1
assert benchmark_result[1]["decode_speed(tok/s)"] > 0
assert benchmark_result[1]["prefill_speed(tok/s)"] > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't really test for regressions, just that it ran, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants