-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add time to first token for llama runner #2141
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2141
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6c28e7f with merge base 3e414fb (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D54223564 |
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. To implement: * Sample time - amount of time spent sampling per token (present in llama.cpp) Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. To implement: * Sample time - amount of time spent sampling per token (present in llama.cpp) Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. To implement: * Sample time - amount of time spent sampling per token (present in llama.cpp) Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
06083b3
to
e478f8b
Compare
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. To implement: * Sample time - amount of time spent sampling per token (present in llama.cpp) Reviewed By: digantdesai Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. To implement: * Sample time - amount of time spent sampling per token (present in llama.cpp) Reviewed By: digantdesai Differential Revision: D54223564
e478f8b
to
abd98c1
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
abd98c1
to
52545fe
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
52545fe
to
1936568
Compare
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) Reviewed By: digantdesai Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks Reviewed By: digantdesai Differential Revision: D54223564
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks Reviewed By: digantdesai Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
5f0067c
to
d0e6269
Compare
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
d0e6269
to
1734687
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
1734687
to
8c6cb45
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
8c6cb45
to
ff13f01
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D54223564 |
ff13f01
to
0aa3659
Compare
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
0aa3659
to
6c28e7f
Compare
This pull request was exported from Phabricator. Differential Revision: D54223564 |
Summary: bypass-github-pytorch-ci-checks Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564
This pull request has been merged in caee336. |
Summary: Add time to first generated
Differential Revision: D54223564