Add time to first token for llama runner #2141

Summary: Add time to first generated token & other features - Since we're measuring the first token time, the token rate is measured both at the * Model Load Time - just a timer around ET_CHECK_OK_OR_RETURN_ERROR(load()); * Total inference time - Immediately after model load until the end of the inference loop * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed. * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens. * Sample time - amount of time spent sampling per token (present in llama.cpp) bypass-github-executorch-ci-checks bypass-github-pytorch-ci-checks Reviewed By: digantdesai, Jack-Khuu Differential Revision: D54223564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add time to first token for llama runner #2141

Add time to first token for llama runner #2141

Commits on Mar 14, 2024