Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time to first token for llama runner #2141

Closed
wants to merge 1 commit into from

Commits on Mar 14, 2024

  1. Add time to first token for llama runner (pytorch#2141)

    Summary:
    
    Add time to first generated token & other features
    
    
    
    - Since we're measuring the first token time, the token rate is measured both at the
    
    * Model Load Time - just a timer around   ET_CHECK_OK_OR_RETURN_ERROR(load());
    * Total inference time - Immediately after model load until the end of the inference loop
    * >>First token time - From immediately after the model load until the first generated (not prompt) token is printed.
    * >>>>Prompt eval - (comparable to llama.cpp prompt_eval_time) prompt array allocation and tokenization. Ends right before the inference loop starts
    * >>Remaining tokens - immediately after the first token is outputted until the end of the inference loop
    * >>Net eval time - (comparable to llama.cpp eval_time) Total time spent generating tokens.
    * Sample time - amount of time spent sampling per token (present in llama.cpp)
    
    bypass-github-executorch-ci-checks
    bypass-github-pytorch-ci-checks
    
    Reviewed By: digantdesai, Jack-Khuu
    
    Differential Revision: D54223564
    Varun Puri authored and facebook-github-bot committed Mar 14, 2024
    Configuration menu
    Copy the full SHA
    6c28e7f View commit details
    Browse the repository at this point in the history