qnn end to end flow for stories model (#3038) #3182

cccclai · 2024-04-19T23:24:26Z

Summary:
Pull Request resolved: #3038

Patch a few changes including:

support bool tensor type
support fp16 and fix the 8w8a quantization.
add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:

python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json

quantize:

python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json

Runtime:

/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"

Output:

Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context

Stories model is too small and sensitive to qunatization. ghstack-source-id: 223199545
exported-using-ghexport

Reviewed By: mergennachin, kirklandsign

Differential Revision: D56119738

fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5 (cherry picked from commit 3257c66)

Summary: Pull Request resolved: #3038 Patch a few changes including: - support bool tensor type - support fp16 and fix the 8w8a quantization. - add two non-supported ops (slice_scatter and index_put) in common_defs.py stories model working end to end: AOT: fp16: ``` python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json ``` quantize: ``` python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json ``` Runtime: ``` /llama_main --model_path=llama2_fp16_qnn_2.21.pte --tokenizer_path=tokenizer.bin --prompt="Once" ``` Output: ``` Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers. Lily said, "Mommy, this place is so pretty! Can we take a picture?" Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found." After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:04.436699 executorch:runner.cpp:414] Prompt Tokens: 2 Generated Tokens: 125 I 00:00:04.436703 executorch:runner.cpp:420] Model Load Time: 0.973000 (seconds) I 00:00:04.436732 executorch:runner.cpp:430] Total inference time: 3.454000 (seconds) Rate: 36.189925 (tokens/second) I 00:00:04.436735 executorch:runner.cpp:438] Prompt evaluation: 0.057000 (seconds) Rate: 35.087719 (tokens/second) I 00:00:04.436739 executorch:runner.cpp:449] Generated 125 tokens: 3.397000 (seconds) Rate: 36.797174 (tokens/second) I 00:00:04.436742 executorch:runner.cpp:457] Time to first generated token: 0.085000 (seconds) I 00:00:04.436744 executorch:runner.cpp:464] Sampling time over 127 tokens: 0.023000 (seconds) [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context ``` Stories model is too small and sensitive to qunatization. ghstack-source-id: 223199545 exported-using-ghexport Reviewed By: mergennachin, kirklandsign Differential Revision: D56119738 fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5 (cherry picked from commit 3257c66)

pytorch-bot · 2024-04-19T23:24:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3182

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7214dff with merge base d3326a2 ():

NEW FAILURE - The following job has failed:

pull / test-models-linux (cmake, llava_encoder, portable, linux.4xlarge, 90) / linux-job (gh)
RuntimeError: Command docker exec -t effb3c8e894feb931897951328f7dd54ed3ff0a6d86f37412436fd07cda0a0d6 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024

cccclai mentioned this pull request Apr 19, 2024

[v0.2.0] Release Tracker #2666

Closed

guangy10 approved these changes Apr 20, 2024

View reviewed changes

guangy10 merged commit 7b29ad2 into release/0.2 Apr 20, 2024
34 of 35 checks passed

mergennachin mentioned this pull request Apr 25, 2024

fix llama readme #3339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qnn end to end flow for stories model (#3038) #3182

qnn end to end flow for stories model (#3038) #3182

cccclai commented Apr 19, 2024

pytorch-bot bot commented Apr 19, 2024 •

edited

Loading

qnn end to end flow for stories model (#3038) #3182

qnn end to end flow for stories model (#3038) #3182

Conversation

cccclai commented Apr 19, 2024

pytorch-bot bot commented Apr 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3182

❌ 1 New Failure

pytorch-bot bot commented Apr 19, 2024 •

edited

Loading