Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qnn end to end flow for stories model (#3038) #3182

Merged
merged 1 commit into from
Apr 20, 2024
Merged

Conversation

cccclai
Copy link
Contributor

@cccclai cccclai commented Apr 19, 2024

Summary:
Pull Request resolved: #3038

Patch a few changes including:

  • support bool tensor type
  • support fp16 and fix the 8w8a quantization.
  • add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:

python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json

quantize:

python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json

Runtime:

/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"

Output:

Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context

Stories model is too small and sensitive to qunatization. ghstack-source-id: 223199545
exported-using-ghexport

Reviewed By: mergennachin, kirklandsign

Differential Revision: D56119738

fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5 (cherry picked from commit 3257c66)

Summary:
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223199545
exported-using-ghexport

Reviewed By: mergennachin, kirklandsign

Differential Revision: D56119738

fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5
(cherry picked from commit 3257c66)
Copy link

pytorch-bot bot commented Apr 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3182

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 7214dff with merge base d3326a2 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024
@guangy10 guangy10 merged commit 7b29ad2 into release/0.2 Apr 20, 2024
34 of 35 checks passed
@mergennachin mergennachin mentioned this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants