Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

Merged
merged 2 commits into from
Mar 13, 2025

Conversation

bssrdf
Copy link
Contributor

@bssrdf bssrdf commented Mar 12, 2025

This PR changed default KV cache type to FP16 to use tensor cores if available. This brings the inference performance much closer to llama.cpp.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit irrelevant for the purpose of this example, because it is not meant to deliver optimal performance, but rather to demonstrate ggml usage. I think we can simply add a comment here that either GGML_TYPE_F16 or GGML_TYPE_F32 can be used.

@bssrdf
Copy link
Contributor Author

bssrdf commented Mar 13, 2025

@ggerganov, fair enough! I have reverted back to FP32 and added a comment. Thanks for reviewing.

@ggerganov ggerganov merged commit ef09452 into ggml-org:master Mar 13, 2025
3 checks passed
@bssrdf bssrdf deleted the make-gpt-2-peform-like-llama-cpp branch March 13, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants