examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

bssrdf · 2025-03-12T15:14:35Z

This PR changed default KV cache type to FP16 to use tensor cores if available. This brings the inference performance much closer to llama.cpp.

ggerganov

This is a bit irrelevant for the purpose of this example, because it is not meant to deliver optimal performance, but rather to demonstrate ggml usage. I think we can simply add a comment here that either GGML_TYPE_F16 or GGML_TYPE_F32 can be used.

bssrdf · 2025-03-13T18:13:46Z

@ggerganov, fair enough! I have reverted back to FP32 and added a comment. Thanks for reviewing.

change KV cache to fp16 to take advantage of tensor cores

1d0184f

ggerganov reviewed Mar 13, 2025

View reviewed changes

added a note/comment to indicate kv can be FP16

b3c7ef6

ggerganov approved these changes Mar 13, 2025

View reviewed changes

ggerganov merged commit ef09452 into ggml-org:master Mar 13, 2025
3 checks passed

bssrdf deleted the make-gpt-2-peform-like-llama-cpp branch March 13, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

bssrdf commented Mar 12, 2025

ggerganov left a comment

bssrdf commented Mar 13, 2025

examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

examples: change GPT-2 KV cache to fp16 to take advantage of tensor cores #1142

Conversation

bssrdf commented Mar 12, 2025

ggerganov left a comment

Choose a reason for hiding this comment

bssrdf commented Mar 13, 2025