Skip to content

Conversation

ORippler
Copy link
Contributor

@ORippler ORippler commented Sep 8, 2025

This PR applies fastdiv and fastmodulo introduced by #15715 to k_bin_bcast and k_bin_bcast_unravel, giving around 1-3% E2E performance on Ada Lovelace and Blackwell GPUs.

While changing host logic in launch_bin_bcast_pack I was surprised to see we keep ne* in 64 bit precision, but use only the 32 least significant bits in the actual kernel. This could potentially lead to some semantic bugs where we do not iterate over all elements of src0/src1, or am I missing something here?


Perf numbers

GPU Model Test t/s master (a885dcf) t/s this pr (956a1d0) Speedup
4000 SFF gemma3 12B Q8_0 pp100@d100 1180.05 1177.70 1.00
4000 SFF gemma3 12B Q8_0 tg100@d100 19.46 19.46 1.00
4000 SFF gemma3n E2B Q8_0 pp100@d100 2833.22 2859.22 1.01
4000 SFF gemma3n E2B Q8_0 tg100@d100 72.75 72.84 1.00
4000 SFF gemma3n E4B Q8_0 pp100@d100 1861.74 1893.85 1.02
4000 SFF gemma3n E4B Q8_0 tg100@d100 43.64 43.78 1.00
4000 SFF gpt-oss 20B MXFP4 MoE pp100@d100 1124.82 1130.57 1.01
4000 SFF gpt-oss 20B MXFP4 MoE tg100@d100 79.67 80.01 1.00
4000 SFF llama 3B Q4_K_M pp100@d100 3918.87 3925.05 1.00
4000 SFF llama 3B Q4_K_M tg100@d100 103.20 103.40 1.00
4000 SFF qwen3 4B Q4_K_M pp100@d100 2929.98 2947.50 1.01
4000 SFF qwen3 4B Q4_K_M tg100@d100 77.96 78.09 1.00
4000 SFF qwen3moe 30B.A3B Q3_K_S pp100@d100 782.50 783.54 1.00
4000 SFF qwen3moe 30B.A3B Q3_K_S tg100@d100 75.74 75.86 1.00
PRO 4500 gemma3 12B Q8_0 pp100@d100 2663.22 2674.80 1.00
PRO 4500 gemma3 12B Q8_0 tg100@d100 53.33 53.50 1.00
PRO 4500 gemma3n E2B Q8_0 pp100@d100 4699.09 4842.60 1.03
PRO 4500 gemma3n E2B Q8_0 tg100@d100 146.19 147.74 1.01
PRO 4500 gemma3n E4B Q8_0 pp100@d100 3595.80 3678.35 1.02
PRO 4500 gemma3n E4B Q8_0 tg100@d100 98.76 99.48 1.01
PRO 4500 gpt-oss 20B MXFP4 MoE pp100@d100 2682.57 2696.05 1.01
PRO 4500 gpt-oss 20B MXFP4 MoE tg100@d100 187.95 189.45 1.01
PRO 4500 llama 3B Q4_K_M pp100@d100 7103.16 7172.19 1.01
PRO 4500 llama 3B Q4_K_M tg100@d100 253.66 254.50 1.00
PRO 4500 qwen3 4B Q4_K_M pp100@d100 5688.40 5689.52 1.00
PRO 4500 qwen3 4B Q4_K_M tg100@d100 190.86 191.15 1.00
PRO 4500 qwen3moe 30B.A3B Q3_K_S pp100@d100 1585.78 1588.32 1.00
PRO 4500 qwen3moe 30B.A3B Q3_K_S tg100@d100 159.09 159.61 1.00
PRO 6000 Max-Q gemma3 12B Q8_0 pp100@d100 3289.90 3340.18 1.02
PRO 6000 Max-Q gemma3 12B Q8_0 tg100@d100 87.69 88.61 1.01
PRO 6000 Max-Q gemma3n E2B Q8_0 pp100@d100 4961.39 5003.11 1.01
PRO 6000 Max-Q gemma3n E2B Q8_0 tg100@d100 170.01 172.68 1.02
PRO 6000 Max-Q gemma3n E4B Q8_0 pp100@d100 3871.13 3952.21 1.02
PRO 6000 Max-Q gemma3n E4B Q8_0 tg100@d100 126.80 129.27 1.02
PRO 6000 Max-Q gpt-oss 20B MXFP4 MoE pp100@d100 3662.91 3675.03 1.00
PRO 6000 Max-Q gpt-oss 20B MXFP4 MoE tg100@d100 245.33 248.77 1.01
PRO 6000 Max-Q llama 3B Q4_K_M pp100@d100 7410.34 7432.80 1.00
PRO 6000 Max-Q llama 3B Q4_K_M tg100@d100 333.30 335.90 1.01
PRO 6000 Max-Q qwen3 4B Q4_K_M pp100@d100 6205.96 6189.10 1.00
PRO 6000 Max-Q qwen3 4B Q4_K_M tg100@d100 251.95 252.77 1.00
PRO 6000 Max-Q qwen3moe 30B.A3B Q3_K_S pp100@d100 2077.16 2081.61 1.00
PRO 6000 Max-Q qwen3moe 30B.A3B Q3_K_S tg100@d100 170.76 173.18 1.01

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 8, 2025
@JohannesGaessler
Copy link
Collaborator

I was surprised to see we keep ne* in 64 bit precision, but use only the 32 least significant bits in the actual kernel.

The GGML specification for the tensor dimensions is int64_t. When work on the CUDA code was first started, int was used for everything, largely due to ignorance. You are correct that this could potentially cause issues but unlike e.g. PyTorch ggml is currently not really used by end users directly so the tensor dimensions encountered in practice are very limited. Most kernels are severely I/O bound with negligible register pressure so it would be better to just use 64 bit arguments. For the kernels with non-negligible register pressure more care must be taken but for 32 bit arguments asserts should be added to detect overflows. But as these changes are fairly low-priority no one got around to doing this.

@ORippler ORippler changed the title CUDA: Add fastdiv and fastmodulo to k_bin_bcast*, giving 1-3% E2E performance CUDA: Add fastdiv to k_bin_bcast*, giving 1-3% E2E performance Sep 10, 2025
@github-actions github-actions bot added the testing Everything test related label Sep 10, 2025
@JohannesGaessler JohannesGaessler merged commit 00681df into ggml-org:master Sep 10, 2025
48 checks passed
@ORippler ORippler deleted the osimons/add_fastdiv_to_k_bin_bcast branch September 10, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants