CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance #15872

ORippler · 2025-09-08T08:43:15Z

This PR applies fastdiv and fastmodulo introduced by #15715 to k_bin_bcast and k_bin_bcast_unravel, giving around 1-3% E2E performance on Ada Lovelace and Blackwell GPUs.

While changing host logic in launch_bin_bcast_pack I was surprised to see we keep ne* in 64 bit precision, but use only the 32 least significant bits in the actual kernel. This could potentially lead to some semantic bugs where we do not iterate over all elements of src0/src1, or am I missing something here?

Perf numbers

GPU	Model	Test	t/s master (`a885dcf`)	t/s this pr (`956a1d0`)	Speedup
4000 SFF	gemma3 12B Q8_0	pp100@d100	1180.05	1177.70	1.00
4000 SFF	gemma3 12B Q8_0	tg100@d100	19.46	19.46	1.00
4000 SFF	gemma3n E2B Q8_0	pp100@d100	2833.22	2859.22	1.01
4000 SFF	gemma3n E2B Q8_0	tg100@d100	72.75	72.84	1.00
4000 SFF	gemma3n E4B Q8_0	pp100@d100	1861.74	1893.85	1.02
4000 SFF	gemma3n E4B Q8_0	tg100@d100	43.64	43.78	1.00
4000 SFF	gpt-oss 20B MXFP4 MoE	pp100@d100	1124.82	1130.57	1.01
4000 SFF	gpt-oss 20B MXFP4 MoE	tg100@d100	79.67	80.01	1.00
4000 SFF	llama 3B Q4_K_M	pp100@d100	3918.87	3925.05	1.00
4000 SFF	llama 3B Q4_K_M	tg100@d100	103.20	103.40	1.00
4000 SFF	qwen3 4B Q4_K_M	pp100@d100	2929.98	2947.50	1.01
4000 SFF	qwen3 4B Q4_K_M	tg100@d100	77.96	78.09	1.00
4000 SFF	qwen3moe 30B.A3B Q3_K_S	pp100@d100	782.50	783.54	1.00
4000 SFF	qwen3moe 30B.A3B Q3_K_S	tg100@d100	75.74	75.86	1.00
PRO 4500	gemma3 12B Q8_0	pp100@d100	2663.22	2674.80	1.00
PRO 4500	gemma3 12B Q8_0	tg100@d100	53.33	53.50	1.00
PRO 4500	gemma3n E2B Q8_0	pp100@d100	4699.09	4842.60	1.03
PRO 4500	gemma3n E2B Q8_0	tg100@d100	146.19	147.74	1.01
PRO 4500	gemma3n E4B Q8_0	pp100@d100	3595.80	3678.35	1.02
PRO 4500	gemma3n E4B Q8_0	tg100@d100	98.76	99.48	1.01
PRO 4500	gpt-oss 20B MXFP4 MoE	pp100@d100	2682.57	2696.05	1.01
PRO 4500	gpt-oss 20B MXFP4 MoE	tg100@d100	187.95	189.45	1.01
PRO 4500	llama 3B Q4_K_M	pp100@d100	7103.16	7172.19	1.01
PRO 4500	llama 3B Q4_K_M	tg100@d100	253.66	254.50	1.00
PRO 4500	qwen3 4B Q4_K_M	pp100@d100	5688.40	5689.52	1.00
PRO 4500	qwen3 4B Q4_K_M	tg100@d100	190.86	191.15	1.00
PRO 4500	qwen3moe 30B.A3B Q3_K_S	pp100@d100	1585.78	1588.32	1.00
PRO 4500	qwen3moe 30B.A3B Q3_K_S	tg100@d100	159.09	159.61	1.00
PRO 6000 Max-Q	gemma3 12B Q8_0	pp100@d100	3289.90	3340.18	1.02
PRO 6000 Max-Q	gemma3 12B Q8_0	tg100@d100	87.69	88.61	1.01
PRO 6000 Max-Q	gemma3n E2B Q8_0	pp100@d100	4961.39	5003.11	1.01
PRO 6000 Max-Q	gemma3n E2B Q8_0	tg100@d100	170.01	172.68	1.02
PRO 6000 Max-Q	gemma3n E4B Q8_0	pp100@d100	3871.13	3952.21	1.02
PRO 6000 Max-Q	gemma3n E4B Q8_0	tg100@d100	126.80	129.27	1.02
PRO 6000 Max-Q	gpt-oss 20B MXFP4 MoE	pp100@d100	3662.91	3675.03	1.00
PRO 6000 Max-Q	gpt-oss 20B MXFP4 MoE	tg100@d100	245.33	248.77	1.01
PRO 6000 Max-Q	llama 3B Q4_K_M	pp100@d100	7410.34	7432.80	1.00
PRO 6000 Max-Q	llama 3B Q4_K_M	tg100@d100	333.30	335.90	1.01
PRO 6000 Max-Q	qwen3 4B Q4_K_M	pp100@d100	6205.96	6189.10	1.00
PRO 6000 Max-Q	qwen3 4B Q4_K_M	tg100@d100	251.95	252.77	1.00
PRO 6000 Max-Q	qwen3moe 30B.A3B Q3_K_S	pp100@d100	2077.16	2081.61	1.00
PRO 6000 Max-Q	qwen3moe 30B.A3B Q3_K_S	tg100@d100	170.76	173.18	1.01

ggml/src/ggml-cuda/binbcast.cu

JohannesGaessler · 2025-09-08T10:39:04Z

I was surprised to see we keep ne* in 64 bit precision, but use only the 32 least significant bits in the actual kernel.

The GGML specification for the tensor dimensions is int64_t. When work on the CUDA code was first started, int was used for everything, largely due to ignorance. You are correct that this could potentially cause issues but unlike e.g. PyTorch ggml is currently not really used by end users directly so the tensor dimensions encountered in practice are very limited. Most kernels are severely I/O bound with negligible register pressure so it would be better to just use 64 bit arguments. For the kernels with non-negligible register pressure more care must be taken but for 32 bit arguments asserts should be added to detect overflows. But as these changes are fairly low-priority no one got around to doing this.

Add fastdiv and fastmodulo to k_bin_bcast kernel

956a1d0

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 8, 2025

JohannesGaessler reviewed Sep 8, 2025

View reviewed changes

ggml/src/ggml-cuda/binbcast.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/binbcast.cu Outdated Show resolved Hide resolved

Address review comments

b63af60

ORippler changed the title ~~CUDA: Add fastdiv and fastmodulo to k_bin_bcast*, giving 1-3% E2E performance~~ CUDA: Add fastdiv to k_bin_bcast*, giving 1-3% E2E performance Sep 10, 2025

ORippler added 2 commits September 10, 2025 19:02

prod_ instead of prod suffix

3cd6708

Add test case for k_bin_bcast_unravel in CUDA backend

4014ae3

ORippler requested a review from JohannesGaessler September 10, 2025 17:19

github-actions bot added the testing Everything test related label Sep 10, 2025

JohannesGaessler approved these changes Sep 10, 2025

View reviewed changes

JohannesGaessler merged commit 00681df into ggml-org:master Sep 10, 2025
48 checks passed

ORippler deleted the osimons/add_fastdiv_to_k_bin_bcast branch September 10, 2025 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance #15872

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance #15872

ORippler commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: Add fastdiv to k_bin_bcast*, giving 1-3% E2E performance #15872

CUDA: Add fastdiv to k_bin_bcast*, giving 1-3% E2E performance #15872

Conversation

ORippler commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance #15872

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance #15872