-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: matmul dequantization improvements #12015
Conversation
I did a quick run on RTX 4070 using the KHR_coopmat path (GGML_VK_DISABLE_COOPMAT2=1). Perf is about neutral on average, maybe down a tiny bit?
The backend tests all passed. |
Interesting. Let's wait for some more results. |
Here are my results: Nvidia RTX 3090
AMD Radeon Pro VII
Intel A770
Looks like it's mostly perf-neutral on AMD and Intel, probably since they are compute-limited. Some minor improvements. But on RTX 3090 it makes a much larger difference. Looks good for legacy and k-quants, but iq quants seem to be negative. Any ideas? |
I mean I didn't touch iq1_s and iq4_xs at all, so I don't know what's going on there. My change only affects the legacy quants and iq4_nl. Anyways I pushed a new update for iq4_nl which might help a bit, though it's only giving me a 1% improvement on my end. On AMD unpack8 is normally faster since it compiles to an instruction which converts an unsigned byte into a float in a single cycle. Since iq4_nl has no float conversion the unpack compiles to a regular bitfieldextract and we also add the overhead of all the bit fiddling. Then again I'll be rolling my eyes if this causes a 30% difference and I don't know if this is even relevant for Nvidia... |
I reran the results, looks good now. The RTX 3090 coopmat results seem to vary a lot between runs, so the negative delta was maybe just variance. |
It might be faster, but it's also not working reliably on RDNA1 right now. |
This reverts commit fbeda90.
* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8
This basically makes the mul_mm shaders load and dequantize 4 or 8 values at a time like how it's done in mat_vec (old quants only).
Results on my RX 470:
PR
Master
PR:
I'm only seeing a small improvement as most of the GPU time is spent doing the actual multiplication, and I think we'll see better results on something that supports coopmat.