-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 is dimwitted compared to AVX512 #23
Comments
Interesting, thanks for making us aware. I see that the Highway targets used are AVX3_ZEN4 vs AVX2. The likeliest cause that comes to mind is native bf16 in the former, whereas we are using emulated bf16 with truncation in the latter. google/highway#1962 changes to proper rounding, but unfortunately merging is delayed due to a compiler bug/crash. Would appreciate if you could test with that patched in, and/or after it lands :) |
I changed |
Bummer, thanks for confirming. I also tried with AVX3 (Skylake, so no native bf16) and got the better answer. |
What I want to do is is add code to the end of your |
Reading your codebase has been a fun learning experience so far. I think your trick for supporting multiple microarchitectures by having a file repeatedly Anyway here's my first attempt at analyzing what's different about the data under avx512 versus avx2: https://github.com/jart/gemma3/blob/main/report1.txt So far they appear to be somewhat different, although there's still numerous things I need to confirm to make sure I'm measuring this right. I'm still in the process of understanding, but I'll post updates here as I learn more. |
Great idea! I very much appreciate you looking into this. To go from T* to float, you can call the following:
MatT is your T (eg SfpStream), kCapacity is an upper bound on how many, compressed is a thin wrapper over std::array, let compressed_ofs = 0, out is your float* and num how many to actually decompress.
Thank you :) Having once fiddled with PE internals, I also respect what you have achieved with the single portable binary :) I suspect many libm functions are based on Cephes which is quite old and might benefit from a redesign. BTW you can generate AVX2 outputs on a newer machine by calling hwy::DisableTargets(HWY_AVX2 - 1) in main() or before the first dispatch. I just had an idea: it might not be the instructions (you ruled out BF16 already), but also the vector length. More per-lane accumulators can change the numerics. In gemma.cc there is one Your results do not necessarily look like destructive cancellation, though: |
An idea: I notice some of the lines in your output file have a low discrepancy, so it's not just a case of accumulating over time. It may be helpful to segregate by call site, i.e., which MatVec, to understand which are more sensitive/broken. Is it feasible to move your logging to the call site, or should we pass through some kind of caller/line number into MatVec itself? |
An update: even with CoT prompting (append "Think step by step and check your work"), we're currently seeing the incorrect 15 days also with AVX3. I plan to experiment with higher-precision arithmetic. |
Compensated/cascaded summation turns out not to help because we are already using fp32. We have found and in #194 fixed a bug that may be related, in that behavior changes depending on vector length. Thanks @szabadka for figuring this out! The tail end of vectors was not being masked off, but this should only be biting us for array lengths not a multiple of 16, which are rare here. |
In the attention layer the array length is the context size, which is typically not a multiple of 16 (unless we are past the local-attention window in recurrent gemma). |
Wow thank you for solving it! This must have been a difficult find! |
Hi @jart Thank you for the confirmation, hence closing this issue. |
On a $10,000 AMD Ryzen 7995WX (znver4 avx512) Gemma 7b instruct sfp is able to solve mathematical riddles.
But on a $600 Intel i9-14900K (raptorlake avx2) the same Gemma model gives the fool's answer.
I expected both machines to produce an identical response since I set the temperature to zero. However the behavior of gemma.cpp appears to differ in a pernicious way depending on the ISA. It'd be great if people without AVX512 privilege could experience the same level of impressive brilliance from Gemma that I'm seeing on my Threadripper.
The text was updated successfully, but these errors were encountered: