Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

Merged

Conversation

SebastianApel
Copy link
Contributor

@SebastianApel SebastianApel commented Mar 31, 2023

UPDATE:

@slaren @rabidcopy @sw @howard0su

First: Let me say "sorry" for the confusion: I was not clear in my original post as to what the baseline was. This was mainly, because #642 was merged AFTER I had branched to run my tests of which I have reported the results.

When I published this pull request, I had not realized that #642 existed, and also not that it had been merged. My results were therefore based on the pre-#642 code version.

I have re-run the benchmarks now with three versions: the "pre-PR642" AVX2 code, the AVX2 code from #642 and the AVX2 code from this PR (#654).

At least on my machine I get (for 100 iterations of the benchmark from #653) a 14% performance increase, which is similar to what @jart reported in #654 (comment).

I have therefore changed the title of the PR. Again, sorry for the confusion.

Please let me know your thoughts.

Results Summary (single thread performance)

Codebase FLOPS (single thread % compared to pre 642 % compared to with #642
pre #642 14769 FLOPS 100% 76%
with #642 19443 FLOPS 132% 100%
with #654 22217 FLOPS 150% 114 %

Please find the raw data of the benchmark runs here: benchmark-data.csv

Boxplot of the results:

image

Technical details

  • Machine: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz, 16 GB RAM
  • Compiler & switches:
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

PREVIOUS POST:

(I leave it here for sake of transparency)
This change produces a ~1.5x performance increase of ggml_vec_dot_q4_0 on AVX2.

  • Avg. FLOPS per uSecond: 14797 (100%) for existing code
  • Avg. FLOPS per uSecond: 21828 (147%) for code in this pull request

Root causes for performance improvement:

  • Existing code merges two m256i vectors in bytesFromNibbles and then seperates them again in ggml_vec_dot_q4_0
  • Existing code uses C inline functions which are apparently a bit harder to optimize for the compiler
  • Code in this pull request uses a combination of C macros and "by-hand" loop unrolling

Benchmark data for code in master branch (commit 02c5b27) - produced with tool from pull request 653.

Iteration NThreads SizeX SizeY SizeZ Required FLOPS Elapsed uSeconds FLOPS per uSecond
0 1 11008 4096 128 11542724608 783404 14734.06
1 1 11008 4096 128 11542724608 781391 14772.02
2 1 11008 4096 128 11542724608 783799 14726.64
3 1 11008 4096 128 11542724608 777324 14849.31
4 1 11008 4096 128 11542724608 781041 14778.64
5 1 11008 4096 128 11542724608 778367 14829.41
6 1 11008 4096 128 11542724608 777931 14837.72
7 1 11008 4096 128 11542724608 780476 14789.34
8 1 11008 4096 128 11542724608 778450 14827.83
9 1 11008 4096 128 11542724608 778300 14830.69

Benchmark data for code in this pull request:

Iteration NThreads SizeX SizeY SizeZ Required FLOPS Elapsed uSeconds FLOPS per uSecond
0 1 11008 4096 128 11542724608 552244 20901.49
1 1 11008 4096 128 11542724608 540733 21346.44
2 1 11008 4096 128 11542724608 555696 20771.65
3 1 11008 4096 128 11542724608 553204 20865.22
4 1 11008 4096 128 11542724608 546703 21113.34
5 1 11008 4096 128 11542724608 505609 22829.35
6 1 11008 4096 128 11542724608 505045 22854.84
7 1 11008 4096 128 11542724608 525141 21980.24
8 1 11008 4096 128 11542724608 506639 22782.94
9 1 11008 4096 128 11542724608 505468 22835.72

@jart
Copy link
Contributor

jart commented Mar 31, 2023

This PR would change the behavior of evaluation. It makes the output of LLaMA extremely different from what Apple M1 produces on the same inputs. Please compare your change with 02c5b27 which was fine.

What's interesting is that this PR causes behavior to regress in the exact same way #642 did, which was merged a moment ago.. I raised the same concern there too #642 (comment)

I've confirmed that this PR would cause an 11% performance boost on inference, which I measured on an Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz.. Would it be possible to get that performance boost without changing the behavior of LLaMA? If so, why is it not possible to preserve consistency of behavior?

@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 31, 2023

This PR would change the behavior of evaluation. It makes the output of LLaMA extremely different from what Apple M1 produces on the same inputs. Please compare your change with 02c5b27 which was fine.

What's interesting is that this PR causes behavior to regress in the exact same way #642 did, which was merged a moment ago.. I raised the same concern there too #642 (comment)

I've confirmed that this PR would cause an 11% performance boost on inference, which I measured on an Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz.. Would it be possible to get that performance boost without changing the behavior of LLaMA? If so, why is it not possible to preserve consistency of behavior?

Edit: A lot of what I said doesn't seem particularly relevant. Here's a comparison between #642 without #617 and this PR.
Left is #642 minus #617, right is this PR. This PR squeezes out a little bit more speed, but may be mild margin of error.
image
Then this is the same seed/prompt on latest master that has both #642 and #617 merged together. Which strangely introduces different output and a regression in performance.

image

@x02Sylvie
Copy link

There are issues with compiling on msvc with this pull

error C2065: '__m128i_u': undeclared identifier

It's probably resolvable as mentioned in here: https://stackoverflow.com/questions/68939552/can-i-assign-the-result-of-intrinsic-that-returns-m128i-to-variable-of-the-typ

@SebastianApel
Copy link
Contributor Author

SebastianApel commented Mar 31, 2023

@rabidcopy @jart

Could you elaborate a bit more on what you mean with "changes the behavior of the evaluation" and how you produced the outcome?

I tried to replicate your findings on my machine, but I was not successful.

I've tried to reproduce this with three execution runs, both with the same seed:

The text output was identical, the only difference were the improved timings ( 751.68 ms per run -> 516.78 ms per run), see below.

$ cat /proc/cpuinfo | grep name | uniq
model name	: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz

$ cat /sys/devices/cpu/caps/pmu_name
haswell

$ gcc -v | tail -1
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04) 

$ ./main -m /tmp/ggml-alpaca-7b-q4-ggjt.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 2 --seed 1

 Building a website can be done in 10 simple steps:
Describe your goal.
Brainstorm keywords/phrases.
Choose a domain name.
Set up hosting.
Plan out content structure.
Create Content for your site.
Test and debug your site.
Launch your site!
Market and Promote your site.
1. Describe Your Goal
The first step to creating a website is to clearly describe what you want your website to accomplish. This will help you figure out which features to include,
llama_print_timings:        load time =  6459.39 ms
llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
llama_print_timings:       total time = 83421.13 ms

$ diff ExecutionRun1.txt ExecutionRun2.txt
13,18c13,17
< llama_print_timings:        load time =  6459.39 ms
< llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
< llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
< llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
< llama_print_timings:       total time = 83421.13 ms
< 
---
> llama_print_timings:        load time =  6778.95 ms
> llama_print_timings:      sample time =    83.37 ms /   100 runs   (    0.83 ms per run)
> llama_print_timings: prompt eval time =  8431.62 ms /    14 tokens (  602.26 ms per token)
> llama_print_timings:        eval time = 51160.91 ms /    99 runs   (  516.78 ms per run)
> llama_print_timings:       total time = 60553.50 ms

$ LLAMA_OPENBLAS=ON make

I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0



$ diff ExecutionRun1.txt ExecutionRun3.txt
13,17c13,17
< llama_print_timings:        load time =  6459.39 ms
< llama_print_timings:      sample time =   108.39 ms /   100 runs   (    1.08 ms per run)
< llama_print_timings: prompt eval time =  7418.28 ms /    14 tokens (  529.88 ms per token)
< llama_print_timings:        eval time = 74415.93 ms /    99 runs   (  751.68 ms per run)
< llama_print_timings:       total time = 83421.13 ms
---
> llama_print_timings:        load time =  3867.08 ms
> llama_print_timings:      sample time =    80.33 ms /   100 runs   (    0.80 ms per run)
> llama_print_timings: prompt eval time =  5218.40 ms /    14 tokens (  372.74 ms per token)
> llama_print_timings:        eval time = 45214.81 ms /    99 runs   (  456.72 ms per run)
> llama_print_timings:       total time = 51430.97 ms

@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 31, 2023

For some reason current master cancels out a fair portion of #642's original performance gains and gives different outputs in my case. Either way, I get a slightly higher uplift with this PR compared to #642 with and without #617. System is Linux, compiling with LLAMA_OPENBLAS=ON make in all my findings. Might be worth compiling without BLAS.. Edit: Same findings without BLAS outside of uplift from BLAS.

@sw
Copy link
Collaborator

sw commented Mar 31, 2023

Joining the discussion here, coming from #642...

I'm seeing a tiny speed improvement of ~2% here, certainly not 1.5x. This is on an i3-8100.

No change in output compared to master after merge of #642, which is in line with @jart's observation:

What's interesting is that this PR causes behavior to regress in the exact same way #642 did

@SebastianApel
Copy link
Contributor Author

That's very strange. For some reason current master cancels out a fair portion of #642's original performance gains and gives different outputs in my case. Either way, I get a slightly higher uplift with this PR compared to #642 with and without #617. System is Linux, compiling with LLAMA_OPENBLAS=ON make in all my findings. Might be worth compiling without BLAS..

I've also tried to replicate with LLAMA_OPENBLAS=ON make - output is still identical, performance even better. See my comment above

@SebastianApel
Copy link
Contributor Author

SebastianApel commented Mar 31, 2023

Joining the discussion here, coming from #642...

I'm seeing a tiny speed improvement of ~2% here, certainly not 1.5x. This is on an i3-8100.

No change in output with master after merge of #642, which is in line with @jart's observation:

What's interesting is that this PR causes behavior to regress in the exact same way #642 did

@sw That's very interesting, apparently the performance is highly dependent on the the specific constellation (CPU, Compiler, Operating System, ...).

Would it be possible that you try to compile with the compiler switch -march=native appended to line 34 and 35 of the Makefile?

@sw
Copy link
Collaborator

sw commented Mar 31, 2023

-march=native is very slightly faster (again ~2%), but I'm not seeing a significant difference between master and this PR. This seems more and more like a wild goose chase...

@rabidcopy
Copy link
Contributor

Please disregard most if not everything regarding #617 and #642. I had folders mixed up. There's no mystery performance regression. Performance is still a tiny bit better here but only by 10-15ms for me.

@SebastianApel
Copy link
Contributor Author

@sw @rabidcopy

The problem (at least on my machine) is: Even with the SAME executable, I get a lot of variance in the "ms per run" eval timings. Between the minimum (448 ms per run) and the maximum (598 ms per run) was a delta of ~150 ms, that's 33% of the total time per run (see below).

My hypothesis is: It highly depends what else is happening on the system.

This means: It's going to be very hard to derive any conclusion from a single run of ./main .

I think we need a consistent "test harness", with identical settings, compiler switches that executes several runs to get to a statistical view. If started to work towards something like this in PR#653

llama_print_timings:        eval time = 56604.60 ms /    99 runs   (  571.76 ms per run)
llama_print_timings:        eval time = 46594.74 ms /    99 runs   (  470.65 ms per run)
llama_print_timings:        eval time = 46408.88 ms /    99 runs   (  468.78 ms per run)
llama_print_timings:        eval time = 59202.12 ms /    99 runs   (  598.00 ms per run)   <- MAX
llama_print_timings:        eval time = 48923.75 ms /    99 runs   (  494.18 ms per run)
llama_print_timings:        eval time = 46817.46 ms /    99 runs   (  472.90 ms per run)
llama_print_timings:        eval time = 44402.43 ms /    99 runs   (  448.51 ms per run)    <- MIN
llama_print_timings:        eval time = 47603.48 ms /    99 runs   (  480.84 ms per run)
llama_print_timings:        eval time = 47156.63 ms /    99 runs   (  476.33 ms per run)
llama_print_timings:        eval time = 47849.84 ms /    99 runs   (  483.33 ms per run)

@rabidcopy
Copy link
Contributor

Just a small comparison on a 13B model. Left is PR, right is master. Compiled with BLAS and -march=native.
image

@howard0su
Copy link
Collaborator

Help run a stats calculation based on FLOPS per us:

ministat.exe a.txt b.txt
x a.txt
+ b.txt
+--------------------------------------------------------------------------+
|xx                                                                       +|
|xx                                                     +                 +|
|xx                                                    ++ + +     +      ++|
||A                                                     |________AM______| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10      14726.64      14849.31      14827.83     14797.566     44.044047
+  10      20771.65      22854.84      21980.24     21828.123     922.04901
Difference at 95.0% confidence
        7030.56 +/- 613.303
        47.5116% +/- 4.14462%
        (Student's t, pooled s = 652.731)

@SebastianApel
Copy link
Contributor Author

SebastianApel commented Apr 1, 2023

Help run a stats calculation based on FLOPS per us:

@howard0su Nice! What do "a" and "b" stand for in your comment?

@SebastianApel SebastianApel changed the title ~1.5x performance improvement of ggml_vec_dot_q4_0 on AVX2 10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 Apr 1, 2023
@Ameobea
Copy link
Contributor

Ameobea commented Apr 2, 2023

Impressive work! I honestly don't fully understand how this is providing performance improvements over the already optimized stuff in #642, but I can replicate the improvement locally on my machine as well.

It seems that after these changes, the AVX512 version can't keep pace and it's actually significantly slower to use that version now, even on AVX512 capable hardware. I've made some attempts to port your changes over to the AVX512 code as well, but I'm unable to see any success.

My best guess for the reason behind this is that the extremely dense 4-bit data has more overhead than benefit for AVX512 at this point. For example, there is no single instruction to broadcast a 32-bit memory address to a zmm/512-bit register; _mm512_set1_ps seems to produce an expensive combination of a broadcast into a ymm followed by a vshuff32x4.

Additionally, I've noticed that while shorter in instruction count, the AVX512 code actually has a longer latency due to data dependencies. The CPU actually seems to be able to parallelize work across three execution ports in your improved AVX2 version (AVX2 execution trace) vs. only two in the AVX512 version (AVX512 execution trace)

I've created a goldbolt comparison between your AVX2 version and my attempt at porting it for AVX512. You can switch between them by toggling the define on line 20.

Anyway, I figure there's a good chance you don't have access to an AVX512-enabled CPU yourself, so if you're not able to replicate or continue on with the AVX512 porting of your improvements, I think the best thing to do for now would be to remove or disable the AVX512 codepath and default back to AVX2 for all CPUs that support it.


EDIT: I've done some more investigation, and it seems that clang-13/clang++-13 is producing significantly better code than gcc for both your current AVX2 and my adapted AVX512 version. AVX512 is still significantly slower, but yeah I think that's something worth investigating if we're really chasing down perf.

Updated godbolt comparison using clang 13: https://c.godbolt.org/z/Yv5szs6Kh

clang-13 AVX2 execution trace: https://uica.uops.info/tmp/63000579f2b24ee182d450170a4f5dd8_trace.html

clang-13 AVX512 execution trace: https://uica.uops.info/tmp/d65cfb55599a41ffbb0cbcbe22c8dcbe_trace.html

With clang-13 on my 7950x CPU, I'm seeing:

  • 231.11 ms per run with AVX2
  • 298.52 ms per run with my ported AVX512 version.

With gcc-12, I'm seeing:

  • 247.57 ms per run per run with AVX2
  • 329.70 ms per run with AVX512

@Ameobea
Copy link
Contributor

Ameobea commented Apr 2, 2023

I've wasted a couple more hours on this. I've realized that once I get ~6 threads going, there are no detectable differences between the AVX2 and AVX512 versions. The whole thing becomes completely memory bound and no changes to compiler or AVX2 vs. AVX512 move more than a percent or two away from 100ms/token.

I'm not going to spend any more time trying to optimize this at this myself point. My opinion is that we should just drop the AVX512 implementation. It provides no detectable benefit for me at this point in the best case, and makes things slower in the worst case.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @Ameobea that we should drop AVX512 for now until we have an implementation that is actually benefiting from these instructions. Moreover, now that we use -march=native by default

Regarding reproducibility of the results:
It is totally expected the produced results to vary for different SIMD implementations since we are dealing with floating point round off errors and the model is auto-regressive, meaning that small variability at the start can cause totally different results at the end

The thing that we are interested in remaining approximately the same is perplexity.
This is the quality metric that we currently have and as long as it is not degraded, we can accept changes in the SIMD implementation that improve the performance

I haven't tested this PR on x86 since it is more tedious for me, but if you observe good perplexity values and performance improvement, please go ahead and merge these changes.

Important
When you evaluate the perplexity, remember that for batch size of >= 32, currently ggml will switch to using OpenBLAS sgemm if you are linking to it and this piece of code will not be evaluated at all. So make sure to either use a smaller batch, or best - disable OpenBLAS when evaluating the perplexity.

ggml.c Outdated Show resolved Hide resolved
@SebastianApel
Copy link
Contributor Author

This is the quality metric that we currently have and as long as it is not degraded, we can accept changes in the SIMD implementation that improve the performance.

I would be very happy to verify this. But: I started to the perplexity binary and it tells me: 223.28 seconds per pass - ETA 40.63 hours. This means it would be nearly 4 days until I have a result (2 runs, one for #642, one for #654).

Is there someone with a beefier machine who could help out?

@SebastianApel
Copy link
Contributor Author

we should drop AVX512 for now until we have an implementation that is actually benefiting from these instructions.

@ggerganov I would suggest to move that to a separate PR and keep the removal in a single commit. This would help to keep things clean and it would be easier to revert the commit if the AVX512 could would still be wanted by someone.

@SebastianApel
Copy link
Contributor Author

@Ameobea

Additionally, I've noticed that while shorter in instruction count, the AVX512 code actually has a longer latency due to data dependencies. The CPU actually seems to be able to parallelize work across three execution ports in your improved AVX2 version (AVX2 execution trace) vs. only two in the AVX512 version (AVX512 execution trace)

Awesome, thank you for the links - did not know that existed, it's super helpful.

Anyway, I figure there's a good chance you don't have access to an AVX512-enabled CPU yourself,

Yes, I currently do not have access to a AVX512 CPU. This why I suggested to move that into a separate issue here: #654 (comment)

I've done some more investigation, and it seems that clang-13/clang++-13 is producing significantly better code than gcc for both your current AVX2

I was also under the impression that CLANG seems to produce slightly better assembly. But I am not sure if I compare it to gcc without -march=native.

@SebastianApel SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from cb5cc71 to d8acf29 Compare April 2, 2023 20:03
@SebastianApel SebastianApel requested a review from sw April 2, 2023 20:04
@SebastianApel SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from d8acf29 to e621f62 Compare April 2, 2023 20:08
@SebastianApel SebastianApel force-pushed the performance_improvement_q4_0_AVX2 branch from 9e62f03 to 69ef03d Compare April 2, 2023 20:13
ggml.c Outdated Show resolved Hide resolved
@rabidcopy
Copy link
Contributor

Doing some short but repetitive testing of the latest commits, about the same performance as the initial commit of this PR. Compared to current master, it's on average 190ms per run versus 165ms per run. This PR being faster. So 10-15% performance increase is still pretty accurate. Output remains the same.

@SebastianApel SebastianApel requested a review from sw April 3, 2023 07:37
Copy link
Collaborator

@sw sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @ggerganov's approval on a previous revision, and the detailed statistics provided by @SebastianApel (thanks!), I'm merging this even though I see no improvement on my machine.

@sw sw merged commit 437e778 into ggerganov:master Apr 3, 2023
@rabidcopy
Copy link
Contributor

Glad to see this merged. Little by little inference is getting faster.

@sw
Copy link
Collaborator

sw commented Apr 4, 2023

This brought in a bunch of trailing whitespace, sorry for not catching that. @SebastianApel : please configure your editor properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants