Add support for batch size to `--perplexity` #407

glinscott · 2023-03-22T19:24:48Z

Adds support for batch size to perplexity. This allows you to run perplexity on context sizes of 2048 (although still limited to a batch size of 512 there, as that's the max currently supported). We set batch size to the min of user defined batch size and the context size.

Note though, the small batch sizes 8 are significantly slower than 512, and also that perplexity takes a hit (from switching off BLAS):

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw
perplexity : calculating perplexity over 655 chunks, batch_size=8
19.95 seconds per pass - ETA 3.63 hours
[1]4.5949,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512
13.41 seconds per pass - ETA 2.44 hours
[1]4.3800,

For larger batch sizes, things match well though:

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 2048 -b 512
68.21 seconds per pass - ETA 3.09 hours                                                                                               
[1]4.0474,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 2048 -b 64
208.50 seconds per pass - ETA 9.44 hours
[1]4.0474,

Also, fixes the batch size passed into llama_eval so it's not off by one, although I didn't notice much speed difference for large sizes.

glinscott · 2023-03-23T04:01:22Z

Ok, very interesting result. From #406 (reply in thread), there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different batch sizes now! Interestingly, the 1 thread results both match the batch size 512 result (with 32 threads).

ggerganov · 2023-03-23T18:10:28Z

I confirm the observation - looking into this

Edit: The source of the variation looks to be in the "self-attention" section of llama_eval(). Could be some numerical instability.

Edit2: Pretty sure I've pin-pointed the cause:

llama.cpp/llama.cpp

Line 737 in 9ea43d4

struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_trans, KQ_soft_max);

This matrix multiplication z = x * y goes through the branch where x (i.e. V_trans) is not contiguous in memory. I.e. we have it transposed on the previous line via the ggml_permute() call. The simple fix is to make a copy into a contiguous buffer. But I want to see if I can find the instability in this branch and try to fix it.

ggerganov · 2023-03-23T18:42:42Z

@glinscott
Here is a quick fix to not block you:

#439

Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml.

Green-Sky · 2023-03-23T20:22:28Z

kind of colliding with #438 , also the batch size in your case has the opposite meaning of what it normally (non perplexity mode) does. and thus very deceptive.

glinscott · 2023-03-24T01:20:03Z

@ggerganov awesome! thank you, very nice find.

@Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after batch tokens? Which is what this is intending to do. Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

This reverts commit 4870e45.

glinscott · 2023-03-24T01:45:28Z

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e45. If I go back to 483bab2 it works well.

glinscott · 2023-03-24T02:06:14Z

Ok, so after 483bab2, I see results are consistent at a given batch_size, but different across batch_sizes.

Eg. tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 with 1 thread, 8 threads, and 32 threads, and always got [1]4.6257.

Then tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 with 1, 8, and 32 and always got [1]4.5690.

Then, did a few more:
batch_size=256, threads=32 -> 4.5690
batch_size=64, threads=32 -> 4.5690
batch_size=32, threads=32 -> 4.5690
batch_size=16, threads=32 -> 4.5903 ** first delta
batch_size=16, threads=16 -> 4.5903

Green-Sky · 2023-03-24T02:28:23Z

Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

if the memory management would actually take it into account.

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e45. If I go back to 483bab2 it works well.

see #438 for a competing attempt

ggerganov · 2023-03-24T05:06:15Z

@Green-Sky I think I know how to fix the memory issues and reduce token memory usage drastically. Will try to do this later today

@glinscott Are you on a Mac? I think at batch size >= 32 the BLAS Accelerate mul_mat branch is triggered:

llama.cpp/ggml.c

Lines 5722 to 5729 in 3cd8dde

    
           // TODO: find the optimal values for these 
        
           if (ggml_is_contiguous(src0) && 
        
               ggml_is_contiguous(src1) && ((ne0 >= 32 && ne1 >= 32 && ne10 >= 32))) { 
        
               //printf("BLAS: %d %d %d\n", ne0, ne1, ne10); 
        
               return true; 
        
           }

If you disable BLAS with make clean && LLAMA_NO_ACCELERATE=1 make should maybe get the same results?

glinscott · 2023-03-24T14:31:23Z

@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with LLAMA_NO_ACCELERATE=1, but got the same results. It is interesting they switch at batch size 16 vs 32 though.

glinscott · 2023-04-03T03:43:30Z

I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise, I'd probably switch the batch size for people to min(512, context_size) automatically.

glinscott · 2023-04-03T22:41:21Z

Some very interesting results here. I'm building with LLAMA_OPENBLAS=1 make -j4 currently. I think this means that the batch_size 8 version is not using BLAS, while the larger one is. And it results in a huge perplexity decrease! Will have to try switching to the BLAS version even for smaller sizes and see how it goes.

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16
perplexity : calculating perplexity over 655 chunks, batch_size=8
18.83 seconds per pass - ETA 3.43 hours 
[655]6.6016,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16 -b 512  
perplexity : calculating perplexity over 655 chunks, batch_size=512
12.96 seconds per pass - ETA 2.36 hours   
[655]6.2838

glinscott · 2023-04-04T04:27:08Z

Indeed, hardcoding ggml_compute_forward_mul_mat_use_blas to return true results in excellent perplexity, but it's incredibly slow:

perplexity : calculating perplexity over 655 chunks, batch_size=16
847.81 seconds per pass - ETA 154.26 hours
[1]4.3801,

ivanstepanovftw · 2023-04-04T22:30:32Z

Hi @glinscott, I have tested "hardcoding ggml_compute_forward_mul_mat_use_blas to return true" using IntelMKL:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  9084.05 ms
llama_print_timings:      sample time =    21.13 ms /    40 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time = 16373.49 ms /    16 tokens ( 1023.34 ms per token)
llama_print_timings:        eval time = 157398.71 ms /    39 runs   ( 4035.86 ms per run)
llama_print_timings:       total time = 174428.73 ms

And OpenBLAS:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  8909.49 ms
llama_print_timings:      sample time =    24.11 ms /    40 runs   (    0.60 ms per run)
llama_print_timings: prompt eval time = 16349.56 ms /    16 tokens ( 1021.85 ms per token)
llama_print_timings:        eval time = 288016.53 ms /    39 runs   ( 7385.04 ms per run)
llama_print_timings:       total time = 305045.43 ms

Also, perplexity slightly better ([3]5.8269 vs [3]5.8271)

ggerganov · 2023-04-05T14:52:12Z

@glinscott

Yes, using BLAS during perplexity computation can be deceiving (I think I noted this somewhere earlier).
I think the explanation is the following:

Let's have matrix multiplication Z = X*Y

X is 4-bit quantized
Y is FP32
Z is FP32

When using BLAS, ggml will dequantize X into FP32 and use BLAS's sgemm to do the matrix multiplication.
When not using BLAS, ggml will quantize Y to 4-bit and use the SIMD routines for 4-bit dot product.

I think the BLAS computation will be more precise, because we lose precision when quantizing Y in the latter case.
Additionally, I am not super confident that the current dot product routines accumulate optimally the floating points - I think there might things to improve here.

ggerganov · 2023-04-13T12:53:50Z

This PR would be nice to get finalized

glinscott · 2023-04-13T15:21:39Z

Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!

adds the hipBlas gpu_target $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch) back to the gpu_target line, possibly allowing misc gpu arch's like gfx1031 or gfx1032 etc to be built

Add support to batch size for perplexity

9ea43d4

glinscott changed the title ~~Add support to batch size for perplexity~~ Add support for batch size to --perplexity Mar 22, 2023

gjmulder added enhancement New feature or request generation quality Quality of model output labels Mar 23, 2023

ggerganov mentioned this pull request Mar 23, 2023

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Merged

Green-Sky mentioned this pull request Mar 23, 2023

dynamic estimate of required memory usage #438

Closed

glinscott added 2 commits March 23, 2023 18:35

Merge remote-tracking branch 'origin/master' into batch_perplexity

9179d08

Revert "Fix memory allocation issues and seg faults"

57dc4dc

This reverts commit 4870e45.

glinscott added 5 commits March 25, 2023 13:24

Merge branch 'master' into batch_perplexity

c3d3cd2

update from merge

7392ad6

Remove perplexity from main

4352322

Merge branch 'master' into batch_perplexity

a17e745

updates

864dcb2

ggerganov added the high priority Very important issue label Apr 13, 2023

glinscott added 2 commits April 13, 2023 08:13

Merge remote-tracking branch 'origin/master' into batch_perplexity

fbcecd5

Update batch size for efficiency

23fd782

glinscott marked this pull request as ready for review April 13, 2023 15:21

ggerganov approved these changes Apr 13, 2023

View reviewed changes

ggerganov merged commit be87b6e into ggerganov:master Apr 13, 2023

sw mentioned this pull request Apr 14, 2023

Q8: use int8_t, AVX/AVX2 optimizations #972

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for batch size to `--perplexity` #407

Add support for batch size to `--perplexity` #407

glinscott commented Mar 22, 2023 •

edited

Loading

glinscott commented Mar 23, 2023

ggerganov commented Mar 23, 2023 •

edited

Loading

ggerganov commented Mar 23, 2023 •

edited

Loading

Green-Sky commented Mar 23, 2023 •

edited

Loading

glinscott commented Mar 24, 2023

glinscott commented Mar 24, 2023

glinscott commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023

glinscott commented Mar 24, 2023

glinscott commented Apr 3, 2023

glinscott commented Apr 3, 2023

glinscott commented Apr 4, 2023

ivanstepanovftw commented Apr 4, 2023

ggerganov commented Apr 5, 2023

ggerganov commented Apr 13, 2023

glinscott commented Apr 13, 2023

Add support for batch size to --perplexity #407

Add support for batch size to --perplexity #407

Conversation

glinscott commented Mar 22, 2023 • edited Loading

glinscott commented Mar 23, 2023

ggerganov commented Mar 23, 2023 • edited Loading

ggerganov commented Mar 23, 2023 • edited Loading

Green-Sky commented Mar 23, 2023 • edited Loading

glinscott commented Mar 24, 2023

glinscott commented Mar 24, 2023

glinscott commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

ggerganov commented Mar 24, 2023

glinscott commented Mar 24, 2023

glinscott commented Apr 3, 2023

glinscott commented Apr 3, 2023

glinscott commented Apr 4, 2023

ivanstepanovftw commented Apr 4, 2023

ggerganov commented Apr 5, 2023

ggerganov commented Apr 13, 2023

glinscott commented Apr 13, 2023

Add support for batch size to `--perplexity` #407

Add support for batch size to `--perplexity` #407

glinscott commented Mar 22, 2023 •

edited

Loading

ggerganov commented Mar 23, 2023 •

edited

Loading

ggerganov commented Mar 23, 2023 •

edited

Loading

Green-Sky commented Mar 23, 2023 •

edited

Loading