Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for batch size to --perplexity #407

Merged
merged 10 commits into from
Apr 13, 2023

Conversation

glinscott
Copy link
Collaborator

@glinscott glinscott commented Mar 22, 2023

Adds support for batch size to perplexity. This allows you to run perplexity on context sizes of 2048 (although still limited to a batch size of 512 there, as that's the max currently supported). We set batch size to the min of user defined batch size and the context size.

Note though, the small batch sizes 8 are significantly slower than 512, and also that perplexity takes a hit (from switching off BLAS):

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw
perplexity : calculating perplexity over 655 chunks, batch_size=8
19.95 seconds per pass - ETA 3.63 hours
[1]4.5949,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512
13.41 seconds per pass - ETA 2.44 hours
[1]4.3800,

For larger batch sizes, things match well though:

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 2048 -b 512
68.21 seconds per pass - ETA 3.09 hours                                                                                               
[1]4.0474,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -c 2048 -b 64
208.50 seconds per pass - ETA 9.44 hours
[1]4.0474,

Also, fixes the batch size passed into llama_eval so it's not off by one, although I didn't notice much speed difference for large sizes.

@glinscott glinscott changed the title Add support to batch size for perplexity Add support for batch size to --perplexity Mar 22, 2023
@glinscott
Copy link
Collaborator Author

Ok, very interesting result. From #406 (reply in thread), there was a delta between 10 and 32 threads. So i tried rerunning my experiment with 1 thread. It's perfectly consistent with the different batch sizes now! Interestingly, the 1 thread results both match the batch size 512 result (with 32 threads).

@gjmulder gjmulder added enhancement New feature or request generation quality Quality of model output labels Mar 23, 2023
@ggerganov
Copy link
Owner

ggerganov commented Mar 23, 2023

I confirm the observation - looking into this

Edit: The source of the variation looks to be in the "self-attention" section of llama_eval(). Could be some numerical instability.

Edit2: Pretty sure I've pin-pointed the cause:

struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_trans, KQ_soft_max);

This matrix multiplication z = x * y goes through the branch where x (i.e. V_trans) is not contiguous in memory. I.e. we have it transposed on the previous line via the ggml_permute() call. The simple fix is to make a copy into a contiguous buffer. But I want to see if I can find the instability in this branch and try to fix it.

@ggerganov
Copy link
Owner

ggerganov commented Mar 23, 2023

@glinscott
Here is a quick fix to not block you:

#439

Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 23, 2023

kind of colliding with #438 , also the batch size in your case has the opposite meaning of what it normally (non perplexity mode) does. and thus very deceptive.

@glinscott
Copy link
Collaborator Author

@ggerganov awesome! thank you, very nice find.

@Green-Sky ah, i must admit, i don't quite understand the main mode batch size parameter. I thought it does evaluation after batch tokens? Which is what this is intending to do. Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

@glinscott
Copy link
Collaborator Author

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e45. If I go back to 483bab2 it works well.

@glinscott
Copy link
Collaborator Author

Ok, so after 483bab2, I see results are consistent at a given batch_size, but different across batch_sizes.

Eg. tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 8 with 1 thread, 8 threads, and 32 threads, and always got [1]4.6257.

Then tested $ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -b 512 with 1, 8, and 32 and always got [1]4.5690.

Then, did a few more:
batch_size=256, threads=32 -> 4.5690
batch_size=64, threads=32 -> 4.5690
batch_size=32, threads=32 -> 4.5690
batch_size=16, threads=32 -> 4.5903 ** first delta
batch_size=16, threads=16 -> 4.5903

@Green-Sky
Copy link
Collaborator

Also, I thought it would save a significant amount of RAM, but in practice, that seems to not be the case, so I'm not sure it's actually useful.

if the memory management would actually take it into account.

@ggerganov ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw runs out of buffer space with 4870e45. If I go back to 483bab2 it works well.

see #438 for a competing attempt

@ggerganov
Copy link
Owner

@Green-Sky I think I know how to fix the memory issues and reduce token memory usage drastically. Will try to do this later today

@glinscott Are you on a Mac? I think at batch size >= 32 the BLAS Accelerate mul_mat branch is triggered:

llama.cpp/ggml.c

Lines 5722 to 5729 in 3cd8dde

// TODO: find the optimal values for these
if (ggml_is_contiguous(src0) &&
ggml_is_contiguous(src1) && ((ne0 >= 32 && ne1 >= 32 && ne10 >= 32))) {
//printf("BLAS: %d %d %d\n", ne0, ne1, ne10);
return true;
}

If you disable BLAS with make clean && LLAMA_NO_ACCELERATE=1 make should maybe get the same results?

@glinscott
Copy link
Collaborator Author

@ggerganov thanks for the suggestion - I'm on an AMD 5950X. I did try building with LLAMA_NO_ACCELERATE=1, but got the same results. It is interesting they switch at batch size 16 vs 32 though.

@glinscott
Copy link
Collaborator Author

I'm doing a run to compare batch size 8 vs 512 for the default context size with BLAS on, and if that looks close, this is ready to go. Otherwise, I'd probably switch the batch size for people to min(512, context_size) automatically.

@glinscott
Copy link
Collaborator Author

Some very interesting results here. I'm building with LLAMA_OPENBLAS=1 make -j4 currently. I think this means that the batch_size 8 version is not using BLAS, while the larger one is. And it results in a huge perplexity decrease! Will have to try switching to the BLAS version even for smaller sizes and see how it goes.

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16
perplexity : calculating perplexity over 655 chunks, batch_size=8
18.83 seconds per pass - ETA 3.43 hours 
[655]6.6016,

$ ./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw -t 16 -b 512  
perplexity : calculating perplexity over 655 chunks, batch_size=512
12.96 seconds per pass - ETA 2.36 hours   
[655]6.2838

@glinscott
Copy link
Collaborator Author

Indeed, hardcoding ggml_compute_forward_mul_mat_use_blas to return true results in excellent perplexity, but it's incredibly slow:

perplexity : calculating perplexity over 655 chunks, batch_size=16
847.81 seconds per pass - ETA 154.26 hours
[1]4.3801,

@ivanstepanovftw
Copy link
Collaborator

Hi @glinscott, I have tested "hardcoding ggml_compute_forward_mul_mat_use_blas to return true" using IntelMKL:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  9084.05 ms
llama_print_timings:      sample time =    21.13 ms /    40 runs   (    0.53 ms per run)
llama_print_timings: prompt eval time = 16373.49 ms /    16 tokens ( 1023.34 ms per token)
llama_print_timings:        eval time = 157398.71 ms /    39 runs   ( 4035.86 ms per run)
llama_print_timings:       total time = 174428.73 ms

And OpenBLAS:

 ## Question: What is best in life? ## Jeeves: Biotechnology and Society.2015, v.35(4), p.678-692.
The Human Genome Project (HGP) was
llama_print_timings:        load time =  8909.49 ms
llama_print_timings:      sample time =    24.11 ms /    40 runs   (    0.60 ms per run)
llama_print_timings: prompt eval time = 16349.56 ms /    16 tokens ( 1021.85 ms per token)
llama_print_timings:        eval time = 288016.53 ms /    39 runs   ( 7385.04 ms per run)
llama_print_timings:       total time = 305045.43 ms

Also, perplexity slightly better ([3]5.8269 vs [3]5.8271)

@ggerganov
Copy link
Owner

@glinscott

Yes, using BLAS during perplexity computation can be deceiving (I think I noted this somewhere earlier).
I think the explanation is the following:

Let's have matrix multiplication Z = X*Y

  • X is 4-bit quantized
  • Y is FP32
  • Z is FP32

When using BLAS, ggml will dequantize X into FP32 and use BLAS's sgemm to do the matrix multiplication.
When not using BLAS, ggml will quantize Y to 4-bit and use the SIMD routines for 4-bit dot product.

I think the BLAS computation will be more precise, because we lose precision when quantizing Y in the latter case.
Additionally, I am not super confident that the current dot product routines accumulate optimally the floating points - I think there might things to improve here.

@ggerganov ggerganov added the high priority Very important issue label Apr 13, 2023
@ggerganov
Copy link
Owner

This PR would be nice to get finalized

@glinscott glinscott marked this pull request as ready for review April 13, 2023 15:21
@glinscott
Copy link
Collaborator Author

Sorry for the delay, I've updated it so the batch size is defaulted to 512, which is much faster. Ready to go!

@ggerganov ggerganov merged commit be87b6e into ggerganov:master Apr 13, 2023
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024
adds the hipBlas gpu_target $(shell $(ROCM_PATH)/llvm/bin/amdgpu-arch)
back to the gpu_target line, possibly allowing misc gpu arch's like gfx1031 or gfx1032 etc to be built
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants