-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipeline parallelism demo #4918
Conversation
With 3x P40 the improvement on my system is negligible. There is some improvement for large batch sizes but the best speed at 512 is barely affected. Also token generation seems to become slower. For comparison, these are the results I get with
|
The copy between GPUs is still synchronous, and this limits the parallelism to two GPUs (the last split of the previous micro-batch runs simultaneously with the first split of the next one, but other splits are synchronized). The micro batch size also has a large impact on the performance as expected, 256 works well for me, but the P40 may need larger batches. |
Should be fixed now. This also improved performance for me with two GPUs.
|
Outstanding! How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal? What else is required for this to become mergeable, apart from the |
In my system, the difference between batch size 256 and 512 is very small (master): Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 4be5ef5 (1861)
We need to figure what to do with the |
With the latest commit (af789e7) and n_microbatch = 512:
You get higher GPU utilization with microbatching because the second GPU can start after the first GPU has processed 256 tokens instead of having to wait for the full batch size.
To make microbatching actually perform well I think it will be necessary to write better dequantization kernels. Presumably the reason slaren used FP16 is because in that case the weight matrices do not need to be dequantized so the performance for small batches is comparatively good. The kernel I wrote in #4895 could presumably be adapted for other formats but templating it will be difficult. With MMQ the weight matrices do not need to be dequantized but then the baseline performance for Volta or newer is lower so the utility is questionable. If #4801 works out it would also help a lot since dequantizing to int8 needs only half as much memory bendwidth as dequantizing to |
@slaren In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop. Edit: Ah it's because the attention compute grows since it is a single sequence. Got it. |
The performance is best with F16, but there is still a good speedup with Q4_0.
Performance always drops when the context is larger, but the speedup relative to master is higher with pp 4096 (it's almost 2x with F16). |
Just for fun, here are results on 8x RTX 4090 with ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: af789e7 (1861) 13B and 34B data
build: af789e7 (1861)
build: af789e7 (1861) And some more data points on 8x A100: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
build: af789e7 (1861) 13B data
build: af789e7 (1861) |
Very large batches already have this behavior on master. The main problem is that the compute needed for soft max scales with batch size. This could be mitigated by writing a softmax kernel specifically for a diagonal infinite mask. You could potentially also save compute by not computing those elements that are later going to be masked anyways. |
We also lack flash attention which results in 2 extra writes and reads of the KQ data to global memory |
Awesome, 20000 tokens/sec .. that's quite a change to where we've been a month ago ;) I think it was 280/sec in that configuration. |
This is interesting in my use case. |
get_rows don't seems to working right now. |
It is expected, the CUDA get_rows implementation does not support k-quants. |
Short update about this: I realized that there is a possible data race when copying data between backends, and it is not enough to create multiple copies of the CPU compute buffer. A possible solution would be to create multiple copies of the GPU compute buffers (that's what the current version does), but the cost in VRAM is too high to do this. The only tensors that really need to be duplicated are the tensors copied between backends at the start of each split, and this requires a lot less memory than duplicating the entire compute buffer, so that's what I am working on. This will require implementing all of the logic to handle this in |
Sounds very promising, I'm happy my little suggestion goes so big. Very nice work so far |
Technically it is possible, there is no reason to treat the CPU backend different than any other backend. But realistically, the CPU is so much slower than the GPU that I wouldn't expect any meaningful improvement in performance, and as it is now, most matrix multiplications are always done on the GPU during prompt processing anyway. |
It is true that the GPU is used for matrix multiplications with batch sizes >= 32 anyways. But for those matrix multiplications most of the runtime goes towards CPU<->GPU data transfers which can be executed in parallel with GPU computations. So it should still help quite a lot. |
Hi @slaren, thanks a lot for your effort on cuda backend. Here are the results, on my infra 2 A100 80GB, what do you think ? build with: Models: If it can help, I see lot of Note: @ggerganov I feel like the model name for mixtral8x7b is misleading, maybe we should include the moe config in the model name. I will have a look. |
The results look reasonable. Mixtral does not work with pipeline parallelism due to the way the This is branch is very outdated and the final implementation will be very different, and at this point there is no need to run more tests on this branch. I'll close this PR to avoid confusion. |
There isn't much synchronization required, just splitting the prompt into multiple micro-batches and queueing them in the CUDA streams is enough.
The micro-batch size is not configurable at the moment, it needs to be changed in
n_microbatch
in llama.cpp.Incidentally, this also adds the ability to split batches into multiple micro-batches, so it is possible to call
llama_decode
with a batch larger thann_batch
. I think the best way to implement this would be to usen_batch
as the microbatch size, and modify the applications to ignoren_batch
and submit the entire prompt or batch in a single call tollama_decode
.Offloading
tok_embd
improves performance significantly in this case.3090Ti+3080, n_microbatch=256, tok_embd on GPU:
Master
MASTER, SINGLE GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
MASTER, TWO GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes