Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demo : per-layer KV / partial offloading of KV cache #3457

Closed
wants to merge 3 commits into from

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Oct 3, 2023

Currently, the entire KV cache is allocated as a single tensor for all the layers. As a consequence, the KV cache is either fully on the CPU, or fully offloaded to the GPU.

With this change, the KV cache is allocated on a different tensor per layer. The result is more granular control over the parts of the KV offloaded to the GPU.

In this demo, when partially offloading a model, the KV cache corresponding to the offloaded layers is also offloaded. This increases performance at the expense of more VRAM.

Is it worth it compared to just offloading more layers? I am not sure, but probably wouldn't hurt to have more flexibility.

Note: only implemented for llama models. CUDA only.

Edit: removed a few unnecessary copies that caused performance to degrade.

6fb9afe1-eca9-480a-a62c-ed01d02503c7
83343429-82e1-4bb5-b034-8667cf242d71

Llama2 70B on a single 24 GB GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params ngl test master t/s PR t/s speedup
70B mostly Q2_K 27.27 GiB 68.98 B 60 pp 512 83.10 ± 0.26 115.29 ± 0.33 1.39
70B mostly Q2_K 27.27 GiB 68.98 B 60 tg 128 4.21 ± 0.05 4.78 ± 0.05 1.14
70B mostly Q2_K 27.27 GiB 68.98 B 61 pp 512 84.29 ± 0.49 118.47 ± 0.19 1.41
70B mostly Q2_K 27.27 GiB 68.98 B 61 tg 128 4.35 ± 0.03 5.01 ± 0.04 1.15
70B mostly Q2_K 27.27 GiB 68.98 B 62 pp 512 85.28 ± 0.23 121.99 ± 0.38 1.43
70B mostly Q2_K 27.27 GiB 68.98 B 62 tg 128 4.47 ± 0.05 5.67 ± 0.15 1.27
70B mostly Q2_K 27.27 GiB 68.98 B 63 pp 512 86.71 ± 0.23 125.13 ± 0.21 1.44
70B mostly Q2_K 27.27 GiB 68.98 B 63 tg 128 4.61 ± 0.03 6.14 ± 0.01 1.33
70B mostly Q2_K 27.27 GiB 68.98 B 64 pp 512 87.99 ± 0.30 - 1.42 (63)
70B mostly Q2_K 27.27 GiB 68.98 B 64 tg 128 4.74 ± 0.04 - 1.30 (63)
70B mostly Q2_K 27.27 GiB 68.98 B 65 pp 512 89.19 ± 0.23 - 1.40 (63)
70B mostly Q2_K 27.27 GiB 68.98 B 65 tg 128 5.00 ± 0.05 - 1.23 (63)
v1

e009d7db-6cc2-4353-88cd-9fd84f1eb55f
9890a361-3b8e-4add-b473-c38d593fe897

@ggerganov
Copy link
Owner

Regardless of the performance effects, this is a good change since it makes the KV cache addressing more intuitive

@slaren slaren changed the title demo: per-layer KV demo : per-layer KV / partial offloading of KV cache Oct 4, 2023
@Dampfinchen
Copy link

Definately worth it to set fewer layers but get higher prompt processing speed out of it.

@slaren slaren added the demo Demonstrate some concept or idea, not intended to be merged label Oct 11, 2023
@oobabooga
Copy link
Contributor

Could this PR, when combined with the performance gains in #3776, allow 70b models in q4_K_M / q4_K_S precision to run on a 3090 at more than 1-2 tokens/second?

@BarfingLemurs BarfingLemurs mentioned this pull request Nov 30, 2023
@ggerganov
Copy link
Owner

I will try to update this PR to latest master and merge

@ggerganov ggerganov self-assigned this Dec 3, 2023
@slaren
Copy link
Collaborator Author

slaren commented Dec 3, 2023

Ok, some notes:

  • The reason I didn't continue working on this is because of the two copies of KQ_mask and KQ_pos for CPU and GPU. I was hoping to do this automatically with ggml-backend, but after your graph building refactoring it may be possible to do it cleanly now.
  • The KV cache can get quite big, so it should still be possible to choose to not offload the KV cache for low VRAM situations

@ggerganov ggerganov mentioned this pull request Dec 3, 2023
4 tasks
@ggerganov
Copy link
Owner

Will leave this PR intact for reference. Opened a new PR: #4309

@oobabooga and anyone else who is interested - would be nice to run some tests with #4309 to make sure it works as expected

@slaren slaren closed this Dec 7, 2023
@slaren slaren deleted the per-layer-kv branch December 7, 2023 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants