New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

llama : per-layer KV cache #4309

Merged

ggerganov merged 15 commits into master from gg/per-layer-kv

Dec 7, 2023

Commits on Oct 3, 2023

per-layer KV

slaren committed Oct 3, 2023
Configuration menu
View commit details

Copy full SHA for e9bcf66

Browse repository at this point
Copy the full SHA

e9bcf66 View commit details

Browse the repository at this point in the history
remove unnecessary copies

slaren committed Oct 3, 2023
Configuration menu
View commit details

Copy full SHA for 55f2f2f

Browse repository at this point
Copy the full SHA

55f2f2f View commit details

Browse the repository at this point in the history

Commits on Oct 6, 2023

less code duplication, offload k and v separately

slaren committed Oct 6, 2023
Configuration menu
View commit details

Copy full SHA for f4f9367

Browse repository at this point
Copy the full SHA

f4f9367 View commit details

Browse the repository at this point in the history

Commits on Dec 3, 2023

Merge branch 'master' into per-layer-kv

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for c294c78

Browse repository at this point
Copy the full SHA

c294c78 View commit details

Browse the repository at this point in the history
llama : offload KV cache per-layer

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for 986b3da

Browse repository at this point
Copy the full SHA

986b3da View commit details

Browse the repository at this point in the history
llama : offload K shift tensors

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for f3dbfb9

Browse repository at this point
Copy the full SHA

f3dbfb9 View commit details

Browse the repository at this point in the history
llama : offload for rest of the model arches

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for 3d3e6bd

Browse repository at this point
Copy the full SHA

3d3e6bd View commit details

Browse the repository at this point in the history
llama : enable offload debug temporarily

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for 1fa91a4

Browse repository at this point
Copy the full SHA

1fa91a4 View commit details

Browse the repository at this point in the history
llama : keep the KV related layers on the device

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for c44bc1e

Browse repository at this point
Copy the full SHA

c44bc1e View commit details

Browse the repository at this point in the history
llama : remove mirrors, perform Device -> Host when partial offload

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for c80b8a2

Browse repository at this point
Copy the full SHA

c80b8a2 View commit details

Browse the repository at this point in the history
common : add command-line arg to disable KV cache offloading

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for e262947

Browse repository at this point
Copy the full SHA

e262947 View commit details

Browse the repository at this point in the history
llama : update session save/load

ggerganov committed Dec 3, 2023
Configuration menu
View commit details

Copy full SHA for 66aaac9

Browse repository at this point
Copy the full SHA

66aaac9 View commit details

Browse the repository at this point in the history

Commits on Dec 6, 2023

llama : support quantum K cache (#4312 )

* llama : support quantum K cache (wip)

* metal : add F32 -> Q8_0 copy kernel

* cuda : add F32 -> Q8_0 copy kernel

ggml-ci

* cuda : use mmv kernel for quantum cache ops

* llama : pass KV cache type through API

* llama : fix build

ggml-ci

* metal : add F32 -> Q4_0 copy kernel

* metal : add F32 -> Q4_1 copy kernel

* cuda : wip

* cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels

* llama-bench : support type_k/type_v

* metal : use mm kernel only for quantum KV cache

* cuda : add comment

* llama : remove memory_f16 and kv_f16 flags

---------

Co-authored-by: slaren <slarengh@gmail.com>

ggerganov and slaren authored Dec 6, 2023

1a1a1c3

Commits on Dec 7, 2023

Merge branch 'master' into gg/per-layer-kv
```
ggml-ci
```
ggerganov committed Dec 7, 2023
Configuration menu
View commit details

Copy full SHA for 680a99e

Browse repository at this point
Copy the full SHA

680a99e View commit details

Browse the repository at this point in the history
readme : add API change notice

ggerganov authored Dec 7, 2023
Configuration menu
View commit details

Copy full SHA for fc5f334

Browse repository at this point
Copy the full SHA

fc5f334 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : per-layer KV cache #4309

llama : per-layer KV cache #4309

Commits on Oct 3, 2023

Commits on Oct 6, 2023

Commits on Dec 3, 2023

Commits on Dec 6, 2023

Commits on Dec 7, 2023