Skip to content

Commit

Permalink
llama : pad KV cache size to 32
Browse files Browse the repository at this point in the history
  • Loading branch information
ggerganov committed Dec 1, 2023
1 parent 5a7d312 commit 3e68df8
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5744,8 +5744,7 @@ static int llama_decode_internal(
// a heuristic, to avoid attending the full cache if it is not yet utilized
// after enough generations, the benefit from this heuristic disappears
// if we start defragmenting the cache, the benefit from this will be more important
//kv_self.n = std::max(32, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)); // TODO: this might be better for CUDA?
kv_self.n = std::min((int32_t) cparams.n_ctx, std::max(32, llama_kv_cache_cell_max(kv_self)));
kv_self.n = std::min((int32_t) cparams.n_ctx, std::max(32, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));

//printf("kv_self.n = %5d, kv_self.used = %5d, kv_self.head = %5d\n", kv_self.n, kv_self.used, kv_self.head);

Expand Down

0 comments on commit 3e68df8

Please sign in to comment.