Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes

This is a re-open of https://github.com/ggml-org/llama.cpp/issues/14113

### Name and Version

**Affected:**
Version at commit: https://github.com/ggml-org/llama.cpp/commit/b7a17463ec190aeee7b9077c606c910fb4688b84

**Not affected:**
Version at commit: https://github.com/ggml-org/llama.cpp/commit/c6a2c9e7411f54b0770b319740561bbd6a2ebd27

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Problem description & steps to reproduce

I had open this bug in `oobabooga/text-generation-webui`: https://github.com/oobabooga/text-generation-webui/issues/7060

The issue being that prompt prefixes were no longer being used in the following requests.

I confirmed that `--cache-reuse 1` was being passed on, so that wasn't the issue.
After reverting to the previous version of the WebUI (which ships an older version of `llama.cpp`), the prompts started to be cached again.

So, this seems to point to being a bug with `llama.cpp`.

### First Bad Commit

It looks like there may have been a commit between https://github.com/ggml-org/llama.cpp/commit/c6a2c9e7411f54b0770b319740561bbd6a2ebd27 and https://github.com/ggml-org/llama.cpp/commit/b7a17463ec190aeee7b9077c606c910fb4688b84 that broke `--cache-reuse`, or that changed its behavior.

----------------

I use the `oobabooga/text-generation-webui` project, which bundles a snapshot of `llama.cpp` (the commits I mentioned above),

and, from what I saw, this is the command it runs:

```
llama-server --model user_data/models/gemma-3-12b-it-qat-UD-Q6_K_XL.guff --ctx-size 32768 --gpu-layers 49 --batch-size 256 --port 60033 --no-webui --threads 10 --threads-batch 10 --rope-freq-scale 0.125 --rope-freq-base 1000000.0 --cache-reuse 1
```

The model can be downloaded from: https://huggingface.co/unsloth/gemma-3-12b-it-qat-GGUF/tree/main

I confirm `--cache-reuse 1` is present in the parameters, when the cache isn't working.

I don't know how exactly a Chat Completion call is made to `llama-server`, but basically, if I make two chat-instruct calls, where the first 1000 characters or so are exactly the same, the prompt will be processed from byte 0, when using the affected commits I mentioned above.

I'm not providing the `cache_prompt` parameter, so it should default to `true`.

I hope this helps!

----------------

This is still happening when using the version at https://github.com/ggml-org/llama.cpp/tree/90083283ec254fa8d33897746dea229aee401b37

It appears that the fix from https://github.com/ggml-org/llama.cpp/pull/14163 did not fix the cache issue.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes #15082

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes #15082

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

First Bad Commit

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions