Skip to content

Misc. bug: --cache-reuse no longer seems to be caching prompt prefixes #15082

@ghnp5

Description

@ghnp5

This is a re-open of #14113

Name and Version

Affected:
Version at commit: b7a1746

Not affected:
Version at commit: c6a2c9e

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

I had open this bug in oobabooga/text-generation-webui: oobabooga/text-generation-webui#7060

The issue being that prompt prefixes were no longer being used in the following requests.

I confirmed that --cache-reuse 1 was being passed on, so that wasn't the issue.
After reverting to the previous version of the WebUI (which ships an older version of llama.cpp), the prompts started to be cached again.

So, this seems to point to being a bug with llama.cpp.

First Bad Commit

It looks like there may have been a commit between c6a2c9e and b7a1746 that broke --cache-reuse, or that changed its behavior.


I use the oobabooga/text-generation-webui project, which bundles a snapshot of llama.cpp (the commits I mentioned above),

and, from what I saw, this is the command it runs:

llama-server --model user_data/models/gemma-3-12b-it-qat-UD-Q6_K_XL.guff --ctx-size 32768 --gpu-layers 49 --batch-size 256 --port 60033 --no-webui --threads 10 --threads-batch 10 --rope-freq-scale 0.125 --rope-freq-base 1000000.0 --cache-reuse 1

The model can be downloaded from: https://huggingface.co/unsloth/gemma-3-12b-it-qat-GGUF/tree/main

I confirm --cache-reuse 1 is present in the parameters, when the cache isn't working.

I don't know how exactly a Chat Completion call is made to llama-server, but basically, if I make two chat-instruct calls, where the first 1000 characters or so are exactly the same, the prompt will be processed from byte 0, when using the affected commits I mentioned above.

I'm not providing the cache_prompt parameter, so it should default to true.

I hope this helps!


This is still happening when using the version at https://github.com/ggml-org/llama.cpp/tree/90083283ec254fa8d33897746dea229aee401b37

It appears that the fix from #14163 did not fix the cache issue.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions