Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 2, 2025

rel #16117

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing.

Still WIP, but probably should be useable already.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

TODOs

  • Set memory limit for the host-memory cache from CLI
  • Clean-up implementation
  • Test with agentic workflows
  • Multi-slot tests

@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49
@tommarques56

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants