server : host-memory prompt caching #16391

ggerganov · 2025-10-02T19:48:56Z

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing.

Still WIP, but probably should be useable already.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

TODOs

Set memory limit for the host-memory cache from CLI
Clean-up implementation
Test with agentic workflows
Multi-slot tests

github-actions bot added examples server labels Oct 2, 2025

ggerganov force-pushed the gg/prompt-cache-ext branch from 4127199 to 0787f03 Compare October 2, 2025 19:54

ggerganov added 3 commits October 3, 2025 21:39

minor : code style

8d616c8

server : fix prompt similarity calculation

f90e6f1

server : initial host-memory prompt caching

5c0cec4

ggerganov force-pushed the gg/prompt-cache-ext branch from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49

This comment was marked as spam.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : host-memory prompt caching #16391

server : host-memory prompt caching #16391

ggerganov commented Oct 2, 2025 •

edited

Loading

Uh oh!

This comment was marked as spam.

Uh oh!

server : host-memory prompt caching #16391

Are you sure you want to change the base?

server : host-memory prompt caching #16391

Conversation

ggerganov commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODOs

Uh oh!

This comment was marked as spam.

Uh oh!

ggerganov commented Oct 2, 2025 •

edited

Loading