LLMCache

A simple LLM cache design intergrated with a command line LLM chatbot, that uses vector similarity search to efficiently store and retrieve responses, reducing API calls and improving response times.

Overview

System Components:

Core: Orchestrates the interaction between the user and Cache/LLM
LLM Cache: Object to store, search and retrieve cached responses
Post Processing: Reranking, filtering and formatting of responses
LLM: Handles communication with language models (currently supports only Hugging Face models)

Other features/highlights:

Vector-based search: Uses embeddings to store and find semantically similar queries
Redis Integration: Leverages Redis for fast vector similarity search
Semantic Reranking: Optional Cohere-based reranking for better match quality between query and cached responses

Installation

Install dependencies

pip install -r requirements.txt

for conda:

conda env create -f environment.yml

Quickstart

Make sure redis server is running (more details in the Redis docs)

from core import Core
from hf import HuggingFaceChat
from llmcache import LLMCache


# Initialize components
llm_model = HuggingFaceChat(model_name="meta-llama/Llama-3.2-3B-Instruct")
cache = LLMCache(enable_rerank=True)
core = Core(llm_model, cache)

# Start the chat
core.chat()

Configuration

LLM Cache Settings

cache = LLMCache(
redis_conn=None, # Optional Redis connection
embedding_dimension=384, # Embedding vector size
ttl_seconds=3600, # Cache entry lifetime
eviction_policy="allkeys-lru", # Redis eviction policy
max_memory_bytes=1_000_000_000, # Redis memory limit
enable_rerank=True # Enable Cohere reranking
)

Core Settings

core = Core(
    llm_model=llm_model,
    cache=cache,
    top_k=3,  # Number of similar cache entries to consider
    similarity_threshold=0.8  # Minimum similarity score for cache hits
)

LLM Interface Settings

llm_model = HuggingFaceChat(
    model_name="meta-llama/Llama-3.2-3B-Instruct"
)

Store a response

cache_key = cache.store_query_response(question="What is Python?", response="Python is...")

Search cache

results = cache.search_cache(
query_text="Tell me about Python",
top_k=3,
similarity_threshold=0.8
)

Clone the repository

git clone https://github.com/yourusername/llm_cache.git
cd llm_cache

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Acknowledgments

Much of my implementation was inspired by Zilliz's GPTCache. My motivation for building LLMCache stemmed from a deep curiosity to explore and understand the various system components that power fast LLM inference, with the goal of meaningfully reducing latency and costs.

Disclaimer

This project is a work-in-progress and is intended for educational and research purposes only. While it may serve as a helpful starting point or reference, it is not optimized or tested for production-grade use.

There are more robust and battle-tested LLM caching systems available (Zilliz's GPTCache for instance).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
examples		examples
llm_cache		llm_cache
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMCache

Overview

System Components:

Other features/highlights:

Installation

Quickstart

Configuration

LLM Cache Settings

Core Settings

LLM Interface Settings

Store a response

Search cache

Clone the repository

Contributing

License

Acknowledgments

Disclaimer

About

Releases

Packages

Languages

License

greninja/llm_cache

Folders and files

Latest commit

History

Repository files navigation

LLMCache

Overview

System Components:

Other features/highlights:

Installation

Quickstart

Configuration

LLM Cache Settings

Core Settings

LLM Interface Settings

Store a response

Search cache

Clone the repository

Contributing

License

Acknowledgments

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages