A simple LLM cache design intergrated with a command line LLM chatbot, that uses vector similarity search to efficiently store and retrieve responses, reducing API calls and improving response times.
- Core: Orchestrates the interaction between the user and Cache/LLM
- LLM Cache: Object to store, search and retrieve cached responses
- Post Processing: Reranking, filtering and formatting of responses
- LLM: Handles communication with language models (currently supports only Hugging Face models)
- Vector-based search: Uses embeddings to store and find semantically similar queries
- Redis Integration: Leverages Redis for fast vector similarity search
- Semantic Reranking: Optional Cohere-based reranking for better match quality between query and cached responses
Install dependencies
pip install -r requirements.txt
for conda:
conda env create -f environment.yml
Make sure redis server is running (more details in the Redis docs)
from core import Core
from hf import HuggingFaceChat
from llmcache import LLMCache
# Initialize components
llm_model = HuggingFaceChat(model_name="meta-llama/Llama-3.2-3B-Instruct")
cache = LLMCache(enable_rerank=True)
core = Core(llm_model, cache)
# Start the chat
core.chat()
cache = LLMCache(
redis_conn=None, # Optional Redis connection
embedding_dimension=384, # Embedding vector size
ttl_seconds=3600, # Cache entry lifetime
eviction_policy="allkeys-lru", # Redis eviction policy
max_memory_bytes=1_000_000_000, # Redis memory limit
enable_rerank=True # Enable Cohere reranking
)
core = Core(
llm_model=llm_model,
cache=cache,
top_k=3, # Number of similar cache entries to consider
similarity_threshold=0.8 # Minimum similarity score for cache hits
)
llm_model = HuggingFaceChat(
model_name="meta-llama/Llama-3.2-3B-Instruct"
)
cache_key = cache.store_query_response(question="What is Python?", response="Python is...")
results = cache.search_cache(
query_text="Tell me about Python",
top_k=3,
similarity_threshold=0.8
)
git clone https://github.com/yourusername/llm_cache.git
cd llm_cache
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License © 2025 Shadab Shaikh.
Much of my implementation was inspired by Zilliz's GPTCache. My motivation for building LLMCache stemmed from a deep curiosity to explore and understand the various system components that power fast LLM inference, with the goal of meaningfully reducing latency and costs.
This project is a work-in-progress and is intended for educational and research purposes only. While it may serve as a helpful starting point or reference, it is not optimized or tested for production-grade use.
There are more robust and battle-tested LLM caching systems available (Zilliz's GPTCache for instance).