This project focuses specifically on evaluating the Retrieval component of RAG (Retrieval-Augmented Generation) systems. While RAG consists of both retrieval and generation components, this benchmark focuses solely on measuring the quality of document retrieval. This project implements semantic search evaluation using local vector storage with FAISS, as opposed to the in-memory approach used in the original BEIR benchmark. This implementation enables efficient handling of large-scale evaluations by storing embeddings on disk.
- Document embedding quality
- Semantic search accuracy
- Retrieval metrics (NDCG@k, Precision@k, Recall@k)
- Ranking effectiveness
- Vector similarity search performance
- Text generation quality
- Answer synthesis
- Hallucination detection
- Response accuracy
- Context integration
-
Foundation of RAG
- Retrieval quality directly impacts generation quality
- Poor retrieval cannot be compensated by good generation
- Efficient retrieval is crucial for system performance
-
Quantitative Metrics
- Retrieval can be evaluated with standard IR metrics
- NDCG@10 provides clear comparison between models
- Results are reproducible and objective
-
Separation of Concerns
- Allows focused optimization of retrieval models
- Independent of LLM choice for generation
- Clearer performance bottleneck identification
The system evaluates semantic search models by:
- Computing document embeddings using transformer models
- Storing vectors locally using FAISS indices
- Computing retrieval metrics (NDCG@k, Precision@k, Recall@k)
- Aggregating results across multiple datasets
- Local vector storage using FAISS
- Support for both CPU and GPU computation
- Batch processing to handle memory constraints
- Comprehensive evaluation metrics
- Model-specific vector storage organization
pip install sentence-transformers
pip install torch
pip install datasets
pip install pandas
pip install numpy
pip install tqdm
Choose one based on your hardware:
For CPU-only:
pip install faiss-cpu
For GPU support:
pip install faiss-gpu
Note: faiss-gpu
requires CUDA to be installed on your system.
python benchmark.py --models_file models.txt --output results.csv --batch_size 32
--models_file
: Path to file containing model names (default: models.txt)--batch_size
: Batch size for encoding (default: 32)--output
: Path to output file (default: results.csv)--force_recompute
: Force recomputation of vectors
Create a models.txt
file with model names, one per line:
BAAI/bge-m3
intfloat/multilingual-e5-base
Snowflake/snowflake-arctic-embed-l-v2.0
Unlike the original BEIR implementation which keeps vectors in memory, this implementation:
- Computes embeddings in batches
- Stores vectors in FAISS indices on disk
- Creates separate storage directories for each model
- Loads vectors only when needed for evaluation
vector_store/
model1_name/
dataset1_vectors.faiss
dataset1_corpus_ids.pkl
model2_name/
dataset2_vectors.faiss
dataset2_corpus_ids.pkl
results/
model1_results_timestamp.csv
model2_results_timestamp.csv
final_results.csv
The average NDCG@10 score is calculated by:
- Computing NDCG@10 for each query in each dataset
- Averaging NDCG@10 scores across all queries in a dataset
- Computing the mean across all datasets for each model
- Converting to percentage (multiplying by 100)
Example evaluation results for different models:
Model | Average NDCG@10 |
---|---|
BAAI/bge-m3 | 65.80% |
Snowflake/snowflake-arctic-embed-l-v2.0 | 64.52% |
intfloat/multilingual-e5-base | 61.38% |
intfloat/multilingual-e5-small | 58.98% |
sentence-transformers/LaBSE | 51.51% |
LocalDoc/TEmA-small | 50.68% |
MIT License
LocalDoc Team
For more information about the underlying technologies: