Retrieval Evaluation for RAG Systems

This project focuses specifically on evaluating the Retrieval component of RAG (Retrieval-Augmented Generation) systems. While RAG consists of both retrieval and generation components, this benchmark focuses solely on measuring the quality of document retrieval. This project implements semantic search evaluation using local vector storage with FAISS, as opposed to the in-memory approach used in the original BEIR benchmark. This implementation enables efficient handling of large-scale evaluations by storing embeddings on disk.

What This Evaluates

Retrieval Evaluation (✓ Covered)

Document embedding quality
Semantic search accuracy
Retrieval metrics (NDCG@k, Precision@k, Recall@k)
Ranking effectiveness
Vector similarity search performance

RAG Components Not Evaluated (❌ Not Covered)

Text generation quality
Answer synthesis
Hallucination detection
Response accuracy
Context integration

Why Focus on Retrieval?

Foundation of RAG
- Retrieval quality directly impacts generation quality
- Poor retrieval cannot be compensated by good generation
- Efficient retrieval is crucial for system performance
Quantitative Metrics
- Retrieval can be evaluated with standard IR metrics
- NDCG@10 provides clear comparison between models
- Results are reproducible and objective
Separation of Concerns
- Allows focused optimization of retrieval models
- Independent of LLM choice for generation
- Clearer performance bottleneck identification

Overview

The system evaluates semantic search models by:

Computing document embeddings using transformer models
Storing vectors locally using FAISS indices
Computing retrieval metrics (NDCG@k, Precision@k, Recall@k)
Aggregating results across multiple datasets

Key Features

Local vector storage using FAISS
Support for both CPU and GPU computation
Batch processing to handle memory constraints
Comprehensive evaluation metrics
Model-specific vector storage organization

Installation

Requirements

pip install sentence-transformers
pip install torch
pip install datasets
pip install pandas
pip install numpy
pip install tqdm

FAISS Installation

Choose one based on your hardware:

For CPU-only:

pip install faiss-cpu

For GPU support:

pip install faiss-gpu

Note: faiss-gpu requires CUDA to be installed on your system.

Usage

Basic Usage

python benchmark.py --models_file models.txt --output results.csv --batch_size 32

Command Line Arguments

--models_file: Path to file containing model names (default: models.txt)
--batch_size: Batch size for encoding (default: 32)
--output: Path to output file (default: results.csv)
--force_recompute: Force recomputation of vectors

Models File Format

Create a models.txt file with model names, one per line:

BAAI/bge-m3
intfloat/multilingual-e5-base
Snowflake/snowflake-arctic-embed-l-v2.0

Implementation Details

Vector Storage

Unlike the original BEIR implementation which keeps vectors in memory, this implementation:

Computes embeddings in batches
Stores vectors in FAISS indices on disk
Creates separate storage directories for each model
Loads vectors only when needed for evaluation

Directory Structure

vector_store/
    model1_name/
        dataset1_vectors.faiss
        dataset1_corpus_ids.pkl
    model2_name/
        dataset2_vectors.faiss
        dataset2_corpus_ids.pkl
results/
    model1_results_timestamp.csv
    model2_results_timestamp.csv
final_results.csv

NDCG@10 Calculation

The average NDCG@10 score is calculated by:

Computing NDCG@10 for each query in each dataset
Averaging NDCG@10 scores across all queries in a dataset
Computing the mean across all datasets for each model
Converting to percentage (multiplying by 100)

Results

Example evaluation results for different models:

Model	Average NDCG@10
BAAI/bge-m3	65.80%
Snowflake/snowflake-arctic-embed-l-v2.0	64.52%
intfloat/multilingual-e5-base	61.38%
intfloat/multilingual-e5-small	58.98%
sentence-transformers/LaBSE	51.51%
LocalDoc/TEmA-small	50.68%

License

MIT License

Authors

LocalDoc Team

For more information about the underlying technologies:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
models.txt		models.txt
results.csv		results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval Evaluation for RAG Systems

What This Evaluates

Retrieval Evaluation (✓ Covered)

RAG Components Not Evaluated (❌ Not Covered)

Why Focus on Retrieval?

Overview

Key Features

Installation

Requirements

FAISS Installation

Usage

Basic Usage

Command Line Arguments

Models File Format

Implementation Details

Vector Storage

Directory Structure

NDCG@10 Calculation

Results

License

Authors

About

Releases

Packages

Languages

License

LocalDoc-Azerbaijan/RAG-Retrieval-Benchmark-Azerbaijani

Folders and files

Latest commit

History

Repository files navigation

Retrieval Evaluation for RAG Systems

What This Evaluates

Retrieval Evaluation (✓ Covered)

RAG Components Not Evaluated (❌ Not Covered)

Why Focus on Retrieval?

Overview

Key Features

Installation

Requirements

FAISS Installation

Usage

Basic Usage

Command Line Arguments

Models File Format

Implementation Details

Vector Storage

Directory Structure

NDCG@10 Calculation

Results

License

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages