Embedding Hallucinations

This repository supports research and experimentation around understanding and mitigating hallucinations in embeddings — specifically how embeddings can fail to capture human-like understanding.

🧠 Objectives

Compare Embeddings: Measure similarity between sentences using cosine similarity (or other metrics like dot product, Euclidean distance).
Fine-Tune Embeddings: Fine-tune SentenceTransformer models to reduce hallucinations and improve semantic understanding.

📁 Folder Structure

.
├── data/                        # Training, validation, and test data
├── fine-tuning/                
│   ├── embedding-fine-tune.py  # Fine-tunes embedding models
│   └── eval.py                 # Evaluates and compares model embeddings
├── outputs/                    # Outputs of similarity scoring between sentence pairs
├── results/                    # Results of evaluation and comparisons
├── requirements.txt            # Python dependencies
└── README.md                   # Project documentation

🔧 Environment Setup

You can create an environment using any of the following:

Option 1: Using virtualenv

python -m venv halluc-env
source halluc-env/bin/activate  # On Windows: halluc-env\Scripts\activate
pip install -r requirements.txt

Option 2: Using conda

conda create -n halluc-env python=3.10
conda activate halluc-env
pip install -r requirements.txt

Option 3: Using uv (Ultra fast package installer)

uv venv halluc-env
source halluc-env/bin/activate
uv pip install -r requirements.txt

create .env file

Create a .env file in the root directory to set environment variables for your project. This is useful for managing sensitive information like API keys and Azure OpenAI related details.

AZURE_OPENAI_ENDPOINT= AZURE_OPENAI_API_KEY= API_VERSION=2024-10-21 AZURE_DEPLOYMENT= MODEL_NAME= TEMPERATURE=0.0

🚀 Usage

1. Fine-Tune Embedding Models

Fine-tune a SentenceTransformer model using the provided training data to reduce hallucinations:

python ./fine-tuning/embedding-fine-tune.py

2. Evaluate Embeddings

Compare a fine-tuned model against a foundational model using evaluation datasets:

python ./fine-tuning/eval.py

This will output evaluation results in the results/ directory.

3. Compare Sentence Similarity

Use the sentence similarity comparison utility to find the semantic similarity between any two sentences. Cosine similarity is used by default.

Results are stored in the outputs/ directory.

🔍 Similarity Metrics

Currently implemented:

✅ Cosine Similarity (default)

You can easily switch to other metrics such as:

Dot Product
Euclidean Distance

These metrics are applied on the sentence embeddings generated using SentenceTransformer models.

📊 Data & Results

data/: Contains training, validation, and test sets used for fine-tuning.
results/: Contains evaluation output comparing foundational and fine-tuned models.
outputs/: Contains similarity scores between sentence pairs.

🧪 Research Focus

This project is part of research for the paper:

"Hallucination by Design: How Embeddings Fail Understanding Human Language"

It explores:

Where and why embeddings hallucinate
How fine-tuning helps mitigate such hallucinations
Benchmarks for measuring improvements

📄 License

This project is released under the MIT License.

🤝 Contributing

Feel free to fork, experiment, and contribute via pull requests or discussions.

🙏 Acknowledgments

Sentence-Transformers
The open-source community supporting transparency in embedding evaluation and interpretability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Embedding Hallucinations

🧠 Objectives

📁 Folder Structure

🔧 Environment Setup

Option 1: Using virtualenv

Option 2: Using conda

Option 3: Using uv (Ultra fast package installer)

create .env file

🚀 Usage

1. Fine-Tune Embedding Models

2. Evaluate Embeddings

3. Compare Sentence Similarity

🔍 Similarity Metrics

📊 Data & Results

🧪 Research Focus

📄 License

🤝 Contributing

🙏 Acknowledgments

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
fine-tuning		fine-tuning
outputs		outputs
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ritesh-modi/embedding-hallucinations

Folders and files

Latest commit

History

Repository files navigation

Embedding Hallucinations

🧠 Objectives

📁 Folder Structure

🔧 Environment Setup

Option 1: Using virtualenv

Option 2: Using conda

Option 3: Using uv (Ultra fast package installer)

create .env file

🚀 Usage

1. Fine-Tune Embedding Models

2. Evaluate Embeddings

3. Compare Sentence Similarity

🔍 Similarity Metrics

📊 Data & Results

🧪 Research Focus

📄 License

🤝 Contributing

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages