Finetuning of embedding models

Embedding models are crucial components in modern natural language processing, serving as the foundation for numerous applications, particularly in Retrieval-Augmented Generation (RAG) systems. These models transform text into dense vector representations, enabling efficient semantic search and comparison. However, despite great generalizability abilities of state-of-the-art embedding models, the models still can be improved when applied to specific tasks or domains.

This respository provides short examples and explanations on fine-tuning methods for embedding models. The used dataset is SciFact consisting of scientific texts and according queries which are used for the retrieval task that is also part of the MTEB benchmark. The notebook main.ipynb contains explanations for the used methods, hyperparameters and training data.

Results

gte model finetuned with different methods and evaluated on SciFact retrieval task

Embedding models are used off-the-shelf in their general version even when used for specific domains. However, even with this very simplistic finetuning on a small dataset (10k sentences) the retrieval performance already improves slightly.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
main.ipynb		main.ipynb
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning of embedding models

Results

About

Releases

Packages

Languages

sn2727/finetuning-embedding-models

Folders and files

Latest commit

History

Repository files navigation

Finetuning of embedding models

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages