24 Jun 16:16

nreimers

3ddd7a7

v2.0.0 - Integration into Huggingface Model Hub

Models hosted on the hub

All pre-trained models are now hosted on the Huggingface Models hub.

Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

model = SentenceTransformer('[your_username]/[model_name]')

For more information, see: Sentence Transformers in the Hugging Face Hub

Breaking changes

There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

Find sentence-transformer models on the Hub

You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

Add the sentence-transformers tag to you model card so that others can find your model.

Widget & Inference API

A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website:
https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

Save Model to Hub

A new method was added to the SentenceTransformer class: save_to_hub.

Provide the model name and the model is saved on the hub.

Here you find the explanation from transformers how the hub works: Model sharing and uploading

Automatic Model Card

When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

New Models

Several new sentence embedding models have been added, which are much better than the previous model: Sentence Embedding Models
Some new models for semantic search based on MS MARCO have been added: MSMARCO Models
The training script for these MS MARCO models have been released as well: Train MS MARCO Bi-Encoder v3

Assets 2

24 Jun 14:20

nreimers

v1.2.1

8a59617

v1.2.1 - Forward compatibility with version 2

Final release of version 1: Makes v1 of sentence-transformers forward compatible with models from version 2 of sentence-transformers.

Assets 2

12 May 13:14

nreimers

v1.2.0

a208d64

v1.2.0 - Unsupervised Learning, New Training Examples, Improved Models

Unsupervised Sentence Embedding Learning

New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

New methods:

CT: Integration of Semantic Re-Tuning With Contrastive Tension (CT) to tune models without labeled data
CT_In-Batch_Negatives: A modification of CT using in-batch negatives
SimCSE: An unsupervised sentence embedding learning method by Gao et al.

Pre-Training Methods

MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

Training Examples

Paraphrase Data: In our paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation we have shown that training on paraphrase data is powerful. In that folder we provide collections of different paraphrase datasets and scripts to train on it.
NLI with MultipleNegativeRankingLoss: A dedicated example how to use MultipleNegativeRankingLoss for training with NLI data, which leads to a significant performance boost.

New models

New NLI & STS models: Following the Paraphrase Data training example we published new models trained on NLI and NLI+STS data. Training code is available: training_nli_v2.py.

Model-Name STSb-test performance

Previous best models

nli-bert-large 79.19

stsb-roberta-large 86.39

New v2 models

nli-mpnet-base-v2 86.53

stsb-mpnet-base-v2 88.57
New MS MARCO model for Semantic Search: Hofstätter et al. optimized the training procedure on the MS MARCO dataset. The resulting model is integrated as msmarco-distilbert-base-tas-b and improves the performance on the MS MARCO dataset from 33.13 to 34.43 MRR@10

Model-Name	STSb-test performance
Previous best models
nli-bert-large	79.19
stsb-roberta-large	86.39
New v2 models
nli-mpnet-base-v2	86.53
stsb-mpnet-base-v2	88.57

New Functions

SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
```
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
```
Valid values are mean/max/cls.
NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~

Assets 2

21 Apr 13:12

nreimers

v1.1.0

abdfbf0

Unsupervised Sentence Embedding Learning

This release integrates methods that allows to learn sentence embeddings without having labeled data:

TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.
GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

New Models - SentenceTransformer

MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.
MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

New Models - CrossEncoder

MSMARCO Re-ranking-Models v2: We trained new significantly faster and significantly better CrossEncoder re-ranking models on the MSMARCO dataset. It outperforms BERT-large models in terms of accuracy while being 18 times faster. Trainingcode is available

New Features

You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.
You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.

Assets 2

01 Apr 06:35

nreimers

v1.0.4

836f822

v1.0.4 - Patch CLIPModel.save

It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

Assets 2

22 Mar 08:15

nreimers

v1.0.3

6353eb9

v1.0.3 - Patch util.paraphrase_mining

v1.0.3 - Patch for util.paraphrase_mining method

Assets 2

19 Mar 21:44

nreimers

v1.0.2

4918bc4

v1.0.2 - Patch CLIPModel

v1.0.2 - Patch for CLIPModel, new Image Examples

Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
Image Clustering and Duplicate Image Detection examples added: more info

Assets 2

18 Mar 20:57

nreimers

v1.0.0

7a2c690

v1.0.0 - Improvements, New Models, Text-Image Models

This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

Text-Image-Model CLIP

You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

More Information
IPython Demo
Colab Demo

Examples how to train the CLIP model on your data will be added soon.

New Models

Add v3 models trained for semantic search on MS MARCO: MS MARCO Models v3
First models trained on Natural Questions dataset for Q&A Retrieval: Natural Questions Models v1
Add DPR Models from Facebook for Q&A Retrieval: DPR-Models

New Features

The Asym Model can now be used as the first model in a SentenceTransformer modules list.
Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

New Examples

Add example for model quantization on CPUs (smaller models, faster run-time): model_quantization.py
Start to add example how to train SBERT models without training data: unsupervised learning. We start with an example for Query Generation to train a semantic search model.

Bugfixes

Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
Bugfix of the LabelAccuracyEvaluator
Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

Breaking changes:

SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers

Assets 2

04 Jan 14:04

nreimers

v0.4.1

de558ab

v0.4.1 - Faster Tokenization & Asymmetric Models

Refactored Tokenization

Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

Asymmetric Models
Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

word_embedding_model = models.Transformer(base_model, max_seq_length=250)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

##Your input examples have to look like this:
inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)

##Encoding (Note: Mixed inputs are not allowed)
model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])

Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer.
More documentation on how to design asymmetric models will follow soon.

New Namespace & Models for Cross-Encoder
Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

Logging
Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

Unit tests
A lot more unit tests have been added, which test the different components of the framework.

Assets 2

22 Dec 13:42

nreimers

v0.4.0

28d6f90

v0.4.0 - Upgrade Transformers Version

Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models hosted on the hub

Breaking changes

Find sentence-transformer models on the Hub

Widget & Inference API

Save Model to Hub

Automatic Model Card

New Models

Unsupervised Sentence Embedding Learning

Pre-Training Methods

Training Examples

New models

New Functions