Skip to content

Commit

Permalink
Models hub (#13926)
Browse files Browse the repository at this point in the history
---------

Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>

* 2023-08-15-gte_base_en (#13922)

* Add model 2023-08-15-gte_base_en

* Add model 2023-08-15-gte_large_en

* Add model 2023-08-15-gte_small_en

---------

Co-authored-by: maziyarpanahi <maziyar.panahi@iscpif.fr>

* 2023-08-15-bge_small_en (#13923)

* Add model 2023-08-15-bge_small_en

* Add model 2023-08-15-bge_base_en

* Add model 2023-08-15-bge_large_en

---------

Co-authored-by: maziyarpanahi <maziyar.panahi@iscpif.fr>

---------

Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com>
  • Loading branch information
maziyarpanahi and jsl-models authored Aug 16, 2023
1 parent 85d06c5 commit 123bddf
Show file tree
Hide file tree
Showing 7 changed files with 707 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
layout: model
title: Bart Zero Shot Classifier Large -MNLI (bart_large_zero_shot_classifier_mnli)
author: John Snow Labs
name: bart_large_zero_shot_classifier_mnli
date: 2023-08-07
tags: [bart, zero_shot, en, open_source, tensorflow]
task: Zero-Shot Classification
language: en
edition: Spark NLP 5.1.0
spark_version: 3.0
supported: true
engine: tensorflow
annotator: BartForZeroShotClassification
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This model is intended to be used for zero-shot text classification, especially in English. It is fine-tuned on MNLI by using large BART model.

BartForZeroShotClassification using a ModelForSequenceClassification trained on MNLI tasks. Equivalent of BartForSequenceClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

We used TFBartForSequenceClassification to train this model and used BartForZeroShotClassification annotator in Spark NLP 🚀 for prediction at scale!

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bart_large_zero_shot_classifier_mnli_en_5.1.0_3.0_1691369930633.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bart_large_zero_shot_classifier_mnli_en_5.1.0_3.0_1691369930633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

zeroShotClassifier = BartForZeroShotClassification \
.pretrained('bart_large_zero_shot_classifier_mnli', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512) \
.setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])

pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
zeroShotClassifier
])

example = spark.createDataFrame([['I have a problem with my iphone that needs to be resolved asap!!']]).toDF("text")
result = pipeline.fit(example).transform(example)
```
```scala
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val zeroShotClassifier = BartForSequenceClassification.pretrained("bart_large_zero_shot_classifier_mnli", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)
.setCandidateLabels(Array("urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"))

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, zeroShotClassifier))

val example = Seq("I have a problem with my iphone that needs to be resolved asap!!").toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|bart_large_zero_shot_classifier_mnli|
|Compatibility:|Spark NLP 5.1.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[token, document]|
|Output Labels:|[label]|
|Language:|en|
|Size:|467.1 MB|
|Case sensitive:|true|
97 changes: 97 additions & 0 deletions docs/_posts/maziyarpanahi/2023-08-15-bge_base_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: model
title: BAAI general embedding English (bge_base)
author: John Snow Labs
name: bge_base
date: 2023-08-15
tags: [open_source, bert, embeddings, english, en, onnx]
task: Embeddings
language: en
edition: Spark NLP 5.0.2
spark_version: 3.0
supported: true
engine: onnx
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
And it also can be used in vector database for LLMs.

`bge` is short for `BAAI general embedding`.

| Model | Language | Description | query instruction for retrieval\* |
|:-------------------------------|:--------:| :--------:| :--------:|
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
| [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_base_en_5.0.2_3.0_1692109953168.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_base_en_5.0.2_3.0_1692109953168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")

embeddings = BertEmbeddings.pretrained("bge_base", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")

```
```scala

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bge_base", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|bge_base|
|Compatibility:|Spark NLP 5.0.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|258.8 MB|
|Case sensitive:|true|

## References

BAAI models are from [BAAI](https://huggingface.co/BAAI)
97 changes: 97 additions & 0 deletions docs/_posts/maziyarpanahi/2023-08-15-bge_large_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: model
title: BAAI general embedding English (bge_large)
author: John Snow Labs
name: bge_large
date: 2023-08-15
tags: [open_source, bert, embeddings, english, en, onnx]
task: Embeddings
language: en
edition: Spark NLP 5.0.2
spark_version: 3.0
supported: true
engine: onnx
annotator: BertEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
And it also can be used in vector database for LLMs.

`bge` is short for `BAAI general embedding`.

| Model | Language | Description | query instruction for retrieval\* |
|:-------------------------------|:--------:| :--------:| :--------:|
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
| [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_large_en_5.0.2_3.0_1692109963281.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_large_en_5.0.2_3.0_1692109963281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")

embeddings = BertEmbeddings.pretrained("bge_large", "en")\
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")

```
```scala

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bge_large", "en")
.setInputCols("document", "token")
.setOutputCol("embeddings")
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|bge_large|
|Compatibility:|Spark NLP 5.0.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|794.2 MB|
|Case sensitive:|true|

## References

BAAI models are from [BAAI](https://huggingface.co/BAAI)
Loading

0 comments on commit 123bddf

Please sign in to comment.