-
Notifications
You must be signed in to change notification settings - Fork 711
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
--------- Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr> * 2023-08-15-gte_base_en (#13922) * Add model 2023-08-15-gte_base_en * Add model 2023-08-15-gte_large_en * Add model 2023-08-15-gte_small_en --------- Co-authored-by: maziyarpanahi <maziyar.panahi@iscpif.fr> * 2023-08-15-bge_small_en (#13923) * Add model 2023-08-15-bge_small_en * Add model 2023-08-15-bge_base_en * Add model 2023-08-15-bge_large_en --------- Co-authored-by: maziyarpanahi <maziyar.panahi@iscpif.fr> --------- Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com>
- Loading branch information
1 parent
85d06c5
commit 123bddf
Showing
7 changed files
with
707 additions
and
0 deletions.
There are no files selected for viewing
107 changes: 107 additions & 0 deletions
107
docs/_posts/ahmedlone127/2023-08-07-bart_large_zero_shot_classifier_mnli_en.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
--- | ||
layout: model | ||
title: Bart Zero Shot Classifier Large -MNLI (bart_large_zero_shot_classifier_mnli) | ||
author: John Snow Labs | ||
name: bart_large_zero_shot_classifier_mnli | ||
date: 2023-08-07 | ||
tags: [bart, zero_shot, en, open_source, tensorflow] | ||
task: Zero-Shot Classification | ||
language: en | ||
edition: Spark NLP 5.1.0 | ||
spark_version: 3.0 | ||
supported: true | ||
engine: tensorflow | ||
annotator: BartForZeroShotClassification | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
This model is intended to be used for zero-shot text classification, especially in English. It is fine-tuned on MNLI by using large BART model. | ||
|
||
BartForZeroShotClassification using a ModelForSequenceClassification trained on MNLI tasks. Equivalent of BartForSequenceClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible. | ||
|
||
We used TFBartForSequenceClassification to train this model and used BartForZeroShotClassification annotator in Spark NLP 🚀 for prediction at scale! | ||
|
||
## Predicted Entities | ||
|
||
|
||
|
||
{:.btn-box} | ||
<button class="button button-orange" disabled>Live Demo</button> | ||
<button class="button button-orange" disabled>Open in Colab</button> | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bart_large_zero_shot_classifier_mnli_en_5.1.0_3.0_1691369930633.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bart_large_zero_shot_classifier_mnli_en_5.1.0_3.0_1691369930633.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
document_assembler = DocumentAssembler() \ | ||
.setInputCol('text') \ | ||
.setOutputCol('document') | ||
|
||
tokenizer = Tokenizer() \ | ||
.setInputCols(['document']) \ | ||
.setOutputCol('token') | ||
|
||
zeroShotClassifier = BartForZeroShotClassification \ | ||
.pretrained('bart_large_zero_shot_classifier_mnli', 'en') \ | ||
.setInputCols(['token', 'document']) \ | ||
.setOutputCol('class') \ | ||
.setCaseSensitive(True) \ | ||
.setMaxSentenceLength(512) \ | ||
.setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"]) | ||
|
||
pipeline = Pipeline(stages=[ | ||
document_assembler, | ||
tokenizer, | ||
zeroShotClassifier | ||
]) | ||
|
||
example = spark.createDataFrame([['I have a problem with my iphone that needs to be resolved asap!!']]).toDF("text") | ||
result = pipeline.fit(example).transform(example) | ||
``` | ||
```scala | ||
val document_assembler = DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val tokenizer = Tokenizer() | ||
.setInputCols("document") | ||
.setOutputCol("token") | ||
|
||
val zeroShotClassifier = BartForSequenceClassification.pretrained("bart_large_zero_shot_classifier_mnli", "en") | ||
.setInputCols("document", "token") | ||
.setOutputCol("class") | ||
.setCaseSensitive(true) | ||
.setMaxSentenceLength(512) | ||
.setCandidateLabels(Array("urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology")) | ||
|
||
val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, zeroShotClassifier)) | ||
|
||
val example = Seq("I have a problem with my iphone that needs to be resolved asap!!").toDS.toDF("text") | ||
|
||
val result = pipeline.fit(example).transform(example) | ||
``` | ||
</div> | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|bart_large_zero_shot_classifier_mnli| | ||
|Compatibility:|Spark NLP 5.1.0+| | ||
|License:|Open Source| | ||
|Edition:|Official| | ||
|Input Labels:|[token, document]| | ||
|Output Labels:|[label]| | ||
|Language:|en| | ||
|Size:|467.1 MB| | ||
|Case sensitive:|true| |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
layout: model | ||
title: BAAI general embedding English (bge_base) | ||
author: John Snow Labs | ||
name: bge_base | ||
date: 2023-08-15 | ||
tags: [open_source, bert, embeddings, english, en, onnx] | ||
task: Embeddings | ||
language: en | ||
edition: Spark NLP 5.0.2 | ||
spark_version: 3.0 | ||
supported: true | ||
engine: onnx | ||
annotator: BertEmbeddings | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. | ||
And it also can be used in vector database for LLMs. | ||
|
||
`bge` is short for `BAAI general embedding`. | ||
|
||
| Model | Language | Description | query instruction for retrieval\* | | ||
|:-------------------------------|:--------:| :--------:| :--------:| | ||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` | | ||
| [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | | | ||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | ||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | | ||
|
||
{:.btn-box} | ||
<button class="button button-orange" disabled>Live Demo</button> | ||
<button class="button button-orange" disabled>Open in Colab</button> | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_base_en_5.0.2_3.0_1692109953168.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_base_en_5.0.2_3.0_1692109953168.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
|
||
document = DocumentAssembler()\ | ||
.setInputCol("text")\ | ||
.setOutputCol("document") | ||
|
||
tokenizer = Tokenizer()\ | ||
.setInputCols(["document"])\ | ||
.setOutputCol("token") | ||
|
||
embeddings = BertEmbeddings.pretrained("bge_base", "en")\ | ||
.setInputCols(["document", "token"])\ | ||
.setOutputCol("embeddings") | ||
|
||
``` | ||
```scala | ||
|
||
val document = new DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val tokenizer = new Tokenizer() | ||
.setInputCols("document") | ||
.setOutputCol("token") | ||
|
||
val embeddings = BertEmbeddings.pretrained("bge_base", "en") | ||
.setInputCols("document", "token") | ||
.setOutputCol("embeddings") | ||
``` | ||
</div> | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|bge_base| | ||
|Compatibility:|Spark NLP 5.0.2+| | ||
|License:|Open Source| | ||
|Edition:|Official| | ||
|Input Labels:|[document, token]| | ||
|Output Labels:|[embeddings]| | ||
|Language:|en| | ||
|Size:|258.8 MB| | ||
|Case sensitive:|true| | ||
|
||
## References | ||
|
||
BAAI models are from [BAAI](https://huggingface.co/BAAI) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
layout: model | ||
title: BAAI general embedding English (bge_large) | ||
author: John Snow Labs | ||
name: bge_large | ||
date: 2023-08-15 | ||
tags: [open_source, bert, embeddings, english, en, onnx] | ||
task: Embeddings | ||
language: en | ||
edition: Spark NLP 5.0.2 | ||
spark_version: 3.0 | ||
supported: true | ||
engine: onnx | ||
annotator: BertEmbeddings | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. | ||
And it also can be used in vector database for LLMs. | ||
|
||
`bge` is short for `BAAI general embedding`. | ||
|
||
| Model | Language | Description | query instruction for retrieval\* | | ||
|:-------------------------------|:--------:| :--------:| :--------:| | ||
| [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | ||
| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` | | ||
| [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | | | ||
| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | ||
| [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | | ||
|
||
{:.btn-box} | ||
<button class="button button-orange" disabled>Live Demo</button> | ||
<button class="button button-orange" disabled>Open in Colab</button> | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_large_en_5.0.2_3.0_1692109963281.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_large_en_5.0.2_3.0_1692109963281.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
|
||
document = DocumentAssembler()\ | ||
.setInputCol("text")\ | ||
.setOutputCol("document") | ||
|
||
tokenizer = Tokenizer()\ | ||
.setInputCols(["document"])\ | ||
.setOutputCol("token") | ||
|
||
embeddings = BertEmbeddings.pretrained("bge_large", "en")\ | ||
.setInputCols(["document", "token"])\ | ||
.setOutputCol("embeddings") | ||
|
||
``` | ||
```scala | ||
|
||
val document = new DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val tokenizer = new Tokenizer() | ||
.setInputCols("document") | ||
.setOutputCol("token") | ||
|
||
val embeddings = BertEmbeddings.pretrained("bge_large", "en") | ||
.setInputCols("document", "token") | ||
.setOutputCol("embeddings") | ||
``` | ||
</div> | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|bge_large| | ||
|Compatibility:|Spark NLP 5.0.2+| | ||
|License:|Open Source| | ||
|Edition:|Official| | ||
|Input Labels:|[document, token]| | ||
|Output Labels:|[embeddings]| | ||
|Language:|en| | ||
|Size:|794.2 MB| | ||
|Case sensitive:|true| | ||
|
||
## References | ||
|
||
BAAI models are from [BAAI](https://huggingface.co/BAAI) |
Oops, something went wrong.