Skip to content

Commit

Permalink
[SPARKNLP-1091] AutoGGUFModel embeddings support (#14433)
Browse files Browse the repository at this point in the history
* Split HasLlamaCppProperties to HasLlamaCppModelProperties and HasLlamaCppInferenceProperties

* Refactor automatic gpu support

* [SPARKNLP-1091] AutoGGUFEmbeddings scala side

- also adds embedding support for AutoGGUFModel, as it already has the parameter

* [SPARKNLP-1091] AutoGGUFEmbeddings python side

* [SPARKNLP-1091] Update Documentation

* [SPARKNLP-1091] Update AutoGGUFEmbeddingsTests

* [SPARKNLP-1091] AutoGGUFEmbeddings python side

- Also adds example notebook

* [SPARKNLP-1080] AutoGGUFEmbeddings change default pretrained model

* [SPARKNLP-1091] AutoGGUFEmbeddings Set Defaults

---------

Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
  • Loading branch information
DevinTDHa and maziyarpanahi authored Dec 9, 2024
1 parent 6d3b273 commit 978f55b
Show file tree
Hide file tree
Showing 19 changed files with 3,025 additions and 1,361 deletions.
123 changes: 123 additions & 0 deletions docs/en/annotator_entries/AutoGGUFEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
{%- capture title -%}
AutoGGUFEmbeddings
{%- endcapture -%}

{%- capture description -%}
Annotator that uses the llama.cpp library to generate text embeddings with large language
models.

The type of embedding pooling can be set with the `setPoolingType` method. The default is
`"MEAN"`. The available options are `"NONE"`, `"MEAN"`, `"CLS"`, and `"LAST"`.

If the parameters are not set, the annotator will default to use the parameters provided by
the model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
```

The default model is `"nomic-embed-text-v1.5.Q8_0.gguf"`, if no name is provided.

For available pretrained models please see the [Models Hub](https://sparknlp.org/models).

For extended examples of usage, see the
[AutoGGUFEmbeddingsTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFEmbeddingsTest.scala)
and the
[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb).

**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
the number of GPU layers with the `setNGpuLayers` method.

When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
according to your hardware to avoid out-of-memory errors.
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture output_anno -%}
SENTENCE_EMBEDDINGS
{%- endcapture -%}

{%- capture python_example -%}
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
... .setInputCol("text") \
... .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
... .setInputCols(["document"]) \
... .setOutputCol("completions") \
... .setBatchSize(4) \
... .setNGpuLayers(99) \
... .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show()
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture scala_example -%}
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFEmbeddings = AutoGGUFEmbeddings
.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
.setBatchSize(4)
.setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))

val data = Seq(
"The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, truncate=80)
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture api_link -%}
[AutoGGUFEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings)
{%- endcapture -%}

{%- capture python_api_link -%}
[AutoGGUFEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/auto_gguf_embeddings/index.html)
{%- endcapture -%}

{%- capture source_link -%}
[AutoGGUFEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
File renamed without changes.
1 change: 1 addition & 0 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ There are two types of Annotators:
{:.table-model-big}
|Annotator|Description|Version |
|---|---|---|
{% include templates/anno_table_entry.md path="" name="AutoGGUFEmbeddings" summary="Annotator that uses the llama.cpp library to generate text embeddings with large language models."%}
{% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "sparknlp_dev",
"language": "python",
"name": "python3"
},
Expand All @@ -264,7 +264,8 @@
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 978f55b

Please sign in to comment.