Skip to content

Commit 6b400b4

Browse files
MthwRobinsonfzowlLiuhong99
authored
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
1 parent 32df4ee commit 6b400b4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+20601
-56
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
## 0.14.3-dev4
1+
## 0.14.3-dev5
22

33
### Enhancements
44

55
* **Move `category` field from Text class to Element class.**
66
* **`partition_docx()` now supports pluggable picture sub-partitioners.** A subpartitioner that accepts a DOCX `Paragraph` and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.
7+
* **Add VoyageAI embedder** Adds VoyageAI embeddings to support embedding via Voyage AI.
78

89
### Features
910

examples/embed/example_voyageai.py

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import os
2+
3+
from unstructured.documents.elements import Text
4+
from unstructured.embed.voyageai import VoyageAIEmbeddingConfig, VoyageAIEmbeddingEncoder
5+
6+
# To use Voyage AI you will need to pass
7+
# Voyage AI API Key (obtained from https://dash.voyageai.com/)
8+
# as the ``api_key`` parameter.
9+
#
10+
# The ``model_name`` parameter is mandatory, please check the available models
11+
# at https://docs.voyageai.com/docs/embeddings
12+
13+
embedding_encoder = VoyageAIEmbeddingEncoder(
14+
config=VoyageAIEmbeddingConfig(api_key=os.environ["VOYAGE_API_KEY"], model_name="voyage-law-2")
15+
)
16+
elements = embedding_encoder.embed_documents(
17+
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
18+
)
19+
20+
query = "This is the query"
21+
query_embedding = embedding_encoder.embed_query(query=query)
22+
23+
[print(e, e.embeddings) for e in elements]
24+
print(query, query_embedding)
25+
print(embedding_encoder.is_unit_vector, embedding_encoder.num_of_dimensions)

requirements/base.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ tabulate==0.9.0
8686
# via -r ./base.in
8787
tqdm==4.66.4
8888
# via nltk
89-
typing-extensions==4.11.0
89+
typing-extensions==4.12.0
9090
# via
9191
# -r ./base.in
9292
# emoji

requirements/deps/constraints.txt

+4-1
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,10 @@ unstructured-client<=0.18.0
5757

5858
fsspec==2024.5.0
5959

60-
# python 3.12 support
60+
# python 3.12 support
6161
numpy>=1.26.0
6262
wrapt>=1.14.0
6363

64+
65+
# NOTE(robinson): for compatiblity with voyage embeddings
66+
langsmith==0.1.62

requirements/dev.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ jsonschema-specifications==2023.12.1
151151
# jsonschema
152152
jupyter==1.0.0
153153
# via -r ./dev.in
154-
jupyter-client==8.6.1
154+
jupyter-client==8.6.2
155155
# via
156156
# ipykernel
157157
# jupyter-console
@@ -185,7 +185,7 @@ jupyter-server==2.14.0
185185
# notebook-shim
186186
jupyter-server-terminals==0.5.3
187187
# via jupyter-server
188-
jupyterlab==4.2.0
188+
jupyterlab==4.2.1
189189
# via notebook
190190
jupyterlab-pygments==0.3.0
191191
# via nbconvert
@@ -392,7 +392,7 @@ traitlets==5.14.3
392392
# qtconsole
393393
types-python-dateutil==2.9.0.20240316
394394
# via arrow
395-
typing-extensions==4.11.0
395+
typing-extensions==4.12.0
396396
# via
397397
# -c ./base.txt
398398
# -c ./test.txt

requirements/extra-docx.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ python-docx==1.1.2
1212
# via
1313
# -c ././deps/constraints.txt
1414
# -r ./extra-docx.in
15-
typing-extensions==4.11.0
15+
typing-extensions==4.12.0
1616
# via
1717
# -c ./base.txt
1818
# python-docx

requirements/extra-odt.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ python-docx==1.1.2
1414
# via
1515
# -c ././deps/constraints.txt
1616
# -r ./extra-odt.in
17-
typing-extensions==4.11.0
17+
typing-extensions==4.12.0
1818
# via
1919
# -c ./base.txt
2020
# python-docx

requirements/extra-paddleocr.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ attrdict==2.0.1
88
# via unstructured-paddleocr
99
babel==2.15.0
1010
# via flask-babel
11-
bce-python-sdk==0.9.10
11+
bce-python-sdk==0.9.11
1212
# via visualdl
1313
blinker==1.8.2
1414
# via flask
@@ -45,7 +45,7 @@ flask==3.0.3
4545
# visualdl
4646
flask-babel==4.0.0
4747
# via visualdl
48-
fonttools==4.51.0
48+
fonttools==4.52.1
4949
# via matplotlib
5050
future==1.0.0
5151
# via bce-python-sdk
@@ -200,7 +200,7 @@ six==1.16.0
200200
# imgaug
201201
# python-dateutil
202202
# visualdl
203-
tifffile==2024.5.10
203+
tifffile==2024.5.22
204204
# via scikit-image
205205
tqdm==4.66.4
206206
# via

requirements/extra-pdf-image.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ filelock==3.14.0
3939
# transformers
4040
flatbuffers==24.3.25
4141
# via onnxruntime
42-
fonttools==4.51.0
42+
fonttools==4.52.1
4343
# via matplotlib
4444
fsspec==2024.5.0
4545
# via
@@ -118,7 +118,7 @@ numpy==1.26.4
118118
# transformers
119119
omegaconf==2.3.0
120120
# via effdet
121-
onnx==1.16.0
121+
onnx==1.16.1
122122
# via
123123
# -r ./extra-pdf-image.in
124124
# unstructured-inference
@@ -278,7 +278,7 @@ tqdm==4.66.4
278278
# transformers
279279
transformers==4.41.1
280280
# via unstructured-inference
281-
typing-extensions==4.11.0
281+
typing-extensions==4.12.0
282282
# via
283283
# -c ./base.txt
284284
# huggingface-hub

requirements/huggingface.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ tqdm==4.66.4
102102
# transformers
103103
transformers==4.41.1
104104
# via -r ./huggingface.in
105-
typing-extensions==4.11.0
105+
typing-extensions==4.12.0
106106
# via
107107
# -c ./base.txt
108108
# huggingface-hub

requirements/ingest/airtable.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ requests==2.32.2
3131
# via
3232
# -c ./ingest/../base.txt
3333
# pyairtable
34-
typing-extensions==4.11.0
34+
typing-extensions==4.12.0
3535
# via
3636
# -c ./ingest/../base.txt
3737
# pyairtable

requirements/ingest/azure-cognitive-search.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ six==1.16.0
3434
# -c ./ingest/../base.txt
3535
# azure-core
3636
# isodate
37-
typing-extensions==4.11.0
37+
typing-extensions==4.12.0
3838
# via
3939
# -c ./ingest/../base.txt
4040
# azure-core

requirements/ingest/azure.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ six==1.16.0
9393
# -c ./ingest/../base.txt
9494
# azure-core
9595
# isodate
96-
typing-extensions==4.11.0
96+
typing-extensions==4.12.0
9797
# via
9898
# -c ./ingest/../base.txt
9999
# azure-core

requirements/ingest/chroma.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ typer==0.9.0
198198
# via
199199
# -r ./ingest/chroma.in
200200
# chromadb
201-
typing-extensions==4.11.0
201+
typing-extensions==4.12.0
202202
# via
203203
# -c ./ingest/../base.txt
204204
# chromadb

requirements/ingest/databricks-volumes.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ charset-normalizer==3.3.2
1515
# via
1616
# -c ./ingest/../base.txt
1717
# requests
18-
databricks-sdk==0.27.1
18+
databricks-sdk==0.28.0
1919
# via -r ./ingest/databricks-volumes.in
2020
google-auth==2.29.0
2121
# via databricks-sdk

requirements/ingest/elasticsearch.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ certifi==2024.2.2
1111
# elastic-transport
1212
elastic-transport==8.13.0
1313
# via elasticsearch
14-
elasticsearch==8.13.1
14+
elasticsearch==8.13.2
1515
# via -r ./ingest/elasticsearch.in
1616
urllib3==1.26.18
1717
# via

requirements/ingest/embed-aws-bedrock.txt

+5-5
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ charset-normalizer==3.3.2
3737
dataclasses-json==0.6.6
3838
# via
3939
# -c ./ingest/../base.txt
40-
# langchain
4140
# langchain-community
4241
frozenlist==1.4.1
4342
# via
@@ -56,9 +55,9 @@ jsonpatch==1.33
5655
# via langchain-core
5756
jsonpointer==2.4
5857
# via jsonpatch
59-
langchain==0.2.0
58+
langchain==0.2.1
6059
# via langchain-community
61-
langchain-community==0.2.0
60+
langchain-community==0.2.1
6261
# via -r ./ingest/embed-aws-bedrock.in
6362
langchain-core==0.2.1
6463
# via
@@ -67,8 +66,9 @@ langchain-core==0.2.1
6766
# langchain-text-splitters
6867
langchain-text-splitters==0.2.0
6968
# via langchain
70-
langsmith==0.1.61
69+
langsmith==0.1.62
7170
# via
71+
# -c ./ingest/../deps/constraints.txt
7272
# langchain
7373
# langchain-community
7474
# langchain-core
@@ -135,7 +135,7 @@ tenacity==8.3.0
135135
# langchain
136136
# langchain-community
137137
# langchain-core
138-
typing-extensions==4.11.0
138+
typing-extensions==4.12.0
139139
# via
140140
# -c ./ingest/../base.txt
141141
# pydantic

requirements/ingest/embed-huggingface.txt

+5-5
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@ charset-normalizer==3.3.2
3030
dataclasses-json==0.6.6
3131
# via
3232
# -c ./ingest/../base.txt
33-
# langchain
3433
# langchain-community
3534
filelock==3.14.0
3635
# via
@@ -68,9 +67,9 @@ jsonpatch==1.33
6867
# via langchain-core
6968
jsonpointer==2.4
7069
# via jsonpatch
71-
langchain==0.2.0
70+
langchain==0.2.1
7271
# via langchain-community
73-
langchain-community==0.2.0
72+
langchain-community==0.2.1
7473
# via -r ./ingest/embed-huggingface.in
7574
langchain-core==0.2.1
7675
# via
@@ -79,8 +78,9 @@ langchain-core==0.2.1
7978
# langchain-text-splitters
8079
langchain-text-splitters==0.2.0
8180
# via langchain
82-
langsmith==0.1.61
81+
langsmith==0.1.62
8382
# via
83+
# -c ./ingest/../deps/constraints.txt
8484
# langchain
8585
# langchain-community
8686
# langchain-core
@@ -188,7 +188,7 @@ tqdm==4.66.4
188188
# transformers
189189
transformers==4.41.1
190190
# via sentence-transformers
191-
typing-extensions==4.11.0
191+
typing-extensions==4.12.0
192192
# via
193193
# -c ./ingest/../base.txt
194194
# huggingface-hub

requirements/ingest/embed-octoai.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ idna==3.7
3838
# anyio
3939
# httpx
4040
# requests
41-
openai==1.30.1
41+
openai==1.30.3
4242
# via -r ./ingest/embed-octoai.in
4343
pydantic==2.7.1
4444
# via openai
@@ -63,7 +63,7 @@ tqdm==4.66.4
6363
# via
6464
# -c ./ingest/../base.txt
6565
# openai
66-
typing-extensions==4.11.0
66+
typing-extensions==4.12.0
6767
# via
6868
# -c ./ingest/../base.txt
6969
# openai

requirements/ingest/embed-openai.txt

+6-6
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ charset-normalizer==3.3.2
3737
dataclasses-json==0.6.6
3838
# via
3939
# -c ./ingest/../base.txt
40-
# langchain
4140
# langchain-community
4241
distro==1.9.0
4342
# via openai
@@ -64,9 +63,9 @@ jsonpatch==1.33
6463
# via langchain-core
6564
jsonpointer==2.4
6665
# via jsonpatch
67-
langchain==0.2.0
66+
langchain==0.2.1
6867
# via langchain-community
69-
langchain-community==0.2.0
68+
langchain-community==0.2.1
7069
# via -r ./ingest/embed-openai.in
7170
langchain-core==0.2.1
7271
# via
@@ -75,8 +74,9 @@ langchain-core==0.2.1
7574
# langchain-text-splitters
7675
langchain-text-splitters==0.2.0
7776
# via langchain
78-
langsmith==0.1.61
77+
langsmith==0.1.62
7978
# via
79+
# -c ./ingest/../deps/constraints.txt
8080
# langchain
8181
# langchain-community
8282
# langchain-core
@@ -98,7 +98,7 @@ numpy==1.26.4
9898
# -c ./ingest/../deps/constraints.txt
9999
# langchain
100100
# langchain-community
101-
openai==1.30.1
101+
openai==1.30.3
102102
# via -r ./ingest/embed-openai.in
103103
orjson==3.10.3
104104
# via langsmith
@@ -152,7 +152,7 @@ tqdm==4.66.4
152152
# via
153153
# -c ./ingest/../base.txt
154154
# openai
155-
typing-extensions==4.11.0
155+
typing-extensions==4.12.0
156156
# via
157157
# -c ./ingest/../base.txt
158158
# openai

0 commit comments

Comments
 (0)