Skip to content

Commit 31a53c8

Browse files
Fix: Chroma Upsert instead of Add (#3086)
Thanks to @0xjgv we have upserting instead of adding in Chroma. This will prevent duplicate embeddings. Also including a huggingface example. We had examples for all the other embedders.
1 parent 47d2861 commit 31a53c8

File tree

2 files changed

+6
-2
lines changed

2 files changed

+6
-2
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
to avoid text being dynamically injected into the XML document.
1414
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.
1515

16+
* **Chromadb change from Add to Upsert using element_id to make idempotent**
17+
1618
## 0.14.2
1719

1820
### Enhancements

unstructured/ingest/connector/chroma.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,8 @@ def upsert_batch(self, batch):
111111

112112
try:
113113
# Chroma wants lists even if there is only one element
114-
collection.add(
114+
# Upserting to prevent duplicates
115+
collection.upsert(
115116
ids=batch["ids"],
116117
documents=batch["documents"],
117118
embeddings=batch["embeddings"],
@@ -147,8 +148,9 @@ def write_dict(self, *args, elements_dict: t.List[t.Dict[str, t.Any]], **kwargs)
147148
self.upsert_batch(self.prepare_chroma_list(chunk))
148149

149150
def normalize_dict(self, element_dict: dict) -> dict:
151+
element_id = element_dict.get("element_id", str(uuid.uuid4()))
150152
return {
151-
"id": str(uuid.uuid4()),
153+
"id": element_id,
152154
"embedding": element_dict.pop("embeddings", None),
153155
"document": element_dict.pop("text", None),
154156
"metadata": flatten_dict(

0 commit comments

Comments
 (0)