Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOID] Fixes #4087 #4080 #4231: Add vector info procedures (#4142) and added Milvus and Pinecone support #4264

Merged
merged 10 commits into from
Jan 21, 2025
1 change: 1 addition & 0 deletions LICENSES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3061,6 +3061,7 @@ MIT
jnr-x86asm-1.0.2.jar
jsoup-1.15.3.jar
localstack-1.17.6.jar
milvus-1.19.7.jar
mockito-core-3.12.4.jar
mssql-jdbc-6.2.1.jre7.jar
mysql-1.17.6.jar
Expand Down
1 change: 1 addition & 0 deletions NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -462,6 +462,7 @@ MIT
jnr-x86asm-1.0.2.jar
jsoup-1.15.3.jar
localstack-1.17.6.jar
milvus-1.19.7.jar
mockito-core-3.12.4.jar
mssql-jdbc-6.2.1.jre7.jar
mysql-1.17.6.jar
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.chroma.info(hostOrKey, collection, $config) | Get information about the specified existing collection or throws an error 500 if it does not exist
| apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/api/v1/collections`.
Expand Down Expand Up @@ -38,6 +39,19 @@ With hostOrKey=null, the default is 'http://localhost:8000'.

=== Examples

.Get collection info (it leverages https://docs.trychroma.com/reference/py-client#get_collection[this API])
[source,cypher]
----
CALL apoc.vectordb.chroma.info(hostOrKey, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"name": "test_collection", "metadata": {"size": 4, "hnsw:space": "cosine"}, "database": "default_database", "id": "74ebe008-1ccb-4d3d-8c5d-cdd7cfa526c2", "tenant": "default_tenant"}
|===

.Create a collection (it leverages https://docs.trychroma.com/usage-guide#creating-inspecting-and-deleting-collections[this API])
[source,cypher]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,17 @@ See the following pages for more details on specific vector db procedures
- xref:./qdrant.adoc[Qdrant]
- xref:./chroma.adoc[ChromaDB]
- xref:./weaviate.adoc[Weaviate]
- xref:./pinecone.adoc[Pinecone]
- xref:./milvus.adoc[Milvus]


== Store Vector db info (i.e. `apoc.vectordb.configure`)
== Store Vector db info (i.e. `apoc.vectordb.configure`)

We can save some info in the System Database to be reused later, that is the host, login credentials, and mapping,
to be used in `*.get` and `.*query` procedures, except for the `apoc.vectordb.custom.get` one.

Therefore, to store the vector info, we can execute the `CALL apoc.vectordb.configure(vectorName, keyConfig, databaseName, $configMap)`,
where `vectorName` can be "QDRANT", "CHROMA" or "WEAVIATE",
where `vectorName` can be "QDRANT", "CHROMA", "PINECONE", "MILVUS" or "WEAVIATE",
that indicates info to be reused respectively by `apoc.vectordb.qdrant.*`, `apoc.vectordb.chroma.*` and `apoc.vectordb.weaviate.*`.

Then `keyConfig` is the configuration name, `databaseName` is the database where the config will be set,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.qdrant.info(hostOrKey, collection, $config) | Get information about the specified existing collection or throws a FileNotFoundException if it does not exist
| apoc.vectordb.qdrant.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/collections/<collection param>`.
Expand Down Expand Up @@ -38,6 +39,29 @@ With hostOrKey=null, the default is 'http://localhost:6333'.

=== Examples

.Get collection info (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/get_collection[this API])
[source,cypher]
----
CALL apoc.vectordb.qdrant.info(hostOrKey, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"result": {"optimizer_status": "ok", "points_count": 2, "vectors_count": 2, "segments_count": 8, "indexed_vectors_count": 0,
"config": {"params": {"on_disk_payload": true, "vectors": {"size": 4, "distance": "Cosine"}, "shard_number": 1, "replication_factor": 1, "write_consistency_factor": 1},
"optimizer_config": {"max_optimization_threads": 1, "indexing_threshold": 20000, "deleted_threshold": 0.2, "flush_interval_sec": 5, "memmap_threshold": null, "default_segment_number": 0, "max_segment_size": null, "vacuum_min_vector_number": 1000}, "quantization_config": null,
"hnsw_config": {"max_indexing_threads": 0, "full_scan_threshold": 10000, "ef_construct": 100, "m": 16, "on_disk": false},
"wal_config": {"wal_segments_ahead": 0, "wal_capacity_mb": 32}
},
"status": green,
"payload_schema": {}
},
"time": 1.2725E-4, "status": ok
}
|===

.Create a collection (it leverages https://qdrant.github.io/qdrant/redoc/index.html#tag/collections/operation/create_collection[this API])
[source,cypher]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ note that the list and the signature procedures are consistent with the others,
[opts=header, cols="1, 3"]
|===
| name | description
| apoc.vectordb.weaviate.info($host, $collectionName, $config) | Get information about the specified existing collection or throws a FileNotFoundException if it does not exist
| apoc.vectordb.weaviate.createCollection(hostOrKey, collection, similarity, size, $config) |
Creates a collection, with the name specified in the 2nd parameter, and with the specified `similarity` and `size`.
The default endpoint is `<hostOrKey param>/schema`.
Expand Down Expand Up @@ -39,6 +40,33 @@ With hostOrKey=null, the default is 'http://localhost:8080/v1'.

=== Examples

.Get collection info (it leverages https://weaviate.io/developers/weaviate/api/rest#tag/schema/get/schema/{className}[this API])
[source, cypher]
----
CALL apoc.vectordb.weaviate.info($host, 'test_collection', {<optional config>})
----

.Example results
[opts="header"]
|===
| value
| {"vectorizer": "none",
"invertedIndexConfig": {"bm25": {"b": 0.75, "k1": 1.2}, "stopwords": {"additions": null, "removals": null, "preset": en}, "cleanupIntervalSeconds": 60},
"vectorIndexConfig": {"ef": -1, "dynamicEfMin": 100, "pq": {"centroids": 256, "trainingLimit": 100000, "encoder": {"type": "kmeans", "distribution": "log-normal"},
"enabled": false, "bitCompression": false, "segments": 0
},
"distance": cosine, "skip": false, "dynamicEfFactor": 8, "bq": {"enabled": false},
"vectorCacheMaxObjects": 1000000000000, "cleanupIntervalSeconds": 300, "dynamicEfMax": 500, "efConstruction": 128, "flatSearchCutoff": 40000, "maxConnections": 64},
"multiTenancyConfig": {"enabled": false},
"vectorIndexType": "hnsw", "replicationConfig": {"factor": 1},
"shardingConfig": {"desiredVirtualCount": 128, "desiredCount": 1, "actualCount": 1, "function": "murmur3", "virtualPerPhysical": 128, "strategy": "hash", "actualVirtualCount": 128, "key": "_id"},
"class": "TestCollection",
"properties": [{"name": "city", "description": "This property was generated by Weaviate's auto-schema feature on Wed Jul 10 12:50:18 2024", "indexFilterable": true, "tokenization": "word", "indexSearchable": true, "dataType": ["text"]},
{"name": "foo", "description": "This property was generated by Weaviate's auto-schema feature on Wed Jul 10 12:50:18 2024", "indexFilterable": true, "tokenization": word, "indexSearchable": true, "dataType": ["text"]}
]
}
|===

.Create a collection (it leverages https://weaviate.io/developers/weaviate/api/rest#tag/schema/post/schema[this API])
[source,cypher]
----
Expand Down
112 changes: 78 additions & 34 deletions full-it/src/test/java/apoc/full/it/vectordb/ChromaDbTest.java
Original file line number Diff line number Diff line change
@@ -1,25 +1,35 @@
package apoc.full.it.vectordb;

import static apoc.ml.Prompt.API_KEY_CONF;
import static apoc.ml.RestAPIConfig.HEADERS_KEY;
import static apoc.util.ExtendedTestUtil.assertFails;
import static apoc.util.MapUtil.map;
import static apoc.util.TestUtil.testCall;
import static apoc.util.TestUtil.testResult;
import static apoc.vectordb.VectorDbHandler.Type.CHROMA;
import static apoc.vectordb.VectorDbTestUtil.EntityType.*;
import static apoc.vectordb.VectorDbTestUtil.EntityType.FALSE;
import static apoc.vectordb.VectorDbTestUtil.EntityType.NODE;
import static apoc.vectordb.VectorDbTestUtil.EntityType.REL;
import static apoc.vectordb.VectorDbTestUtil.assertBerlinResult;
import static apoc.vectordb.VectorDbTestUtil.assertLondonResult;
import static apoc.vectordb.VectorDbTestUtil.assertNodesCreated;
import static apoc.vectordb.VectorDbTestUtil.assertReadOnlyProcWithMappingResults;
import static apoc.vectordb.VectorDbTestUtil.assertRelsCreated;
import static apoc.vectordb.VectorDbTestUtil.dropAndDeleteAll;
import static apoc.vectordb.VectorDbTestUtil.getAuthHeader;
import static apoc.vectordb.VectorDbTestUtil.ragSetup;
import static apoc.vectordb.VectorDbUtil.ERROR_READONLY_MAPPING;
import static apoc.vectordb.VectorEmbeddingConfig.ALL_RESULTS_KEY;
import static apoc.vectordb.VectorEmbeddingConfig.MAPPING_KEY;
import static apoc.vectordb.VectorMappingConfig.*;
import static apoc.vectordb.VectorMappingConfig.EMBEDDING_KEY;
import static apoc.vectordb.VectorMappingConfig.ENTITY_KEY;
import static apoc.vectordb.VectorMappingConfig.METADATA_KEY;
import static apoc.vectordb.VectorMappingConfig.MODE_KEY;
import static apoc.vectordb.VectorMappingConfig.MappingMode;
import static apoc.vectordb.VectorMappingConfig.NODE_LABEL;
import static apoc.vectordb.VectorMappingConfig.REL_TYPE;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;
import static org.junit.Assert.fail;
import static org.neo4j.configuration.GraphDatabaseSettings.DEFAULT_DATABASE_NAME;
import static org.neo4j.configuration.GraphDatabaseSettings.SYSTEM_DATABASE_NAME;

Expand All @@ -31,7 +41,6 @@
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.assertj.core.api.Assertions;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
Expand All @@ -46,6 +55,9 @@
public class ChromaDbTest {
private static final AtomicReference<String> COLL_ID = new AtomicReference<>();
private static final ChromaDBContainer CHROMA_CONTAINER = new ChromaDBContainer("chromadb/chroma:0.4.25.dev137");
private static final String READONLY_KEY = "my_readonly_api_key";
private static final Map<String, String> READONLY_AUTHORIZATION = getAuthHeader(READONLY_KEY);
private static final String COLLECTION_NAME = "test_collection";

private static String HOST;

Expand Down Expand Up @@ -79,10 +91,10 @@ public static void setUp() throws Exception {

testCall(
db,
"CALL apoc.vectordb.chroma.upsert($host, $collection,\n" + " [\n"
+ " {id: '1', vector: [0.05, 0.61, 0.76, 0.74], metadata: {city: \"Berlin\", foo: \"one\"}, text: 'ajeje'},\n"
+ " {id: '2', vector: [0.19, 0.81, 0.75, 0.11], metadata: {city: \"London\", foo: \"two\"}, text: 'brazorf'}\n"
+ " ])",
"CALL apoc.vectordb.chroma.upsert($host, $collection,\n" + "[\n"
+ " {id: '1', vector: [0.05, 0.61, 0.76, 0.74], metadata: {city: \"Berlin\", foo: \"one\"}, text: 'ajeje'},\n"
+ " {id: '2', vector: [0.19, 0.81, 0.75, 0.11], metadata: {city: \"London\", foo: \"two\"}, text: 'brazorf'}\n"
+ "])",
map("host", HOST, "collection", COLL_ID.get()),
r -> {
assertNull(r.get("value"));
Expand All @@ -105,6 +117,27 @@ public void before() {
dropAndDeleteAll(db);
}

@Test
public void getInfo() {
testResult(
db,
"CALL apoc.vectordb.chroma.info($host, $collection, $conf) ",
map("host", HOST, "collection", COLLECTION_NAME, "conf", map(ALL_RESULTS_KEY, true)),
r -> {
Map<String, Object> row = (Map<String, Object>) r.next().get("value");
assertEquals(COLLECTION_NAME, row.get("name"));
});
}

@Test
public void getInfoNotExistentCollection() {
assertFails(
db,
"CALL apoc.vectordb.chroma.info($host, 'wrong_collection', $conf) ",
map("host", HOST, "collection", COLLECTION_NAME, "conf", map(ALL_RESULTS_KEY, true)),
"Server returned HTTP response code: 500");
}

@Test
public void getVectors() {
testResult(
Expand Down Expand Up @@ -257,8 +290,8 @@ public void queryVectorsWithCreateNode() {
"myId",
METADATA_KEY,
"foo",
CREATE_KEY,
true));
MODE_KEY,
MappingMode.CREATE_IF_MISSING.toString()));

testResult(
db,
Expand Down Expand Up @@ -328,17 +361,17 @@ public void getVectorsWithCreateNodeUsingExistingNode() {

@Test
public void getReadOnlyVectorsWithMapping() {
Map<String, Object> conf = map(ALL_RESULTS_KEY, true, MAPPING_KEY, map(EMBEDDING_KEY, "vect"));

try {
testCall(
db,
"CALL apoc.vectordb.chroma.get($host, $collection, [1, 2], $conf)",
map("host", HOST, "collection", COLL_ID.get(), "conf", conf),
r -> fail());
} catch (RuntimeException e) {
Assertions.assertThat(e.getMessage()).contains(ERROR_READONLY_MAPPING);
}
db.executeTransactionally("CREATE (:Test {readID: 'one'}), (:Test {readID: 'two'})");

Map<String, Object> conf = map(
ALL_RESULTS_KEY, true, MAPPING_KEY, map(NODE_LABEL, "Test", ENTITY_KEY, "readID", METADATA_KEY, "foo"));

testResult(
db,
"CALL apoc.vectordb.chroma.get($host, $collection, ['1', '2'], $conf) "
+ "YIELD vector, id, metadata, node RETURN * ORDER BY id",
map("host", HOST, "collection", COLL_ID.get(), "conf", conf),
r -> assertReadOnlyProcWithMappingResults(r, "node"));
}

@Test
Expand Down Expand Up @@ -372,17 +405,23 @@ public void queryVectorsWithCreateNodeUsingExistingNode() {

@Test
public void queryReadOnlyVectorsWithMapping() {
Map<String, Object> conf = map(ALL_RESULTS_KEY, true, MAPPING_KEY, map(EMBEDDING_KEY, "vect"));

try {
testCall(
db,
"CALL apoc.vectordb.chroma.query($host, $collection, [0.2, 0.1, 0.9, 0.7], {}, 5, $conf)",
map("host", HOST, "collection", COLL_ID.get(), "conf", conf),
r -> fail());
} catch (RuntimeException e) {
Assertions.assertThat(e.getMessage()).contains(ERROR_READONLY_MAPPING);
}
db.executeTransactionally(
"CREATE (:Start)-[:TEST {readID: 'one'}]->(:End), (:Start)-[:TEST {readID: 'two'}]->(:End)");

Map<String, Object> conf = map(
ALL_RESULTS_KEY,
true,
MAPPING_KEY,
map(
REL_TYPE, "TEST",
ENTITY_KEY, "readID",
METADATA_KEY, "foo"));

testResult(
db,
"CALL apoc.vectordb.chroma.query($host, $collection, [0.2, 0.1, 0.9, 0.7], {}, 5, $conf)",
map("host", HOST, "collection", COLL_ID.get(), "conf", conf),
r -> assertReadOnlyProcWithMappingResults(r, "rel"));
}

@Test
Expand Down Expand Up @@ -462,7 +501,12 @@ public void queryVectorsWithRag() {
String openAIKey = ragSetup(db);

Map<String, Object> conf = map(
ALL_RESULTS_KEY, true, MAPPING_KEY, map(NODE_LABEL, "Rag", ENTITY_KEY, "readID", METADATA_KEY, "foo"));
ALL_RESULTS_KEY,
true,
HEADERS_KEY,
READONLY_AUTHORIZATION,
MAPPING_KEY,
map(NODE_LABEL, "Rag", ENTITY_KEY, "readID", METADATA_KEY, "foo"));

testResult(
db,
Expand Down
Loading
Loading