[YSQL] Make YSQL support for ANN vector indexes #22195

tanujnay112 · 2024-04-30T09:47:42Z

Description

Add initial YSQL code to make ANN vector indexes.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

Summary: The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method. These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM. Jira: DB-11118 Test Plan: Jenkins Reviewers: jason Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34676

Summary: This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form ``` create extension vector; CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3)); CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops); ``` This creates an inverted index in DocDB with a schema that looks like `BaseYBCTID | embedding |` With only `BaseYBCTID` as the key. We can do an index ANN scan based on certain query vector such as ` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` or an index only scan such as ` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage. ``` WARNING: ybdummyann is meant for internal-testing only. It does not yield ordered results. ``` When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`. The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet. A new access method handler called `ybdummyannhandler` is created by this diff. Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`. The main remaining TODOs after this change are: - Build out DocDB side. - Add capabilities to mergesort rows from tablets based on their distance from the query vector. - Add an extra key column to denote future sharding information of each row. - Allow included values. - Allow a mix of vector and non-vector key attributes. **Upgrade/Rollback safety:** This adds vector index protobuf fields that should not be used by anybody production customer right now. Jira: DB-11118 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector' Reviewers: timur, jason, mbautin, sergei Reviewed By: timur, jason Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34200

Summary: The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method. These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM. Jira: DB-11118 Test Plan: Jenkins Reviewers: jason Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34676

Summary: This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form ``` create extension vector; CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3)); CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops); ``` This creates an inverted index in DocDB with a schema that looks like `BaseYBCTID | embedding |` With only `BaseYBCTID` as the key. We can do an index ANN scan based on certain query vector such as ` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` or an index only scan such as ` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage. ``` WARNING: ybdummyann is meant for internal-testing only. It does not yield ordered results. ``` When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`. The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet. A new access method handler called `ybdummyannhandler` is created by this diff. Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`. The main remaining TODOs after this change are: - Build out DocDB side. - Add capabilities to mergesort rows from tablets based on their distance from the query vector. - Add an extra key column to denote future sharding information of each row. - Allow included values. - Allow a mix of vector and non-vector key attributes. **Upgrade/Rollback safety:** This adds vector index protobuf fields that should not be used by anybody production customer right now. Jira: DB-11118 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector' Reviewers: timur, jason, mbautin, sergei Reviewed By: timur, jason Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34200

tanujnay112 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Apr 30, 2024

tanujnay112 self-assigned this Apr 30, 2024

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Apr 30, 2024

tanujnay112 mentioned this issue Jun 11, 2024

[DocDB] Vector Index Support #22828

Open

7 tasks

sushantrmishra removed the status/awaiting-triage Issue awaiting triage label Sep 23, 2024

lingamsandeep closed this as completed Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Make YSQL support for ANN vector indexes #22195

[YSQL] Make YSQL support for ANN vector indexes #22195

tanujnay112 commented Apr 30, 2024 •

edited

Loading

[YSQL] Make YSQL support for ANN vector indexes #22195

[YSQL] Make YSQL support for ANN vector indexes #22195

Comments

tanujnay112 commented Apr 30, 2024 • edited Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

tanujnay112 commented Apr 30, 2024 •

edited

Loading