-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Make YSQL support for ANN vector indexes #22195
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Comments
tanujnay112
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Apr 30, 2024
yugabyte-ci
added
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
labels
Apr 30, 2024
tanujnay112
added a commit
that referenced
this issue
May 3, 2024
Summary: The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method. These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM. Jira: DB-11118 Test Plan: Jenkins Reviewers: jason Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34676
tanujnay112
added a commit
that referenced
this issue
May 21, 2024
Summary: This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form ``` create extension vector; CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3)); CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops); ``` This creates an inverted index in DocDB with a schema that looks like `BaseYBCTID | embedding |` With only `BaseYBCTID` as the key. We can do an index ANN scan based on certain query vector such as ` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` or an index only scan such as ` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage. ``` WARNING: ybdummyann is meant for internal-testing only. It does not yield ordered results. ``` When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`. The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet. A new access method handler called `ybdummyannhandler` is created by this diff. Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`. The main remaining TODOs after this change are: - Build out DocDB side. - Add capabilities to mergesort rows from tablets based on their distance from the query vector. - Add an extra key column to denote future sharding information of each row. - Allow included values. - Allow a mix of vector and non-vector key attributes. **Upgrade/Rollback safety:** This adds vector index protobuf fields that should not be used by anybody production customer right now. Jira: DB-11118 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector' Reviewers: timur, jason, mbautin, sergei Reviewed By: timur, jason Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34200
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
Summary: The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method. These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM. Jira: DB-11118 Test Plan: Jenkins Reviewers: jason Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34676
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
Summary: This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form ``` create extension vector; CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3)); CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops); ``` This creates an inverted index in DocDB with a schema that looks like `BaseYBCTID | embedding |` With only `BaseYBCTID` as the key. We can do an index ANN scan based on certain query vector such as ` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` or an index only scan such as ` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; ` Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage. ``` WARNING: ybdummyann is meant for internal-testing only. It does not yield ordered results. ``` When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`. The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet. A new access method handler called `ybdummyannhandler` is created by this diff. Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`. The main remaining TODOs after this change are: - Build out DocDB side. - Add capabilities to mergesort rows from tablets based on their distance from the query vector. - Add an extra key column to denote future sharding information of each row. - Allow included values. - Allow a mix of vector and non-vector key attributes. **Upgrade/Rollback safety:** This adds vector index protobuf fields that should not be used by anybody production customer right now. Jira: DB-11118 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector' Reviewers: timur, jason, mbautin, sergei Reviewed By: timur, jason Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34200
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
priority/medium
Medium priority issue
Jira Link: DB-11118
Description
Add initial YSQL code to make ANN vector indexes.
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: