Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Make YSQL support for ANN vector indexes #22195

Closed
1 task done
tanujnay112 opened this issue Apr 30, 2024 · 0 comments
Closed
1 task done

[YSQL] Make YSQL support for ANN vector indexes #22195

tanujnay112 opened this issue Apr 30, 2024 · 0 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@tanujnay112
Copy link
Contributor

tanujnay112 commented Apr 30, 2024

Jira Link: DB-11118

Description

Add initial YSQL code to make ANN vector indexes.

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@tanujnay112 tanujnay112 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Apr 30, 2024
@tanujnay112 tanujnay112 self-assigned this Apr 30, 2024
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Apr 30, 2024
tanujnay112 added a commit that referenced this issue May 3, 2024
Summary:
The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method.

These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM.
Jira: DB-11118

Test Plan: Jenkins

Reviewers: jason

Reviewed By: jason

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34676
tanujnay112 added a commit that referenced this issue May 21, 2024
Summary:
This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form

```
create extension vector;
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops);
```

This creates an inverted index in DocDB with a schema that looks like
`BaseYBCTID | embedding |`

With only `BaseYBCTID` as the key.
We can do an index ANN scan based on certain query vector such as
` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `

or an index only scan such as
` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `

Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage.
```
WARNING:  ybdummyann is meant for internal-testing only. It does not yield ordered results.
```

When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`.

The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet.

A new access method handler called `ybdummyannhandler` is created by this diff.
Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`.

The main remaining TODOs after this change are:
- Build out DocDB side.
- Add capabilities to mergesort rows from tablets based on their distance from the query vector.
- Add an extra key column to denote future sharding information of each row.
- Allow included values.
- Allow a mix of vector and non-vector key attributes.

**Upgrade/Rollback safety:**
This adds vector index protobuf fields that should not be used by anybody production customer right now.

Jira: DB-11118

Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector'

Reviewers: timur, jason, mbautin, sergei

Reviewed By: timur, jason

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34200
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
The ybgin and lsm access methods repeat a lot of logic during column binding. This diff deduplicates such logic. This diff also adds the `yb_is_supported` method to the IndexAMHandler interface to simplify areas of the code that check whether an index is supported by YB. This diff also adds the `yb_am_bind_schema` method the same interface. This method allows YB AM's to customize underlying DocDB schema creation and also pass along any extra metadata. Right now, the two YB AM's, yblsm and ybgin, implement the same logic for this method.

These changes are useful for upcoming changes where new index types are added by users or extensions, more specifically for the upcoming addition of a vector-based AM.
Jira: DB-11118

Test Plan: Jenkins

Reviewers: jason

Reviewed By: jason

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34676
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
This change implements the YSQL side of vector index creation. This diff adds support for index creation statements with a dummy ANN method called `ybdummyann` for now in the form

```
create extension vector;
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
CREATE INDEX ON items USING ybdummyann (embedding vector_l2_ops);
```

This creates an inverted index in DocDB with a schema that looks like
`BaseYBCTID | embedding |`

With only `BaseYBCTID` as the key.
We can do an index ANN scan based on certain query vector such as
` SELECT * FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `

or an index only scan such as
` SELECT embedding FROM items ORDER BY embedding <-> '[1.0, 0.4, 0.35]' LIMIT 5; `

Note that the results from a `ybdummyann` index won't actually be sorted by their distance from the given query vector as the DocDB side of vector indexing has not been implemented. This is made clear by the following client warning when such an index is created. In the future, when we fully have end-to-end support of vector indexing we will add index AM's such as `hnsw` and `ivfflat` meant for external usage.
```
WARNING:  ybdummyann is meant for internal-testing only. It does not yield ordered results.
```

When a vector index is created, a message of type `PgVectorIdxOptionsPB` found in `common.proto` is populated into `IndexInfo`. A log message has been inserted into `tablet.cc` to show how this can be accessed. Vector index scans populate a field of type `PgVectorReadOptionsPB` in the `PgsqlReadRequestPB`.

The relcache preloader is adjusted to not load index relations whose user-defined AM handler procs might not be loaded yet.

A new access method handler called `ybdummyannhandler` is created by this diff.
Any future vector index AM/AM handler will share functionality very similar to `ybdummyann`. For this reason, this common functionality is all placed in `src/ybvector/ybvector*`.

The main remaining TODOs after this change are:
- Build out DocDB side.
- Add capabilities to mergesort rows from tablets based on their distance from the query vector.
- Add an extra key column to denote future sharding information of each row.
- Allow included values.
- Allow a mix of vector and non-vector key attributes.

**Upgrade/Rollback safety:**
This adds vector index protobuf fields that should not be used by anybody production customer right now.

Jira: DB-11118

Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressThirdPartyExtensionsPgvector'

Reviewers: timur, jason, mbautin, sergei

Reviewed By: timur, jason

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34200
@sushantrmishra sushantrmishra removed the status/awaiting-triage Issue awaiting triage label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants