Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support latest vectorsearch (dev branch) and hybrid queries #1980

Merged
merged 5 commits into from
May 10, 2024

Conversation

snej
Copy link
Collaborator

@snej snej commented Mar 14, 2024

  1. Added the official SQLite carray extension, because the latest vectorsearch library requires it.
  2. Support for hybrid vector queries, where the vector search is combined with other criteria on the collection:
  • When vector_match() is the only criterion in the WHERE clause, OR if an explicit max_results arg is given, it's a "plain" query like already existed.
  • Otherwise it's a "hybrid" query, which invokes the vectorsearch extension differently (with a JOIN constraint on its rowid column.) This is less efficient, but computes distances for all the rows selected by the other WHERE tests, instead of just finding the closest docs in the whole collection, so it gives more accurate results.
  1. In a plain query where there is no max_results given, but the query itself has a LIMIT, use the LIMIT as the max_results for the vector query. This is intuitive, and makes it so you only need to use max_results if you want to force a plain vector query in combination with other conditions.

snej added 2 commits March 12, 2024 15:13
This extension is part of the SQLite source tree but not built in.

Primary impetus for adding it is that the vector-search extension now
uses it, but as an extension can't bundle the code, so it requires
the owner of the SQLite handle to load it.

It's a great optimization for passing lots of values into a query in
an `IN(...)` clause, without having to encode every single value into
the SQL string. I added it to the SQLiteCpp library as well as C++
API for using it when binding parameters.

The one place we can immediately use it in LiteCore is
SQLiteKeyStore::withDocBodies(), so I updated that method. It might
give us a tiny boost in replication performance...
- When vector_match() is the only criterion in the WHERE clause,
  OR if an explicit max_results arg is given, it's a "plain" query
  like already existed.
- Otherwise it's a "hybrid" query, which invokes the vectorsearch
  extension differently (with a JOIN constraint on its rowid column.)
  This is less efficient, but computes distances for all the rows
  selected by the other WHERE tests, instead of just finding the
  closest docs in the whole collection, so it gives more accurate
  results.
@snej snej force-pushed the feature/more-vector-search branch from b4da378 to 3d033f7 Compare March 15, 2024 16:17
@pasin
Copy link
Collaborator

pasin commented Apr 23, 2024

@snej Can you fix the windows build issue so the PR could get reviewed?

@cbl-bot
Copy link

cbl-bot commented May 9, 2024

Code Coverage Results:

Type Percentage
branches 69.51
functions 79.88
instantiations 35.55
lines 80.06
regions 76.56

@jianminzhao jianminzhao requested review from jianminzhao and pasin May 10, 2024 16:12
Copy link
Collaborator

@pasin pasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In beta 1 and 2, the default limit of the vector_match() is 3.

For the case of hybrid search without specifying the limit, from the test, it seems like there is no default limit = 3 applied anymore? If this is correct, this will be a behavior change that needs to be documented and probably need to see if it will cause any confusion when the default limit will be applied.

@pasin
Copy link
Collaborator

pasin commented May 10, 2024

I have chatted with @jianminzhao to confirm my understanding. As the default limit will not be applied in the hybrid queries, we will need to explain this in the documentation (maybe with some examples). It's intuitive to understand so I hope it will not be hard for users to understand this.

@jianminzhao jianminzhao merged commit 374d485 into master May 10, 2024
8 of 9 checks passed
@jianminzhao jianminzhao deleted the feature/more-vector-search branch May 10, 2024 17:59
jianminzhao added a commit that referenced this pull request May 23, 2024
CBL-5629: Update zlib to 1.3.1 (#2032)
CBL-5627: Update min MacOS version to 12.0 (#2033)
CBL-5539: Add an API to check if a vector index is trained or not (#2035)
CBL-5628: Update mbedtls to 2.28.8 (#2027)
374d485 Support latest vectorsearch (dev branch) and hybrid queries (#1980)
5c3c854 Lazy vector index updating (#1949)
CBL-5522: Port - N1QL Parser has exponential slowdown for redundant parentheses (#1984)
ab19634 Part of CBL 5579 in order to facilitate VS on .NET Android (#1993)
CBL-5507: Fix index-past-end in CookieStore (#1982)
CBL-5591: Binary Decoder to account for the new Logging object path (#1995)
294c3f8 Define _LIBCPP_REMOVE_TRANSITIVE_INCLUDES (#1987)
CBL-5438: DateTime standard format parser (#1977)
CBL-5498: Util changes for ConnectedClient (#1978)
CBL-5450: Remote rev KeepBody flag could be cleared accidentally
f8a8de2 Remove UWP builds from build scripts (#1954)
CBL-5425: Binary Encoder to encode the (Logging) object path (#1986)
CBL-4661: Fix ROUND_EVEN. (#1981)
jianminzhao added a commit that referenced this pull request Aug 26, 2024
CBL-5629: Update zlib to 1.3.1 (#2032)
CBL-5627: Update min MacOS version to 12.0 (#2033)
CBL-5539: Add an API to check if a vector index is trained or not (#2035)
CBL-5628: Update mbedtls to 2.28.8 (#2027)
374d485 Support latest vectorsearch (dev branch) and hybrid queries (#1980)
5c3c854 Lazy vector index updating (#1949)
CBL-5522: Port - N1QL Parser has exponential slowdown for redundant parentheses (#1984)
ab19634 Part of CBL 5579 in order to facilitate VS on .NET Android (#1993)
CBL-5507: Fix index-past-end in CookieStore (#1982)
CBL-5591: Binary Decoder to account for the new Logging object path (#1995)
294c3f8 Define _LIBCPP_REMOVE_TRANSITIVE_INCLUDES (#1987)
CBL-5438: DateTime standard format parser (#1977)
CBL-5498: Util changes for ConnectedClient (#1978)
CBL-5450: Remote rev KeepBody flag could be cleared accidentally
f8a8de2 Remove UWP builds from build scripts (#1954)
CBL-5425: Binary Encoder to encode the (Logging) object path (#1986)
CBL-4661: Fix ROUND_EVEN. (#1981)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants