-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Integrating KNNVectorsFormat in Native Vector Search Engine #1853
Comments
Adding Tasks in the comment as Issue is already quite large Tasks
|
Re-opening the issue somehow as the PRs were getting merged this issue was resolved |
Reopening this issue. Somehow it keeps on getting closed as the PRs are getting merged. |
The exact search experience improvement will be taken in the 2.18 release |
I am closing this issue. The exact search experience changes will be taken in 2.18/2.19 version of Opensearch |
Introduction
The issue focuses on providing detailed design for integrating KNNVectorsFormat in Native Vector Search Engines (like NMSLib and FAISSLib). Apart from just moving to the new vector format, the document takes another step forward to improve exact search user experience too. At the end issue touch bases on the plan of implementation to give an iterative way to implement things.
Background
KNNPlugin(aka Vector Engine) was added in OpenSearch back 2019, at that time Lucene didn’t support any native VectorFormat. To go around this, the decision was taken to represent vector as Binary DocValues and then override the
BinaryDocValuesFormat to store vectors and build the vector data structures. With Lucene Version 9.0 Lucene added a new format optimized for Vectors. Since that the format has evolved and optimized with features like iterative graph builds, in built scaler quantization, optimized support for reading vectors from disk etc.
Earlier Investigation
In September 2023, we did an investigation(thanks to @heemin32 who did the investigation) to get some details on what it takes to move from BinaryDocValuesFormat to KNNVectorsFormat. Below were the main concerns in a summarized fashion as per the older deep-dive.
Summarizing the Main Concerns from the earlier investigation(ref: Cons section of this)
Benefits of Moving to KNNVectorsFormat #
Below are the top benefits that for moving to KNNVectorsFormat:
What about earlier concerns?
Solution
High Level Design
Indexing Flow
Below is the high level indexing flow. In the KNNFieldsMapper k-NN plugin will decide(refer later sections on how we will take this decision) which VectorField to add in the Lucene document. This field will then be used to decide which VectorFormat to use. If we go with knn plugin VectorField then we will use BinaryDocValues, if we FloatVectorField/ByteVectorField of Lucene for native engines then we will use NativeEngineKNNVectorFormat.
Components Definition:
Search Flow
There will no major changes in search flow both of exact search and approximate nearest neighbors search. The only anticipated change is with efficient filters. So, when we will do exact search in efficient filter we need to switch from Binary DocValues to KNNVectorValues based on which values are available for field. Refer next sections to understand more how this will be done seamlessly.
Pros
Cons
Alternatives
Alternative 1: Improve current KNNDocValuesFormat to bridge the feature gap
Improving the KNNDocValuesFormat is another option where we can invest in doc values format and improve the format so that it can support iterative graph builds and reading the float values efficiently. I did deep-dive on both of them and what was found is for iterative graph build there is no support in Lucene for DocValues. This support is only there for VectorsFormat. Secondly on the reading floats efficiently, I tried to look into MemorySegmentIndex API of Lucene and DocValues Reader, all the classes are marked either package private or final.
User Experience (No Change in ANN Search, improvements for Exact Search)
The user experience for creating the index and doing the Approximate Nearest Neighbors Search will remain . But to use full potential of KNNVectorsValues for other use-cases below is the proposed changes. With the below changes we will also be able to resolve these(ref1, ref2) enhancements.
Exact Search and Training Index Creation
The main thing we used to do for the training indices and exact search indices was we mark index.knn as false. What this used to do was rather than it using the KNNCodec, index will use the default codec. In the default codec as the DocValuesFormat was not overridden no graphs used to be created. So if we look closely we can achieve the behavior by another parameter which is present in the field for every field in Opensearch this is index:true/false. Currently kNNFieldMapper doesn’t take advantage of this parameter but we can now start taking the advantage of this parameter and set a new attribute in the field and then use it later in the codec to take a decision if we need to create KNN data structures or not.
Old
New Proposed Exact Search Optimized Interface, the old interface will still be supported. With this new interface
Exact Search Query Experience
Old
New Experience
The new experience is similar to ANN search experience. The difference here is, if customer has specified the index:false in the field mapping, the Vector Engine will be intelligent enough to switch to exact search behavior.
Low Level Design
The major Low Level changes are explained below.
New KNNVectorsFormat for Native Engines (aka NativeEngineKNNVectorsFormat)
To use the KNNVectorsFormat we will be adding a new VectorsFormat specially for the NativeEngines(nmslib and faiss) named NativeEngineKNNVectorsFormat. This KNNVectorsFormat will be used for writing (via NativeEngineKNNVectorsFormatWriter) and reading (via NativeEngineKNNVectorsFormatReader) vector fields when native engines are used. Refer the class digram below for more understanding and working POC here.
Common Interface for interacting with StoredVectors(aka BinaryDocValues, FloatVectorValues and ByteVectorValues)
A new KNNVectorValues interface will be added that will act as an abstraction layer on top of FloatVectorValues, ByteVectorValues and BinaryDocValues. This KNNVectorValues then can be used at different places like in codec and also in the query(in fiters) to iterate over Vectors from segments and segment readers. Working POC can be found here.
Backward Compatibility
To maintain the backward compatibility the new KNNVectorsFormat will be enabled for the indices that are equal or above a specific version of OpenSearch in this case it will be 2.17(as we are targeting to release this feature in 2.17). Every index in OpenSearch has an associated version with it which tells with what version of OpenSearch index was created. We will leverage that parameter here. We have already used this parameter when we changed the default hyper parameters values of HNSW algorithm. Hence we have a high confidence that this will work.
Feasibility Study
I did a small POC with KNNVectorValues and ran all the BWC tests. I saw no failures. Here is the POC1 code for that. The below benchmarks that we performed was with this POC code. So we can confirm that new format works, it is backward compatible and is performant.
Benchmarking
We will use our nightly runs to benchmark the performance of this change. No special benchmarking is required apart from running a sanity test with 1M 768D dataset on similar configuration as that of nightly runs.
Testing Strategy
Backward Compatibility Testing Plan
We will use the BWC rolling upgrade and restart upgrade tests to test the BWC for this change. No other separate changes are required as it will cover the indexing and search both cases.
Integration Testing Plan
Future Improvements/Ideas
Below are some of future improvement that I think could be added after this implementation
FAQ
What is BinaryDocValuesFormat?
This format defines to read and write a field which has doc values in the binary format. k-NN plugin before this change was using BinaryDocValuesFormat to index vectors.
What is KNNVectorsFormat?
This is format introduced in Lucene with 9.0 version which is tailor made for indexing and retrieving dense vectors in Lucene.
Appendix
Appendix A
Benchmarks sift-128
Updated Code
Baseline
Benchmarks cohere-768
Updated Code
Baseline code
Reference
The text was updated successfully, but these errors were encountered: