Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Indexing Support for Lucene Byte Sized Vector #937

Conversation

naveentatikonda
Copy link
Member

Description

This PR contains changes which adds indexing support to lucene byte sized vector and corresponding tests to validate it. It helps users to index vectors as byte sized vectors(which theoretically saves 75% of memory when compared to float vectors) by setting the optional data_type field as byte while creating the index. As we are not adding support for Quantization Techniques, users are expected to index vectors that are in the byte range [-128 to 127] without any decimal values. Also, right now this feature is only supported for lucene engine.

Issues Resolved

#812

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
@naveentatikonda naveentatikonda added Features Introduces a new unit of functionality that satisfies a requirement v2.9.0 labels Jun 15, 2023
@naveentatikonda naveentatikonda self-assigned this Jun 15, 2023
@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Merging #937 (5c73aa2) into feature/lucene_byte_vector (a2ee834) will decrease coverage by 0.32%.
The diff coverage is 80.91%.

@@                       Coverage Diff                        @@
##             feature/lucene_byte_vector     #937      +/-   ##
================================================================
- Coverage                         85.32%   85.00%   -0.32%     
- Complexity                         1117     1129      +12     
================================================================
  Files                               152      154       +2     
  Lines                              4519     4616      +97     
  Branches                            405      412       +7     
================================================================
+ Hits                               3856     3924      +68     
- Misses                              480      502      +22     
- Partials                            183      190       +7     
Impacted Files Coverage Δ
...opensearch/knn/index/mapper/LuceneFieldMapper.java 71.15% <64.00%> (-3.27%) ⬇️
...ain/java/org/opensearch/knn/index/VectorField.java 71.42% <66.66%> (-3.58%) ⬇️
...nsearch/knn/index/mapper/KNNVectorFieldMapper.java 84.10% <80.85%> (+0.29%) ⬆️
...rch/knn/index/mapper/KNNVectorFieldMapperUtil.java 86.84% <86.84%> (ø)
...n/java/org/opensearch/knn/common/KNNConstants.java 93.75% <100.00%> (+0.41%) ⬆️
.../java/org/opensearch/knn/index/VectorDataType.java 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

@naveentatikonda
Copy link
Member Author

The test failures are not related to this. The CI is failing on Windows when it is trying to refresh the env after setting mingw path. Created an issue to track it.

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for doing exact search with byte vector using K-NN painless. Since we are changing the doc values I want to make sure that all cases are covered.

@navneet1v
Copy link
Collaborator

A high level comment,
with byte size vector and Float type, the doc value serialization and de-serilization should be different. I am not seeing that code in this PR. Can you please add will it be present in follow-up PRs?

@naveentatikonda
Copy link
Member Author

A high level comment, with byte size vector and Float type, the doc value serialization and de-serilization should be different. I am not seeing that code in this PR. Can you please add will it be present in follow-up PRs?

This is where the serialization process is done for docValues.

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
src/main/java/org/opensearch/knn/common/KNNConstants.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
return Optional.empty();
}
validateVectorDimension(dimension, vector.size());
byte[] array = new byte[vector.size()];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use byte[] from the start instead of converting from ArrayList to [] later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, we don't know what will be the size of the vector provided by user. So, we are using an ArrayList. After parsing everything then we are validating it's length against the dimension value.


vector.add(value);
validateByteVectorValues(value);
vector.add((byte) value);
token = context.parser().nextToken();
}
} else if (token == XContentParser.Token.VALUE_NUMBER) {
value = context.parser().floatValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not read intValue instead of floatValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bytes user is expected not to provide decimal values. If we try to parse it as intValue then if user provides any decimal values then it will be trimmed and we can't validate that scenario.

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the first glance I think we need BWC for this, say today index has fields and data with float32 then we need to make sure we still can read that data

src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
src/main/java/org/opensearch/knn/index/VectorDataType.java Outdated Show resolved Hide resolved
}

// Create an index with byte vector data_type and add a doc with decimal values which should throw exception
public void testInvalidVectorData() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test case should be in unit test for mapper as well, maybe with some more edge cases

@naveentatikonda naveentatikonda force-pushed the feature/lucene_8_bit branch 3 times, most recently from 96140ee to c931a77 Compare July 1, 2023 01:23
@naveentatikonda
Copy link
Member Author

At the first glance I think we need BWC for this, say today index has fields and data with float32 then we need to make sure we still can read that data

Sure, will raise a separate PR for BWC Tests and any other missing tests

@naveentatikonda naveentatikonda force-pushed the feature/lucene_8_bit branch 2 times, most recently from 56f3930 to ffd2ffa Compare July 3, 2023 22:29
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
@naveentatikonda naveentatikonda merged commit 1386519 into opensearch-project:feature/lucene_byte_vector Jul 6, 2023
9 of 37 checks passed
naveentatikonda added a commit to naveentatikonda/k-NN that referenced this pull request Jul 12, 2023
…#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
naveentatikonda added a commit that referenced this pull request Jul 12, 2023
* Add Indexing Support for Lucene Byte Sized Vector (#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add Querying Support to Lucene Byte Sized Vector (#956)

* Add Querying Support to Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add DocValues Support for Lucene Byte Sized Vector (#953)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Update Release Notes

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
naveentatikonda added a commit that referenced this pull request Jul 12, 2023
* Add Indexing Support for Lucene Byte Sized Vector (#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add Querying Support to Lucene Byte Sized Vector (#956)

* Add Querying Support to Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add DocValues Support for Lucene Byte Sized Vector (#953)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Update Release Notes

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
(cherry picked from commit bf04854)
naveentatikonda added a commit that referenced this pull request Jul 12, 2023
* Add Indexing Support for Lucene Byte Sized Vector (#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add Querying Support to Lucene Byte Sized Vector (#956)

* Add Querying Support to Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add DocValues Support for Lucene Byte Sized Vector (#953)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Update Release Notes

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
(cherry picked from commit bf04854)
naveentatikonda added a commit that referenced this pull request Jul 12, 2023
* Add Indexing Support for Lucene Byte Sized Vector (#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add Querying Support to Lucene Byte Sized Vector (#956)

* Add Querying Support to Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add DocValues Support for Lucene Byte Sized Vector (#953)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Update Release Notes

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
(cherry picked from commit bf04854)
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
naveentatikonda added a commit that referenced this pull request Jul 12, 2023
* Add Indexing Support for Lucene Byte Sized Vector (#937)

* Add Indexing Support for Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add tests for Indexing

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add Querying Support to Lucene Byte Sized Vector (#956)

* Add Querying Support to Lucene Byte Sized Vector

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add CHANGELOG

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Address Review Comments

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Add DocValues Support for Lucene Byte Sized Vector (#953)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Update Release Notes

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
(cherry picked from commit bf04854)
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement v2.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants