-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script: Fields API for Dense Vector #83550
Script: Fields API for Dense Vector #83550
Conversation
Adds the fields API for `dense_vector` field mapper. Adds a `DenseVector` interface for the value type. Implemented by: * `KnnDenseVector` which wraps a decoded float array from `VectorValues` * `BinaryDenseVector` which lazily decodes a `BytesRef` from `BinaryDocValues` The `DenseVector` API is simliar to `BinaryDenseVectorScriptDocValues`. ``` float getMagnitude(); double dotProduct(float[]); double l1Norm(float[]); double l2Norm(float[]); float[] getVector(); int dims(); boolean isEmpty(); // does the value exist int size(); // 0 if isEmpty(), 1 otherwise PrimitiveIterator.OfDouble iterator() ``` The `DenseVectorDocValuesField` abstract class contains two getter APIS. It is implemented by `KnnDenseVectorDocValuesField` and `BinaryDenseVectorDocValuesField`. ``` DenseVector get() DenseVector get(DenseVector) ``` The `get()` method is included because there isn't a good default dense vector, so that API returns an empty `DenseVector` which throws an `IllegalArgumentException` for all method calls other than `isEmpty()`, `size()` and `iterator()`. The empty dense vector will always be `DenseVector.EMPTY` in case users want to use equality checks. Refs: elastic#79105
Hi @stu-elastic, I've created a changelog YAML for you. |
…ticsearch into pain_field_dense_vector
…ield_dense_vector
Pinging @elastic/es-core-infra (Team:Core/Infra) |
@jtibshirani can you particularly take a look to make sure I moved the vector operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me from the scripting perspective. Left some comments. Please wait for @jtibshirani to review before committing.
import java.util.Iterator; | ||
import java.util.NoSuchElementException; | ||
|
||
public interface DenseVector extends Iterable<Float> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should cosineSimilarity
be part of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported over the ScriptDocValues
API which was also missing this. We'd have to normalize the query vector. @jtibshirani, what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should include cosineSimilarity
, this was available in the old API and is something people will expect.
Some other notes on API:
- I don't think it's helpful to extend
Iterable<Float>
, since you can just retrieve the vector throughgetVector
. - I liked the old name
vectorValue
a little more. Currently you'd writeDenseVector#getVector
which seems kind of weird since the object is already a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
double dotProduct(QueryVector queryVector); | ||
|
||
double l1Norm(QueryVector queryVector); | ||
|
||
double l2Norm(QueryVector queryVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should any of these take an additional boolean parameter to normalize the QueryVector? Or should we document that these expect the QueryVector to be normalized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those don't need to be normalized, is my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, they do not expect the query vector to be normalized. Only cosine similarity needs to normalize the vector (and this should happen "behind the scenes").
public interface DenseVector extends Iterable<Float> { | ||
float getMagnitude(); | ||
|
||
// The implementations of these vector operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a helpful comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
/** | ||
* Get the DenseVector for a document if one exists, DenseVector.EMPTY otherwise | ||
*/ | ||
public abstract DenseVector get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we should we want to add this now? It may be beneficial to figure out how we plan to implement implicit default values for the scripting fields API in general first since deprecation is a long-term process if we decide to do something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on discussions with the search team, the only sensible defaults are the empty vector I've added and null
. That is what makes this field type an outlier. I think it's worth keeping it.
this.dims = dims; | ||
} | ||
|
||
public int dims() { | ||
return dims; | ||
} | ||
|
||
private DenseVector getVectorChecked() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename getCheckedVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
throw new IllegalArgumentException(badElement(element, i)); | ||
} | ||
|
||
protected static String badElement(Object element, int index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is only used a single time so please just move the message to the get
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
return new ListQueryVector(vector); | ||
} | ||
|
||
public static QueryVector fromObject(Object vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want a comment that is also for Painless support as we do not have type overloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -64,30 +58,14 @@ public SortedBinaryDocValues getBytesValues() { | |||
if (indexed) { | |||
VectorValues values = reader.getVectorValues(field); | |||
if (values == null || values == VectorValues.EMPTY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the null check here anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
// There's no way for KnnDenseVectorDocValuesField to reliably differentiate between VectorValues.EMPTY and | ||
// values that can be iterated through. Since VectorValues.EMPTY throws on docID(), pass a null instead. | ||
values = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this behavior to be a bit strange in that VectorValues.EMPTY
will throw an exception, but null
does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VectorValues.EMPTY
is part of the Lucene API, so it cannot be easily changed as part of this PR
# implementation of DenseVectorDocValuesField | ||
class org.elasticsearch.xpack.vectors.query.KnnDenseVectorDocValuesField { | ||
} | ||
|
||
# implementation of DenseVectorDocValuesField | ||
class org.elasticsearch.xpack.vectors.query.BinaryDenseVectorDocValuesField { | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not necessary as these are top-level field classes, for consistency we have duck-typed all of them so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stu-elastic, this looks good to me overall. I had one high-level question about the QueryVector
class -- why do we need this class? Can we just force the user to pass in a float array? Or always convert to an array up-front? In our testing we found that converting to a float array upfront was good for performance, since you avoid the unboxing costs of iterating over a List<Float>
.
*/ | ||
public abstract DenseVector get(); | ||
|
||
public abstract DenseVector get(DenseVector defaultValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really doubtful people will use this API because there's no natural value for a "default" high dimensional vector. I think we could drop it (if that fits with how the rest of the API is designed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like it to be consistent. There'd only be two "defaults" available right now, DenseVector.EMPTY
and null
, which is unfortunate but the nature of this type.
import java.util.Iterator; | ||
import java.util.NoSuchElementException; | ||
|
||
public interface DenseVector extends Iterable<Float> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should include cosineSimilarity
, this was available in the old API and is something people will expect.
Some other notes on API:
- I don't think it's helpful to extend
Iterable<Float>
, since you can just retrieve the vector throughgetVector
. - I liked the old name
vectorValue
a little more. Currently you'd writeDenseVector#getVector
which seems kind of weird since the object is already a vector?
double dotProduct(QueryVector queryVector); | ||
|
||
double l1Norm(QueryVector queryVector); | ||
|
||
double l2Norm(QueryVector queryVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, they do not expect the query vector to be normalized. Only cosine similarity needs to normalize the vector (and this should happen "behind the scenes").
public BinaryDenseVector(BytesRef value, int dims, Version indexVersion) { | ||
this.value = value; | ||
this.indexVersion = indexVersion; | ||
this.vector = new float[dims]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid eagerly creating this vector -- it can happen on every doc and the allocation is significant (the vector dims
can be as high as 2048). Maybe we can put it off to getVector()
and cache the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The Field interface extends
Users rarely see the class name in painless, check out the rest tests to see how that ends up: I can change it to
cosine similarity is: The query vector will be normalized on every API call in this case. I can do the normalization without allocating a new
|
Re:
My thinking around this class was if you are using the query vector once, it's the same cost, minus the array allocation, to do the conversion in line. Users of the class bindings in Users of this API have to pay that cost on every execution unless we add new painless features. I could remove this class and add a If we go this route, I'd add those augmentations in a different pr (we'd want ones for |
Thanks for the explanations! I agree Maybe we could catch up offline about QueryVector to discuss the alternatives? |
Based on offline discussion with @jtibshirani, A implies
|
…kip, lazy allocate BinaryDenseVector.vector array
@jdconrad and @jtibshirani I've updated the PR based on the comments above, can you take another look?
|
…ield_dense_vector
Do we need |
Indeed this is useful to have, people were using it as part of custom calculations (#51964). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some last small comments, apart from these it looks good to me!
return dims; | ||
} | ||
|
||
protected static ByteBuffer wrap(BytesRef dv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be protected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made wrap
private.
/** | ||
* dotProduct of doc vector and query vector that normalizes the query vector while performing the calculation. | ||
*/ | ||
protected double dotProduct(QueryVector qv, float qvMagnitude) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these special methods that take qvMagnitude
, because this calculation is the equivalent to just doing a normal dotProduct
, then dividing the result by qvMagnitude
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these methods and updated cosSim implementation.
checkDimensions(getDims(), qv.size()); | ||
return dotProduct(qv); | ||
|
||
} else if (queryVector instanceof List<?> list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be instanceof List<Number>
? When we wrote the first version of vectors in painless, I think we could assume the param would have this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't due to type erasure. We can check instanceof List<?>
and then do an unchecked cast below, adding @SuppressWarnings("unchecked")
.
If they send in a bad type, they'll get a different reason message. For the test
- do:
catch: bad_request
search:
body:
query:
script_score:
query: { "exists": { "field": "bdv" } }
script:
source: |
(int) field(params.field).get().l2Norm(params.query)
params:
query: [1, 2, "three"]
field: bdv
the error will look like
"failed_shards" : [
{
"shard" : 0,
"index" : "test-index",
"node" : "ueT4QigUSbK_ngYxjFfLZA",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.xpack.vectors.query.BinaryDenseVector.l2Norm(BinaryDenseVector.java:127)",
"org.elasticsearch.xpack.vectors.query.DenseVector.l2Norm(DenseVector.java:89)",
"(int) field(params.field).get().l2Norm(params.query)
",
" ^---- HERE"
],
"script" : "(int) field(params.field).get().l2Norm(params.query) ...",
"lang" : "painless",
"position" : {
"offset" : 45,
"start" : 0,
"end" : 53
},
"caused_by" : {
"type" : "class_cast_exception",
"reason" : "class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')",
"stack_trace" : "java.lang.ClassCastException: class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')
at org.elasticsearch.xpack.vectors.query.BinaryDenseVector.l2Norm(BinaryDenseVector.java:127)
currently, the error looks like
"failed_shards" : [
{
"shard" : 0,
"index" : "test-index",
"node" : "Fty-bHa-RC6HTekobvpLrw",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.xpack.vectors.query.QueryVector.get(QueryVector.java:44)",
"org.elasticsearch.xpack.vectors.query.BinaryDenseVector.l2Norm(BinaryDenseVector.java:126)",
"org.elasticsearch.xpack.vectors.query.DenseVector.l2Norm(DenseVector.java:89)",
"(int) field(params.field).get().l2Norm(params.query)
",
" ^---- HERE"
],
"script" : "(int) field(params.field).get().l2Norm(params.query) ...",
"lang" : "painless",
"position" : {
"offset" : 45,
"start" : 0,
"end" : 53
},
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Cannot treat [three] at index [2] of type [java.lang.String] as Number",
"stack_trace" : "java.lang.IllegalArgumentException: Cannot treat [three] at index [2] of type [java.lang.String] as Number
at org.elasticsearch.xpack.vectors.query.QueryVector.get(QueryVector.java:44)
ie the reason will be
class java.lang.String cannot be cast to class java.lang.Number (java.lang.String and java.lang.Number are in module java.base of loader 'bootstrap')
vs
Cannot treat [three] at index [2] of type [java.lang.String] as Number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading quickly, I actually find the first one easier to understand, so happy to go with that if it helps simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
* Provide consistent access to query vector elements from a backing list or float array. Used to provide | ||
* a flexible API for DenseVector in painless, which does not have method overloading. | ||
*/ | ||
public class QueryVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still a benefit to introducing this wrapper class, or could we just use List<Number>
directly in the vector functions? (Sorry if I missed this in our conversations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is avoiding an unchecked cast and a nicer error message if the user sends in a mixed type List
.
We could avoid the wrapper and use a helper instead method, that would still give a nice error and remove the QueryVector
helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed QueryVector
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes look good to me! Thanks for the work on this @stu-elastic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the iterations! We have rally nightly benchmarks for dense_vector
performance, I will keep an eye on them to double-check there's no performance regression with the original docvalues syntax.
@@ -99,15 +81,15 @@ public double l2Norm(QueryVector queryVector) { | |||
@Override | |||
public double cosineSimilarity(float[] queryVector, boolean normalizeQueryVector) { | |||
if (normalizeQueryVector) { | |||
return dotProduct(queryVector, DenseVector.getMagnitude(queryVector)) / getMagnitude(); | |||
return (dotProduct(queryVector) / DenseVector.getMagnitude(queryVector)) / getMagnitude(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, we usually would do dotProduct(queryVector) / (DenseVector.getMagnitude(queryVector) * getMagnitude());
since it's more readable (similar to how you'd write it mathematically) and potentially a little faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, changed.
@elasticmachine run elasticsearch-ci/part-2 |
…ijun/elasticsearch into fix-none-tsdb-index-dimension-tests * 'fix-none-tsdb-index-dimension-tests' of github.com:weizijun/elasticsearch: (37 commits) [docs] Mention JDK 17 in the Contributing docs (elastic#84018) Fix GeoIpDownloader startup during rolling upgrade (elastic#84000) Script: Fields API for Dense Vector (elastic#83550) Move InferenceConfigUpdate under VersionedNamedWriteable (elastic#84022) [ML] Fix license feature test cleanup (elastic#84020) Replace deprecated api in artifact transforms (elastic#84015) QL: Add leniency option to SQL CLI (elastic#83795) [Stack Monitoring] add kibana_stats version alias to -mb template (elastic#83930) Optimize spliterator for ImmutableOpenMap (elastic#83899) Feature usage actions for archive (elastic#83931) Use latch to speedup multi feature migration test (elastic#84007) Make action names available in NodeClient (elastic#83919) [DOCS] Re-add HTTP proxy setings from elastic#82737 (elastic#84001) Add CI matrix configuration for snapshot BWC versions (elastic#83990) Update YAML Rest tests to check for product header on all responses (elastic#83290) TSDB: Add time series aggs cancellation (elastic#83492) [DOCS] Fix percolate query headings (elastic#83988) [DOCS] Move tip for percolate query example (elastic#83972) Simplify LocalExporter cleaner function to fix failing tests (elastic#83812) [GCE Discovery] Correcly handle large zones with 500 or more instances (elastic#83785) ...
Adds the fields API for `dense_vector` field mapper. Adds a `DenseVector` interface for the value type. Implemented by: * `KnnDenseVector` which wraps a decoded float array from `VectorValues` * `BinaryDenseVector` which lazily decodes a `BytesRef` from `BinaryDocValues` The vector operations have moved into those implements from `BinaryDenseVectorScriptDocValues.java` and `KnnDenseVectorScriptDocValues.java`, respectively. The `DenseVector` API is: ``` float getMagnitude(); double dotProduct(float[] | List); double l1Norm(float[] | List); double l2Norm(float[] | List); float[] getVector(); int dims(); boolean isEmpty(); // does the value exist int size(); // 0 if isEmpty(), 1 otherwise Iterator<Float> iterator() ``` `dotProduct`, `l1Norm` and `l2Norm` take a `float[]` or a `List` via the a delegating `default` method on the `DenseVector` interface. The `DenseVectorDocValuesField` abstract class contains two getter APIS. It is implemented by `KnnDenseVectorDocValuesField` and `BinaryDenseVectorDocValuesField`. ``` DenseVector get() DenseVector get(DenseVector defaultValue) ``` The `get()` method is included because there isn't a good default dense vector, so that API returns an empty `DenseVector` which throws an `IllegalArgumentException` for all method calls other than `isEmpty()`, `size()` and `iterator()`. The empty dense vector will always be `DenseVector.EMPTY` in case users want to use equality checks. Refs: elastic#79105
Adds the fields API for
dense_vector
field mapper.Adds a
DenseVector
interface for the value type.Implemented by:
KnnDenseVector
which wraps a decoded float array fromVectorValues
BinaryDenseVector
which lazily decodes aBytesRef
fromBinaryDocValues
The vector operations have moved into those implements from
BinaryDenseVectorScriptDocValues.java
andKnnDenseVectorScriptDocValues.java
, respectively.The
DenseVector
API is:dotProduct
,l1Norm
andl2Norm
take afloat[]
or aList
via thea delegating
default
method on theDenseVector
interface.The
DenseVectorDocValuesField
abstract class contains two getter APIS.It is implemented by
KnnDenseVectorDocValuesField
andBinaryDenseVectorDocValuesField
.The
get()
method is included because there isn't a good default dense vector,so that API returns an empty
DenseVector
which throws anIllegalArgumentException
for all method calls other thanisEmpty()
,size()
anditerator()
.The empty dense vector will always be
DenseVector.EMPTY
in case users wantto use equality checks.
Refs: #79105