-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per-field metadata. #49419
Add per-field metadata. #49419
Conversation
This PR adds per-field metadata that can be set in the mappings and is later returned by the field capabilities API. This metadata is completely opaque to Elasticsearch but may be used by tools that index data in Elasticsearch to communicate metadata about fields with tools that then search this data. A typical example that has been requested in the past is the ability to attach a unit to a numeric field. In order to not bloat the cluster state, Elasticsearch requires that this metadata be small: - keys can't be longer than 20 chars, - values can only be numbers or strings of no more than 50 chars - no inner arrays or objects, - the metadata can't have more than 5 keys in total. Given that metadata is opaque to Elasticsearch, field capabilities don't try to do anything smart when merging metadata about multiple indices, the union of all field metadatas is returned. Here is how the meta might look like in mappings: ```json { "properties": { "latency": { "type": "long", "meta": { "unit": "ms" } } } } ``` And then in the field capabilities response: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms" ] } } } } ``` When there are no conflicts, values are arrays of size 1, but when there are conflicts, Elasticsearch includes all unique values in this array, without giving ways to know which index has which metadata value: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms", "ns" ] } } } } ``` Closes elastic#33267
Pinging @elastic/es-search (:Search/Mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jpountz, looks like a very useful addition
server/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/fieldcaps/FieldCapabilities.java
Outdated
Show resolved
Hide resolved
rest-api-spec/src/main/resources/rest-api-spec/test/field_caps/20_meta.yml
Show resolved
Hide resolved
The limitations you put in place look good to me. I can't think of an example at the moment where we would have more than 2 or max 3 keys. Also the key length and value length should be more than enough. I assume the meta information of a field can be updated without having to update the data? Assuming I have a field |
@ruflin Your assumption is correct, metadata can be updated on an existing field. |
My concerns are regarding the limitations and how this feature might be used. 5 keys will get crowded very quickly if more than one solution needs to use it. Practically speaking you might need each solution to be limited to one key. I was curious if per-field metadata might be a good solution for replacing index pattern objects but it doesn't seem quite flexible enough - which is fine, the idea wasn't past the The example application - setting units for a given field - is simple and straight forward. Are there other planned uses for this feature? |
If a |
Can you explain what index pattern objects are and how metadata might help?
I think @ruflin is thinking about using it to differentiate counters from gauges as well. These are the two use-cases I know about for now. I guess some people might be tempted to use it to help for e.g. internationalization by storing a field name for every language but I think this would cause more trouble than it would help due to how it would increase the size of the cluster state. This is why limits have been put in place. I'm surprised by this last statement, which suggests you can't think of many use-cases, versus your first statement where you are afraid that 5 keys might not be enough? |
There would be only one value in that case. But it could be changed to |
@jpountz and I had a conversation regarding this pr, some of which I'll summarize and some of which I'll expand upon. Index pattern objects in kibana perform a number of tasks, one of which can be viewed as providing metadata. Docs - https://www.elastic.co/guide/en/kibana/current/index-patterns.html One of the flaws of index pattern objects is that a field list and associated field data is generated upon index pattern object creation (which requires a document to be present) and manual refresh. It would be nice if we could query elasticsearch directly for this data. Field formatters could potentially be stored via metadata, although the field length limit likely gets in the way. There's also field popularity but thats only used by Discover. I thought there would be more to list here but formatters would either use more than one key OR save multiple values to a single field. While it seems like there's the potential to address some of these concerns with field level metadata I'm not as confident as I'd like to be since I'm focused on other work. My aim was to find if there's common ground but our needs might not be close enough. I'm still curious about the beats use case - using field metadata to store units - since it seems similar to something we might store in index patterns. (...but are glad not to, we're trying to limit complexity) Index patterns are a messy problem that require focused attention and consensus building and we're still in the early stages. Rant over, thank you for listening. |
@mattkime I think field formatters would be a good use-case for this feature. It seems that Kibana has a number of built-in formatters that don't require any configuration and could use this feature directly, e.g.
It also seems to me that some of the formatters could be directly inferred from units when specified. E.g. For the more advanced formatters that require configuration, maybe a good compromise would be to define the formatter at the index level, and then only refer to them at the field level. Something like that:
Top-level metadata doesn't have size/length limits today, which we might want to address in the future, though I'm less concerned about the size of this object since it is per-index, as opposed to per-index per-field. If we were to go that route, we'd need to include top-level metadata in the |
That sounds like a nice solution, much appreciated! |
@elasticmachine update branch |
merge conflict between base and head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for jumping in really late with a review. I left some small comments and also had a couple higher-level ones:
- Do we intend to allow metadata to be removed by passing
"key": null
? I tried it out but we currently throw aNullPointerException
. We could leave it as a potential follow-up if you'd prefer, but it'd be good to give a nice error message onnull
instead of throwing an exception. - Would it make sense restrict the values to strings for now, unless we have a use case in mind for numbers? Supporting numbers could make it tempting to add a piece of metadata that is updated frequently (like a counter). I just have a vague intuition here and don't feel strongly, happy to go with what you prefer.
- I will start a separate discussion around this, but I'm starting to find the behavior of the field caps API a bit confusing. Apart from the field type, we always try to merge capabilities across indices. For some of our newer additions like
meta
and the proposedsource_path
, the user can’t tell what each index actually contained. For my context, do we know how themeta
part of the field caps response will be used? Will Kibana claim that two field types are conflicting in the index pattern if their meta information is different?
@@ -86,3 +87,5 @@ include::params/similarity.asciidoc[] | |||
include::params/store.asciidoc[] | |||
|
|||
include::params/term-vector.asciidoc[] | |||
|
|||
include::params/meta.asciidoc[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the other parameters are listed in alphabetical order.
@@ -72,6 +74,7 @@ | |||
private Object nullValue; | |||
private String nullValueAsString; // for sending null value to _all field | |||
private boolean eagerGlobalOrdinals; | |||
private Map<String,Object> meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space: Map<String, Object>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll also make it a Map<String, String> based on your other comment.
@@ -483,4 +483,5 @@ private BytesReference createIndexRequest(Object value) throws IOException { | |||
return BytesReference.bytes(XContentFactory.jsonBuilder().startObject().field("field", value).endObject()); | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra edit here
@@ -87,7 +87,12 @@ protected final T assertSerialization(T testInstance, Version version) throws IO | |||
|
|||
protected void assertEqualInstances(T expectedInstance, T newInstance) { | |||
assertNotSame(newInstance, expectedInstance); | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this accidentally left over from debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wooops yes indeed
Null pointer exceptions are always bugs, let's fix it in this PR. Thanks for catching it!
Doing a put-mapping currently replaces the metadata. For instance if you have a field mapped as
and want to remove the
Unlike some properties that are treated as patches by mapping updates (typically those that are guarded by an
+1 I'll restrict metadata to string values. I don't have any use-case in mind for numerics.
Please ping me on that other thread, I have been thinking about this too. I wonder that we should treat differently properties that need to be consistent across indices of the index pattern (like whether the field is aggregatable) and properties that don't have to be consistent (like how the field could be retrieved from the
The goal is to help the software that ships the data, typically Beats, share information with the software that consumes the data, typically Kibana, in order to provide a better out-of-the-box experience. For instance, imagine that you are deploying Kibana and start building dashboards on existing data. One field is called
It might depend on which metadata is conflicting, but if metadata reports that some indices record milliseconds and other indices record nanoseconds for the same field, then it would make sense to me for Kibana to complain about it. |
Oops my comment was unclear -- I agree that the NPE should be fixed in this PR, thanks!
Got it, this seems nicer to me than nulling out keys. I think it would be helpful to add a note about update behavior in the
Will do, it would be nice to have a sense of the overall strategy/ design for adding new information to field caps. Perhaps we can quickly discuss the general field caps question before merging this PR, in case we want to change the approach. |
@elasticmachine run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the additional changes.
@jpountz and I discussed offline and are okay with current field caps format. To note something we touched on but hasn't been discussed on this PR yet: there could be a situation where in one index a field like latency
contains a piece of metadata, but in another index the metadata is missing. In the field caps response, this would not be possible to distinguish from a case where latency
has the same metadata in both indices:
{
"latency": {
"long": {
"searchable": true,
"aggregatable": true,
"meta": {
"unit": [ "ms" ]
}
}
}
}
This leniency seems okay to me, but just wanted to highlight it in case @ruflin or @mattkime had an opinion based on their use cases.
Finally found some time to play around with this. It all seems to work. One additional thing I tried is the following:
I was curious if the meta information could also be set on an object to indicate that the properties below have a certain structure. This would be useful in cases where a certain type does not exist yet in Elasticsearch like histogram / summary but we would still give the UI an indication on what it is. Probably best to take this discussion to an other issue, just wanted to mention it. @jpountz Do we plan to release this as GA directly or should we do a first round of beta release? @jtibshirani This trade off SGTM. |
@ruflin The docs don't add an experimental warning so this is going GA directly. I don't think an experimental label would help us much here as we are adding this feature to our most used fields (keyword, numbers, boolean, ...) which we can't afford to break, so if we want to evolve or remove this functionality in the future we will need to go through the same process anyway regardless of whether this functionality is considered GA or experimental. |
Wohoo, thanks for getting this in @jpountz ! Will follow up ;-) |
This PR adds per-field metadata that can be set in the mappings and is later returned by the field capabilities API. This metadata is completely opaque to Elasticsearch but may be used by tools that index data in Elasticsearch to communicate metadata about fields with tools that then search this data. A typical example that has been requested in the past is the ability to attach a unit to a numeric field. In order to not bloat the cluster state, Elasticsearch requires that this metadata be small: - keys can't be longer than 20 chars, - values can only be numbers or strings of no more than 50 chars - no inner arrays or objects, - the metadata can't have more than 5 keys in total. Given that metadata is opaque to Elasticsearch, field capabilities don't try to do anything smart when merging metadata about multiple indices, the union of all field metadatas is returned. Here is how the meta might look like in mappings: ```json { "properties": { "latency": { "type": "long", "meta": { "unit": "ms" } } } } ``` And then in the field capabilities response: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms" ] } } } } ``` When there are no conflicts, values are arrays of size 1, but when there are conflicts, Elasticsearch includes all unique values in this array, without giving ways to know which index has which metadata value: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms", "ns" ] } } } } ``` Closes elastic#33267
This PR adds per-field metadata that can be set in the mappings and is later returned by the field capabilities API. This metadata is completely opaque to Elasticsearch but may be used by tools that index data in Elasticsearch to communicate metadata about fields with tools that then search this data. A typical example that has been requested in the past is the ability to attach a unit to a numeric field. In order to not bloat the cluster state, Elasticsearch requires that this metadata be small: - keys can't be longer than 20 chars, - values can only be numbers or strings of no more than 50 chars - no inner arrays or objects, - the metadata can't have more than 5 keys in total. Given that metadata is opaque to Elasticsearch, field capabilities don't try to do anything smart when merging metadata about multiple indices, the union of all field metadatas is returned. Here is how the meta might look like in mappings: ```json { "properties": { "latency": { "type": "long", "meta": { "unit": "ms" } } } } ``` And then in the field capabilities response: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms" ] } } } } ``` When there are no conflicts, values are arrays of size 1, but when there are conflicts, Elasticsearch includes all unique values in this array, without giving ways to know which index has which metadata value: ```json { "latency": { "long": { "searchable": true, "aggreggatable": true, "meta": { "unit": [ "ms", "ns" ] } } } } ``` Closes elastic#33267
This PR adds per-field metadata that can be set in the mappings and is later
returned by the field capabilities API. This metadata is completely opaque to
Elasticsearch but may be used by tools that index data in Elasticsearch to
communicate metadata about fields with tools that then search this data. A
typical example that has been requested in the past is the ability to attach
a unit to a numeric field.
In order to not bloat the cluster state, Elasticsearch requires that this
metadata be small:
arrays or objects,
Given that metadata is opaque to Elasticsearch, field capabilities don't try to
do anything smart when merging metadata about multiple indices, the union of
all field metadatas is returned.
Here is how the meta might look like in mappings:
And then in the field capabilities response:
When there are no conflicts, values are arrays of size 1, but when there are
conflicts, Elasticsearch includes all unique values in this array, without
giving ways to know which index has which metadata value:
Closes #33267