Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Regression] Telemetry Usage Stats API no longer honor timeRange parameter #109960

Closed
ycombinator opened this issue Aug 24, 2021 · 11 comments
Closed
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@ycombinator
Copy link
Contributor

ycombinator commented Aug 24, 2021

Kibana version:

7.11.0

Elasticsearch version:

7.11.0

Describe the bug:

Up through version 7.10.0, the POST /api/telemetry/v2/clusters/_stats API used to accept a timeRange parameter in the request body like so:

{
  "timeRange": {
    "min": "2021-08-24T10:20:08-07:00",
    "max": "2021-08-24T11:19:31-07:00"
  }
}

Starting from version 7.11.0, however, the same API call returns the following error response:

{"statusCode":400,"error":"Bad Request","message":"[request body.timeRange]: definition for this key is missing"}

Steps to reproduce:

curl -X POST "https://$HOSTPORT/api/telemetry/v2/clusters/_stats" -u elastic:$PASSWORD -H 'kbn-xsrf: true' -H 'Content-Type: application/json' -d '{"unencrypted":true,"timeRange":{"max":"2021-08-24T11:17:30-07:00","min":"2021-08-24T10:20:08-07:00"}}'

Expected behavior:

API honors timeRange parameter as before and returns usage stats.

Context:

The Cloud Billing team calls this API on a periodic (roughly hourly) basis on every Kibana instance running in Cloud. We are not currently using the data returned by this API but it would be good to know whether this is indeed a regression that will be fixed or if this was a deliberate change. If it's the latter, it would also be good to know if there's an alternate way to make the equivalent request starting with Kibana version 7.11.0.

@ycombinator ycombinator added bug Fixes for quality problems that affect the customer experience Team:KibanaTelemetry labels Aug 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-telemetry (Team:KibanaTelemetry)

@afharo
Copy link
Member

afharo commented Aug 26, 2021

@ycombinator sorry about the confusion. It's indeed an intentional change: #81579

We realized that each collector internally needed different ranges (#55171), no matter what the from and to values were, in order to report meaningful data. So we thought that it made sense that only one timestamp was provided.

If it's not causing too much of a problem, I'd close this issue with "works as designed" flag.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 26, 2021

Yeah, it's fine to close this issue with the "works as designed" flag but it would be helpful if you could post here an example or two of what the equivalent calls should be starting Kibana version 7.11.0. AFAICT, this API is not documented (probably since it's an Elastic-internal API) so I can't easily tell what the new contract is. This will help us (Cloud Billing) write code to call the right APIs depending on the Kibana version, since there could be a variety of Kibana instance versions running in Cloud at any time.

@afharo
Copy link
Member

afharo commented Aug 31, 2021

@ycombinator this is the new contract:

POST /api/telemetry/v2/clusters/_stats
{
  unencrypted: true
}

As you may have noticed, there is no concept of timestamp. This is because we could only provide historical data when the source of the telemetry stats was the monitoring indices. However, for local telemetry (when Kibana actively runs the collectors), it's always now (the code has always dismissed the timeRange for this type of collection).

Sorry! We didn't know you used it for Cloud Billing purposes. We'll keep you in mind for any changes we may need to introduce in the future (i.e.: #96538).

cc @elastic/kibana-core

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 31, 2021

Thanks @afharo! I'll look into how this change will impact the Cloud Billing code and file any follow up issues if necessary. But for now, we're good to close this issue here.

[EDIT] And thanks for linking to #96538 as well. I have subscribed to it now.

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 31, 2021

As you may have noticed, there is no concept of timestamp. This is because we could only provide historical data when the source of the telemetry stats was the monitoring indices. However, for local telemetry (when Kibana actively runs the collectors), it's always now (the code has always dismissed the timeRange for this type of collection).

@afharo I want to clarify this part a bit.

Cloud Billing's use case for this API is to know which features, e.g. Reporting, were being used between a start and end timestamp. This is because there is a process we run (roughly every hour, but the exact interval could vary a bit) that calls this API and asks the question: give me the telemetry usage stats between the last time I ran (timeRange.min) and now (timeRange.max).

Starting with 7.11, since there is no concept of timeRange for this API any more, what time range will the metrics in the response cover? I know you said, "it's always now" but does that apply to timeRange.max? If so, what's timeRange.min assumed to be?

Also, under what conditions does this API use monitoring indices as the source vs. local collection as the source? Is this something that can vary, depending on when the API is called?

@afharo
Copy link
Member

afharo commented Sep 1, 2021

Cloud Billing's use case for this API is to know which features, e.g. Reporting, were being used between a start and end timestamp. This is because there is a process we run (roughly every hour, but the exact interval could vary a bit) that calls this API and asks the question: give me the telemetry usage stats between the last time I ran (timeRange.min) and now (timeRange.max).

That's kind of the reason we preferred to remove the concept of time from this API. AFAIK, this API has never returned the usage stats between min and max (as in the delta between min and max). It always returns the full snapshot. In the Reporting use case: the collector reports a combination of usage for all and the last 7 days. Triggering a request periodically will only show always increasing numbers for all (they might decrease for the last 7 days if not used in that period of time).

Also, under what conditions does this API use monitoring indices as the source vs. local collection as the source? Is this something that can vary, depending on when the API is called?
Pre-7.11, if it found any data in the .monitoring-* indices (on Monitoring clusters), it would return telemetry from there. Otherwise, it would fallback to the local collection. The timeRange would only affect the time constraints applied to the query in the .monitoring-* indices. However, as you may already know, usage stats were only retrieved every 24h, so the value timeRange.min was always corrected to ensure a 24h span between min and max.
In addition to that, the data available in the .monitoring-* indices had the same structure as the local collection, so the snapshot/no-delta principles apply here as well. The only difference is that you could get historical snapshot data.

From 7.11, we stopped shipping Kibana's usage to the .monitoring indices (we may still report usage from Logstash & Beats), so all the Kibana usage will likely come from the local collection (first item of the array in the response), and usage from LS and Beats might come up on the following items.

To identify the source of the data, if you can see collection: 'local' in the root of the object, you're looking at a locally sourced collection. And it means that the response is as fresh as it can get.

I've tried to summarize all the above in the table below:

Kibana's version Type of collection Time range
Pre-7.11 Monitoring-sourced, falling back to "local" if the cluster does not have .monitoring indices (it's not a Monitoring cluster) Used to retrieve the snapshot reported between max-24h and max when querying the .monitoring-* indices. Always now for "local".
Post-7.11 "local" AND monitoring-sourced (if available). As in response = [{...localUsage}, {...monitoredClusterOne}, {...monitoredClusterTwo}, ...] now for "local" and now-20min to now when querying .monitoring-* indices

I'll cc @Bamieh just in case he wants to add anything.

@ycombinator
Copy link
Contributor Author

Thanks for the detailed explanation and the summary table at the end, @afharo!

That's kind of the reason we preferred to remove the concept of time from this API. AFAIK, this API has never returned the usage stats between min and max (as in the delta between min and max). It always returns the full snapshot. In the Reporting use case: the collector reports a combination of usage for all and the last 7 days. Triggering a request periodically will only show always increasing numbers for all (they might decrease for the last 7 days if not used in that period of time).

When you say "full snapshot", you mean the duration between when the Kibana server started up and the time of the API request (now), right?

@afharo
Copy link
Member

afharo commented Sep 2, 2021

When you say "full snapshot", you mean the duration between when the Kibana server started up and the time of the API request (now), right?

As always, IT depends 🙃
In the snapshot, some metrics are kind of size_of_the_index/saved objects count (usually always growing ever since the cluster was created) and "last day/7/30/90 days". However, there are some others are only kept in memory (like the ops metrics' request statuses), so restarts may affect them.

@ycombinator
Copy link
Contributor Author

ycombinator commented Sep 2, 2021

That makes sense, thanks @afharo.

At the moment the only feature we're looking at from this API's response is Reporting. That will definitely change in the future. For Reporting (the "all" key), are the metrics from the time the cluster was created or from the time of the latest restart?

@afharo
Copy link
Member

afharo commented Sep 2, 2021

I'll defer on @elastic/kibana-reporting-services to fully confirm. But, looking at the implementation, I'd say it's for the entire life of the cluster.

@lukeelmers lukeelmers added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry and removed Team:KibanaTelemetry labels Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants