Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Snapshot collection may skip days #142058

Closed
afharo opened this issue Sep 28, 2022 · 4 comments · Fixed by #144132
Closed

[Telemetry] Snapshot collection may skip days #142058

afharo opened this issue Sep 28, 2022 · 4 comments · Fixed by #144132
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@afharo
Copy link
Member

afharo commented Sep 28, 2022

When shipping Snapshot telemetry from the server, we check if we should send telemetry [every 12h].

This approach may miss some days without sending telemetry. Here's an scenario that explains it:

  1. Kibana checks today at 09:00Z, but the last time it sent any telemetry was yesterday at 21:01Z. It's less than 24h ago, so it skips.
  2. Kibana checks after 12hs, at 21:00Z. Still, the last time was yesterday at 21:01Z. Sill less than 24h ago.
  3. Kibana checks after 12hs, tomorrow at 09:00Z. Now it can send telemetry, but we missed the data for today.

Why could it have 21:01Z as the last time it sent telemetry?

a. It takes time to generate the telemetry report, and we store the lastReportedAt date when we successfully report the data.
b. Browsers may send the report if the server didn't. Since the server checks every 12h, there are high chances that the browsers kick in. So it shifts when the data was sent.

Potential fix:

Have a smarter timer logic: instead of an interval based, set the timer to the lastReportedAt + a random delay to avoid race conditions of multiple instances reporting at the same time.

@afharo afharo added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry labels Sep 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@amitkanfer
Copy link

@afharo thank you for reporting this. This might explain issues we're seeing with ingest telemetry. Can you please help me understand why Kibana checks if it needs to send data every 12 hours? why not check every hour?

@afharo
Copy link
Member Author

afharo commented Oct 27, 2022

Can you please help me understand why Kibana checks if it needs to send data every 12 hours?

I think the reason is the initial implementation: we only stored lastReportedAt in a Saved Object on the server (the browser checked the browser's local storage). This means we used to send once per Kibana instance (if there was connectivity from the server) + once for each browser accessing Kibana.

This was fixed in #121656.

The 12h interval stayed, though, because that's how often we check for connectivity from the server (if we fail to report) until we give up after a couple of retries.

In any case, #121656 added this randomization to the value lastReportedAt, leading to this situation when users may not log in into Kibana (no browser sending data), in a day, but the lastReportedAt value doesn't comply with the 12h interval the server checks.

The fix is def improving this delay, although I'm not sure 1h interval would work either because there's still an edge case that may make it skip (for the last hour of the UTC day).

@amitkanfer
Copy link

I think i understand the situation. I'm sure we can find a solution to avoid these race conditions and send telemetry exactly once a day (not more, not less). We'll look forward to having this fixed. If safe, let's also back-port the future fix it to 8.5.x as we prefer not to wait for 8.6 to see if there's another problem somewhere we need to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Telemetry Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants