-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exponentially backs off retry attempts for sending usage data #117955
Exponentially backs off retry attempts for sending usage data #117955
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review
@@ -53,13 +53,22 @@ export class TelemetrySender { | |||
return false; | |||
}; | |||
|
|||
private delayRetrySend = () => { | |||
return 60 * (1000 * Math.min(Math.pow(2, this.retryCount), 64)); // 120s, 240s, 480s, 960s, 1920s, 3840s, 3840s, 3840s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be written as 60000 * Math.min(.....)
but writing it this way makes it easier to read at a glance.
We limit the backoff period to just over once per hour.
private sendIfDue = async (): Promise<void> => { | ||
if (this.isSending || !this.shouldSendReport()) { | ||
if (!this.shouldSendReport()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exit early if and only if the next report isn't due to be sent. This backs-off on the once-per minute fetch and send request calls we used to make.
return; | ||
} | ||
// optimistically update the report date and reset the retry counter for a new time report interval window | ||
this.lastReported = `${Date.now()}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a behavior change here where we're now updating the report date before sending the data vs only after a successful send. We need to do this to ensure we're backing off the retry attempts.
private sendIfDue = async (): Promise<void> => { | ||
if (this.isSending || !this.shouldSendReport()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the removal of isSending
, could we add unit tests where sendIfDue
is called multiple time in a row without delay just to be sure that shouldSendReport
properly handles it (aka sendUsageData
should only be called once)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendIfDue
is private and we’re the only ones calling the method. It will never be called without a delay.
window.setTimeout(this.sendUsageData, this.delayRetrySend()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: maybe we want to add else { console.log(something) }
to at least surface the error in the console logs? I feel like it's better than just trapping the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: https://github.com/elastic/kibana/pull/117955/files#diff-5a45ad5f1ca947b4b167609152746b1599cf51aac704cd7b181bb73cec63e208R100-R102 with tests updated.
Thanks!
const telemetryService = mockTelemetryService(); | ||
const telemetrySender = new TelemetrySender(telemetryService); | ||
telemetrySender['shouldSendReport'] = jest.fn().mockReturnValue(false); | ||
telemetrySender['retryCount'] = 0; | ||
await telemetrySender['sendIfDue'](); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thinking out loud, not related to the PR as it was already present) This is a code smell of bad isolation. Proper unit testing should never require to mock the blackboxed component.
e.g shouldSendReport
mocking could be replaced with telemetryService.canSendTelemetry
mocking (which would be the correct way to test the method's behavior btw). Here, we need to mock some of the TelemetrySender
's methods to test the behavior of other methods of the class, which is not unit testing the class.
this.retryCount = this.retryCount + 1; | ||
if (this.retryCount < 20) { | ||
// exponentially backoff the time between subsequent retries to up to 19 attempts, after which we give up until the next report is due | ||
window.setTimeout(this.sendUsageData, this.delayRetrySend()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT/optional: delayRetrySend
can be extracted as a static function getRetryDelay(retryCount: number)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion, thank you! It also made the testing much easier.
💚 Build Succeeded
Metrics [docs]Page load bundle
History
To update your PR or re-run it, just comment with: |
… (#118900) Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
… (#118901) Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
Resolves #115221
In some cases, a call to fetch all the usage data for a specific report time period might take a long time and add additional strain to a server already under a high load.
This PR exponential backs off retry fetch and send attempts within a report time window and changes the once-per-minute checks to only check if the next report is due.
In this PR we're changing the behavior of when we update the report date, from after a successful send request to before the first send request is attempted. Optimistically updating the report date before the calls avoids subsequent retries every minute.
This has the risk of ever increasing payloads that might eventually also put a large strain on the browser if any usage collector was previously relying on an at-most 24 hour usage data fetch window.
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
For maintainers