Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase and simplify backoff #827

Merged
merged 1 commit into from
Feb 10, 2021
Merged

Conversation

felixge
Copy link
Member

@felixge felixge commented Jan 28, 2021

The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so the backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue we
immediately spam the upstream service with a second request which
is usually counter-productive.

This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. evenly space
out our attempt to upload the profile.

Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.

Illustration

Each dot is one upload attempt. The vertical lines mark the begining/end of an upstream outage.

Before (note how the second retry attempt immediately follows the initial attempt):

image

After (note how the retry attempt is now smoothly sitting in between the first attempt and the next profile upload):

image

More thoughts

This is just a proposal for something simple we could do to make the client behave less aggressively when there is an upstream issue. That being said, we're still doing 2x the amount of requests during an outage right now, so this won't solve the issue completely. However, when discussing this with @AlexJF, he was concerned about the tradeoff between data safety and rate limiting, so this change tries to keep that balance the same way it was before.

@felixge felixge requested a review from AlexJF January 28, 2021 18:14
@felixge felixge marked this pull request as draft January 28, 2021 18:14
The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue we
immediately spam the upstream service with a second request which is
usually counter-productive.

This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. even space
out our attempt to upload the profile.

Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.
@felixge felixge force-pushed the felix.geisendoerfer/increase-backoff branch from bdfb007 to 100f834 Compare January 29, 2021 12:30
@felixge felixge added this to the Triage milestone Jan 29, 2021
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tying this to the period makes sense. We will potentially add a non-negligible latency when the failure was due to a transient network error but this should not be a frequent occurrence and given the small number of retries we do it increases the chance that whatever failed went away (or atleast the immediate retrying won't make the problem worse!).

There's plenty we can do here in following steps but I think this goes in a good direction. I'd like @dbenamydd 's view on this as well though.

@flodav
Copy link

flodav commented Feb 2, 2021

Spreading the retry period over the sampling period is great to avoid a storm of retry to logs-intake. Retry only once is a bit small but I understand from Alex comment that it's more important to not worsen the situation compared to avoid data loss. In the light of last events, I also agree with that approach. Nice work 👍 !

I just see a little nit: in the unfortunate case where the retry does happen at the end of the sampling period, the current profile will still remain in memory while a new one is created. In the worst case where the retry upload timeouts (the default HTTP timeout value is 10 sec), 2/3 of the sampling data will be already gathered by the profiler for the new period (we sample over 15 seconds). If the process memory consumption is already high, we could end up OOM.

This is a very small corner case which will be most likely due to a poor memory configuration of the process (amount of data gathered by the profiler during each sampling period is not bounded anyway) but reducing the retry period to 50 seconds to keep memory overhead the same seems to be a small price to pay for a little bit more safety.

@felixge
Copy link
Member Author

felixge commented Feb 2, 2021

Spreading the retry period over the sampling period is great to avoid a storm of retry to logs-intake. Retry only once is a bit small but I understand from Alex comment that it's more important to not worsen the situation compared to avoid data loss. In the light of last events, I also agree with that approach. Nice work 👍 !

Thanks for taking a look! : )

I just see a little nit: in the unfortunate case where the retry does happen at the end of the sampling period, the current profile will still remain in memory while a new one is created. In the worst case where the retry upload timeouts (the default HTTP timeout value is 10 sec), 2/3 of the sampling data will be already gathered by the profiler for the new period (we sample over 15 seconds). If the process memory consumption is already high, we could end up OOM.

This is a very small corner case which will be most likely due to a poor memory configuration of the process (amount of data gathered by the profiler during each sampling period is not bounded anyway) but reducing the retry period to 50 seconds to keep memory overhead the same seems to be a small price to pay for a little bit more safety.

The Go profiler currently buffers a maximum of 6 profiles in memory. Arguably that might be too much, but as far as I can tell it's an orthogonal issue to this PR. I.e. changing the retry backoff mechanism has no impact on the high-watermark for profiler memory usage. The illustration below captures the key aspects of the current implementation:

image

see https://github.com/felixge/go-profiler-notes/blob/main/datadog.md#operation-details

PTAL and let me know if that makes sense or not.

@felixge felixge marked this pull request as ready for review February 10, 2021 12:14
@felixge felixge modified the milestones: Triage, 1.29.0 Feb 10, 2021
@felixge felixge merged commit a85b65f into v1 Feb 10, 2021
@felixge felixge deleted the felix.geisendoerfer/increase-backoff branch February 10, 2021 12:30
This was referenced Mar 11, 2021
This was referenced Mar 15, 2021
nsrip-dd added a commit that referenced this pull request Jul 21, 2022
PR #827 introduced a randomized wait duration before retrying a failed
profile upload. This was done to reduce the likelihood of flooding the
backend with uploads. The wait was done to be an average of half the
profiling perdio. However, PR #961 lost this behavior when introducing
interruptible sleeps. This commit restores the lost behavior.

Fixes #1395
nsrip-dd added a commit that referenced this pull request Jul 21, 2022
PR #827 introduced a randomized wait duration before retrying a failed
profile upload. This was done to reduce the likelihood of flooding the
backend with uploads. The wait was done to be an average of half the
profiling perdio. However, PR #961 lost this behavior when introducing
interruptible sleeps. This commit restores the lost behavior.

Fixes #1395
Kyle-Verhoog pushed a commit that referenced this pull request Jul 21, 2022
PR #827 introduced a randomized wait duration before retrying a failed
profile upload. This was done to reduce the likelihood of flooding the
backend with uploads. The wait was done to be an average of half the
profiling perdio. However, PR #961 lost this behavior when introducing
interruptible sleeps. This commit restores the lost behavior.

Fixes #1395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants