increase and simplify backoff #827

felixge · 2021-01-28T18:14:27Z

The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so the backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue we
immediately spam the upstream service with a second request which
is usually counter-productive.

This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. evenly space
out our attempt to upload the profile.

Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.

Illustration

Each dot is one upload attempt. The vertical lines mark the begining/end of an upstream outage.

Before (note how the second retry attempt immediately follows the initial attempt):

After (note how the retry attempt is now smoothly sitting in between the first attempt and the next profile upload):

More thoughts

This is just a proposal for something simple we could do to make the client behave less aggressively when there is an upstream issue. That being said, we're still doing 2x the amount of requests during an outage right now, so this won't solve the issue completely. However, when discussing this with @AlexJF, he was concerned about the tradeoff between data safety and rate limiting, so this change tries to keep that balance the same way it was before.

The current backoffDuration is designed to deal with optimistic concurrency control, which is not what we're trying to do. More importantly, we're only doing a single retry right now if the first attempt failed, so backoffDuration() call always returns 100ms +/- some jitter. This means that the moment there is an upstream issue we immediately spam the upstream service with a second request which is usually counter-productive. This patch simply sleeps for rand(0, profilePeriod). The profile period by default is 60s, so on average we'll sleep for 30s, i.e. even space out our attempt to upload the profile. Attached to the PR are two illustrations that show the old and new behavior in a simple simulation.

AlexJF

I think tying this to the period makes sense. We will potentially add a non-negligible latency when the failure was due to a transient network error but this should not be a frequent occurrence and given the small number of retries we do it increases the chance that whatever failed went away (or atleast the immediate retrying won't make the problem worse!).

There's plenty we can do here in following steps but I think this goes in a good direction. I'd like @dbenamydd 's view on this as well though.

flodav · 2021-02-02T10:16:07Z

Spreading the retry period over the sampling period is great to avoid a storm of retry to logs-intake. Retry only once is a bit small but I understand from Alex comment that it's more important to not worsen the situation compared to avoid data loss. In the light of last events, I also agree with that approach. Nice work 👍 !

I just see a little nit: in the unfortunate case where the retry does happen at the end of the sampling period, the current profile will still remain in memory while a new one is created. In the worst case where the retry upload timeouts (the default HTTP timeout value is 10 sec), 2/3 of the sampling data will be already gathered by the profiler for the new period (we sample over 15 seconds). If the process memory consumption is already high, we could end up OOM.

This is a very small corner case which will be most likely due to a poor memory configuration of the process (amount of data gathered by the profiler during each sampling period is not bounded anyway) but reducing the retry period to 50 seconds to keep memory overhead the same seems to be a small price to pay for a little bit more safety.

felixge · 2021-02-02T10:49:07Z

Spreading the retry period over the sampling period is great to avoid a storm of retry to logs-intake. Retry only once is a bit small but I understand from Alex comment that it's more important to not worsen the situation compared to avoid data loss. In the light of last events, I also agree with that approach. Nice work 👍 !

Thanks for taking a look! : )

I just see a little nit: in the unfortunate case where the retry does happen at the end of the sampling period, the current profile will still remain in memory while a new one is created. In the worst case where the retry upload timeouts (the default HTTP timeout value is 10 sec), 2/3 of the sampling data will be already gathered by the profiler for the new period (we sample over 15 seconds). If the process memory consumption is already high, we could end up OOM.

This is a very small corner case which will be most likely due to a poor memory configuration of the process (amount of data gathered by the profiler during each sampling period is not bounded anyway) but reducing the retry period to 50 seconds to keep memory overhead the same seems to be a small price to pay for a little bit more safety.

The Go profiler currently buffers a maximum of 6 profiles in memory. Arguably that might be too much, but as far as I can tell it's an orthogonal issue to this PR. I.e. changing the retry backoff mechanism has no impact on the high-watermark for profiler memory usage. The illustration below captures the key aspects of the current implementation:

see https://github.com/felixge/go-profiler-notes/blob/main/datadog.md#operation-details

PTAL and let me know if that makes sense or not.

PR #827 introduced a randomized wait duration before retrying a failed profile upload. This was done to reduce the likelihood of flooding the backend with uploads. The wait was done to be an average of half the profiling perdio. However, PR #961 lost this behavior when introducing interruptible sleeps. This commit restores the lost behavior. Fixes #1395

felixge requested a review from AlexJF January 28, 2021 18:14

felixge marked this pull request as draft January 28, 2021 18:14

felixge force-pushed the felix.geisendoerfer/increase-backoff branch from bdfb007 to 100f834 Compare January 29, 2021 12:30

felixge added this to the Triage milestone Jan 29, 2021

AlexJF approved these changes Jan 29, 2021

View reviewed changes

felixge marked this pull request as ready for review February 10, 2021 12:14

felixge modified the milestones: Triage, 1.29.0 Feb 10, 2021

felixge merged commit a85b65f into v1 Feb 10, 2021

felixge deleted the felix.geisendoerfer/increase-backoff branch February 10, 2021 12:30

nsrip-dd mentioned this pull request Jul 21, 2022

profiler upload does always wait a second before retrying upload #1395

Closed

nsrip-dd mentioned this pull request Jul 21, 2022

profiler: randomize wait duration before retrying upload #1396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase and simplify backoff #827

increase and simplify backoff #827

felixge commented Jan 28, 2021 •

edited

Loading

AlexJF left a comment

flodav commented Feb 2, 2021

felixge commented Feb 2, 2021

increase and simplify backoff #827

increase and simplify backoff #827

Conversation

felixge commented Jan 28, 2021 • edited Loading

Illustration

More thoughts

AlexJF left a comment

Choose a reason for hiding this comment

flodav commented Feb 2, 2021

felixge commented Feb 2, 2021

felixge commented Jan 28, 2021 •

edited

Loading