-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase and simplify backoff #827
Conversation
The current backoffDuration is designed to deal with optimistic concurrency control, which is not what we're trying to do. More importantly, we're only doing a single retry right now if the first attempt failed, so backoffDuration() call always returns 100ms +/- some jitter. This means that the moment there is an upstream issue we immediately spam the upstream service with a second request which is usually counter-productive. This patch simply sleeps for rand(0, profilePeriod). The profile period by default is 60s, so on average we'll sleep for 30s, i.e. even space out our attempt to upload the profile. Attached to the PR are two illustrations that show the old and new behavior in a simple simulation.
bdfb007
to
100f834
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tying this to the period makes sense. We will potentially add a non-negligible latency when the failure was due to a transient network error but this should not be a frequent occurrence and given the small number of retries we do it increases the chance that whatever failed went away (or atleast the immediate retrying won't make the problem worse!).
There's plenty we can do here in following steps but I think this goes in a good direction. I'd like @dbenamydd 's view on this as well though.
Spreading the retry period over the sampling period is great to avoid a storm of retry to logs-intake. Retry only once is a bit small but I understand from Alex comment that it's more important to not worsen the situation compared to avoid data loss. In the light of last events, I also agree with that approach. Nice work 👍 ! I just see a little nit: in the unfortunate case where the retry does happen at the end of the sampling period, the current profile will still remain in memory while a new one is created. In the worst case where the retry upload timeouts (the default HTTP timeout value is 10 sec), 2/3 of the sampling data will be already gathered by the profiler for the new period (we sample over 15 seconds). If the process memory consumption is already high, we could end up OOM. This is a very small corner case which will be most likely due to a poor memory configuration of the process (amount of data gathered by the profiler during each sampling period is not bounded anyway) but reducing the retry period to 50 seconds to keep memory overhead the same seems to be a small price to pay for a little bit more safety. |
Thanks for taking a look! : )
The Go profiler currently buffers a maximum of 6 profiles in memory. Arguably that might be too much, but as far as I can tell it's an orthogonal issue to this PR. I.e. changing the retry backoff mechanism has no impact on the high-watermark for profiler memory usage. The illustration below captures the key aspects of the current implementation: see https://github.com/felixge/go-profiler-notes/blob/main/datadog.md#operation-details PTAL and let me know if that makes sense or not. |
PR #827 introduced a randomized wait duration before retrying a failed profile upload. This was done to reduce the likelihood of flooding the backend with uploads. The wait was done to be an average of half the profiling perdio. However, PR #961 lost this behavior when introducing interruptible sleeps. This commit restores the lost behavior. Fixes #1395
PR #827 introduced a randomized wait duration before retrying a failed profile upload. This was done to reduce the likelihood of flooding the backend with uploads. The wait was done to be an average of half the profiling perdio. However, PR #961 lost this behavior when introducing interruptible sleeps. This commit restores the lost behavior. Fixes #1395
PR #827 introduced a randomized wait duration before retrying a failed profile upload. This was done to reduce the likelihood of flooding the backend with uploads. The wait was done to be an average of half the profiling perdio. However, PR #961 lost this behavior when introducing interruptible sleeps. This commit restores the lost behavior. Fixes #1395
The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so the backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue we
immediately spam the upstream service with a second request which
is usually counter-productive.
This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. evenly space
out our attempt to upload the profile.
Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.
Illustration
Each dot is one upload attempt. The vertical lines mark the begining/end of an upstream outage.
Before (note how the second retry attempt immediately follows the initial attempt):
After (note how the retry attempt is now smoothly sitting in between the first attempt and the next profile upload):
More thoughts
This is just a proposal for something simple we could do to make the client behave less aggressively when there is an upstream issue. That being said, we're still doing 2x the amount of requests during an outage right now, so this won't solve the issue completely. However, when discussing this with @AlexJF, he was concerned about the tradeoff between data safety and rate limiting, so this change tries to keep that balance the same way it was before.