Skip to content

Commit

Permalink
increase and simplify backoff
Browse files Browse the repository at this point in the history
The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue (e.g. S3
rate limit hit), we immediately spam the upstream service with a second
request which is usually counter-productive.

This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. even space
out our attempt to upload the profile.

Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.
  • Loading branch information
felixge committed Jan 28, 2021
1 parent f35bc1d commit bdfb007
Showing 1 changed file with 1 addition and 14 deletions.
15 changes: 1 addition & 14 deletions profiler/upload.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ import (
"errors"
"fmt"
"io"
"math"
"math/rand"
"mime/multipart"
"net/http"
Expand All @@ -25,18 +24,6 @@ const maxRetries = 2
var errOldAgent = errors.New("Datadog Agent is not accepting profiles. Agent-based profiling deployments " +
"require Datadog Agent >= 7.20")

// backoffDuration calculates the backoff duration given an attempt number and max duration
func backoffDuration(attempt int, max time.Duration) time.Duration {
// https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
if attempt == 0 {
return 0
}
maxPow := float64(max / 100 * time.Millisecond)
pow := math.Min(math.Pow(2, float64(attempt)), maxPow)
ns := int64(float64(100*time.Millisecond) * pow)
return time.Duration(rand.Int63n(ns))
}

// upload tries to upload a batch of profiles. It has retry and backoff mechanisms.
func (p *profiler) upload(bat batch) error {
statsd := p.cfg.statsd
Expand All @@ -45,7 +32,7 @@ func (p *profiler) upload(bat batch) error {
err = p.doRequest(bat)
if rerr, ok := err.(*retriableError); ok {
statsd.Count("datadog.profiler.go.upload_retry", 1, nil, 1)
wait := backoffDuration(i+1, p.cfg.cpuDuration)
wait := time.Duration(rand.Int63n(p.cfg.period.Nanoseconds()))
log.Error("Uploading profile failed: %v. Trying again in %s...", rerr, wait)
time.Sleep(wait)
continue
Expand Down

0 comments on commit bdfb007

Please sign in to comment.