Skip to content

Commit

Permalink
profiler: increase and simplify backoff (#827)
Browse files Browse the repository at this point in the history
The current backoffDuration is designed to deal with optimistic
concurrency control, which is not what we're trying to do. More
importantly, we're only doing a single retry right now if the first
attempt failed, so backoffDuration() call always returns 100ms +/- some
jitter. This means that the moment there is an upstream issue we
immediately spam the upstream service with a second request which is
usually counter-productive.

This patch simply sleeps for rand(0, profilePeriod). The profile period
by default is 60s, so on average we'll sleep for 30s, i.e. even space
out our attempt to upload the profile.

Attached to the PR are two illustrations that show the old and new
behavior in a simple simulation.
  • Loading branch information
felixge authored Feb 10, 2021
1 parent 40012f0 commit a85b65f
Showing 1 changed file with 1 addition and 14 deletions.
15 changes: 1 addition & 14 deletions profiler/upload.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ import (
"errors"
"fmt"
"io"
"math"
"math/rand"
"mime/multipart"
"net/http"
Expand All @@ -25,18 +24,6 @@ const maxRetries = 2
var errOldAgent = errors.New("Datadog Agent is not accepting profiles. Agent-based profiling deployments " +
"require Datadog Agent >= 7.20")

// backoffDuration calculates the backoff duration given an attempt number and max duration
func backoffDuration(attempt int, max time.Duration) time.Duration {
// https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
if attempt == 0 {
return 0
}
maxPow := float64(max / 100 * time.Millisecond)
pow := math.Min(math.Pow(2, float64(attempt)), maxPow)
ns := int64(float64(100*time.Millisecond) * pow)
return time.Duration(rand.Int63n(ns))
}

// upload tries to upload a batch of profiles. It has retry and backoff mechanisms.
func (p *profiler) upload(bat batch) error {
statsd := p.cfg.statsd
Expand All @@ -45,7 +32,7 @@ func (p *profiler) upload(bat batch) error {
err = p.doRequest(bat)
if rerr, ok := err.(*retriableError); ok {
statsd.Count("datadog.profiler.go.upload_retry", 1, nil, 1)
wait := backoffDuration(i+1, p.cfg.cpuDuration)
wait := time.Duration(rand.Int63n(p.cfg.period.Nanoseconds()))
log.Error("Uploading profile failed: %v. Trying again in %s...", rerr, wait)
time.Sleep(wait)
continue
Expand Down

0 comments on commit a85b65f

Please sign in to comment.