Does Veneur retry failed datadog flushes? #560

volfco · 2018-10-10T15:31:11Z

I'm not that good at Go so I couldn't find my answer from the code.

Does Veneur retry sending failed datadog flushes? I've seen fair amount of flushes fail due to http errors (timeouts) and I'm wondering if those are just warnings or could be dataloss.
...: time="2018-10-10T10:22:44-05:00" level=warning msg="Could not execute request" action=flush error="net/http: request canceled (Client.Timeout exceeded while awaiting headers)" host=app.datadoghq.com path=/api/v1/series

Are these requests retried? And if so, how many retries before the segment is lost?

The text was updated successfully, but these errors were encountered:

ChimeraCoder · 2018-10-10T15:52:38Z

Veneur itself doesn't retry flushes to Datadog (though you could use an http proxy for that, if you wanted). The entire pipeline is assumed to be mildly lossy, given that metrics are themselves received over UDP, which provides no delivery guarantees. Sporadic, occasional metric failures are tolerated.

That said, if you're seeing a lot of timeouts, something is probably up. We ourselves don't see many timeouts running Veneur at scale, so I'm curious what's going on here. Is your outbound network connection spotty? Are you sending a particularly large payload with each flush (a lot of metrics, or a long flush cycle)?

volfco · 2018-10-10T16:01:27Z

We're seeing a small number of sustained errors. I've got flush_max_per_body set as 25000- which is the default in the example. I don't know if this is inline with what you're seeing, but 10 to 15 errors every 15 minutes across the various DCs I've deployed Veneur to.

(Broken down by DC)

These are servers from all over the world going to aws us-east-1, so I'm expecting some errors just not sure how many.

volfco · 2018-10-10T16:27:07Z

Digging into the native datadog agent, it does look like it has some retry logic here: https://github.com/DataDog/datadog-agent/blob/d3e74927d78a5982d9978ed8540bd6b2c61ab437/pkg/forwarder/transaction.go#L144 under certain failure cases- namely request errors such as timeouts.

ChimeraCoder · 2018-10-10T18:24:01Z

Yeah, that's definitely not in line with what we've experienced. We're not using Datadog ourselves at the moment, so I can't compare against current data, but timeouts in Veneur are quite rare - less than one per day - except during a Datadog outage (and their status page is green right now).

Just to clarify: when you say that this is from servers all around the world going to aws us-east-1, that's from tracing the location of where app.datadoghq.com resolves (us-east-1)?

We do use haproxy for external egress from our network, and haproxy does have built in retrial. So it's possible that we wouldn't have noticed the connection timeouts within Veneur, if haproxy was retrying and the success rate of the retried request was high enough. As a quick test, I'd recommend trying running requests through a proxy like haproxy and seeing if that fixes the issue.

volfco · 2018-10-10T18:34:39Z

Yep. Every DC we have resolved to us-east-1 ELBs. We're talking directly to datadog without a proxy.

It seems the path forward is for me to add some basic retry logic into the datadog requests. We're not moving away from Datadog anytime soon so retry logic is desired. I've put in a fair amount of effort to get our dogstatsd pipeline reliable so I'm not stopping now.

volfco · 2018-10-10T19:57:30Z

Opened #561 for a I'm-new-to-go-and-I-think-this-is-a-valid-fix fix

ChimeraCoder closed this as completed Oct 10, 2018

datsabk mentioned this issue Apr 6, 2023

Veneur facing client timeout for large metric count #1054

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Veneur retry failed datadog flushes? #560

Does Veneur retry failed datadog flushes? #560

volfco commented Oct 10, 2018

ChimeraCoder commented Oct 10, 2018

volfco commented Oct 10, 2018 •

edited

Loading

volfco commented Oct 10, 2018

ChimeraCoder commented Oct 10, 2018 •

edited

Loading

volfco commented Oct 10, 2018

volfco commented Oct 10, 2018

Does Veneur retry failed datadog flushes? #560

Does Veneur retry failed datadog flushes? #560

Comments

volfco commented Oct 10, 2018

ChimeraCoder commented Oct 10, 2018

volfco commented Oct 10, 2018 • edited Loading

volfco commented Oct 10, 2018

ChimeraCoder commented Oct 10, 2018 • edited Loading

volfco commented Oct 10, 2018

volfco commented Oct 10, 2018

volfco commented Oct 10, 2018 •

edited

Loading

ChimeraCoder commented Oct 10, 2018 •

edited

Loading