-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Veneur retry failed datadog flushes? #560
Comments
Veneur itself doesn't retry flushes to Datadog (though you could use an http proxy for that, if you wanted). The entire pipeline is assumed to be mildly lossy, given that metrics are themselves received over UDP, which provides no delivery guarantees. Sporadic, occasional metric failures are tolerated. That said, if you're seeing a lot of timeouts, something is probably up. We ourselves don't see many timeouts running Veneur at scale, so I'm curious what's going on here. Is your outbound network connection spotty? Are you sending a particularly large payload with each flush (a lot of metrics, or a long flush cycle)? |
We're seeing a small number of sustained errors. I've got flush_max_per_body set as 25000- which is the default in the example. I don't know if this is inline with what you're seeing, but 10 to 15 errors every 15 minutes across the various DCs I've deployed Veneur to. (Broken down by DC) These are servers from all over the world going to aws us-east-1, so I'm expecting some errors just not sure how many. |
Digging into the native datadog agent, it does look like it has some retry logic here: https://github.com/DataDog/datadog-agent/blob/d3e74927d78a5982d9978ed8540bd6b2c61ab437/pkg/forwarder/transaction.go#L144 under certain failure cases- namely request errors such as timeouts. |
Yeah, that's definitely not in line with what we've experienced. We're not using Datadog ourselves at the moment, so I can't compare against current data, but timeouts in Veneur are quite rare - less than one per day - except during a Datadog outage (and their status page is green right now). Just to clarify: when you say that this is from servers all around the world going to aws us-east-1, that's from tracing the location of where We do use haproxy for external egress from our network, and haproxy does have built in retrial. So it's possible that we wouldn't have noticed the connection timeouts within Veneur, if haproxy was retrying and the success rate of the retried request was high enough. As a quick test, I'd recommend trying running requests through a proxy like haproxy and seeing if that fixes the issue. |
Yep. Every DC we have resolved to us-east-1 ELBs. We're talking directly to datadog without a proxy. It seems the path forward is for me to add some basic retry logic into the datadog requests. We're not moving away from Datadog anytime soon so retry logic is desired. I've put in a fair amount of effort to get our dogstatsd pipeline reliable so I'm not stopping now. |
Opened #561 for a I'm-new-to-go-and-I-think-this-is-a-valid-fix fix |
I'm not that good at Go so I couldn't find my answer from the code.
Does Veneur retry sending failed datadog flushes? I've seen fair amount of flushes fail due to http errors (timeouts) and I'm wondering if those are just warnings or could be dataloss.
...: time="2018-10-10T10:22:44-05:00" level=warning msg="Could not execute request" action=flush error="net/http: request canceled (Client.Timeout exceeded while awaiting headers)" host=app.datadoghq.com path=/api/v1/series
Are these requests retried? And if so, how many retries before the segment is lost?
The text was updated successfully, but these errors were encountered: