Added simple retry logic for failed post requests (#560) #561

volfco · 2018-10-10T19:56:48Z

Summary

Added a for loop in the PostHelper method to retry the request if the first request failed due to an IO exception, it'd try 3 more times with a 50ms pause after each retry. If these 3 requests fails, it executes the existing failure logic. If one of these 3 requests succeeds, the for loop is broken and execution resumes as normal.

Motivation

I'm seeing failing requests to Datadog in production, and less dataloss is always something that should be desired.

Test plan

I added a print statement in my local code to print when the post failed. I routed the datadog url to localhost and got....

Ta Da!

Rollout/monitoring/revert plan

Nothing should need to be done other than watch errors drop

ChimeraCoder · 2018-10-10T20:43:54Z

First, congratulations on your first (I think) Go pull request! PostHelper would indeed be the place to add this, so you found the right spot. 😄

That said, from an architectural standpoint, I'm averse to adding retrial logic within Veneur itself. It would intersect in a cumbersome way with the proxies we already have, which could cause problems that are difficult to track down.

Proxying logic can be complex, with a lot of edge cases that can come up. Dedicated proxies like haproxy handle this pretty well. I suspect that's why we've never noticed these timeouts as an issue during normal use at Stripe, because the proxy would retry it for us, so we only ever run into timeouts when the upstream service is completely down for an extended period of time.

I'd strongly secommend standing up a proxy instead of relying on Veneur for retrial logic. That's how we handle thing for ourselves internally at Stripe. Other companies that are running Veneur at scale take that approach as well, using haproxy, Envoy, Bandaid, etc.

That said, thanks for putting this together (and reporting the issue in the first place), and let us know if you run into any issues with the above!

P.S. (If you're wondering why I'd draw the line here, and not the other stuff that's in PostHelper, the answer is: that stuff shouldn't really be in PostHelper either, and we've talked about stripping it out as well, in favor of offloading that logic to a dedicated traffic proxy.)

Added simple retry logic for failed post requests (#560)

677b588

volfco mentioned this pull request Oct 10, 2018

Does Veneur retry failed datadog flushes? #560

Closed

ChimeraCoder closed this Oct 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added simple retry logic for failed post requests (#560) #561

Added simple retry logic for failed post requests (#560) #561

volfco commented Oct 10, 2018

ChimeraCoder commented Oct 10, 2018

Added simple retry logic for failed post requests (#560) #561

Added simple retry logic for failed post requests (#560) #561

Conversation

volfco commented Oct 10, 2018

Summary

Motivation

Test plan

Rollout/monitoring/revert plan

ChimeraCoder commented Oct 10, 2018