Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added simple retry logic for failed post requests (#560) #561

Closed
wants to merge 1 commit into from
Closed

Added simple retry logic for failed post requests (#560) #561

wants to merge 1 commit into from

Conversation

volfco
Copy link

@volfco volfco commented Oct 10, 2018

Summary

Added a for loop in the PostHelper method to retry the request if the first request failed due to an IO exception, it'd try 3 more times with a 50ms pause after each retry. If these 3 requests fails, it executes the existing failure logic. If one of these 3 requests succeeds, the for loop is broken and execution resumes as normal.

Motivation

I'm seeing failing requests to Datadog in production, and less dataloss is always something that should be desired.

Test plan

I added a print statement in my local code to print when the post failed. I routed the datadog url to localhost and got....

image

Ta Da!

Rollout/monitoring/revert plan

Nothing should need to be done other than watch errors drop

@ChimeraCoder
Copy link
Contributor

First, congratulations on your first (I think) Go pull request! PostHelper would indeed be the place to add this, so you found the right spot. 😄

That said, from an architectural standpoint, I'm averse to adding retrial logic within Veneur itself. It would intersect in a cumbersome way with the proxies we already have, which could cause problems that are difficult to track down.

Proxying logic can be complex, with a lot of edge cases that can come up. Dedicated proxies like haproxy handle this pretty well. I suspect that's why we've never noticed these timeouts as an issue during normal use at Stripe, because the proxy would retry it for us, so we only ever run into timeouts when the upstream service is completely down for an extended period of time.

I'd strongly secommend standing up a proxy instead of relying on Veneur for retrial logic. That's how we handle thing for ourselves internally at Stripe. Other companies that are running Veneur at scale take that approach as well, using haproxy, Envoy, Bandaid, etc.

That said, thanks for putting this together (and reporting the issue in the first place), and let us know if you run into any issues with the above!

P.S. (If you're wondering why I'd draw the line here, and not the other stuff that's in PostHelper, the answer is: that stuff shouldn't really be in PostHelper either, and we've talked about stripping it out as well, in favor of offloading that logic to a dedicated traffic proxy.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants