-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse proxy upstream keepalive #6452
Comments
We don't really deal with keepalives ourselves, we just pass the value onto the Go standard library's HTTP transport: https://pkg.go.dev/net/http#Transport To be clear, you're talking about HTTP keepalives, right? Not TCP keepalive? Before making any changes, I'd like to reproduce the issue as minimally as possible to understand it. Do you have a minimal reproducer? Ideally, if the backend can also be Caddy, that will help me eliminate Node as a possible source of confusion/bugs/problems. |
Thanks for the feedback. I've tried reproducing locally and managed to get a handful of 502s but it's hard as it's a timing issue, more so than I thought. After further reading on the topic, the cause is slightly different from what I assumed above and is apparently well-known from HTTP client implementors, it's a race condition between the moment the server closes the idle socket and the moment the client detects it (so in our example above, that means not requests in the range 5sec < x < 2min but more like 5sec < x < 5.1sec). It seems to be a known issue in all HTTP clients, one that doesn't have a clean solution. Go's library does not have a heuristic to detect that situation and does what the HTTP spec actually mandates in more general cases: only idempotent requests verbs are retried, and in my case POST requests without idempotency headers are excluded and therefore fail. Other clients like browsers tend to retry in all cases instead, going against the spec to some extent. The issue has been reported/closed for Go's HTTP lib in golang/go#22158, which brought support for idempotency headers. I think it's still worth thinking about a way to state that in the docs, as it does cause issues and the root cause is hard to figure out. Something along the lines of: HTTP keep-alive is enabled by default with a long timeout. If your upstream server is configured to close idle connections with a shorter delay, HTTP 502s might be returned on some non-idempotent requests. Further reading:
|
That's a great idea. Thanks for looking deeply into this, it would have driven me bonkers too. So you're saying that a POST (for example) request, with an Idempotency-Key header (for example) would be retried? We can add that to the docs. I agree it's worth it. |
Hello, thanks for Caddy, it's a pleasure to use it for its easy config and its "it just works" approach.
Issue
I've spent hours trying to understand why Caddy was randomly returning 502s for some proxied calls to my Node.js backend. It was logging the following message which made me pointlessly look for issues in my network stack:
Resolution
I have finally identified the root cause which is that the default http transport keepalive duration setting in Caddy (2 minutes) widely exceeds the default Node.js HTTP server keepAliveTimeout setting (5 seconds).
Improvements
I believe it would be worth doing the following:
What do you think?
Rationale
This issue is not specific to Caddy and you can find many articles online of people spending lots of time on similar issues for each proxy out there that has a long keepalive enabled by default. So it would fit Caddy's philosophy to provide something that works better out-of-the-box.
This issue tends to appear for small apps that suddenly have more users. Backends that get very little activity hit Caddy's timeout (which hides the problem) while the ones with a lot of activity never hit Node's timeout (which may hide the problem as well). But for a small server which sees a few bursts of activity, suddenly a lot of unexplainable 502s pop up as many calls fall in the 5sec < x < 2min gap.
The text was updated successfully, but these errors were encountered: