-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests #59690
Comments
I think the simplest way to address this would be to have that I don't know whether it's the best method though - HTTP/2's multiplexing complicates things. At the As a result, receiving a timeout doesn't necessarily mean that the connection itself is dead: tearing it down might therefore impact working requests transiting the same connection via other streams. Setting It'd probably be tidier to extend |
Refusing to reuse a connection that's not successfully handling requests seems reasonable to me. False positives (a request timing out because of a slow server handler, not a bad connection, say) will result in additional connections being unnecessarily created, which seems less bad than continuing to send requests to an unresponsive connection. |
This commit mitigates some raciness which can occur when an established connection fails. Without it, the connection can continue to be drawn from the connection pool, leading to subsequent requests failing. Fixes golang/go#59690
Change https://go.dev/cl/486156 mentions this issue: |
This adds @dniel's changes from https://go-review.googlesource.com/c/net/+/486156 When a request on a connection fails to complete successfully, mark the conn as doNotReuse. It's possible for requests to fail for reasons unrelated to connection health, but opening a new connection unnecessarily is less of an impact than reusing a dead connection. Fixes golang/go#59690
This adds @dniel's changes from https://go-review.googlesource.com/c/net/+/486156 When a request on a connection fails to complete successfully, mark the conn as doNotReuse. It's possible for requests to fail for reasons unrelated to connection health, but opening a new connection unnecessarily is less of an impact than reusing a dead connection. Fixes golang/go#59690
Change https://go.dev/cl/490335 mentions this issue: |
@gopherbot Please open backport to 1.19. Rationale quoting #60301 (which is for the other open minor release): When encountered, this is a serious issue: requests are blocked until the kernel gives up on the socket (with default settings on a Linux host, that's around 17 minutes). If the operator restarts the application as a result, data loss may occur. Although the fix is in x/net v0.10.0 it will not be available to most developers who likely import net/http (and so rely on the h2 bundle rolled into Go's releases) The only viable workaround is to disable the H2 client entirely via environment variable. |
Backport issue(s) opened: #60301 (for 1.20), #60662 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/507395 mentions this issue: |
https://go.dev/cl/486156 was problematic, breaking applications like load balancers where cancellations are normal. We tried reverting only part of that change but it turns out all three of the cancellation paths tainting the connection were wrong and could ultimately be triggered by a request cancellation. We need a different solution for golang/go#59690 which we'll now reopen. This isn't a straight revert of CL 486156 because subsequent changes (CL 496335) depended on its cancelRequest closure, so this was reverted by hand. Fixes golang/go#60818 Updates golang/go#59690 (no longer fixed) Change-Id: Ic83b746d7cf89d07655d9efbd04b4d957487cb35 Reviewed-on: https://go-review.googlesource.com/c/net/+/507395 Reviewed-by: Damien Neil <dneil@google.com> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
I have noticed that HTTP/2 tends to be unreliable in certain circumstances due to bugs in Go's HTTP/2 library. See the following upstream issues: golang/go#36026 golang/go#59690 Let's add a configuration option here, so that HTTP/2 can be disabled on an individual basis if needed.
@btasker is http2Trans, err := http2.ConfigureTransports(transport)
if err == nil {
http2Trans.ReadIdleTimeout = time.Duration(ReadIdleTimeout)
} still a valid workaround for go 1.20.x? (I assume yes since I see the fix planned for 1.20.7 Milestone now) Is the ReadIdleTimeout of the http2 transport redundant with https://pkg.go.dev/net#Dialer.KeepAlive? |
Hey @serbrech
Yep, still valid for both.
I've not checked the interaction, but I suspect it's probably not redundant, no (as the dialer keepalive should be operating at a lower level). |
Sometimes we're seeing HTTP/2 connections be reused for up to ~16 minutes despite being completely defunct -- i.e. we're receiving no response from the remote end. We believe that this is in part a result of Go bug golang/go#59690 in which failed connections are not correctly discarded under certain circumstances. This commit enables the http2 transport's `ReadIdleTimeout` function, which will send an h2 "ping" frame on any connection that's been idle longer than the value of `ReadIdleTimeout`. If it doesn't hear back in `PingTimeout`, it will throw away the connection. This seems like a reasonable thing to leave on in general.
Sometimes we're seeing HTTP/2 connections be reused for up to ~16 minutes despite being completely defunct -- i.e. we're receiving no response from the remote end. We believe that this is in part a result of Go bug golang/go#59690 in which failed connections are not correctly discarded under certain circumstances. This commit enables the http2 transport's `ReadIdleTimeout` function, which will send an h2 "ping" frame on any connection that's been idle longer than the value of `ReadIdleTimeout`. If it doesn't hear back in `PingTimeout`, it will throw away the connection. This seems like a reasonable thing to leave on in general.
**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>
**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>
….91.0 (#33) [![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [go.opentelemetry.io/collector/exporter](https://togithub.com/open-telemetry/opentelemetry-collector) | require | minor | `v0.90.0` -> `v0.91.0` | --- ### Release Notes <details> <summary>open-telemetry/opentelemetry-collector (go.opentelemetry.io/collector/exporter)</summary> ### [`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0) ##### 💡 Enhancements 💡 - `statusreporting`: Automates status reporting upon the completion of component.Start(). ([#​7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682)) - `service`: add resource attributes as labels to otel metrics to ensures backwards compatibility with OpenCensus metrics. ([#​9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029)) - `semconv`: Generated Semantic conventions 1.21. ([#​9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056)) - `config/confighttp`: Exposes http/2 transport settings to enable health check and workaround golang http/2 issue [https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690) ([#​9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022)) - `cmd/builder`: running builder version on binaries installed with `go install` will output the version specified at the suffix. ([#​8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770)) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: fix missed metric aggregations ([#​9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048)) This ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add`. This means that if there's a cancelled context in an export, the metrics are now recorded. - `service`: Fix bug where MutatesData would not correctly propagate through connectors. ([#​9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053)) ### [`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: Remove noisy log ([#​9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuODcuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
….91.0 (#32) [![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [go.opentelemetry.io/collector/consumer](https://togithub.com/open-telemetry/opentelemetry-collector) | `v0.90.0` -> `v0.91.0` | [![age](https://developer.mend.io/api/mc/badges/age/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>open-telemetry/opentelemetry-collector (go.opentelemetry.io/collector/consumer)</summary> ### [`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0) ##### 💡 Enhancements 💡 - `statusreporting`: Automates status reporting upon the completion of component.Start(). ([#​7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682)) - `service`: add resource attributes as labels to otel metrics to ensures backwards compatibility with OpenCensus metrics. ([#​9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029)) - `semconv`: Generated Semantic conventions 1.21. ([#​9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056)) - `config/confighttp`: Exposes http/2 transport settings to enable health check and workaround golang http/2 issue [https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690) ([#​9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022)) - `cmd/builder`: running builder version on binaries installed with `go install` will output the version specified at the suffix. ([#​8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770)) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: fix missed metric aggregations ([#​9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048)) This ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add`. This means that if there's a cancelled context in an export, the metrics are now recorded. - `service`: Fix bug where MutatesData would not correctly propagate through connectors. ([#​9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053)) ### [`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: Remove noisy log ([#​9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuMTAzLjEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIn0=--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
….91.0 (#33) [![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [go.opentelemetry.io/collector/exporter](https://togithub.com/open-telemetry/opentelemetry-collector) | require | minor | `v0.90.0` -> `v0.91.0` | --- ### Release Notes <details> <summary>open-telemetry/opentelemetry-collector (go.opentelemetry.io/collector/exporter)</summary> ### [`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0) ##### 💡 Enhancements 💡 - `statusreporting`: Automates status reporting upon the completion of component.Start(). ([#​7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682)) - `service`: add resource attributes as labels to otel metrics to ensures backwards compatibility with OpenCensus metrics. ([#​9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029)) - `semconv`: Generated Semantic conventions 1.21. ([#​9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056)) - `config/confighttp`: Exposes http/2 transport settings to enable health check and workaround golang http/2 issue [https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690) ([#​9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022)) - `cmd/builder`: running builder version on binaries installed with `go install` will output the version specified at the suffix. ([#​8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770)) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: fix missed metric aggregations ([#​9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048)) This ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add`. This means that if there's a cancelled context in an export, the metrics are now recorded. - `service`: Fix bug where MutatesData would not correctly propagate through connectors. ([#​9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053)) ### [`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: Remove noisy log ([#​9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuODcuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
….91.0 (#32) [![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [go.opentelemetry.io/collector/consumer](https://togithub.com/open-telemetry/opentelemetry-collector) | `v0.90.0` -> `v0.91.0` | [![age](https://developer.mend.io/api/mc/badges/age/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>open-telemetry/opentelemetry-collector (go.opentelemetry.io/collector/consumer)</summary> ### [`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0) ##### 💡 Enhancements 💡 - `statusreporting`: Automates status reporting upon the completion of component.Start(). ([#​7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682)) - `service`: add resource attributes as labels to otel metrics to ensures backwards compatibility with OpenCensus metrics. ([#​9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029)) - `semconv`: Generated Semantic conventions 1.21. ([#​9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056)) - `config/confighttp`: Exposes http/2 transport settings to enable health check and workaround golang http/2 issue [https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690) ([#​9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022)) - `cmd/builder`: running builder version on binaries installed with `go install` will output the version specified at the suffix. ([#​8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770)) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: fix missed metric aggregations ([#​9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048)) This ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add`. This means that if there's a cancelled context in an export, the metrics are now recorded. - `service`: Fix bug where MutatesData would not correctly propagate through connectors. ([#​9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053)) ### [`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901) [Compare Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1) ##### 🧰 Bug fixes 🧰 - `exporterhelper`: Remove noisy log ([#​9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuMTAzLjEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIn0=--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Change https://go.dev/cl/617655 mentions this issue: |
Consider the case of an unresponsive client connection, where the server has stopped responding. We send an infinite sequence of requests to the connection in sequence, each with a timeout. Each request counts against the concurrency limit for the connection while active, but when a request times out we send a RST_STREAM and free up the concurrency slot it was using. We continue to try to send requests to the connection forever (or until the kernel closes the underlying TCP connection, or until ReadIdleTimeout/WriteByteTimeout results in us closing the connection). Defend against this scenario by counting a canceled request against the connection concurrency limit until we confirm the server is responding. Specifically: Track the number of in-flight request cancellations in cc.pendingResets. This total counts against the connection concurrency limit. When sending a RST_STREAM for a canceled request, increment cc.pendingResets. Send a PING frame to the server, unless a PING is already in flight. When receiving a PING response, set cc.pendingResets to 0. A hung connection will be used for at most SETTINGS_MAX_CONCURRENT_STREAMS requests. When StrictMaxConcurrentStreams is false, we will create a new connection after reaching the concurrency limit for a hung one. When StrictMaxConcurrentStreams is true, we will continue to wait for the existing connection until some timeout closes it or it becomes responsive again. For golang/go#59690 Change-Id: I0151f9a594af14b32bcb6005a239fa19eb103704 Reviewed-on: https://go-review.googlesource.com/c/net/+/617655 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Jonathan Amsterdam <jba@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
A connection is established to a remote server (using HTTP/2 where available) and requests are sent periodically over this connection - it's essentially a write pipeline working through a queue sequentially.
This being the internet, sometimes bad things happen to individual connections and packets go awry.
If a timeout occurs, the connection will (well, should) be closed:
(that
CloseIdleConnections()
call is sourced from #36026)What did you expect to see?
On the next request (generally a retry), a new connection should be established (if possible) and writes should resume.
What did you see instead?
The underlying TCP connection does not get closed - the kernel sits retransmitting the most recent packets (until
net.ipv4.tcp_retries2
is reached about 17mins later).Subsequent requests all attempt to use the dead connection and so time out.
Repro
The following can be used to help repro the behaviour
If you build and run that, you should see
403
printed on a new line every second or so (it being 403 isn't pertinent here, it's just easy/convenient to generate)Check which source port is being used
and then block it in
iptables
to break the return pathThe script should stop printing 403s and after around 5s will print
Timed out
.Although the code calls
CloseIdleConnections()
, it'll silently fail to do anything to the connection (more below)Subsequent attempts will also print
Time out
and re-runningnetstat
will show that same connection still in anESTABLISHED
state. If you take a packet capture it'll show re-transmission of whatever packet we interrupted by adding the rule.If you remove the
iptables
rule once the remote server starts sending resets, the kernel will tear down the connection and requests will go back to behaving as they should.Hacking it out
It is actually possible to work around this at the caller's level, although it's not exactly pleasant to look at.
This works because
net/http2
specifically checks for aConnection: close
header before placing a request, and set'sdoNotReuse
on theClientConn
, so when the next request comes around, the connection won't be used even if it is otherwise considered still active.Underlying Issue
The underlying seems to be that there's some raciness between normal stream closure and
closeIdleConnections()
.The crucial check is here
If the map
cc.streams
contains any entries, the connection isn't considered idle and therefore can't be closed.Dropping some prints in as a quick test shows it's this conditional we're failing at
Gives
net/http2
does actually try to tidy up the streams,doRequest
callscleanupWriteRequest
which ultimately drops the broken stream.Adding some more prints though, shows that this only tends to happen after the caller has had opportunity to call
closeIdleConnections()
:The string
Forgetting ID
was added here.Raciness
Timings definitely seem to be important here - if the
time.Sleep
call in the repro is adjusted to wait 5 seconds, you'll sometimes see a string of timeouts followed by writes recovering.That's not the only raciness though.
Stream closures/timeouts etc are handled by a
select
statement hereSometimes this is triggered by
cs.abort
and sometimes byctx.Done()
. This appears to be as a result of an attempt to abort the stream when the context hits deadline - so there's an element of potluck in terms of which gets through (I've not dug into this in too much depth though).TL:DR
CloseIdleConnections()
relies oncloseIfIdle()
which will close a connection only if there are no active streams on that connection.However, when a timeout occurs, there may be a delay before the stream is closed (leaving it visible to
closeIfIdle()
).By the time the stream closure occurs, the timeout has already been bubbled up to the caller, who may then choose to call
CloseIdleConnections()
knowing that the connection has failed.Because the dead connection does not get closed by this call, it remains available for use by the next request.
The result is that requests will fail until the kernel itself gives up on the connection, or a RST or FIN makes it through from the remote end.
The text was updated successfully, but these errors were encountered: