Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests #59690

Open
btasker opened this issue Apr 18, 2023 · 11 comments · May be fixed by golang/net#171
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@btasker
Copy link

btasker commented Apr 18, 2023

What version of Go are you using (go version)?

go version go1.20.3 linux/amd6

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2968508177=/tmp/go-build -gno-record-gcc-switches"

What did you do?

A connection is established to a remote server (using HTTP/2 where available) and requests are sent periodically over this connection - it's essentially a write pipeline working through a queue sequentially.

This being the internet, sometimes bad things happen to individual connections and packets go awry.

If a timeout occurs, the connection will (well, should) be closed:

resp, err := client.Do(req.WithContext(ctx))
if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")
            client.CloseIdleConnections()
        }

(that CloseIdleConnections() call is sourced from #36026)

What did you expect to see?

On the next request (generally a retry), a new connection should be established (if possible) and writes should resume.

What did you see instead?

The underlying TCP connection does not get closed - the kernel sits retransmitting the most recent packets (until net.ipv4.tcp_retries2 is reached about 17mins later).

Subsequent requests all attempt to use the dead connection and so time out.

Repro

The following can be used to help repro the behaviour

package main

import (
    "context"
    "fmt"
    "net/http"
    "time"
    "strings"
)

func main() {

    // The url to place requests to
    url := "https://www.bentasker.co.uk/"

    // Create an empty context
    ctx := context.Background()

    // Create an io.Reader from a string to act as the request body
    bodystring := "foo\nbar\nsed"
    myReader := strings.NewReader(bodystring)

    // build client and request
    client := http.Client{
        Timeout: 5 * time.Second,
    }

    // Set up an infinite loop to place requests from
    for {
        req, err := http.NewRequest("POST", url, myReader)
        if err != nil {
            fmt.Println("boo")
        }
        req.Header.Set("Content-Type", "text/plain; charset=utf-8")

        resp, err := client.Do(req.WithContext(ctx))

        if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")
            client.CloseIdleConnections()
        } else {
            fmt.Println(resp.StatusCode)
        }

        time.Sleep(time.Second)
    }
}

If you build and run that, you should see 403 printed on a new line every second or so (it being 403 isn't pertinent here, it's just easy/convenient to generate)

Check which source port is being used

root@ratchett:/home/ben/tmp/go_repro# netstat -anp | grep main
tcp        0   2185 192.168.3.217:50446     143.244.38.137:443      ESTABLISHED 2389361/./main   

and then block it in iptables to break the return path

iptables -I INPUT -p tcp --dport 50446 -j DROP

The script should stop printing 403s and after around 5s will print Timed out.

Although the code calls CloseIdleConnections(), it'll silently fail to do anything to the connection (more below)

Subsequent attempts will also print Time out and re-running netstat will show that same connection still in an ESTABLISHED state. If you take a packet capture it'll show re-transmission of whatever packet we interrupted by adding the rule.

If you remove the iptables rule once the remote server starts sending resets, the kernel will tear down the connection and requests will go back to behaving as they should.


Hacking it out

It is actually possible to work around this at the caller's level, although it's not exactly pleasant to look at.

        if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")

            // prompt the handler into marking the connection as not
           // being reusable
            req.Header.Set("Connection", "close")
            client.Do(req.WithContext(ctx))

            // this is no longer required
            // client.CloseIdleConnections()
        } else {
            fmt.Println(resp.StatusCode)
        }

This works because net/http2 specifically checks for a Connection: close header before placing a request, and set's doNotReuse on the ClientConn, so when the next request comes around, the connection won't be used even if it is otherwise considered still active.


Underlying Issue

The underlying seems to be that there's some raciness between normal stream closure and closeIdleConnections().

The crucial check is here

	if len(cc.streams) > 0 || cc.streamsReserved > 0 {
		cc.mu.Unlock()
		return
	}

If the map cc.streams contains any entries, the connection isn't considered idle and therefore can't be closed.

Dropping some prints in as a quick test shows it's this conditional we're failing at

@@ -7984,6 +7990,8 @@ func (cc *http2ClientConn) forceCloseConn() {
 func (cc *http2ClientConn) closeIfIdle() {
        cc.mu.Lock()
        if len(cc.streams) > 0 || cc.streamsReserved > 0 {
+                fmt.Println("Refusing to close - streams:", len(cc.streams), " reserved:", cc.streamsReserved)
+
                cc.mu.Unlock()
                return
        }

Gives

Refusing to close - streams: 1  reserved: 0

net/http2 does actually try to tidy up the streams, doRequest calls cleanupWriteRequest which ultimately drops the broken stream.

Adding some more prints though, shows that this only tends to happen after the caller has had opportunity to call closeIdleConnections():

Refusing to close - streams: 1  reserved: 0
Doing cleanup context deadline exceeded
Cleanup stream 5 got err context deadline exceeded
Must close body:  false
Got err Sent headers
Writing reset
Forgetting ID  5
werr  <nil>
closing

The string Forgetting ID was added here.


Raciness

Timings definitely seem to be important here - if the time.Sleep call in the repro is adjusted to wait 5 seconds, you'll sometimes see a string of timeouts followed by writes recovering.

That's not the only raciness though.

Stream closures/timeouts etc are handled by a select statement here

	for {
		select {
		case <-cs.peerClosed:
			return nil
		case <-respHeaderTimer:
			return errTimeout
		case <-respHeaderRecv:
			respHeaderRecv = nil
			respHeaderTimer = nil // keep waiting for END_STREAM
		case <-cs.abort:
			return cs.abortErr
		case <-ctx.Done():
			return ctx.Err()
		case <-cs.reqCancel:
			return errRequestCanceled
		}
	}

Sometimes this is triggered by cs.abort and sometimes by ctx.Done(). This appears to be as a result of an attempt to abort the stream when the context hits deadline - so there's an element of potluck in terms of which gets through (I've not dug into this in too much depth though).


TL:DR

CloseIdleConnections() relies on closeIfIdle() which will close a connection only if there are no active streams on that connection.

However, when a timeout occurs, there may be a delay before the stream is closed (leaving it visible to closeIfIdle()).

By the time the stream closure occurs, the timeout has already been bubbled up to the caller, who may then choose to call CloseIdleConnections() knowing that the connection has failed.

Because the dead connection does not get closed by this call, it remains available for use by the next request.

The result is that requests will fail until the kernel itself gives up on the connection, or a RST or FIN makes it through from the remote end.

@btasker
Copy link
Author

btasker commented Apr 18, 2023

I think the simplest way to address this would be to have that select statement check whether the current stream is the only one in the ClientConn and if so, set DoNotReuse.

I don't know whether it's the best method though - HTTP/2's multiplexing complicates things.

At the net/http2 level there doesn't seem to be a way to tell whether a timeout is a connection problem or just that the other end is being slow (or the caller setting an overly low timeout).

As a result, receiving a timeout doesn't necessarily mean that the connection itself is dead: tearing it down might therefore impact working requests transiting the same connection via other streams.

Setting doNotReuse on the connection feels like an acceptable compromise: it allows other streams to carry on unimpeded (or hit their own timeouts), whilst ensuring that a potentially problematic connection is not reused.

It'd probably be tidier to extend ClientConn to include a hasErrored property and achieve the same end, but I'll link a PR in in a minute using doNotReuse - I'm happy to go the other way, but it didn't seem worth the effort until it's confirmed that this is the right path.

@ianlancetaylor
Copy link
Member

CC @neild @bradfitz

@neild
Copy link
Contributor

neild commented Apr 18, 2023

Refusing to reuse a connection that's not successfully handling requests seems reasonable to me.

False positives (a request timing out because of a slow server handler, not a bad connection, say) will result in additional connections being unnecessarily created, which seems less bad than continuing to send requests to an unresponsive connection.

btasker added a commit to btasker/net that referenced this issue Apr 18, 2023
This commit mitigates some raciness which can occur when an established
connection fails. Without it, the connection can continue to be drawn
from the connection pool, leading to subsequent requests failing.

Fixes golang/go#59690
@neild neild self-assigned this Apr 18, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/486156 mentions this issue: http2: don't reuse connections that are experiencing errors

btasker added a commit to btasker/net that referenced this issue Apr 28, 2023
This adds @dniel's changes from https://go-review.googlesource.com/c/net/+/486156

When a request on a connection fails to complete successfully,
mark the conn as doNotReuse. It's possible for requests to
fail for reasons unrelated to connection health,
but opening a new connection unnecessarily is less of an
impact than reusing a dead connection.

Fixes golang/go#59690
btasker added a commit to btasker/net that referenced this issue Apr 28, 2023
This adds @dniel's changes from https://go-review.googlesource.com/c/net/+/486156

When a request on a connection fails to complete successfully,
mark the conn as doNotReuse. It's possible for requests to
fail for reasons unrelated to connection health,
but opening a new connection unnecessarily is less of an
impact than reusing a dead connection.

Fixes golang/go#59690
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/490335 mentions this issue: net/http2: don't re-use connections that are experiencing errors

@dmitshur dmitshur added the NeedsFix The path to resolution is known, but the work has not been done. label May 24, 2023
@dmitshur dmitshur added this to the Unreleased milestone May 24, 2023
@ianlancetaylor
Copy link
Member

@gopherbot Please open backport to 1.19.

Rationale quoting #60301 (which is for the other open minor release):

When encountered, this is a serious issue: requests are blocked until the kernel gives up on the socket (with default settings on a Linux host, that's around 17 minutes). If the operator restarts the application as a result, data loss may occur.

Although the fix is in x/net v0.10.0 it will not be available to most developers who likely import net/http (and so rely on the h2 bundle rolled into Go's releases)

The only viable workaround is to disable the H2 client entirely via environment variable.

@gopherbot
Copy link
Contributor

gopherbot commented Jun 7, 2023

Backport issue(s) opened: #60301 (for 1.20), #60662 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/507395 mentions this issue: http2: revert Transport change from CL 486156

@neild neild reopened this Jun 30, 2023
gopherbot pushed a commit to golang/net that referenced this issue Jul 1, 2023
https://go.dev/cl/486156 was problematic, breaking applications
like load balancers where cancellations are normal. We tried
reverting only part of that change but it turns out all three of
the cancellation paths tainting the connection were wrong and
could ultimately be triggered by a request cancellation.

We need a different solution for golang/go#59690 which we'll now reopen.

This isn't a straight revert of CL 486156 because subsequent changes
(CL 496335) depended on its cancelRequest closure, so this was
reverted by hand.

Fixes golang/go#60818
Updates golang/go#59690 (no longer fixed)

Change-Id: Ic83b746d7cf89d07655d9efbd04b4d957487cb35
Reviewed-on: https://go-review.googlesource.com/c/net/+/507395
Reviewed-by: Damien Neil <dneil@google.com>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
EdSchouten added a commit to buildbarn/bb-storage that referenced this issue Aug 30, 2023
I have noticed that HTTP/2 tends to be unreliable in certain
circumstances due to bugs in Go's HTTP/2 library. See the following
upstream issues:

golang/go#36026
golang/go#59690

Let's add a configuration option here, so that HTTP/2 can be disabled on
an individual basis if needed.
@serbrech
Copy link

@btasker is

http2Trans, err := http2.ConfigureTransports(transport)
if err == nil {
   http2Trans.ReadIdleTimeout = time.Duration(ReadIdleTimeout)
}

still a valid workaround for go 1.20.x? (I assume yes since I see the fix planned for 1.20.7 Milestone now)
What about 1.21?

Is the ReadIdleTimeout of the http2 transport redundant with https://pkg.go.dev/net#Dialer.KeepAlive?

@btasker
Copy link
Author

btasker commented Sep 26, 2023

Hey @serbrech

still a valid workaround for go 1.20.x? (I assume yes since I see the fix planned for 1.20.7 Milestone now)
What about 1.21?

Yep, still valid for both.

Is the ReadIdleTimeout of the http2 transport redundant with https://pkg.go.dev/net#Dialer.KeepAlive?

I've not checked the interaction, but I suspect it's probably not redundant, no (as the dialer keepalive should be operating at a lower level).

@dmitshur dmitshur changed the title x/net/http2: Raciness in stream close prevents closing of failed connections, blocking subsequent requests x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests Oct 2, 2023
nickstenning added a commit to replicate/go that referenced this issue Nov 27, 2023
Sometimes we're seeing HTTP/2 connections be reused for up to ~16
minutes despite being completely defunct -- i.e. we're receiving no
response from the remote end.

We believe that this is in part a result of Go bug

  golang/go#59690

in which failed connections are not correctly discarded under certain
circumstances.

This commit enables the http2 transport's `ReadIdleTimeout` function,
which will send an h2 "ping" frame on any connection that's been idle
longer than the value of `ReadIdleTimeout`. If it doesn't hear back in
`PingTimeout`, it will throw away the connection. This seems like a
reasonable thing to leave on in general.
nickstenning added a commit to replicate/go that referenced this issue Nov 27, 2023
Sometimes we're seeing HTTP/2 connections be reused for up to ~16
minutes despite being completely defunct -- i.e. we're receiving no
response from the remote end.

We believe that this is in part a result of Go bug

  golang/go#59690

in which failed connections are not correctly discarded under certain
circumstances.

This commit enables the http2 transport's `ReadIdleTimeout` function,
which will send an h2 "ping" frame on any connection that's been idle
longer than the value of `ReadIdleTimeout`. If it doesn't hear back in
`PingTimeout`, it will throw away the connection. This seems like a
reasonable thing to leave on in general.
dmitryax pushed a commit to open-telemetry/opentelemetry-collector that referenced this issue Dec 8, 2023
**Description:** 
This PR introduces options to enable `http/2` health check by exposing
`HTTP2ReadIdleTimeout` and `HTTP2PingTimeout`

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection
used by the http/2 client in the exporter became
unstable/unusable/unreachable, unlike http/1, the http/2 client does not
forcibly close the connection and redial a new one, instead it keeps
using it for 15 minutes (default value of OS `tcp_retries2` ) until the
OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export
failures/timeouts for ~15 minutes, this will happen for every connection
that got into a bad state, after 15 minutes things will recover until
next time the tcp connection gets into a bad state.


**Testing:**
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** 
Readme is updated with new settings

Signed-off-by: Dani Louca <dlouca@splunk.com>
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this issue Dec 8, 2023
**Description:** 
This PR introduces options to enable `http/2` health check by exposing
`HTTP2ReadIdleTimeout` and `HTTP2PingTimeout`

The golang issues are:
golang/go#59690
golang/go#36026

In summary, if due to environmental issue the underlying tcp connection
used by the http/2 client in the exporter became
unstable/unusable/unreachable, unlike http/1, the http/2 client does not
forcibly close the connection and redial a new one, instead it keeps
using it for 15 minutes (default value of OS `tcp_retries2` ) until the
OS cleans it up and a new tcp connection gets established.
From OTEL user perspective, one will see a spike in export
failures/timeouts for ~15 minutes, this will happen for every connection
that got into a bad state, after 15 minutes things will recover until
next time the tcp connection gets into a bad state.


**Testing:**
- Run OTEL with one of the exporters that uses HTTP/2 client, example
`signalfx` exporter
- For simplicity use a single pipeline/exporter
- In a different shell, run this to watch the tcp state of the
established connection
```
 while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done
 ```  
- From the netstat, take a note of the source port and the source IP address
- replace <> from previous step
`sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP`
- Note how the OTEL exporter export starts timing out

Expected Result:
- A new connection should be established, similarly to http/1 and exports should succeed

Actual Result: 
- The exports keep failing for  ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to
- After 15 minutes, a new tcp connection is created and exports start working

**Documentation:** 
Readme is updated with new settings

Signed-off-by: Dani Louca <dlouca@splunk.com>
ymotongpoo referenced this issue in ymotongpoo/opentelemetry-collector-extra Dec 22, 2023
….91.0 (#33)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
|
[go.opentelemetry.io/collector/exporter](https://togithub.com/open-telemetry/opentelemetry-collector)
| require | minor | `v0.90.0` -> `v0.91.0` |

---

### Release Notes

<details>
<summary>open-telemetry/opentelemetry-collector
(go.opentelemetry.io/collector/exporter)</summary>

###
[`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0)

##### 💡 Enhancements 💡

- `statusreporting`: Automates status reporting upon the completion of
component.Start().
([#&#8203;7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682))
- `service`: add resource attributes as labels to otel metrics to
ensures backwards compatibility with OpenCensus metrics.
([#&#8203;9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029))
- `semconv`: Generated Semantic conventions 1.21.
([#&#8203;9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056))
- `config/confighttp`: Exposes http/2 transport settings to enable
health check and workaround golang http/2 issue
[https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690)
([#&#8203;9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022))
- `cmd/builder`: running builder version on binaries installed with `go
install` will output the version specified at the suffix.
([#&#8203;8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770))

##### 🧰 Bug fixes 🧰

- `exporterhelper`: fix missed metric aggregations
([#&#8203;9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048))
This ensures that context cancellation in the exporter doesn't interfere
with metric aggregation. The OTel
SDK currently returns if there's an error in the context used in `Add`.
This means that if there's a
    cancelled context in an export, the metrics are now recorded.

- `service`: Fix bug where MutatesData would not correctly propagate
through connectors.
([#&#8203;9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053))

###
[`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1)

##### 🧰 Bug fixes 🧰

- `exporterhelper`: Remove noisy log
([#&#8203;9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuODcuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
ymotongpoo referenced this issue in ymotongpoo/opentelemetry-collector-extra Dec 22, 2023
….91.0 (#32)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[go.opentelemetry.io/collector/consumer](https://togithub.com/open-telemetry/opentelemetry-collector)
| `v0.90.0` -> `v0.91.0` |
[![age](https://developer.mend.io/api/mc/badges/age/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>open-telemetry/opentelemetry-collector
(go.opentelemetry.io/collector/consumer)</summary>

###
[`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0)

##### 💡 Enhancements 💡

- `statusreporting`: Automates status reporting upon the completion of
component.Start().
([#&#8203;7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682))
- `service`: add resource attributes as labels to otel metrics to
ensures backwards compatibility with OpenCensus metrics.
([#&#8203;9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029))
- `semconv`: Generated Semantic conventions 1.21.
([#&#8203;9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056))
- `config/confighttp`: Exposes http/2 transport settings to enable
health check and workaround golang http/2 issue
[https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690)
([#&#8203;9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022))
- `cmd/builder`: running builder version on binaries installed with `go
install` will output the version specified at the suffix.
([#&#8203;8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770))

##### 🧰 Bug fixes 🧰

- `exporterhelper`: fix missed metric aggregations
([#&#8203;9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048))
This ensures that context cancellation in the exporter doesn't interfere
with metric aggregation. The OTel
SDK currently returns if there's an error in the context used in `Add`.
This means that if there's a
    cancelled context in an export, the metrics are now recorded.

- `service`: Fix bug where MutatesData would not correctly propagate
through connectors.
([#&#8203;9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053))

###
[`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1)

##### 🧰 Bug fixes 🧰

- `exporterhelper`: Remove noisy log
([#&#8203;9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuMTAzLjEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIn0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
ymotongpoo referenced this issue in ymotongpoo/opentelemetry-collector-extra Dec 22, 2023
….91.0 (#33)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
|
[go.opentelemetry.io/collector/exporter](https://togithub.com/open-telemetry/opentelemetry-collector)
| require | minor | `v0.90.0` -> `v0.91.0` |

---

### Release Notes

<details>
<summary>open-telemetry/opentelemetry-collector
(go.opentelemetry.io/collector/exporter)</summary>

###
[`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0)

##### 💡 Enhancements 💡

- `statusreporting`: Automates status reporting upon the completion of
component.Start().
([#&#8203;7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682))
- `service`: add resource attributes as labels to otel metrics to
ensures backwards compatibility with OpenCensus metrics.
([#&#8203;9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029))
- `semconv`: Generated Semantic conventions 1.21.
([#&#8203;9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056))
- `config/confighttp`: Exposes http/2 transport settings to enable
health check and workaround golang http/2 issue
[https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690)
([#&#8203;9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022))
- `cmd/builder`: running builder version on binaries installed with `go
install` will output the version specified at the suffix.
([#&#8203;8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770))

##### 🧰 Bug fixes 🧰

- `exporterhelper`: fix missed metric aggregations
([#&#8203;9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048))
This ensures that context cancellation in the exporter doesn't interfere
with metric aggregation. The OTel
SDK currently returns if there's an error in the context used in `Add`.
This means that if there's a
    cancelled context in an export, the metrics are now recorded.

- `service`: Fix bug where MutatesData would not correctly propagate
through connectors.
([#&#8203;9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053))

###
[`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1)

##### 🧰 Bug fixes 🧰

- `exporterhelper`: Remove noisy log
([#&#8203;9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuODcuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
ymotongpoo referenced this issue in ymotongpoo/opentelemetry-collector-extra Dec 22, 2023
….91.0 (#32)

[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
|
[go.opentelemetry.io/collector/consumer](https://togithub.com/open-telemetry/opentelemetry-collector)
| `v0.90.0` -> `v0.91.0` |
[![age](https://developer.mend.io/api/mc/badges/age/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/go/go.opentelemetry.io%2fcollector%2fconsumer/v0.90.0/v0.91.0?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>open-telemetry/opentelemetry-collector
(go.opentelemetry.io/collector/consumer)</summary>

###
[`v0.91.0`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0910)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.1...v0.91.0)

##### 💡 Enhancements 💡

- `statusreporting`: Automates status reporting upon the completion of
component.Start().
([#&#8203;7682](https://togithub.com/open-telemetry/opentelemetry-collector/issues/7682))
- `service`: add resource attributes as labels to otel metrics to
ensures backwards compatibility with OpenCensus metrics.
([#&#8203;9029](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9029))
- `semconv`: Generated Semantic conventions 1.21.
([#&#8203;9056](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9056))
- `config/confighttp`: Exposes http/2 transport settings to enable
health check and workaround golang http/2 issue
[https://github.com/golang/go/issues/59690](https://togithub.com/golang/go/issues/59690)
([#&#8203;9022](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9022))
- `cmd/builder`: running builder version on binaries installed with `go
install` will output the version specified at the suffix.
([#&#8203;8770](https://togithub.com/open-telemetry/opentelemetry-collector/issues/8770))

##### 🧰 Bug fixes 🧰

- `exporterhelper`: fix missed metric aggregations
([#&#8203;9048](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9048))
This ensures that context cancellation in the exporter doesn't interfere
with metric aggregation. The OTel
SDK currently returns if there's an error in the context used in `Add`.
This means that if there's a
    cancelled context in an export, the metrics are now recorded.

- `service`: Fix bug where MutatesData would not correctly propagate
through connectors.
([#&#8203;9053](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9053))

###
[`v0.90.1`](https://togithub.com/open-telemetry/opentelemetry-collector/blob/HEAD/CHANGELOG.md#v0901)

[Compare
Source](https://togithub.com/open-telemetry/opentelemetry-collector/compare/v0.90.0...v0.90.1)

##### 🧰 Bug fixes 🧰

- `exporterhelper`: Remove noisy log
([#&#8203;9017](https://togithub.com/open-telemetry/opentelemetry-collector/issues/9017))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/ymotongpoo/opentelemetry-collector-extra).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy41OS44IiwidXBkYXRlZEluVmVyIjoiMzcuMTAzLjEiLCJ0YXJnZXRCcmFuY2giOiJtYWluIn0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
patrobinson added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
DrJosh9000 added a commit to buildkite/agent that referenced this issue Sep 23, 2024
patrobinson added a commit to buildkite/agent that referenced this issue Sep 23, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/617655 mentions this issue: http2: detect hung client connections by confirming stream resets

gopherbot pushed a commit to golang/net that referenced this issue Nov 1, 2024
Consider the case of an unresponsive client connection, where
the server has stopped responding. We send an infinite sequence of
requests to the connection in sequence, each with a timeout.
Each request counts against the concurrency limit for the
connection while active, but when a request times out we send
a RST_STREAM and free up the concurrency slot it was using.

We continue to try to send requests to the connection forever (or
until the kernel closes the underlying TCP connection, or until
ReadIdleTimeout/WriteByteTimeout results in us closing the connection).

Defend against this scenario by counting a canceled request
against the connection concurrency limit until we confirm the
server is responding. Specifically:

Track the number of in-flight request cancellations in cc.pendingResets.
This total counts against the connection concurrency limit.

When sending a RST_STREAM for a canceled request, increment
cc.pendingResets. Send a PING frame to the server, unless a PING
is already in flight.

When receiving a PING response, set cc.pendingResets to 0.

A hung connection will be used for at most
SETTINGS_MAX_CONCURRENT_STREAMS requests.

When StrictMaxConcurrentStreams is false, we will create a
new connection after reaching the concurrency limit for a hung one.

When StrictMaxConcurrentStreams is true, we will continue to
wait for the existing connection until some timeout closes it
or it becomes responsive again.

For golang/go#59690

Change-Id: I0151f9a594af14b32bcb6005a239fa19eb103704
Reviewed-on: https://go-review.googlesource.com/c/net/+/617655
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Jonathan Amsterdam <jba@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants