Close Session when there's failures to write to a Stream #143

lattwood · 2025-01-14T18:01:58Z

This pull request introduces a new test to detect a deadlock bug and includes modifications to handle connection write timeouts more robustly. The changes span across multiple files, with the primary focus on ensuring that sessions are closed properly when a write timeout occurs.

New Test for Deadlock Detection:

deadlock_test.go: Added a comprehensive test TestTimeoutParallel to simulate and detect a deadlock scenario in yamux streams. This test sets up two streams, stalls one to fill the channel, and verifies if the session closes correctly on write timeout.

Improvements in Connection Write Timeout Handling:

session.go: Enhanced the waitForSendErr and sendNoWait functions to ensure that the session is properly closed when a connection write timeout or session shutdown occurs. The changes include:
- Adding deferred functions to call s.exitErr(err) if an error occurs. [1] [2] [3] [4]

Test Adjustments:

session_test.go: Modified the TestSession_PingOfDeath to handle the expected ErrSessionShutdown error when a write times out, ensuring the test accounts for the new session shutdown behaviour.

Closes #142.

I humbly request a callout in changelogs when this bugfix ships :)

schmichael

I think I see what's going on here: the writing side of unidirectional Stream (streamBserver in this case) must close its connection for the reading side to be notified it has gone away. Your change makes it so Sessions close themselves on any send errors and appears to work as designed.

In theory yamux supports retrying Writes on Streams which receive a ErrConnectionWriteTimeout error. However in practice I highly doubt anyone does this because the Write that timed out is still in the sendCh buffer and could succeed later. What is a Stream to do? Wait for a Response just in case the send succeeds? Try the Write again if the application layer protocol is idempotent? I highly doubt any consumers of yamux do anything but Stream.Close() on any error from a Stream operation, and I think that's probably for the best.

This PR

I'm still hesitant to accept this PR because I don't think it's fixing a bug as much as being defensive against coding errors (a missing Stream.Close() in this case). I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously!

Speaking of Nomad, another reason I'm hesitant to close the whole Session whenever a single send times out is that Nomad doesn't always know what kind of net.Conn its using. It might be a TCP connection, a TLS session, or a yamux Stream. If a Write to any of these connections fail, Nomad will want to explicitly close the connection and create a new one. For example Go's crypto/tls says:

After a Write has timed out, the TLS state is corrupt and all future writes will return the same error.

So clearly the connection is useless, but do you need to explicitly close it? The examples do, and I have to imagine always explicitly closing a connection that will not be reused is the most robust choice.

So if users should be explicitly closing non-yamux.Stream net.Conns on error, I'd prefer they stick to that idiom with yamux.Streams as well.

And finally I think there are some risks to throwing away a whole Session on a single Stream's error. In your test Stream A is fine because it doesn't Write during the partition. But with your code change, Stream A, and any other Streams on a Session that experiences any kind of error, get thrown away. This could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes. The heartbeat mechanism is what ensures those connections don't miss a disconnect, and as far as I can tell the heartbeat mechanism isn't in question.

Test Variants

To arrive at these conclusions I spent quite a lot of time altering your very useful test case. You can see 2 variants in this gist: https://gist.github.com/schmichael/85bf905c95e0c7b03487b27832e8351c

The first tries to be a minimal variant with 2 changes:

It closes streams on errors.
It periodically prints received messages to demonstrate Stream A unpauses/unblocks as expected.

The second variant is much more drastic and tries to represent a more idiomatic approach to using yamux. It was extremely useful to play around with, but I will admit the end state has diverged sufficiently from the original that I may not be exercising the same cases.

Conclusion

Both test variants lead me to believe that in the absence of coding errors yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection.

Next week I intend to both audit Nomad's use of yamux Streams to ensure we are explicitly closing Streams on errors.

In addition the Nomad team intends to do more testing of full Client and Server agents that communicate through a proxy of our control so we can simulate various flaky conditions.

THANK YOU!

We take security and reliability of yamux and Nomad seriously and greatly appreciate your contributions! For potential security vulnerabilities we prefer to hear about it directly first like we lay out here: https://www.hashicorp.com/trust/security/vulnerability-management

That triggers a much more robust internal process than "schmichael promises to check out that suspicious issue someday (if he doesn't get distracted first)." 😬

I also know you've been battling hashicorp/nomad#23305 for a long time, and hopefully this latest attempt will help us find the root cause!

Let me know what you think, and expect to hear back from us on this PR or the original issue soon.

schmichael · 2025-01-31T19:30:36Z

deadlock_test.go

+	// Start A writer (server side)
+	go func() {
+		wg.Add(1)
+		defer wg.Done()
+		streamAserver, _ := serverSes.Open()


Usually when only one peer Opens streams and the other peer only Accepts streams, the Opener/initiator is considered the "client" and the Accepter is considered the "server." Not a functional problem though.

schmichael · 2025-01-31T20:11:02Z

deadlock_test.go

+		streamBserver, _ := serverSes.Open()
+		err := monotonicWriter(t, streamBserver, "B", nil)
+		if err != nil && !errors.Is(err, ErrConnectionWriteTimeout) {
+			failNow(errCh, "[B-writer-server] received unexpected error", err)
+		}


You're not calling streamBserver.Close() which I would expect to always be done (usually in a defer after a successful Open) regardless of whether Stream.Write() last returned an error or not.

schmichael · 2025-01-31T20:11:27Z

deadlock_test.go

+	label string
+	val   int


Since these are unexported, they are never being json encoded/decoded.

schmichael · 2025-01-31T20:35:22Z

deadlock_test.go

+	go func() {
+		wg.Add(1)
+		defer wg.Done()
+		streamAclient, _ := clientSes.Accept()


Because you are Accept()ing Streams concurrently in two distinct goroutines, streamAserver may be accepted by streamBclient and vice versa.

lattwood · 2025-02-02T20:07:16Z

Will adjust the PR based on your feedback when I'm back from PTO. I realized the security implications after filing the issue, and this was the response I was given after asking about the status, a week after I got an initial response, which seemed to indicate they either didn't see the issue, or (mistakenly) thought that mTLS mitigated it- ``` Thanks again for reaching out and all the details in your PR and GitHub issue. This issue was also identified in the past but due to mitigating control (use of mTLS in clusters following HashiCorp guidance / good practices & the Nomad security model <https://developer.hashicorp.com/nomad/docs/concepts/security>), this has been handled as a longer-lived security / defense-in-depth improvement rather than a vulnerability that we had in our roadmap. I'll revive the topic with the Nomad team and have your PR reviewed and hopefully merged in a future release. Thanks a lot for the contribution, it's greatly appreciated. ``` So in this case I don't think the more in depth review was triggered via that mechanism. In fact, it was the presence of mTLS that made me realize the missing write was never making it onto the wire and I started looking into the issue from the write instead of read perspective.

…

On Fri, Jan 31, 2025 at 9:41 PM Michael Schurter ***@***.***> wrote: ***@***.**** commented on this pull request. I think I see what's going on here: the writing side of unidirectional Stream (streamBserver in this case) *must* close its connection for the reading side to be notified it has gone away. Your change makes it so *Sessions* close themselves on any send errors and appears to work as designed. In theory yamux supports retrying Writes on Streams which receive a ErrConnectionWriteTimeout error. However in practice I highly doubt anyone does this because the Write that timed out is still in the sendCh buffer and could succeed later. What is a Stream to do? Wait for a Response just in case the send succeeds? Try the Write again if the application layer protocol is idempotent? I highly doubt any consumers of yamux do anything but Stream.Close() on *any* error from a Stream operation, and I think that's probably for the best. This PR I'm still hesitant to accept this PR because I don't think it's fixing a bug as much as being defensive against coding errors (a missing Stream.Close() in this case). I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously! Speaking of Nomad, another reason I'm hesitant to close the whole Session whenever a single send times out is that Nomad doesn't always know what kind of net.Conn its using. It might be a TCP connection, a TLS session, or a yamux Stream. If a Write to any of these connections fail, Nomad will want to explicitly close the connection and create a new one. For example Go's crypto/tls says <https://pkg.go.dev/crypto/tls#Conn.SetDeadline>: After a Write has timed out, the TLS state is corrupt and all future writes will return the same error. So clearly the connection is useless, but do you need to explicitly close it? The examples do, and I have to imagine always explicitly closing a connection that will not be reused is the most robust choice. So if users should be explicitly closing non-yamux.Stream net.Conns on error, I'd prefer they stick to that idiom with yamux.Streams as well. And finally I think there are some risks to throwing away a whole Session on a single Stream's error. In your test Stream A is fine because it doesn't Write during the partition. But with your code change, Stream A, and any other Streams on a Session that experiences any kind of error, get thrown away. This could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes. The heartbeat mechanism is what ensures those connections don't miss a disconnect, and as far as I can tell the heartbeat mechanism isn't in question. Test Variants To arrive at these conclusions I spent quite a lot of time altering your very useful test case. You can see 2 variants in this gist: https://gist.github.com/schmichael/85bf905c95e0c7b03487b27832e8351c The first tries to be a minimal variant with 2 changes: 1. It closes streams on errors. 2. It periodically prints received messages to demonstrate Stream A unpauses/unblocks as expected. The second variant is much more drastic and tries to represent a more idiomatic approach to using yamux. It was extremely useful to play around with, but I will admit the end state has diverged sufficiently from the original that I may not be exercising the same cases. Conclusion Both test variants lead me to believe that *in the absence of coding errors* yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection. Next week I intend to both audit Nomad's use of yamux Streams to ensure we are explicitly closing Streams on errors. In addition the Nomad team intends to do more testing of full Client and Server agents that communicate through a proxy of our control so we can simulate various flaky conditions. THANK YOU! We take security and reliability of yamux and Nomad seriously and greatly appreciate your contributions! For potential security vulnerabilities we prefer to hear about it directly first like we lay out here: https://www.hashicorp.com/trust/security/vulnerability-management That triggers a much more robust internal process than "schmichael promises to check out that suspicious issue someday (if he doesn't get distracted first)." 😬 I also know you've been battling hashicorp/nomad#23305 <hashicorp/nomad#23305> for a long time, and hopefully this latest attempt will help us find the root cause! Let me know what you think, and expect to hear back from us on this PR or the original issue soon. ------------------------------ In deadlock_test.go <#143 (comment)>: > + // Start A writer (server side) + go func() { + wg.Add(1) + defer wg.Done() + streamAserver, _ := serverSes.Open() Usually when only one peer Opens streams and the other peer only Accepts streams, the Opener/initiator is considered the "client" and the Accepter is considered the "server." Not a functional problem though. ------------------------------ In deadlock_test.go <#143 (comment)>: > + streamBserver, _ := serverSes.Open() + err := monotonicWriter(t, streamBserver, "B", nil) + if err != nil && !errors.Is(err, ErrConnectionWriteTimeout) { + failNow(errCh, "[B-writer-server] received unexpected error", err) + } You're not calling streamBserver.Close() which I would expect to always be done (usually in a defer after a successful Open) regardless of whether Stream.Write() last returned an error or not. ------------------------------ In deadlock_test.go <#143 (comment)>: > + label string + val int Since these are unexported, they are never being json encoded/decoded. ------------------------------ In deadlock_test.go <#143 (comment)>: > + go func() { + wg.Add(1) + defer wg.Done() + streamAclient, _ := clientSes.Accept() Because you are Accept()ing Streams concurrently in two distinct goroutines, streamAserver may be accepted by streamBclient and vice versa. — Reply to this email directly, view it on GitHub <#143 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJNKE6K4LFFOSFV2MEKUD2NQQ5PAVCNFSM6AAAAABVFOYOAGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBXGY2DGMJVGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

lattwood · 2025-02-04T16:04:57Z

Both test variants lead me to believe that in the absence of coding errors yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection.

ಠ_ಠ (slightly tongue in cheek)

One could argue that libraries should be defensive (default keepalive/ping interval should be adjusted to below the write timeout), and the case is made by the lack of handling write timeouts by Hashicorp when using Hashicorp's own connection multiplexing library. If Hashicorp couldn't get it right initially, then it's not reasonable to expect others to do the same as well.

in the absence of coding errors I agree with, which means the related go vulndb report will need to be amended or another one created so folks using Nomad or other tools that aren't properly handling write timeouts causing CWE-1088 are notified. I empathize with the fact this likely effects every Nomad version that has been released (and potentially Terraform, I can think of some edgecases that would impact Terraform Plugins if you had some overly aggressive EDR installed), making for a rather widely scoped notification.

[closing the Session when there's Stream write failures] could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes.

My initial thoughts were a naive read timeout akin to the write timeout would fix the issue, but would then cause the blocking queries to return prematurely- but I think that is somewhat core. Had there been a functional equivalence w/ read and write timeouts, there wouldn't have been the long polling/blocking issue as a result.

wrt "unnecessary disruption", this issue has been plaguing my team since at least Q4 2023, though it took us until the middle of last year to get enough information gathered to file an issue. It caused constant deployment failures impacting developer velocity which required SRE involvement every time to go around restarting Nomad nodes to address this bug, duplicate containers running on hosts with constraints preventing that, outright failure of nodes to start their assigned containers (ie, system jobs as well as regular).

Compared to that context, there really shouldn't be that much overhead to establish fresh connections in the event of write timeouts.

I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously!

I pulled the thread on this after writing the above and didn't want to rewrite it, but some of it is still relevant, so take that into consideration.

Nomad uses net/rpc, which uses a codec to send responses over the wire. In the case of Nomad, it's using msgpack. As you can see, the error from calling WriteResponse is never returned to anything, and is only logged when a package private variable debugLog is set to true. The msgpack rpc codec Hashicorp has written also doesn't do anything special when WriteResponse is called, or when the package private write method is called.

This means when net/rpc Writes fail, they fail silently. ಠ_ಠ (not tongue in cheek, mostly at the stdlib's implementation)

As far as I can tell, this means all Write errors encountered by Nomad when communicating by way of net/rpc are susceptible to this bug/issue, and additionally, as the libraries (both standard and Hashicorp) are currently written, it isn't possible to address this issue solely in Nomad.

It looks like handling this issue would require changing the behaviour of the standard library's net/rpc to propagate the write error, but looking at net/rpc's documentation, it doesn't appear there's anywhere to propagate the error TO. Based on that, regardless of the disruption concerns, it appears as though yamux needs to independently recover from this condition.

edit: https://github.com/golang/go/blob/b07b20fbb591ac77158e1089299ce5acad71ffde/src/net/rpc/server.go#L408-L428

Another option is to make the custom msgpack encoder close the Stream like the gob encoder does in the event of write errors.

lattwood · 2025-02-04T16:39:00Z

Go stdlib issue opened for net/rpc error issues- golang/go#71559

tgross · 2025-02-04T17:07:23Z

@lattwood just wanted to give you a heads up that I'm working on building out a reproduction of the Nomad issue in hopes of getting some empirical insights into the underlying problem here. I'll report back on that soon.

lattwood · 2025-02-04T17:23:58Z

@tgross fantastic, thank you (and good luck).

We have a control plane in eu-central-1, agents in southeast asia, and regularly see large raft update messages.

nomad job status -json line-of-business-app | jq -c | wc -c
 18164148

that's the size of the json for the job that is impacted the worst, which I'm including to underscore the sheer size of the writes going between servers and agents.

> nomad job status -json | jq -c | wc -c
 60789391

full cluster job status.

We have 3 masters, and 2533 nodes.

Here's the number of nodes that have been impacted by this through inference (they have a pending allocation due to this bug). It does not include nodes that have the hung read and haven't had an allocation updated though. It's a partial screenshot from the Job summary page in the UI.

lattwood · 2025-02-04T17:44:34Z

This is how I reconcile the broken nodes with pending allocations. This doesn't capture nodes that have a hung read and haven't had an allocation change state.

parallel -j16 'ssh -F /dev/null -i ~/.ssh/REDACTED -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@{} '"'killall -9 nomad'" ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u`

This is how I get the node_id's for everything with a hung read in yamux. I get goroutine stacks via the pprof endpoint, then use overlapping grep commands to find any hosts that have been blocked in yamux for at least 10 minutes.

nomad node status -json | jq -r '.[] | select(.Status=="ready") | .ID' | parallel -j16 "curl -s -k --cacert $NOMAD_CACERT --cert $NOMAD_CLIENT_CERT --key $NOMAD_CLIENT_KEY --header X-Nomad-Token:\ $NOMAD_TOKEN https://REDACTED:4646/v1/agent/pprof/goroutine\?debug=2\&node_id={} | grep -B2 yamux | grep -A2 -E ' [0-9]{2,} minutes' | grep -B1 Read | grep minutes > {}.out"

From the node_id's, I look them up w/ nomad node status, ssh in and kill nomad again.

tgross · 2025-02-04T22:17:30Z

Ok, I'm at the end of my day here but I want to brain-dump what I've got so far...

I don't have a reproduction yet but I have some insights as to why this has been challenging to reproduce. My test rig is a server and a client each in their own network namespace. The structure of the test is to induce traffic from one to the other, and then intervene with either iptables or tc to simulate network splits/congestion.

(I also discovered another strange bug in the Nomad client where with iptables split I was able to induce the client to get stuck on a much shorter "wait" for it's blocking RPC than the 5min it should be waiting. We set a shorter wait when retrying and must not be resetting that properly. I'll get an issue open in the Nomad repo for that in the next day or so.)

Inducing the ErrConnectionWriteTimeout instead of some other error has proven surprisingly hard, and I think that's because of buffering at the TCP level. The congestion is on the network, but the write timeout we're worried about is in yamux. So if the underlying TCP connection buffer is big enough for yamux to write the header+body to it (via the msgpack codec), then we return without error to yamux's write. Nothing else (in Nomad) will write on that stream till we get a response. But the body can just be sitting in the TCP buffers at that point.

To give myself the best odds of detecting this bug we care about here I'm:

Setting Nomad's yamux keepalive window to 120s.
Setting Nomad's yamux write timeout to 1s.
Setting the TCP connection's write buffer to 1 byte.
Adding logging to yamux so that it reports sending keepalive, so I can time when to interrupt traffic

After a dozen variations I've managed to trigger the ErrConnectionWriteTimeout at session.go#L441 but not the one earlier in the waitForSendErr. Unfortunately that test did not trigger the condition where (*Stream).Read hangs longer than the network interruption. Once connectivity is restored, the read unblocks.

However, the RPC handler is not getting an error after the ErrConnectionWriteTimeout; it's hanging until either the keepalive or the netsplit heals! Which is certainly a bug even if it doesn't quite match the behavior described in #142. I think your observation about net/rpc is likely correct, but that would be a bug in the msgpack library and we should fix it there by closing on error as was done in gob (ref golang/go#7689). But before we go running off to do that, I'd really like to have an exact repro of the behavior you're seeing so we can be confident we've put this bug to rest. I'll take another run at it tomorrow and will report back after that and/or open a PR in the msgpack library.

(Please hold off on opening an issue/PR there, as another team who isn't in the loop on this bug and is in a far-flung TZ will get notifications and that'll cause them a bit of scramble. We're making progress! 😁 )

lattwood · 2025-02-04T22:28:13Z

Awesome. If you can find yourself on the community "halihax" tomorrow, I can also grab arbitrary goroutine stacks, poke around hung agents with delve, etc.

…

On Tue, Feb 4, 2025 at 6:17 PM Tim Gross ***@***.***> wrote: Ok, I'm at the end of my day here but I want to brain-dump what I've got so far... I don't have a reproduction yet but I have some insights as to why this has been challenging to reproduce. My test rig is a server and a client each in their own network namespace. The structure of the test is to induce traffic from one to the other, and then intervene with either iptables or tc <https://www.man7.org/linux/man-pages/man8/tc.8.html> to simulate network splits/congestion. (I also discovered another strange bug in the Nomad client where with iptables split I was able to induce the client to get stuck on a much shorter "wait" for it's blocking RPC than the 5min it should be waiting. We set a shorter wait when retrying and must not be resetting that properly. I'll get an issue open in the Nomad repo for that in the next day or so.) Inducing the ErrConnectionWriteTimeout instead of some other error has proven surprisingly hard, and I think that's because of buffering at the TCP level. The congestion is on the network, but the write timeout we're worried about is in yamux. So if the underlying TCP connection buffer is big enough for yamux to write the header+body to it (via the msgpack codec), then we return without error to yamux's write. Nothing else (in Nomad) will write on that stream till we get a response. But the body can just be sitting in the TCP buffers at that point. To give myself the best odds of detecting this bug we care about here I'm: - Setting Nomad's yamux keepalive window to 120s. - Setting Nomad's yamux write timeout to 1s. - Setting the TCP connection's write buffer to 1 byte. - Adding logging to yamux so that it reports sending keepalive, so I can time when to interrupt traffic After a dozen variations I've managed to trigger the ErrConnectionWriteTimeout at session.go#L441 <https://github.com/hashicorp/yamux/blob/v0.1.2/session.go#L441> but not the one earlier in the waitForSendErr. Unfortunately that test did *not* trigger the condition where (*Stream).Read hangs longer than the network interruption. Once connectivity is restored, the read unblocks. However, the RPC handler is *not* getting an error after the ErrConnectionWriteTimeout; it's hanging until either the keepalive or the netsplit heals! Which is certainly a bug even if it doesn't quite match the behavior described in #142 <#142>. I think your observation about net/rpc is likely correct, but that would be a bug in the msgpack library and we should fix it there by closing on error as was done in gob (ref golang/go#7689 <golang/go#7689>). But before we go running off to do that, I'd really like to have an exact repro of the behavior you're seeing so we can be confident we've put this bug to rest. I'll take another run at it tomorrow and will report back after that and/or open a PR in the msgpack library. (Please hold off on opening an issue/PR there, as another team who isn't in the loop on this bug and is in a far-flung TZ will get notifications and that'll cause them a bit of scramble. We're making progress.) — Reply to this email directly, view it on GitHub <#143 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPJNKEUACDYTXOPPJKPHL32OE4BDAVCNFSM6AAAAABVFOYOAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZVGE4TKOJTHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

lattwood · 2025-02-05T15:09:16Z

After a dozen variations I've managed to trigger the ErrConnectionWriteTimeout at session.go#L441 but not the one earlier in the waitForSendErr. Unfortunately that test did not trigger the condition where (*Stream).Read hangs longer than the network interruption. Once connectivity is restored, the read unblocks.

Yeah, this is the insidious part. You need to get Write Timeouts on the connection, and then never write to it again, to replicate the behaviour of the message pack codec.

yamux/session.go

Line 109 in 9e980a5

sendCh: make(chan *sendReady, 64),

What you need to do (which sounds insane) is fill that buffered channel to reproduce the hanging read, and have it kept full for the entire Write timeout. It looks like you managed to get the OS buffers full but not yamux's, so almost there!

I'm going to open up some PRs to fix their codec behaviour 🤦🏻‍♂️ , and I'll create a branch and link it here for the msgpack stuff if you don't get to that before I do. I won't actually create the PR until later due to the aforementioned TZ stuff.

tgross · 2025-02-05T15:18:08Z

I'm going to open up some PRs to fix their codec behaviour 🤦🏻‍♂️ , and I'll create a branch and link it here for the msgpack stuff if you don't get to that before I do. I won't actually create the PR until later due to the aforementioned TZ stuff.

"Their" who? The msgpack codec repo? That's just as much "us" as this one in terms of implementation expertise. But there's another "IP compliance" team that gets pinged on these repos, especially when there's an industry-wide scramble created by the spurious vuln DB filing. Please just hold on, @lattwood, and let us work here.

lattwood · 2025-02-05T15:26:41Z

@tgross ack.

if you're curious-

https://chatgpt.com/share/67a38308-f618-800e-b1ea-6df285d2f2cf

lattwood · 2025-02-05T15:34:03Z

One other realization- it's been stated numerous times that using mTLS with Nomad prevents direct exploitation of this issue.

But starting with 1.10, official guidance is to turn off mTLS when using Workload Identity with Vault.

tgross · 2025-02-05T15:36:16Z

But starting with 1.10, official guidance is to turn off mTLS when using Workload Identity with Vault.

For HTTP, not RPC.

finnigja · 2025-02-05T20:30:23Z

👋 from HashiCorp Security. Based on investigations detailed above, we do not believe that Yamux is exposed to a denial of service vulnerability as described or that the proposed changes are necessary to address such a vulnerability.

Yamux requires that dependent applications explicitly close streams on write errors. Coding errors may cause network connectivity issues, but Yamux does not attempt to defend against these errors by design.

We’ll continue to investigate the Nomad-specific symptoms that initially pointed the PR submitter in this direction.

tgross · 2025-02-05T20:33:52Z

I've got a follow-up of where we're at on this @lattwood but give me a little bit to write it up and vet it with @schmichael. Will post back here shortly.

tgross · 2025-02-05T21:00:32Z

I've been working to build out a reproduction of the behavior you've described and I'm still unsuccessful after 2 days.

With a hacked-up API client that starts a stream (via alloc exec), pauses with it open long enough for me to create network congestion, and then sends an endless stream of base64 garbage, I can regularly trigger the ErrConnectionWriteTimeout as described. But this isn't resulting in the client getting stuck no matter how I tune buffers and timeouts. In all cases, the writer closes either the stream or session (depending on details). Where the reader side doesn't close the stream it's returned to the pool where the window gets resized.

Added logging in yamux, go-msgpack, and net-rpc-msgpackrpc suggests that I'm still not hitting the code path you've identified where net/rpc swallows the error. Because of code generation in release builds it gets a little hairy exactly which code path we take to WriteResponse, as both libraries implement it.

That being said, the gob fix in golang/go@cd5b785 also seems reasonably safe. This also lines-up with libp2p's yamux fork not auto-closing-on-errors: they don't use our msgpack RPC, so they don't have this problem! This problem is only likely to be observed in rpc implementations like this where the actual Stream.Write is buried in a blackbox like net/rpc. (Also note that libp2p's may not be protocol-compatible, so just dropping it in may be a dead end.)

So here's what I'd like to propose:

Go ahead and open the gob fix as PRs on the go-msgpack and net-rpc-msgpackrpc libraries. Reference pending allocations stuck in pending state after adoption by a new deployment nomad#23305 and tag @schmichael and I in the description. We'll pop in and let other HashiCorp folks who might jump in know that this is work-in-progress.
Build yourself a binary with that fix and deploy it to some chunk of your cluster.
If the problem goes away, great! We'll merge the PRs and ship them.
If not, lets head back to #23305 and @schmichael and I will help debug from there.
- Bring data! Ideally we'll get logs and goroutine stacks from both ends of the problem.
- If it becomes possible for you to provide a minimal reproduction, that'd be even better.
- Let's try to keep the discussion centralized in #23305 until we have that smoking gun. My gut tells me that the symptoms you're having may extend beyond networking/RPC issues, so it'll help if we can dissect things together in one place.

lattwood added 3 commits January 14, 2025 13:09

Create test for a Write timeout causing Reads to hang

fcba0fc

Close Sessions on write failure

6f326ae

Rework PingOfDeath test as a failure to Ping closes the Session

c848c33

lattwood requested a review from a team as a code owner January 14, 2025 18:01

lattwood mentioned this pull request Jan 14, 2025

DefaultConfig has dangerous defaults causing hung Reads #142

Closed

Declare err inside the scope of each test

bec8ffe

This was referenced Jan 14, 2025

pending allocations stuck in pending state after adoption by a new deployment hashicorp/nomad#23305

Closed

x/vulndb: potential Go vuln in github.com/hashicorp/yamux golang/vulndb#3408

Closed

schmichael self-assigned this Jan 18, 2025

GoVulnBot mentioned this pull request Jan 29, 2025

x/vulndb: potential Go vuln in github.com/hashicorp/yamux: GHSA-29qp-crvh-w22m golang/vulndb#3430

Closed

mukeshjc self-assigned this Jan 31, 2025

mukeshjc requested a review from schmichael January 31, 2025 05:11

schmichael reviewed Feb 1, 2025

View reviewed changes

This was referenced Feb 3, 2025

grafana-11.5/11.5.1 package update wolfi-dev/os#41168

Merged

tflint/0.55.1 package update wolfi-dev/os#41100

Merged

jrasell mentioned this pull request Feb 3, 2025

sec: Suppress additional yamux advisory and AWS v1 indirect dep. hashicorp/nomad#25003

Merged

6 tasks

hc-github-team-nomad-core mentioned this pull request Feb 3, 2025

Backport of sec: Suppress additional yamux advisory and AWS v1 indirect dep. into release/1.9.x hashicorp/nomad#25005

Merged

6 tasks

This was referenced Feb 3, 2025

splunk-otel-collector/0.118.0 package update wolfi-dev/os#41196

Merged

terraform-provider-google/6.19.0 package update wolfi-dev/os#41203

Merged

crossplane-provider-gcp/1.11.3 package update wolfi-dev/os#41272

Merged

lattwood mentioned this pull request Feb 4, 2025

net/rpc: no way to get visibility into write errors (debugLog isn't exposed or changed) golang/go#71559

Closed

jharley mentioned this pull request Feb 4, 2025

build(deps): bump github.com/hashicorp/yamux from 0.1.1 to 0.1.2 honeycombio/terraform-provider-honeycombio#606

Merged

tgross closed this Feb 5, 2025

tgross mentioned this pull request Feb 5, 2025

autopilot stats opens extra yamux stream from server to itself hashicorp/nomad#25032

Open

finnigja mentioned this pull request Feb 5, 2025

x/vulndb: suggestion regarding GO-2025-3408 golang/vulndb#3453

Closed

tgross mentioned this pull request Feb 5, 2025

RPC retries from client can alter blocking time for Node.GetClientAllocs until client restart hashicorp/nomad#25033

Closed

finnigja mentioned this pull request Feb 6, 2025

withdraw [GHSA-29qp-crvh-w22m] github.com/hashicorp/yamux's DefaultConfig has dangerous defaults causing hung Read github/advisory-database#5250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close Session when there's failures to write to a Stream #143

Close Session when there's failures to write to a Stream #143

lattwood commented Jan 14, 2025 •

edited

Loading

schmichael left a comment

schmichael Jan 31, 2025

schmichael Jan 31, 2025

schmichael Jan 31, 2025

schmichael Jan 31, 2025

lattwood commented Feb 2, 2025 via email

lattwood commented Feb 4, 2025 •

edited

Loading

lattwood commented Feb 4, 2025

tgross commented Feb 4, 2025

lattwood commented Feb 4, 2025

lattwood commented Feb 4, 2025

tgross commented Feb 4, 2025 •

edited

Loading

lattwood commented Feb 4, 2025 via email

lattwood commented Feb 5, 2025

tgross commented Feb 5, 2025

lattwood commented Feb 5, 2025

lattwood commented Feb 5, 2025

tgross commented Feb 5, 2025

finnigja commented Feb 5, 2025

tgross commented Feb 5, 2025

tgross commented Feb 5, 2025

Close Session when there's failures to write to a Stream #143

Close Session when there's failures to write to a Stream #143

Conversation

lattwood commented Jan 14, 2025 • edited Loading

New Test for Deadlock Detection:

Improvements in Connection Write Timeout Handling:

Test Adjustments:

schmichael left a comment

Choose a reason for hiding this comment

This PR

Test Variants

Conclusion

THANK YOU!

schmichael Jan 31, 2025

Choose a reason for hiding this comment

schmichael Jan 31, 2025

Choose a reason for hiding this comment

schmichael Jan 31, 2025

Choose a reason for hiding this comment

schmichael Jan 31, 2025

Choose a reason for hiding this comment

lattwood commented Feb 2, 2025 via email

lattwood commented Feb 4, 2025 • edited Loading

lattwood commented Feb 4, 2025

tgross commented Feb 4, 2025

lattwood commented Feb 4, 2025

lattwood commented Feb 4, 2025

tgross commented Feb 4, 2025 • edited Loading

lattwood commented Feb 4, 2025 via email

lattwood commented Feb 5, 2025

tgross commented Feb 5, 2025

lattwood commented Feb 5, 2025

lattwood commented Feb 5, 2025

tgross commented Feb 5, 2025

finnigja commented Feb 5, 2025

tgross commented Feb 5, 2025

tgross commented Feb 5, 2025

lattwood commented Jan 14, 2025 •

edited

Loading

lattwood commented Feb 4, 2025 •

edited

Loading

tgross commented Feb 4, 2025 •

edited

Loading