Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close Session when there's failures to write to a Stream #143

Closed
wants to merge 4 commits into from

Conversation

lattwood
Copy link
Contributor

@lattwood lattwood commented Jan 14, 2025

This pull request introduces a new test to detect a deadlock bug and includes modifications to handle connection write timeouts more robustly. The changes span across multiple files, with the primary focus on ensuring that sessions are closed properly when a write timeout occurs.

New Test for Deadlock Detection:

  • deadlock_test.go: Added a comprehensive test TestTimeoutParallel to simulate and detect a deadlock scenario in yamux streams. This test sets up two streams, stalls one to fill the channel, and verifies if the session closes correctly on write timeout.

Improvements in Connection Write Timeout Handling:

  • session.go: Enhanced the waitForSendErr and sendNoWait functions to ensure that the session is properly closed when a connection write timeout or session shutdown occurs. The changes include:
    • Adding deferred functions to call s.exitErr(err) if an error occurs. [1] [2] [3] [4]

Test Adjustments:

  • session_test.go: Modified the TestSession_PingOfDeath to handle the expected ErrSessionShutdown error when a write times out, ensuring the test accounts for the new session shutdown behaviour.

Closes #142.

I humbly request a callout in changelogs when this bugfix ships :)

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what's going on here: the writing side of unidirectional Stream (streamBserver in this case) must close its connection for the reading side to be notified it has gone away. Your change makes it so Sessions close themselves on any send errors and appears to work as designed.

In theory yamux supports retrying Writes on Streams which receive a ErrConnectionWriteTimeout error. However in practice I highly doubt anyone does this because the Write that timed out is still in the sendCh buffer and could succeed later. What is a Stream to do? Wait for a Response just in case the send succeeds? Try the Write again if the application layer protocol is idempotent? I highly doubt any consumers of yamux do anything but Stream.Close() on any error from a Stream operation, and I think that's probably for the best.

This PR

I'm still hesitant to accept this PR because I don't think it's fixing a bug as much as being defensive against coding errors (a missing Stream.Close() in this case). I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously!

Speaking of Nomad, another reason I'm hesitant to close the whole Session whenever a single send times out is that Nomad doesn't always know what kind of net.Conn its using. It might be a TCP connection, a TLS session, or a yamux Stream. If a Write to any of these connections fail, Nomad will want to explicitly close the connection and create a new one. For example Go's crypto/tls says:

After a Write has timed out, the TLS state is corrupt and all future writes will return the same error.

So clearly the connection is useless, but do you need to explicitly close it? The examples do, and I have to imagine always explicitly closing a connection that will not be reused is the most robust choice.

So if users should be explicitly closing non-yamux.Stream net.Conns on error, I'd prefer they stick to that idiom with yamux.Streams as well.

And finally I think there are some risks to throwing away a whole Session on a single Stream's error. In your test Stream A is fine because it doesn't Write during the partition. But with your code change, Stream A, and any other Streams on a Session that experiences any kind of error, get thrown away. This could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes. The heartbeat mechanism is what ensures those connections don't miss a disconnect, and as far as I can tell the heartbeat mechanism isn't in question.

Test Variants

To arrive at these conclusions I spent quite a lot of time altering your very useful test case. You can see 2 variants in this gist: https://gist.github.com/schmichael/85bf905c95e0c7b03487b27832e8351c

The first tries to be a minimal variant with 2 changes:

  1. It closes streams on errors.
  2. It periodically prints received messages to demonstrate Stream A unpauses/unblocks as expected.

The second variant is much more drastic and tries to represent a more idiomatic approach to using yamux. It was extremely useful to play around with, but I will admit the end state has diverged sufficiently from the original that I may not be exercising the same cases.

Conclusion

Both test variants lead me to believe that in the absence of coding errors yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection.

Next week I intend to both audit Nomad's use of yamux Streams to ensure we are explicitly closing Streams on errors.

In addition the Nomad team intends to do more testing of full Client and Server agents that communicate through a proxy of our control so we can simulate various flaky conditions.

THANK YOU!

We take security and reliability of yamux and Nomad seriously and greatly appreciate your contributions! For potential security vulnerabilities we prefer to hear about it directly first like we lay out here: https://www.hashicorp.com/trust/security/vulnerability-management

That triggers a much more robust internal process than "schmichael promises to check out that suspicious issue someday (if he doesn't get distracted first)." 😬

I also know you've been battling hashicorp/nomad#23305 for a long time, and hopefully this latest attempt will help us find the root cause!

Let me know what you think, and expect to hear back from us on this PR or the original issue soon.

Comment on lines +163 to +167
// Start A writer (server side)
go func() {
wg.Add(1)
defer wg.Done()
streamAserver, _ := serverSes.Open()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when only one peer Opens streams and the other peer only Accepts streams, the Opener/initiator is considered the "client" and the Accepter is considered the "server." Not a functional problem though.

Comment on lines +190 to +194
streamBserver, _ := serverSes.Open()
err := monotonicWriter(t, streamBserver, "B", nil)
if err != nil && !errors.Is(err, ErrConnectionWriteTimeout) {
failNow(errCh, "[B-writer-server] received unexpected error", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not calling streamBserver.Close() which I would expect to always be done (usually in a defer after a successful Open) regardless of whether Stream.Write() last returned an error or not.

Comment on lines +32 to +33
label string
val int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are unexported, they are never being json encoded/decoded.

Comment on lines +175 to +178
go func() {
wg.Add(1)
defer wg.Done()
streamAclient, _ := clientSes.Accept()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are Accept()ing Streams concurrently in two distinct goroutines, streamAserver may be accepted by streamBclient and vice versa.

@lattwood
Copy link
Contributor Author

lattwood commented Feb 2, 2025 via email

@lattwood
Copy link
Contributor Author

lattwood commented Feb 4, 2025

Both test variants lead me to believe that in the absence of coding errors yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection.

ಠ_ಠ (slightly tongue in cheek)

One could argue that libraries should be defensive (default keepalive/ping interval should be adjusted to below the write timeout), and the case is made by the lack of handling write timeouts by Hashicorp when using Hashicorp's own connection multiplexing library. If Hashicorp couldn't get it right initially, then it's not reasonable to expect others to do the same as well.

in the absence of coding errors I agree with, which means the related go vulndb report will need to be amended or another one created so folks using Nomad or other tools that aren't properly handling write timeouts causing CWE-1088 are notified. I empathize with the fact this likely effects every Nomad version that has been released (and potentially Terraform, I can think of some edgecases that would impact Terraform Plugins if you had some overly aggressive EDR installed), making for a rather widely scoped notification.

[closing the Session when there's Stream write failures] could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes.

My initial thoughts were a naive read timeout akin to the write timeout would fix the issue, but would then cause the blocking queries to return prematurely- but I think that is somewhat core. Had there been a functional equivalence w/ read and write timeouts, there wouldn't have been the long polling/blocking issue as a result.

wrt "unnecessary disruption", this issue has been plaguing my team since at least Q4 2023, though it took us until the middle of last year to get enough information gathered to file an issue. It caused constant deployment failures impacting developer velocity which required SRE involvement every time to go around restarting Nomad nodes to address this bug, duplicate containers running on hosts with constraints preventing that, outright failure of nodes to start their assigned containers (ie, system jobs as well as regular).

Compared to that context, there really shouldn't be that much overhead to establish fresh connections in the event of write timeouts.


I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously!

I pulled the thread on this after writing the above and didn't want to rewrite it, but some of it is still relevant, so take that into consideration.

Nomad uses net/rpc, which uses a codec to send responses over the wire. In the case of Nomad, it's using msgpack. As you can see, the error from calling WriteResponse is never returned to anything, and is only logged when a package private variable debugLog is set to true. The msgpack rpc codec Hashicorp has written also doesn't do anything special when WriteResponse is called, or when the package private write method is called.

This means when net/rpc Writes fail, they fail silently. ಠ_ಠ (not tongue in cheek, mostly at the stdlib's implementation)

As far as I can tell, this means all Write errors encountered by Nomad when communicating by way of net/rpc are susceptible to this bug/issue, and additionally, as the libraries (both standard and Hashicorp) are currently written, it isn't possible to address this issue solely in Nomad.

It looks like handling this issue would require changing the behaviour of the standard library's net/rpc to propagate the write error, but looking at net/rpc's documentation, it doesn't appear there's anywhere to propagate the error TO. Based on that, regardless of the disruption concerns, it appears as though yamux needs to independently recover from this condition.

edit: https://github.com/golang/go/blob/b07b20fbb591ac77158e1089299ce5acad71ffde/src/net/rpc/server.go#L408-L428

Another option is to make the custom msgpack encoder close the Stream like the gob encoder does in the event of write errors.

@lattwood
Copy link
Contributor Author

lattwood commented Feb 4, 2025

Go stdlib issue opened for net/rpc error issues- golang/go#71559

@tgross
Copy link
Member

tgross commented Feb 4, 2025

@lattwood just wanted to give you a heads up that I'm working on building out a reproduction of the Nomad issue in hopes of getting some empirical insights into the underlying problem here. I'll report back on that soon.

@lattwood
Copy link
Contributor Author

lattwood commented Feb 4, 2025

@tgross fantastic, thank you (and good luck).

We have a control plane in eu-central-1, agents in southeast asia, and regularly see large raft update messages.

nomad job status -json line-of-business-app | jq -c | wc -c
 18164148

that's the size of the json for the job that is impacted the worst, which I'm including to underscore the sheer size of the writes going between servers and agents.

> nomad job status -json | jq -c | wc -c
 60789391

full cluster job status.

We have 3 masters, and 2533 nodes.

image

Here's the number of nodes that have been impacted by this through inference (they have a pending allocation due to this bug). It does not include nodes that have the hung read and haven't had an allocation updated though. It's a partial screenshot from the Job summary page in the UI.

@lattwood
Copy link
Contributor Author

lattwood commented Feb 4, 2025

This is how I reconcile the broken nodes with pending allocations. This doesn't capture nodes that have a hung read and haven't had an allocation change state.

parallel -j16 'ssh -F /dev/null -i ~/.ssh/REDACTED -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@{} '"'killall -9 nomad'" ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u`

This is how I get the node_id's for everything with a hung read in yamux. I get goroutine stacks via the pprof endpoint, then use overlapping grep commands to find any hosts that have been blocked in yamux for at least 10 minutes.

nomad node status -json | jq -r '.[] | select(.Status=="ready") | .ID' | parallel -j16 "curl -s -k --cacert $NOMAD_CACERT --cert $NOMAD_CLIENT_CERT --key $NOMAD_CLIENT_KEY --header X-Nomad-Token:\ $NOMAD_TOKEN https://REDACTED:4646/v1/agent/pprof/goroutine\?debug=2\&node_id={} | grep -B2 yamux | grep -A2 -E ' [0-9]{2,} minutes' | grep -B1 Read | grep minutes > {}.out"

From the node_id's, I look them up w/ nomad node status, ssh in and kill nomad again.

@tgross
Copy link
Member

tgross commented Feb 4, 2025

Ok, I'm at the end of my day here but I want to brain-dump what I've got so far...

I don't have a reproduction yet but I have some insights as to why this has been challenging to reproduce. My test rig is a server and a client each in their own network namespace. The structure of the test is to induce traffic from one to the other, and then intervene with either iptables or tc to simulate network splits/congestion.

(I also discovered another strange bug in the Nomad client where with iptables split I was able to induce the client to get stuck on a much shorter "wait" for it's blocking RPC than the 5min it should be waiting. We set a shorter wait when retrying and must not be resetting that properly. I'll get an issue open in the Nomad repo for that in the next day or so.)

Inducing the ErrConnectionWriteTimeout instead of some other error has proven surprisingly hard, and I think that's because of buffering at the TCP level. The congestion is on the network, but the write timeout we're worried about is in yamux. So if the underlying TCP connection buffer is big enough for yamux to write the header+body to it (via the msgpack codec), then we return without error to yamux's write. Nothing else (in Nomad) will write on that stream till we get a response. But the body can just be sitting in the TCP buffers at that point.

To give myself the best odds of detecting this bug we care about here I'm:

  • Setting Nomad's yamux keepalive window to 120s.
  • Setting Nomad's yamux write timeout to 1s.
  • Setting the TCP connection's write buffer to 1 byte.
  • Adding logging to yamux so that it reports sending keepalive, so I can time when to interrupt traffic

After a dozen variations I've managed to trigger the ErrConnectionWriteTimeout at session.go#L441 but not the one earlier in the waitForSendErr. Unfortunately that test did not trigger the condition where (*Stream).Read hangs longer than the network interruption. Once connectivity is restored, the read unblocks.

However, the RPC handler is not getting an error after the ErrConnectionWriteTimeout; it's hanging until either the keepalive or the netsplit heals! Which is certainly a bug even if it doesn't quite match the behavior described in #142. I think your observation about net/rpc is likely correct, but that would be a bug in the msgpack library and we should fix it there by closing on error as was done in gob (ref golang/go#7689). But before we go running off to do that, I'd really like to have an exact repro of the behavior you're seeing so we can be confident we've put this bug to rest. I'll take another run at it tomorrow and will report back after that and/or open a PR in the msgpack library.

(Please hold off on opening an issue/PR there, as another team who isn't in the loop on this bug and is in a far-flung TZ will get notifications and that'll cause them a bit of scramble. We're making progress! 😁 )

@lattwood
Copy link
Contributor Author

lattwood commented Feb 4, 2025 via email

@lattwood
Copy link
Contributor Author

lattwood commented Feb 5, 2025

After a dozen variations I've managed to trigger the ErrConnectionWriteTimeout at session.go#L441 but not the one earlier in the waitForSendErr. Unfortunately that test did not trigger the condition where (*Stream).Read hangs longer than the network interruption. Once connectivity is restored, the read unblocks.

Yeah, this is the insidious part. You need to get Write Timeouts on the connection, and then never write to it again, to replicate the behaviour of the message pack codec.

sendCh: make(chan *sendReady, 64),

What you need to do (which sounds insane) is fill that buffered channel to reproduce the hanging read, and have it kept full for the entire Write timeout. It looks like you managed to get the OS buffers full but not yamux's, so almost there!

I'm going to open up some PRs to fix their codec behaviour 🤦🏻‍♂️ , and I'll create a branch and link it here for the msgpack stuff if you don't get to that before I do. I won't actually create the PR until later due to the aforementioned TZ stuff.

@tgross
Copy link
Member

tgross commented Feb 5, 2025

I'm going to open up some PRs to fix their codec behaviour 🤦🏻‍♂️ , and I'll create a branch and link it here for the msgpack stuff if you don't get to that before I do. I won't actually create the PR until later due to the aforementioned TZ stuff.

"Their" who? The msgpack codec repo? That's just as much "us" as this one in terms of implementation expertise. But there's another "IP compliance" team that gets pinged on these repos, especially when there's an industry-wide scramble created by the spurious vuln DB filing. Please just hold on, @lattwood, and let us work here.

@lattwood
Copy link
Contributor Author

lattwood commented Feb 5, 2025

@lattwood
Copy link
Contributor Author

lattwood commented Feb 5, 2025

One other realization- it's been stated numerous times that using mTLS with Nomad prevents direct exploitation of this issue.

But starting with 1.10, official guidance is to turn off mTLS when using Workload Identity with Vault.

@tgross
Copy link
Member

tgross commented Feb 5, 2025

But starting with 1.10, official guidance is to turn off mTLS when using Workload Identity with Vault.

For HTTP, not RPC.

@finnigja
Copy link

finnigja commented Feb 5, 2025

👋 from HashiCorp Security. Based on investigations detailed above, we do not believe that Yamux is exposed to a denial of service vulnerability as described or that the proposed changes are necessary to address such a vulnerability.

Yamux requires that dependent applications explicitly close streams on write errors. Coding errors may cause network connectivity issues, but Yamux does not attempt to defend against these errors by design.

We’ll continue to investigate the Nomad-specific symptoms that initially pointed the PR submitter in this direction.

@tgross
Copy link
Member

tgross commented Feb 5, 2025

I've got a follow-up of where we're at on this @lattwood but give me a little bit to write it up and vet it with @schmichael. Will post back here shortly.

@tgross tgross closed this Feb 5, 2025
@tgross
Copy link
Member

tgross commented Feb 5, 2025

I've been working to build out a reproduction of the behavior you've described and I'm still unsuccessful after 2 days.

With a hacked-up API client that starts a stream (via alloc exec), pauses with it open long enough for me to create network congestion, and then sends an endless stream of base64 garbage, I can regularly trigger the ErrConnectionWriteTimeout as described. But this isn't resulting in the client getting stuck no matter how I tune buffers and timeouts. In all cases, the writer closes either the stream or session (depending on details). Where the reader side doesn't close the stream it's returned to the pool where the window gets resized.

Added logging in yamux, go-msgpack, and net-rpc-msgpackrpc suggests that I'm still not hitting the code path you've identified where net/rpc swallows the error. Because of code generation in release builds it gets a little hairy exactly which code path we take to WriteResponse, as both libraries implement it.

That being said, the gob fix in golang/go@cd5b785 also seems reasonably safe. This also lines-up with libp2p's yamux fork not auto-closing-on-errors: they don't use our msgpack RPC, so they don't have this problem! This problem is only likely to be observed in rpc implementations like this where the actual Stream.Write is buried in a blackbox like net/rpc. (Also note that libp2p's may not be protocol-compatible, so just dropping it in may be a dead end.)

So here's what I'd like to propose:

  • Go ahead and open the gob fix as PRs on the go-msgpack and net-rpc-msgpackrpc libraries. Reference pending allocations stuck in pending state after adoption by a new deployment nomad#23305 and tag @schmichael and I in the description. We'll pop in and let other HashiCorp folks who might jump in know that this is work-in-progress.
  • Build yourself a binary with that fix and deploy it to some chunk of your cluster.
  • If the problem goes away, great! We'll merge the PRs and ship them.
  • If not, lets head back to #23305 and @schmichael and I will help debug from there.
    • Bring data! Ideally we'll get logs and goroutine stacks from both ends of the problem.
    • If it becomes possible for you to provide a minimal reproduction, that'd be even better.
    • Let's try to keep the discussion centralized in #23305 until we have that smoking gun. My gut tells me that the symptoms you're having may extend beyond networking/RPC issues, so it'll help if we can dissect things together in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DefaultConfig has dangerous defaults causing hung Reads
5 participants