-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close Session when there's failures to write to a Stream #143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what's going on here: the writing side of unidirectional Stream (streamBserver
in this case) must close its connection for the reading side to be notified it has gone away. Your change makes it so Sessions close themselves on any send errors and appears to work as designed.
In theory yamux supports retrying Writes on Streams which receive a ErrConnectionWriteTimeout
error. However in practice I highly doubt anyone does this because the Write that timed out is still in the sendCh buffer and could succeed later. What is a Stream to do? Wait for a Response just in case the send succeeds? Try the Write again if the application layer protocol is idempotent? I highly doubt any consumers of yamux do anything but Stream.Close()
on any error from a Stream operation, and I think that's probably for the best.
This PR
I'm still hesitant to accept this PR because I don't think it's fixing a bug as much as being defensive against coding errors (a missing Stream.Close()
in this case). I could not find an example of this coding error in Nomad, but I intend to keep looking. We do take these things very seriously!
Speaking of Nomad, another reason I'm hesitant to close the whole Session whenever a single send times out is that Nomad doesn't always know what kind of net.Conn
its using. It might be a TCP connection, a TLS session, or a yamux Stream. If a Write to any of these connections fail, Nomad will want to explicitly close the connection and create a new one. For example Go's crypto/tls
says:
After a Write has timed out, the TLS state is corrupt and all future writes will return the same error.
So clearly the connection is useless, but do you need to explicitly close it? The examples do, and I have to imagine always explicitly closing a connection that will not be reused is the most robust choice.
So if users should be explicitly closing non-yamux.Stream net.Conns on error, I'd prefer they stick to that idiom with yamux.Streams as well.
And finally I think there are some risks to throwing away a whole Session on a single Stream's error. In your test Stream A is fine because it doesn't Write during the partition. But with your code change, Stream A, and any other Streams on a Session that experiences any kind of error, get thrown away. This could cause a lot of unnecessary disruption. For example in Nomad blocking queries may sit in a Read on an idle connection for minutes. The heartbeat mechanism is what ensures those connections don't miss a disconnect, and as far as I can tell the heartbeat mechanism isn't in question.
Test Variants
To arrive at these conclusions I spent quite a lot of time altering your very useful test case. You can see 2 variants in this gist: https://gist.github.com/schmichael/85bf905c95e0c7b03487b27832e8351c
The first tries to be a minimal variant with 2 changes:
- It closes streams on errors.
- It periodically prints received messages to demonstrate Stream A unpauses/unblocks as expected.
The second variant is much more drastic and tries to represent a more idiomatic approach to using yamux. It was extremely useful to play around with, but I will admit the end state has diverged sufficiently from the original that I may not be exercising the same cases.
Conclusion
Both test variants lead me to believe that in the absence of coding errors yamux and tools that use it like Nomad do not risk DoS in the face of transient network partitions. Either heartbeat failures or explicit Close calls on timed out Streams will eventually free both sides of a connection.
Next week I intend to both audit Nomad's use of yamux Streams to ensure we are explicitly closing Streams on errors.
In addition the Nomad team intends to do more testing of full Client and Server agents that communicate through a proxy of our control so we can simulate various flaky conditions.
THANK YOU!
We take security and reliability of yamux and Nomad seriously and greatly appreciate your contributions! For potential security vulnerabilities we prefer to hear about it directly first like we lay out here: https://www.hashicorp.com/trust/security/vulnerability-management
That triggers a much more robust internal process than "schmichael promises to check out that suspicious issue someday (if he doesn't get distracted first)." 😬
I also know you've been battling hashicorp/nomad#23305 for a long time, and hopefully this latest attempt will help us find the root cause!
Let me know what you think, and expect to hear back from us on this PR or the original issue soon.
// Start A writer (server side) | ||
go func() { | ||
wg.Add(1) | ||
defer wg.Done() | ||
streamAserver, _ := serverSes.Open() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually when only one peer Opens streams and the other peer only Accepts streams, the Opener/initiator is considered the "client" and the Accepter is considered the "server." Not a functional problem though.
streamBserver, _ := serverSes.Open() | ||
err := monotonicWriter(t, streamBserver, "B", nil) | ||
if err != nil && !errors.Is(err, ErrConnectionWriteTimeout) { | ||
failNow(errCh, "[B-writer-server] received unexpected error", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not calling streamBserver.Close()
which I would expect to always be done (usually in a defer after a successful Open) regardless of whether Stream.Write()
last returned an error or not.
label string | ||
val int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are unexported, they are never being json encoded/decoded.
go func() { | ||
wg.Add(1) | ||
defer wg.Done() | ||
streamAclient, _ := clientSes.Accept() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you are Accept()
ing Streams concurrently in two distinct goroutines, streamAserver
may be accepted by streamBclient
and vice versa.
Will adjust the PR based on your feedback when I'm back from PTO.
I realized the security implications after filing the issue, and this was
the response I was given after asking about the status, a week after I got
an initial response, which seemed to indicate they either didn't see the
issue, or (mistakenly) thought that mTLS mitigated it-
```
Thanks again for reaching out and all the details in your PR and GitHub
issue.
This issue was also identified in the past but due to mitigating control
(use of mTLS in clusters following HashiCorp guidance / good practices &
the Nomad security model
<https://developer.hashicorp.com/nomad/docs/concepts/security>), this has
been handled as a longer-lived security / defense-in-depth improvement
rather than a vulnerability that we had in our roadmap.
I'll revive the topic with the Nomad team and have your PR reviewed and
hopefully merged in a future release.
Thanks a lot for the contribution, it's greatly appreciated.
```
So in this case I don't think the more in depth review was triggered via
that mechanism. In fact, it was the presence of mTLS that made me realize
the missing write was never making it onto the wire and I started looking
into the issue from the write instead of read perspective.
…On Fri, Jan 31, 2025 at 9:41 PM Michael Schurter ***@***.***> wrote:
***@***.**** commented on this pull request.
I think I see what's going on here: the writing side of unidirectional
Stream (streamBserver in this case) *must* close its connection for the
reading side to be notified it has gone away. Your change makes it so
*Sessions* close themselves on any send errors and appears to work as
designed.
In theory yamux supports retrying Writes on Streams which receive a
ErrConnectionWriteTimeout error. However in practice I highly doubt
anyone does this because the Write that timed out is still in the sendCh
buffer and could succeed later. What is a Stream to do? Wait for a Response
just in case the send succeeds? Try the Write again if the application
layer protocol is idempotent? I highly doubt any consumers of yamux do
anything but Stream.Close() on *any* error from a Stream operation, and I
think that's probably for the best.
This PR
I'm still hesitant to accept this PR because I don't think it's fixing a
bug as much as being defensive against coding errors (a missing
Stream.Close() in this case). I could not find an example of this coding
error in Nomad, but I intend to keep looking. We do take these things very
seriously!
Speaking of Nomad, another reason I'm hesitant to close the whole Session
whenever a single send times out is that Nomad doesn't always know what
kind of net.Conn its using. It might be a TCP connection, a TLS session,
or a yamux Stream. If a Write to any of these connections fail, Nomad will
want to explicitly close the connection and create a new one. For example Go's
crypto/tls says <https://pkg.go.dev/crypto/tls#Conn.SetDeadline>:
After a Write has timed out, the TLS state is corrupt and all future
writes will return the same error.
So clearly the connection is useless, but do you need to explicitly close
it? The examples do, and I have to imagine always explicitly closing a
connection that will not be reused is the most robust choice.
So if users should be explicitly closing non-yamux.Stream net.Conns on
error, I'd prefer they stick to that idiom with yamux.Streams as well.
And finally I think there are some risks to throwing away a whole Session
on a single Stream's error. In your test Stream A is fine because it
doesn't Write during the partition. But with your code change, Stream A,
and any other Streams on a Session that experiences any kind of error, get
thrown away. This could cause a lot of unnecessary disruption. For example
in Nomad blocking queries may sit in a Read on an idle connection for
minutes. The heartbeat mechanism is what ensures those connections don't
miss a disconnect, and as far as I can tell the heartbeat mechanism isn't
in question.
Test Variants
To arrive at these conclusions I spent quite a lot of time altering your
very useful test case. You can see 2 variants in this gist:
https://gist.github.com/schmichael/85bf905c95e0c7b03487b27832e8351c
The first tries to be a minimal variant with 2 changes:
1. It closes streams on errors.
2. It periodically prints received messages to demonstrate Stream A
unpauses/unblocks as expected.
The second variant is much more drastic and tries to represent a more
idiomatic approach to using yamux. It was extremely useful to play around
with, but I will admit the end state has diverged sufficiently from the
original that I may not be exercising the same cases.
Conclusion
Both test variants lead me to believe that *in the absence of coding
errors* yamux and tools that use it like Nomad do not risk DoS in the
face of transient network partitions. Either heartbeat failures or explicit
Close calls on timed out Streams will eventually free both sides of a
connection.
Next week I intend to both audit Nomad's use of yamux Streams to ensure we
are explicitly closing Streams on errors.
In addition the Nomad team intends to do more testing of full Client and
Server agents that communicate through a proxy of our control so we can
simulate various flaky conditions.
THANK YOU!
We take security and reliability of yamux and Nomad seriously and greatly
appreciate your contributions! For potential security vulnerabilities we
prefer to hear about it directly first like we lay out here:
https://www.hashicorp.com/trust/security/vulnerability-management
That triggers a much more robust internal process than "schmichael
promises to check out that suspicious issue someday (if he doesn't get
distracted first)." 😬
I also know you've been battling hashicorp/nomad#23305
<hashicorp/nomad#23305> for a long time, and
hopefully this latest attempt will help us find the root cause!
Let me know what you think, and expect to hear back from us on this PR or
the original issue soon.
------------------------------
In deadlock_test.go
<#143 (comment)>:
> + // Start A writer (server side)
+ go func() {
+ wg.Add(1)
+ defer wg.Done()
+ streamAserver, _ := serverSes.Open()
Usually when only one peer Opens streams and the other peer only Accepts
streams, the Opener/initiator is considered the "client" and the Accepter
is considered the "server." Not a functional problem though.
------------------------------
In deadlock_test.go
<#143 (comment)>:
> + streamBserver, _ := serverSes.Open()
+ err := monotonicWriter(t, streamBserver, "B", nil)
+ if err != nil && !errors.Is(err, ErrConnectionWriteTimeout) {
+ failNow(errCh, "[B-writer-server] received unexpected error", err)
+ }
You're not calling streamBserver.Close() which I would expect to always
be done (usually in a defer after a successful Open) regardless of whether
Stream.Write() last returned an error or not.
------------------------------
In deadlock_test.go
<#143 (comment)>:
> + label string
+ val int
Since these are unexported, they are never being json encoded/decoded.
------------------------------
In deadlock_test.go
<#143 (comment)>:
> + go func() {
+ wg.Add(1)
+ defer wg.Done()
+ streamAclient, _ := clientSes.Accept()
Because you are Accept()ing Streams concurrently in two distinct
goroutines, streamAserver may be accepted by streamBclient and vice versa.
—
Reply to this email directly, view it on GitHub
<#143 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKE6K4LFFOSFV2MEKUD2NQQ5PAVCNFSM6AAAAABVFOYOAGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBXGY2DGMJVGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
One could argue that libraries should be defensive (default keepalive/ping interval should be adjusted to below the write timeout), and the case is made by the lack of handling write timeouts by Hashicorp when using Hashicorp's own connection multiplexing library. If Hashicorp couldn't get it right initially, then it's not reasonable to expect others to do the same as well. in the absence of coding errors I agree with, which means the related go vulndb report will need to be amended or another one created so folks using Nomad or other tools that aren't properly handling write timeouts causing CWE-1088 are notified. I empathize with the fact this likely effects every Nomad version that has been released (and potentially Terraform, I can think of some edgecases that would impact Terraform Plugins if you had some overly aggressive EDR installed), making for a rather widely scoped notification.
My initial thoughts were a naive wrt "unnecessary disruption", this issue has been plaguing my team since at least Q4 2023, though it took us until the middle of last year to get enough information gathered to file an issue. It caused constant deployment failures impacting developer velocity which required SRE involvement every time to go around restarting Nomad nodes to address this bug, duplicate containers running on hosts with constraints preventing that, outright failure of nodes to start their assigned containers (ie, system jobs as well as regular). Compared to that context, there really shouldn't be that much overhead to establish fresh connections in the event of write timeouts.
I pulled the thread on this after writing the above and didn't want to rewrite it, but some of it is still relevant, so take that into consideration. Nomad uses This means when As far as I can tell, this means all Write errors encountered by Nomad when communicating by way of It looks like handling this issue would require changing the behaviour of the standard library's Another option is to make the custom msgpack encoder close the Stream like the gob encoder does in the event of write errors. |
Go stdlib issue opened for |
@lattwood just wanted to give you a heads up that I'm working on building out a reproduction of the Nomad issue in hopes of getting some empirical insights into the underlying problem here. I'll report back on that soon. |
@tgross fantastic, thank you (and good luck). We have a control plane in eu-central-1, agents in southeast asia, and regularly see large raft update messages.
that's the size of the json for the job that is impacted the worst, which I'm including to underscore the sheer size of the writes going between servers and agents.
full cluster job status. We have 3 masters, and 2533 nodes. ![]() Here's the number of nodes that have been impacted by this through inference (they have a pending allocation due to this bug). It does not include nodes that have the hung read and haven't had an allocation updated though. It's a partial screenshot from the Job summary page in the UI. |
This is how I reconcile the broken nodes with pending allocations. This doesn't capture nodes that have a hung read and haven't had an allocation change state.
This is how I get the node_id's for everything with a hung read in yamux. I get goroutine stacks via the pprof endpoint, then use overlapping grep commands to find any hosts that have been blocked in yamux for at least 10 minutes.
From the node_id's, I look them up w/ |
Ok, I'm at the end of my day here but I want to brain-dump what I've got so far... I don't have a reproduction yet but I have some insights as to why this has been challenging to reproduce. My test rig is a server and a client each in their own network namespace. The structure of the test is to induce traffic from one to the other, and then intervene with either iptables or (I also discovered another strange bug in the Nomad client where with iptables split I was able to induce the client to get stuck on a much shorter "wait" for it's blocking RPC than the 5min it should be waiting. We set a shorter wait when retrying and must not be resetting that properly. I'll get an issue open in the Nomad repo for that in the next day or so.) Inducing the To give myself the best odds of detecting this bug we care about here I'm:
After a dozen variations I've managed to trigger the However, the RPC handler is not getting an error after the (Please hold off on opening an issue/PR there, as another team who isn't in the loop on this bug and is in a far-flung TZ will get notifications and that'll cause them a bit of scramble. We're making progress! 😁 ) |
Awesome. If you can find yourself on the community "halihax" tomorrow, I
can also grab arbitrary goroutine stacks, poke around hung agents with
delve, etc.
…On Tue, Feb 4, 2025 at 6:17 PM Tim Gross ***@***.***> wrote:
Ok, I'm at the end of my day here but I want to brain-dump what I've got
so far...
I don't have a reproduction yet but I have some insights as to why this
has been challenging to reproduce. My test rig is a server and a client
each in their own network namespace. The structure of the test is to induce
traffic from one to the other, and then intervene with either iptables or
tc <https://www.man7.org/linux/man-pages/man8/tc.8.html> to simulate
network splits/congestion.
(I also discovered another strange bug in the Nomad client where with
iptables split I was able to induce the client to get stuck on a much
shorter "wait" for it's blocking RPC than the 5min it should be waiting. We
set a shorter wait when retrying and must not be resetting that properly.
I'll get an issue open in the Nomad repo for that in the next day or so.)
Inducing the ErrConnectionWriteTimeout instead of some other error has
proven surprisingly hard, and I think that's because of buffering at the
TCP level. The congestion is on the network, but the write timeout we're
worried about is in yamux. So if the underlying TCP connection buffer is
big enough for yamux to write the header+body to it (via the msgpack
codec), then we return without error to yamux's write. Nothing else (in
Nomad) will write on that stream till we get a response. But the body can
just be sitting in the TCP buffers at that point.
To give myself the best odds of detecting this bug we care about here I'm:
- Setting Nomad's yamux keepalive window to 120s.
- Setting Nomad's yamux write timeout to 1s.
- Setting the TCP connection's write buffer to 1 byte.
- Adding logging to yamux so that it reports sending keepalive, so I
can time when to interrupt traffic
After a dozen variations I've managed to trigger the
ErrConnectionWriteTimeout at session.go#L441
<https://github.com/hashicorp/yamux/blob/v0.1.2/session.go#L441> but not
the one earlier in the waitForSendErr. Unfortunately that test did *not*
trigger the condition where (*Stream).Read hangs longer than the network
interruption. Once connectivity is restored, the read unblocks.
However, the RPC handler is *not* getting an error after the
ErrConnectionWriteTimeout; it's hanging until either the keepalive or the
netsplit heals! Which is certainly a bug even if it doesn't quite match the
behavior described in #142 <#142>.
I think your observation about net/rpc is likely correct, but that would
be a bug in the msgpack library and we should fix it there by closing on
error as was done in gob (ref golang/go#7689
<golang/go#7689>). But before we go running off
to do that, I'd really like to have an exact repro of the behavior you're
seeing so we can be confident we've put this bug to rest. I'll take another
run at it tomorrow and will report back after that and/or open a PR in the
msgpack library.
(Please hold off on opening an issue/PR there, as another team who isn't
in the loop on this bug and is in a far-flung TZ will get notifications and
that'll cause them a bit of scramble. We're making progress.)
—
Reply to this email directly, view it on GitHub
<#143 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJNKEUACDYTXOPPJKPHL32OE4BDAVCNFSM6AAAAABVFOYOAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZVGE4TKOJTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yeah, this is the insidious part. You need to get Write Timeouts on the connection, and then never write to it again, to replicate the behaviour of the message pack codec. Line 109 in 9e980a5
What you need to do (which sounds insane) is fill that buffered channel to reproduce the hanging read, and have it kept full for the entire Write timeout. It looks like you managed to get the OS buffers full but not yamux's, so almost there! I'm going to open up some PRs to fix their codec behaviour 🤦🏻♂️ , and I'll create a branch and link it here for the msgpack stuff if you don't get to that before I do. I won't actually create the PR until later due to the aforementioned TZ stuff. |
"Their" who? The msgpack codec repo? That's just as much "us" as this one in terms of implementation expertise. But there's another "IP compliance" team that gets pinged on these repos, especially when there's an industry-wide scramble created by the spurious vuln DB filing. Please just hold on, @lattwood, and let us work here. |
@tgross ack. if you're curious- https://chatgpt.com/share/67a38308-f618-800e-b1ea-6df285d2f2cf |
One other realization- it's been stated numerous times that using mTLS with Nomad prevents direct exploitation of this issue. But starting with 1.10, official guidance is to turn off mTLS when using Workload Identity with Vault. |
For HTTP, not RPC. |
👋 from HashiCorp Security. Based on investigations detailed above, we do not believe that Yamux is exposed to a denial of service vulnerability as described or that the proposed changes are necessary to address such a vulnerability. Yamux requires that dependent applications explicitly close streams on write errors. Coding errors may cause network connectivity issues, but Yamux does not attempt to defend against these errors by design. We’ll continue to investigate the Nomad-specific symptoms that initially pointed the PR submitter in this direction. |
I've got a follow-up of where we're at on this @lattwood but give me a little bit to write it up and vet it with @schmichael. Will post back here shortly. |
I've been working to build out a reproduction of the behavior you've described and I'm still unsuccessful after 2 days. With a hacked-up API client that starts a stream (via Added logging in That being said, the So here's what I'd like to propose:
|
This pull request introduces a new test to detect a deadlock bug and includes modifications to handle connection write timeouts more robustly. The changes span across multiple files, with the primary focus on ensuring that sessions are closed properly when a write timeout occurs.
New Test for Deadlock Detection:
deadlock_test.go
: Added a comprehensive testTestTimeoutParallel
to simulate and detect a deadlock scenario in yamux streams. This test sets up two streams, stalls one to fill the channel, and verifies if the session closes correctly on write timeout.Improvements in Connection Write Timeout Handling:
session.go
: Enhanced thewaitForSendErr
andsendNoWait
functions to ensure that the session is properly closed when a connection write timeout or session shutdown occurs. The changes include:s.exitErr(err)
if an error occurs. [1] [2] [3] [4]Test Adjustments:
session_test.go
: Modified theTestSession_PingOfDeath
to handle the expectedErrSessionShutdown
error when a write times out, ensuring the test accounts for the new session shutdown behaviour.Closes #142.
I humbly request a callout in changelogs when this bugfix ships :)