Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging RPC caused agent's state lock to be held for hours #8504

Closed
a-robinson opened this issue Aug 12, 2020 · 1 comment · Fixed by #11500
Closed

Hanging RPC caused agent's state lock to be held for hours #8504

a-robinson opened this issue Aug 12, 2020 · 1 comment · Fixed by #11500
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected

Comments

@a-robinson
Copy link

a-robinson commented Aug 12, 2020

Overview of the Issue

We had a Consul agent get stuck today such that it wouldn't respond to any HTTP API requests. Going by the goroutine dump we got from the process, a goroutine was holding the State lock while waiting on syncCheck, which was waiting on an RPC to the server. This one goroutine holding the lock was causing more than 400 others to block attempting to grab the RWMutex in read mode. The stuck goroutine was here, and as you can see had been blocked for more than 7 hours:

goroutine 32 [select, 445 minutes]:
github.com/hashicorp/yamux.(*Stream).Read(0xc00071a780, 0xc000a78000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/yamux@v0.0.0-20181012175058-2f1d1f20f75d/stream.go:133 +0x2a4
bufio.(*Reader).fill(0xc000216c60)
        /usr/local/go/src/bufio/bufio.go:100 +0x103
bufio.(*Reader).ReadByte(0xc000216c60, 0x457, 0xc001496000, 0xc00095fda0)
        /usr/local/go/src/bufio/bufio.go:252 +0x39
github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1(0xc0007f00f0, 0x42ea61)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:90 +0x3d
github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext(0xc0012f40a0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/msgpack.go:540 +0x43
github.com/hashicorp/go-msgpack/codec.(*Decoder).decode(0xc000216d80, 0x2549960, 0xc0011694d0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:635 +0x4c
github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode(0xc000216d80, 0x2549960, 0xc0011694d0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:630 +0x7b
github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read(0xc0015f8100, 0x2549960, 0xc0011694d0, 0xc0015f8101, 0xc0011694d0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/codec.go:121 +0x55
github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader(0xc0015f8100, 0xc0011694d0, 0x2987840, 0xc00028c000)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/codec.go:74 +0x41
github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec(0x32687c0, 0xc0015f8100, 0x2d72640, 0x10, 0x2987840, 0xc00028c000, 0x254f720, 0x4af5458, 0xc0012f4100, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/client.go:29 +0xee
github.com/hashicorp/consul/agent/pool.(*ConnPool).rpc(0xc0002b1c70, 0xc000449c31, 0x5, 0x322f680, 0xc000768540, 0x2, 0x2d72640, 0x10, 0x1, 0x2987840, ...)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/pool/pool.go:467 +0x1d0
github.com/hashicorp/consul/agent/pool.(*ConnPool).RPC(0xc0002b1c70, 0xc000449c31, 0x5, 0x322f680, 0xc000768540, 0x2, 0x2d72640, 0x10, 0x203001, 0x2987840, ...)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/pool/pool.go:431 +0xfa
github.com/hashicorp/consul/agent/consul.(*Client).RPC(0xc0002b1ce0, 0x2d72640, 0x10, 0x2987840, 0xc00028c000, 0x254f720, 0x4af5458, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/consul/client.go:307 +0x1f1
github.com/hashicorp/consul/agent/local.(*State).syncCheck(0xc0002f41c0, 0xc00049ede0, 0x15, 0xc00108fde8, 0x1)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/local/state.go:1186 +0x269
github.com/hashicorp/consul/agent/local.(*State).SyncChanges(0xc0002f41c0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/local/state.go:1003 +0x33c
github.com/hashicorp/consul/agent/ae.(*StateSyncer).nextFSMState(0xc00023c980, 0x2d671c6, 0xb, 0x2d671c6, 0xb)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:197 +0x2c8
github.com/hashicorp/consul/agent/ae.(*StateSyncer).runFSM(0xc00023c980, 0x2d60f2a, 0x8, 0xc00108ffb8)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:153 +0x3a
github.com/hashicorp/consul/agent/ae.(*StateSyncer).Run(0xc00023c980)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:147 +0x6e
created by github.com/hashicorp/consul/agent.(*Agent).StartSync
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/agent.go:1858 +0x43

As for why the RPC was hanging for so long, I don't know exactly. The consul servers were fine, and other agents in the cluster were behaving normally. The notable thing that happened on the affected agent was memory pressure on the machine that may have caused memory allocation failures. According to our collection of metrics from the agent, the agent had a GC pause of 13.8 seconds around this time. The memory pressure went away, but from this point on the agent was effectively stuck.

Reproduction Steps

I don't know how to reproduce the issue. I've only seen it happen once. It's plausible that you could reproduce it with enough memory pressure on a machine, but I don't know the odds of hitting it again.

Consul info for both Client and Server

Both client and server are running v1.6.3.

Operating system and Environment details

Debian Stretch

Log Fragments

Full goroutine dump from sending the process SIGQUIT: https://gist.github.com/a-robinson/ec5425d98cc8e0bcac9fc6298cf7a9a3

This may be related to #6616, but the big difference here is that the RPC attempt never timed out.

@dnephin
Copy link
Contributor

dnephin commented Feb 7, 2022

Thank you for creating this issue! Sorry I did not comment when you first opened it. #6616 is related to this issue as well, and there is some discussion in that PR about it.

The reason this blocks for so long as because we don't have read timeouts on these connections!

We can see in the stack trace that it's blocked reading from the connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants