Hanging RPC caused agent's state lock to be held for hours #8504
Labels
theme/internals
Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
type/bug
Feature does not function as expected
Overview of the Issue
We had a Consul agent get stuck today such that it wouldn't respond to any HTTP API requests. Going by the goroutine dump we got from the process, a goroutine was holding the
State
lock while waiting onsyncCheck
, which was waiting on an RPC to the server. This one goroutine holding the lock was causing more than 400 others to block attempting to grab theRWMutex
in read mode. The stuck goroutine was here, and as you can see had been blocked for more than 7 hours:As for why the RPC was hanging for so long, I don't know exactly. The consul servers were fine, and other agents in the cluster were behaving normally. The notable thing that happened on the affected agent was memory pressure on the machine that may have caused memory allocation failures. According to our collection of metrics from the agent, the agent had a GC pause of 13.8 seconds around this time. The memory pressure went away, but from this point on the agent was effectively stuck.
Reproduction Steps
I don't know how to reproduce the issue. I've only seen it happen once. It's plausible that you could reproduce it with enough memory pressure on a machine, but I don't know the odds of hitting it again.
Consul info for both Client and Server
Both client and server are running v1.6.3.
Operating system and Environment details
Debian Stretch
Log Fragments
Full goroutine dump from sending the process SIGQUIT: https://gist.github.com/a-robinson/ec5425d98cc8e0bcac9fc6298cf7a9a3
This may be related to #6616, but the big difference here is that the RPC attempt never timed out.
The text was updated successfully, but these errors were encountered: