Hanging RPC caused agent's state lock to be held for hours #8504

a-robinson · 2020-08-12T21:43:02Z

Overview of the Issue

We had a Consul agent get stuck today such that it wouldn't respond to any HTTP API requests. Going by the goroutine dump we got from the process, a goroutine was holding the State lock while waiting on syncCheck, which was waiting on an RPC to the server. This one goroutine holding the lock was causing more than 400 others to block attempting to grab the RWMutex in read mode. The stuck goroutine was here, and as you can see had been blocked for more than 7 hours:

goroutine 32 [select, 445 minutes]:
github.com/hashicorp/yamux.(*Stream).Read(0xc00071a780, 0xc000a78000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/yamux@v0.0.0-20181012175058-2f1d1f20f75d/stream.go:133 +0x2a4
bufio.(*Reader).fill(0xc000216c60)
        /usr/local/go/src/bufio/bufio.go:100 +0x103
bufio.(*Reader).ReadByte(0xc000216c60, 0x457, 0xc001496000, 0xc00095fda0)
        /usr/local/go/src/bufio/bufio.go:252 +0x39
github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readn1(0xc0007f00f0, 0x42ea61)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:90 +0x3d
github.com/hashicorp/go-msgpack/codec.(*msgpackDecDriver).initReadNext(0xc0012f40a0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/msgpack.go:540 +0x43
github.com/hashicorp/go-msgpack/codec.(*Decoder).decode(0xc000216d80, 0x2549960, 0xc0011694d0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:635 +0x4c
github.com/hashicorp/go-msgpack/codec.(*Decoder).Decode(0xc000216d80, 0x2549960, 0xc0011694d0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/go-msgpack@v0.5.5/codec/decode.go:630 +0x7b
github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).read(0xc0015f8100, 0x2549960, 0xc0011694d0, 0xc0015f8101, 0xc0011694d0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/codec.go:121 +0x55
github.com/hashicorp/net-rpc-msgpackrpc.(*MsgpackCodec).ReadResponseHeader(0xc0015f8100, 0xc0011694d0, 0x2987840, 0xc00028c000)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/codec.go:74 +0x41
github.com/hashicorp/net-rpc-msgpackrpc.CallWithCodec(0x32687c0, 0xc0015f8100, 0x2d72640, 0x10, 0x2987840, 0xc00028c000, 0x254f720, 0x4af5458, 0xc0012f4100, 0x0)
        /cfsetup_build/consul16/tmp/build/pkg/mod/github.com/hashicorp/net-rpc-msgpackrpc@v0.0.0-20151116020338-a14192a58a69/client.go:29 +0xee
github.com/hashicorp/consul/agent/pool.(*ConnPool).rpc(0xc0002b1c70, 0xc000449c31, 0x5, 0x322f680, 0xc000768540, 0x2, 0x2d72640, 0x10, 0x1, 0x2987840, ...)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/pool/pool.go:467 +0x1d0
github.com/hashicorp/consul/agent/pool.(*ConnPool).RPC(0xc0002b1c70, 0xc000449c31, 0x5, 0x322f680, 0xc000768540, 0x2, 0x2d72640, 0x10, 0x203001, 0x2987840, ...)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/pool/pool.go:431 +0xfa
github.com/hashicorp/consul/agent/consul.(*Client).RPC(0xc0002b1ce0, 0x2d72640, 0x10, 0x2987840, 0xc00028c000, 0x254f720, 0x4af5458, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/consul/client.go:307 +0x1f1
github.com/hashicorp/consul/agent/local.(*State).syncCheck(0xc0002f41c0, 0xc00049ede0, 0x15, 0xc00108fde8, 0x1)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/local/state.go:1186 +0x269
github.com/hashicorp/consul/agent/local.(*State).SyncChanges(0xc0002f41c0, 0x0, 0x0)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/local/state.go:1003 +0x33c
github.com/hashicorp/consul/agent/ae.(*StateSyncer).nextFSMState(0xc00023c980, 0x2d671c6, 0xb, 0x2d671c6, 0xb)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:197 +0x2c8
github.com/hashicorp/consul/agent/ae.(*StateSyncer).runFSM(0xc00023c980, 0x2d60f2a, 0x8, 0xc00108ffb8)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:153 +0x3a
github.com/hashicorp/consul/agent/ae.(*StateSyncer).Run(0xc00023c980)
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/ae/ae.go:147 +0x6e
created by github.com/hashicorp/consul/agent.(*Agent).StartSync
        /cfsetup_build/consul16/tmp/build/src/github.com/hashicorp/consul/agent/agent.go:1858 +0x43

As for why the RPC was hanging for so long, I don't know exactly. The consul servers were fine, and other agents in the cluster were behaving normally. The notable thing that happened on the affected agent was memory pressure on the machine that may have caused memory allocation failures. According to our collection of metrics from the agent, the agent had a GC pause of 13.8 seconds around this time. The memory pressure went away, but from this point on the agent was effectively stuck.

Reproduction Steps

I don't know how to reproduce the issue. I've only seen it happen once. It's plausible that you could reproduce it with enough memory pressure on a machine, but I don't know the odds of hitting it again.

Consul info for both Client and Server

Both client and server are running v1.6.3.

Operating system and Environment details

Debian Stretch

Log Fragments

Full goroutine dump from sending the process SIGQUIT: https://gist.github.com/a-robinson/ec5425d98cc8e0bcac9fc6298cf7a9a3

This may be related to #6616, but the big difference here is that the RPC attempt never timed out.

The text was updated successfully, but these errors were encountered:

dnephin · 2022-02-07T22:57:37Z

Thank you for creating this issue! Sorry I did not comment when you first opened it. #6616 is related to this issue as well, and there is some discussion in that PR about it.

The reason this blocks for so long as because we don't have read timeouts on these connections!

We can see in the stack trace that it's blocked reading from the connection.

dnephin added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Aug 12, 2020

tantra35 mentioned this issue Sep 8, 2020

Consul stop answer on 8500 port when many wait(blocking) api queries are made #8628

Open

dnephin mentioned this issue May 11, 2021

Release state lock before catalog operations in State.SyncChanges #6616

Open

wjordan mentioned this issue Nov 5, 2021

Add timeout to Client RPC calls #11500

Merged

wjordan mentioned this issue Mar 2, 2022

Release state lock before catalog operations in State.SyncChanges #12485

Closed

kisunji closed this as completed in #11500 Apr 21, 2022

eculver mentioned this issue Sep 22, 2022

docs: fix missing bugfix changelog entries #14720

Merged

obourdon mentioned this issue Oct 18, 2022

Consul connection error on port 8300 #14464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging RPC caused agent's state lock to be held for hours #8504

Hanging RPC caused agent's state lock to be held for hours #8504

a-robinson commented Aug 12, 2020 •

edited

Loading

dnephin commented Feb 7, 2022

Hanging RPC caused agent's state lock to be held for hours #8504

Hanging RPC caused agent's state lock to be held for hours #8504

Comments

a-robinson commented Aug 12, 2020 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

dnephin commented Feb 7, 2022

a-robinson commented Aug 12, 2020 •

edited

Loading