-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: perturbation/full/decommission failed #137688
Comments
roachtest.perturbation/full/decommission failed with artifacts on master @ 951d890d1af51eefef402034c0506b5d6a9f014e:
Parameters:
|
From the last failure, the delay appears to be on n10 where the go scheduler is not scheduled for > 1s. 2024-12-20T14_04_42Z-UPSERT-1223.171ms.zip Its not clear why this node is so overloaded during this time window since it doesn't have additional read or write bandwdith. However after this point it does slowly get into IO overload. |
roachtest.perturbation/full/decommission failed with artifacts on master @ b63a736c85cfc1a968b74863b7f99c89ddebc1d3:
Parameters:
|
roachtest.perturbation/full/decommission failed with artifacts on master @ f9df57e2bebd963d10ffd7fa52e4d37cf01b80df:
Parameters:
|
roachtest.perturbation/full/decommission failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:
Parameters:
|
@andrewbaptist I looked at a trace from this earlier today. This bit jumps out:
Specifically, it seems like 1 (the leader) took ~115ms to ack a raft proposal. The trace event is from n2, which is presumably the leaseholder. The raft IDs indicate that 1 is the leader -- is this a leader leaseholder split? Or am I misunderstanding something about raft tracing here? Either way, I didn't dig into what caused this 115ms delay. |
Yes - that is a correct reading of this. The slow local ack implies that the disk is overloaded and has a slow sync. Specifically this happens when there is a lot of elastic work occurring at the same time and its not appropriately throttled. I'm continuing to look at how to handle these types of cases as they are note part of what AC or RACv2 cover. |
Looking at the last test, for some reason n11 gets into a bad state at ~14:49:30. I don't fully understand what happened as it doesn't appear to be doing extra work, but its fsync latency spikes on both disks and IO AC kicks in (although I'm not 100% sure why). The disk bandwidth goes up slightly, but there isn't any clear cause. Fairly quickly we see very slow requests due to AC throttling:
The decommission had just started before this, so it is likely getting a lot of snapshot traffic, but it doesn't seem sufficient to push it to this level.
I'm going to assign to storage/AC to see if they can make anything of this, since this test has become a lot more erratic since ~Dec 22nd. Its possible something change on the underlying cloud, but more likely something has changed in our software that is pushing some things harder. |
This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
One thing I noticed on all the failures is that the raft scheduler latency gets high on one node around the time of the failure. My guess is that the scheduler threads are blocked on disk reads. As an example here is the graphs from the recent failure: Note that go scheduler latency is low during this time (under 5ms) and CPU is only at ~25%. I'm still unclear what is driving the fsync latency and raft scheduler latency so high. |
roachtest.perturbation/full/decommission failed with artifacts on master @ 22b262749c502d07ec7ccec5b76abd67c361ae4d:
Parameters:
Same failure on other branches
|
roachtest.perturbation/full/decommission failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:
Parameters:
acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-45700
The text was updated successfully, but these errors were encountered: