roachtest: perturbation/full/decommission failed #137688

cockroach-teamcity · 2024-12-18T15:25:38Z

roachtest.perturbation/full/decommission failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 455873.9934 > 5.0000 BASE: 7.896919ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 464598.7377 > 5.0000 BASE: 7.748622ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 445918.1765 > 5.0000 BASE: 8.07323ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-45700

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2024-12-20T14:15:44Z

roachtest.perturbation/full/decommission failed with artifacts on master @ 951d890d1af51eefef402034c0506b5d6a9f014e:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 9.9893 > 5.0000 BASE: 6.481615ms SCORE: 64.746521ms
	            	
	            	FAILURE: read           : Increase 30.5676 > 5.0000 BASE: 7.016868ms SCORE: 214.489101ms
	            	
	            	FAILURE: write          : Increase 10.6101 > 5.0000 BASE: 18.82548ms SCORE: 199.740274ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
leaseType=epoch
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

andrewbaptist · 2024-12-20T16:16:34Z

From the last failure, the delay appears to be on n10 where the go scheduler is not scheduled for > 1s.

2024-12-20T14_04_42Z-UPSERT-1223.171ms.zip

Its not clear why this node is so overloaded during this time window since it doesn't have additional read or write bandwdith. However after this point it does slowly get into IO overload.

cockroach-teamcity · 2024-12-23T15:38:12Z

roachtest.perturbation/full/decommission failed with artifacts on master @ b63a736c85cfc1a968b74863b7f99c89ddebc1d3:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: write          : Increase 6.1470 > 5.0000 BASE: 11.961031ms SCORE: 73.524882ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-29T13:45:36Z

roachtest.perturbation/full/decommission failed with artifacts on master @ f9df57e2bebd963d10ffd7fa52e4d37cf01b80df:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 5.5666 > 5.0000 BASE: 6.318174ms SCORE: 35.170608ms
	            	
	            	FAILURE: read           : Increase 5.8903 > 5.0000 BASE: 6.166364ms SCORE: 36.321491ms
	            	
	            	FAILURE: write          : Increase 8.0275 > 5.0000 BASE: 14.381ms SCORE: 115.443257ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
fs=ext4
leaseType=epoch
localSSD=true
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-01T15:01:35Z

roachtest.perturbation/full/decommission failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 44.7127 > 5.0000 BASE: 6.576626ms SCORE: 294.059018ms
	            	
	            	FAILURE: read           : Increase 43.2523 > 5.0000 BASE: 6.658326ms SCORE: 287.987715ms
	            	
	            	FAILURE: write          : Increase 21.6648 > 5.0000 BASE: 17.523226ms SCORE: 379.636316ms
	            	
	            	FAILURE: follower-read  : Increase 11.8065 > 5.0000 BASE: 6.576626ms SCORE: 77.646735ms
	            	
	            	FAILURE: read           : Increase 11.3909 > 5.0000 BASE: 6.658326ms SCORE: 75.844281ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
leaseType=epoch
mem=standard
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

_{This test on roachdash | Improve this report!}

arulajmani · 2025-01-08T17:31:19Z

@andrewbaptist I looked at a trace from this earlier today. This bit jumps out:

    32.575ms     30.987ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] flushing proposal to Raft
    32.585ms      0.010ms                                        === operation:raft trace _verbose:1 node:2 store:4 range:9550/1:/Table/106/1/16{88630â€¦-90475â€¦}
    32.614ms      0.029ms                                        event:kv/kvserver/replica_proposal_buf.go:1199 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] registering local trace i3960/531d.5211
    32.658ms      0.044ms                                        event:kv/kvserver/replica_raft.go:2013 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 1->4 MsgApp Term:9 Log:9/3959 Entries:[3960-3960]
    32.691ms      0.034ms                                        event:kv/kvserver/replica_raft.go:2013 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 1->3 MsgApp Term:9 Log:9/3959 Entries:[3960-3960]
    32.758ms      0.067ms                                        event:kv/kvserver/replica_raft.go:1210 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 1->AppendThread MsgStorageAppend Term:0 Log:9/3960 Entries:[3960-3960]
    34.899ms      2.141ms                                        event:kv/kvserver/replica_raft.go:646 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 3->1 MsgAppResp Term:9 Index:3960
    45.235ms     10.335ms                                        event:kv/kvserver/replica_raft.go:646 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 4->1 MsgAppResp Term:9 Index:3960
   160.720ms    115.486ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 1->1 MsgAppResp Term:9 Index:3960
   160.742ms      0.022ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] AppendThread->1 MsgStorageAppendResp Log:9/3960
   160.836ms      0.094ms                                        event:kv/kvserver/replica_raft.go:1239 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] 1->ApplyThread MsgStorageApply Term:0 Log:0/0 Entries:[3960-3960]
   160.972ms      0.136ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] ApplyThread->1 MsgStorageApplyResp LastEntry:9/3960
   160.995ms      0.023ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦}] unregistered log index 3960 from tracing
   160.771ms     -0.224ms                                        === operation:local proposal _verbose:1 node:2 store:4 range:9550/1:/Table/106/1/16{88630â€¦-90475â€¦} raft:
   160.807ms      0.036ms                                        event:kv/kvserver/replica_application_cmd.go:143 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦},raft] ack-ing replication success to the client; application will continue async w.r.t. the client
   160.856ms      0.049ms                                        event:kv/kvserver/app_batch.go:116 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦},raft] applying command
   160.922ms      0.067ms                                        event:kv/kvserver/replica_application_state_machine.go:179 [n2,s4,r9550/1:/Table/106/1/16{88630â€¦-90475â€¦},raft] LocalResult (reply: (err: <nil>), *kvpb.PutResponse, *kvpb.EndTxnResponse, #encountered intents: 0, #acquired locks: 0, #resolved locks: 1#updated txns: 1 #end txns: 0, PopulateBarrierResponse:false GossipFirstRange:false MaybeGossipSystemConfig:false MaybeGossipSystemConfigIfHaveFailure:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil>

Specifically, it seems like 1 (the leader) took ~115ms to ack a raft proposal. The trace event is from n2, which is presumably the leaseholder. The raft IDs indicate that 1 is the leader -- is this a leader leaseholder split? Or am I misunderstanding something about raft tracing here?

Either way, I didn't dig into what caused this 115ms delay.

andrewbaptist · 2025-01-15T14:54:29Z

Specifically, it seems like 1 (the leader) took ~115ms to ack a raft proposal. The trace event is from n2, which is presumably the leaseholder. The raft IDs indicate that 1 is the leader -- is this a leader leaseholder split? Or am I misunderstanding something about raft tracing here?

Yes - that is a correct reading of this. The slow local ack implies that the disk is overloaded and has a slow sync. Specifically this happens when there is a lot of elastic work occurring at the same time and its not appropriately throttled. I'm continuing to look at how to handle these types of cases as they are note part of what AC or RACv2 cover.

andrewbaptist · 2025-01-15T19:14:05Z

Looking at the last test, for some reason n11 gets into a bad state at ~14:49:30. I don't fully understand what happened as it doesn't appear to be doing extra work, but its fsync latency spikes on both disks and IO AC kicks in (although I'm not 100% sure why).

https://grafana.testeng.crdb.io/d/CAvWxELVq/l2?from=1735742940000&to=1735743120000&var-cluster=teamcity-18346412-1735713837-148-n13cpu16sm&orgId=1

The disk bandwidth goes up slightly, but there isn't any clear cause.

Fairly quickly we see very slow requests due to AC throttling:

     0.923ms      0.005ms                                    === operation:admissionWorkQueueWait _verbose:1 node:11
  3143.476ms   3142.552ms                                    event:wait_duration_nanos:3142528376 queue_kind:"kv-regular-store-queue" work_priority:"normal-pri" 
  3143.455ms     -0.020ms                                    structured:{"@type":"type.googleapis.com/cockroach.util.admission.admissionpb.AdmissionWorkQueueStats","waitDurationNanos":"3.142528376s","queueKind":"kv-regular-store-queue","workPriority":"normal-pri"}
  3143.531ms   3142.613ms                                event:kv/kvserver/store_send.go:149 [n11,s21] executing Put [/Table/106/1/-4941528009249908907/0], EndTxn(parallel commit) [/Table/106/1/-4941528009249908907/0], [txn: 2a9f0995], [can-forward-ts]

The decommission had just started before this, so it is likely getting a lot of snapshot traffic, but it doesn't seem sufficient to push it to this level.

2025/01/01 14:49:11 cluster.go:2501: running cmd `./cockroach node decommissi...` on nodes [:12]

I'm going to assign to storage/AC to see if they can make anything of this, since this test has become a lot more erratic since ~Dec 22nd.

Its possible something change on the underlying cloud, but more likely something has changed in our software that is pushing some things harder.

blathers-crl · 2025-01-15T19:14:17Z

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

andrewbaptist · 2025-01-16T16:28:19Z

One thing I noticed on all the failures is that the raft scheduler latency gets high on one node around the time of the failure. My guess is that the scheduler threads are blocked on disk reads. As an example here is the graphs from the recent failure:

https://grafana.testeng.crdb.io/d/J-yAVzkVddd/detailed?from=1735742869200&to=1735743292916&var-cluster=teamcity-18346412-1735713837-148-n13cpu16sm&orgId=1

Note that go scheduler latency is low during this time (under 5ms) and CPU is only at ~25%.

I'm still unclear what is driving the fsync latency and raft scheduler latency so high.

cockroach-teamcity · 2025-02-14T14:23:29Z

roachtest.perturbation/full/decommission failed with artifacts on master @ 22b262749c502d07ec7ccec5b76abd67c361ae4d:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1333
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: write          : Increase 5.2349 > 5.0000 BASE: 11.605528ms SCORE: 60.754157ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

acMode=defaultOption
arch=amd64
blockSize=4096
cloud=gce
coverageBuild=false
cpu=16
diskBandwidthLimit=0
disks=2
encrypted=false
fillDuration=10m0s
leaseType=epoch
mem=standard
metamorphicBufferedSender=false
numNodes=12
numWorkloadNodes=1
perturbationDuration=10m0s
ratioOfMax=0.5
runtimeAssertionsBuild=false
seed=0
splits=10000
ssd=2
validationDuration=5m0s
vcpu=16

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

roachtest: perturbation/full/decommission failed #139978 roachtest: perturbation/full/decommission failed [C-test-failure O-roachtest O-robot T-kv branch-release-25.1]

_{This test on roachdash | Improve this report!}

andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 8, 2025

andrewbaptist self-assigned this Jan 8, 2025

andrewbaptist added A-admission-control T-storage Storage Team labels Jan 15, 2025

blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jan 15, 2025

andrewbaptist removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-kv KV Team labels Jan 15, 2025

cockroach-teamcity mentioned this issue Jan 28, 2025

roachtest: perturbation/full/decommission failed #139978

Open

cockroach-teamcity mentioned this issue Feb 8, 2025

roachtest: perturbation/full/decommission failed #140784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: perturbation/full/decommission failed #137688

roachtest: perturbation/full/decommission failed #137688

cockroach-teamcity commented Dec 18, 2024 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Dec 20, 2024

andrewbaptist commented Dec 20, 2024

cockroach-teamcity commented Dec 23, 2024

cockroach-teamcity commented Dec 29, 2024

cockroach-teamcity commented Jan 1, 2025

arulajmani commented Jan 8, 2025

andrewbaptist commented Jan 15, 2025

andrewbaptist commented Jan 15, 2025

blathers-crl bot commented Jan 15, 2025

andrewbaptist commented Jan 16, 2025

cockroach-teamcity commented Feb 14, 2025

roachtest: perturbation/full/decommission failed #137688

roachtest: perturbation/full/decommission failed #137688

Comments

cockroach-teamcity commented Dec 18, 2024 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Dec 20, 2024

andrewbaptist commented Dec 20, 2024

cockroach-teamcity commented Dec 23, 2024

cockroach-teamcity commented Dec 29, 2024

cockroach-teamcity commented Jan 1, 2025

arulajmani commented Jan 8, 2025

andrewbaptist commented Jan 15, 2025

andrewbaptist commented Jan 15, 2025

blathers-crl bot commented Jan 15, 2025

andrewbaptist commented Jan 16, 2025

cockroach-teamcity commented Feb 14, 2025

cockroach-teamcity commented Dec 18, 2024 •

edited by cockroach-jira-scripts

Loading