Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: perturbation/full/decommission failed #137688

Open
cockroach-teamcity opened this issue Dec 18, 2024 · 11 comments
Open

roachtest: perturbation/full/decommission failed #137688

cockroach-teamcity opened this issue Dec 18, 2024 · 11 comments
Assignees
Labels
A-admission-control branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 18, 2024

roachtest.perturbation/full/decommission failed with artifacts on master @ 51693691ed763f700dd06fa2d001cce1ffd42203:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 455873.9934 > 5.0000 BASE: 7.896919ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 464598.7377 > 5.0000 BASE: 7.748622ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 445918.1765 > 5.0000 BASE: 8.07323ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-45700

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Dec 18, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/decommission failed with artifacts on master @ 951d890d1af51eefef402034c0506b5d6a9f014e:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 9.9893 > 5.0000 BASE: 6.481615ms SCORE: 64.746521ms
	            	
	            	FAILURE: read           : Increase 30.5676 > 5.0000 BASE: 7.016868ms SCORE: 214.489101ms
	            	
	            	FAILURE: write          : Increase 10.6101 > 5.0000 BASE: 18.82548ms SCORE: 199.740274ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • leaseType=epoch
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@andrewbaptist
Copy link
Contributor

From the last failure, the delay appears to be on n10 where the go scheduler is not scheduled for > 1s.

2024-12-20T14_04_42Z-UPSERT-1223.171ms.zip

Its not clear why this node is so overloaded during this time window since it doesn't have additional read or write bandwdith. However after this point it does slowly get into IO overload.

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/decommission failed with artifacts on master @ b63a736c85cfc1a968b74863b7f99c89ddebc1d3:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: write          : Increase 6.1470 > 5.0000 BASE: 11.961031ms SCORE: 73.524882ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/decommission failed with artifacts on master @ f9df57e2bebd963d10ffd7fa52e4d37cf01b80df:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 5.5666 > 5.0000 BASE: 6.318174ms SCORE: 35.170608ms
	            	
	            	FAILURE: read           : Increase 5.8903 > 5.0000 BASE: 6.166364ms SCORE: 36.321491ms
	            	
	            	FAILURE: write          : Increase 8.0275 > 5.0000 BASE: 14.381ms SCORE: 115.443257ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • fs=ext4
  • leaseType=epoch
  • localSSD=true
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/decommission failed with artifacts on master @ 47699f3887ad5d1b8c7c5905eb5c49628aa59bbe:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: follower-read  : Increase 44.7127 > 5.0000 BASE: 6.576626ms SCORE: 294.059018ms
	            	
	            	FAILURE: read           : Increase 43.2523 > 5.0000 BASE: 6.658326ms SCORE: 287.987715ms
	            	
	            	FAILURE: write          : Increase 21.6648 > 5.0000 BASE: 17.523226ms SCORE: 379.636316ms
	            	
	            	FAILURE: follower-read  : Increase 11.8065 > 5.0000 BASE: 6.576626ms SCORE: 77.646735ms
	            	
	            	FAILURE: read           : Increase 11.3909 > 5.0000 BASE: 6.658326ms SCORE: 75.844281ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • leaseType=epoch
  • mem=standard
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 8, 2025
@andrewbaptist andrewbaptist self-assigned this Jan 8, 2025
@arulajmani
Copy link
Collaborator

@andrewbaptist I looked at a trace from this earlier today. This bit jumps out:

    32.575ms     30.987ms                                    event:kv/kvserver/replica_proposal_buf.go:575 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] flushing proposal to Raft
    32.585ms      0.010ms                                        === operation:raft trace _verbose:1 node:2 store:4 range:9550/1:/Table/106/1/16{88630…-90475…}
    32.614ms      0.029ms                                        event:kv/kvserver/replica_proposal_buf.go:1199 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] registering local trace i3960/531d.5211
    32.658ms      0.044ms                                        event:kv/kvserver/replica_raft.go:2013 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 1->4 MsgApp Term:9 Log:9/3959 Entries:[3960-3960]
    32.691ms      0.034ms                                        event:kv/kvserver/replica_raft.go:2013 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 1->3 MsgApp Term:9 Log:9/3959 Entries:[3960-3960]
    32.758ms      0.067ms                                        event:kv/kvserver/replica_raft.go:1210 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 1->AppendThread MsgStorageAppend Term:0 Log:9/3960 Entries:[3960-3960]
    34.899ms      2.141ms                                        event:kv/kvserver/replica_raft.go:646 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 3->1 MsgAppResp Term:9 Index:3960
    45.235ms     10.335ms                                        event:kv/kvserver/replica_raft.go:646 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 4->1 MsgAppResp Term:9 Index:3960
   160.720ms    115.486ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 1->1 MsgAppResp Term:9 Index:3960
   160.742ms      0.022ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] AppendThread->1 MsgStorageAppendResp Log:9/3960
   160.836ms      0.094ms                                        event:kv/kvserver/replica_raft.go:1239 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] 1->ApplyThread MsgStorageApply Term:0 Log:0/0 Entries:[3960-3960]
   160.972ms      0.136ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] ApplyThread->1 MsgStorageApplyResp LastEntry:9/3960
   160.995ms      0.023ms                                        event:kv/kvserver/replica_raft.go:1989 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…}] unregistered log index 3960 from tracing
   160.771ms     -0.224ms                                        === operation:local proposal _verbose:1 node:2 store:4 range:9550/1:/Table/106/1/16{88630…-90475…} raft:
   160.807ms      0.036ms                                        event:kv/kvserver/replica_application_cmd.go:143 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…},raft] ack-ing replication success to the client; application will continue async w.r.t. the client
   160.856ms      0.049ms                                        event:kv/kvserver/app_batch.go:116 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…},raft] applying command
   160.922ms      0.067ms                                        event:kv/kvserver/replica_application_state_machine.go:179 [n2,s4,r9550/1:/Table/106/1/16{88630…-90475…},raft] LocalResult (reply: (err: <nil>), *kvpb.PutResponse, *kvpb.EndTxnResponse, #encountered intents: 0, #acquired locks: 0, #resolved locks: 1#updated txns: 1 #end txns: 0, PopulateBarrierResponse:false GossipFirstRange:false MaybeGossipSystemConfig:false MaybeGossipSystemConfigIfHaveFailure:false MaybeAddToSplitQueue:false MaybeGossipNodeLiveness:<nil> 

Specifically, it seems like 1 (the leader) took ~115ms to ack a raft proposal. The trace event is from n2, which is presumably the leaseholder. The raft IDs indicate that 1 is the leader -- is this a leader leaseholder split? Or am I misunderstanding something about raft tracing here?

Either way, I didn't dig into what caused this 115ms delay.

@andrewbaptist
Copy link
Contributor

Specifically, it seems like 1 (the leader) took ~115ms to ack a raft proposal. The trace event is from n2, which is presumably the leaseholder. The raft IDs indicate that 1 is the leader -- is this a leader leaseholder split? Or am I misunderstanding something about raft tracing here?

Yes - that is a correct reading of this. The slow local ack implies that the disk is overloaded and has a slow sync. Specifically this happens when there is a lot of elastic work occurring at the same time and its not appropriately throttled. I'm continuing to look at how to handle these types of cases as they are note part of what AC or RACv2 cover.

@andrewbaptist
Copy link
Contributor

Looking at the last test, for some reason n11 gets into a bad state at ~14:49:30. I don't fully understand what happened as it doesn't appear to be doing extra work, but its fsync latency spikes on both disks and IO AC kicks in (although I'm not 100% sure why).

https://grafana.testeng.crdb.io/d/CAvWxELVq/l2?from=1735742940000&to=1735743120000&var-cluster=teamcity-18346412-1735713837-148-n13cpu16sm&orgId=1

The disk bandwidth goes up slightly, but there isn't any clear cause.

Fairly quickly we see very slow requests due to AC throttling:

     0.923ms      0.005ms                                    === operation:admissionWorkQueueWait _verbose:1 node:11
  3143.476ms   3142.552ms                                    event:wait_duration_nanos:3142528376 queue_kind:"kv-regular-store-queue" work_priority:"normal-pri" 
  3143.455ms     -0.020ms                                    structured:{"@type":"type.googleapis.com/cockroach.util.admission.admissionpb.AdmissionWorkQueueStats","waitDurationNanos":"3.142528376s","queueKind":"kv-regular-store-queue","workPriority":"normal-pri"}
  3143.531ms   3142.613ms                                event:kv/kvserver/store_send.go:149 [n11,s21] executing Put [/Table/106/1/-4941528009249908907/0], EndTxn(parallel commit) [/Table/106/1/-4941528009249908907/0], [txn: 2a9f0995], [can-forward-ts]

The decommission had just started before this, so it is likely getting a lot of snapshot traffic, but it doesn't seem sufficient to push it to this level.

2025/01/01 14:49:11 cluster.go:2501: running cmd `./cockroach node decommissi...` on nodes [:12]

I'm going to assign to storage/AC to see if they can make anything of this, since this test has become a lot more erratic since ~Dec 22nd.

Its possible something change on the underlying cloud, but more likely something has changed in our software that is pushing some things harder.

Copy link

blathers-crl bot commented Jan 15, 2025

This issue has multiple T-eam labels. Please make sure it only has one, or else issue synchronization will not work correctly.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jan 15, 2025
@andrewbaptist andrewbaptist removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-kv KV Team labels Jan 15, 2025
@andrewbaptist
Copy link
Contributor

One thing I noticed on all the failures is that the raft scheduler latency gets high on one node around the time of the failure. My guess is that the scheduler threads are blocked on disk reads. As an example here is the graphs from the recent failure:

https://grafana.testeng.crdb.io/d/J-yAVzkVddd/detailed?from=1735742869200&to=1735743292916&var-cluster=teamcity-18346412-1735713837-148-n13cpu16sm&orgId=1

Image

Note that go scheduler latency is low during this time (under 5ms) and CPU is only at ~25%.

I'm still unclear what is driving the fsync latency and raft scheduler latency so high.

@cockroach-teamcity
Copy link
Member Author

roachtest.perturbation/full/decommission failed with artifacts on master @ 22b262749c502d07ec7ccec5b76abd67c361ae4d:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/perturbation/framework.go:673
	            				pkg/cmd/roachtest/test_runner.go:1333
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/full/decommission
	Messages:   	FAILURE: write          : Increase 5.2349 > 5.0000 BASE: 11.605528ms SCORE: 60.754157ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/full/decommission/run_1

Parameters:

  • acMode=defaultOption
  • arch=amd64
  • blockSize=4096
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • diskBandwidthLimit=0
  • disks=2
  • encrypted=false
  • fillDuration=10m0s
  • leaseType=epoch
  • mem=standard
  • metamorphicBufferedSender=false
  • numNodes=12
  • numWorkloadNodes=1
  • perturbationDuration=10m0s
  • ratioOfMax=0.5
  • runtimeAssertionsBuild=false
  • seed=0
  • splits=10000
  • ssd=2
  • validationDuration=5m0s
  • vcpu=16
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

3 participants