Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd client panics with "[tso] timestamp fallback" #8836

Closed
overvenus opened this issue Nov 21, 2024 · 6 comments · Fixed by #8837
Closed

pd client panics with "[tso] timestamp fallback" #8836

overvenus opened this issue Nov 21, 2024 · 6 comments · Fixed by #8837
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. impact/panic severity/major type/bug The issue is confirmed as a bug.

Comments

@overvenus
Copy link
Member

Bug Report

store/TiDB master bank-multitable auto-retry auto-retry-limit :default select FOR UPDATE txn-mode pessimistic isolation :repeatable-read nemesis kill,partition-one,schedules,stop/20241119T091111.000Z/no
de-0.node-peer.jepsen-tps-7527309-1-659/db.log:[2024/11/19 09:22:35.584 +00:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="[github.com/tikv/pd/client.(*tsoStream).recv](http://github.com/tikv/pd/client.(*tsoStream).recv)
Loop.func1\n\t/root/go/pkg/mod/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWri](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330/nruntime.gopanic/n/t/usr/local/go/src/runtime/panic.go:785/ngo.uber.org/zap/zapcore.CheckWri)
teAction.OnWrite\n\t/root/go/pkg/mod/[go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber](http://go.uber.org/zap@v1.27.0/zapcore/entry.go:196/ngo.uber.org/zap/zapcore.(*CheckedEntry).Write/n/t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262/ngo.uber)
.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/[go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go](http://go.uber.org/zap@v1.27.0/logger.go:285/ngithub.com/pingcap/log.Panic/n/t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go)
:54\[ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client](http://ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity/n/t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647/ngithub.com/tikv/pd/client).(*tsoD
ispatcher).processRequests.func2\n\t/root/go/pkg/mod/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/m](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597/ngithub.com/tikv/pd/client.(*tsoStream).recvLoop/n/t/root/go/pkg/m)
od/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444)"] [panicMessage="[tso] timestamp fallback"] [stack="[github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pk](http://github.com/tikv/pd/client.(*tsoStream).recvLoop.func1/n/t/root/go/pk)
g/mod/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330/nruntime.gopanic/n/t/usr/local/go/src/runtime/panic.go:785/ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite/n/t/root)
/go/pkg/mod/[go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic](http://go.uber.org/zap@v1.27.0/zapcore/entry.go:196/ngo.uber.org/zap/zapcore.(*CheckedEntry).Write/n/t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262/ngo.uber.org/zap.(*Logger).Panic)\
n\t/root/go/pkg/mod/[go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/c](http://go.uber.org/zap@v1.27.0/logger.go:285/ngithub.com/pingcap/log.Panic/n/t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54/ngithub.com/tikv/pd/c)
lient.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequest](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647/ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequest)
s.func2\n\t/root/go/pkg/mod/[github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/cli](http://github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597/ngithub.com/tikv/pd/client.(*tsoStream).recvLoop/n/t/root/go/pkg/mod/github.com/tikv/pd/cli)
ent@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"]

Test: https://tcms.pingcap.net/dashboard/executions/plan/7527309

The issue was found by @SpadeA-Tang, thank you!

What did you do?

Run jepsen tests.

What did you expect to see?

No panic.

What version of PD are you using (pd-server -V)?

TiDB: pingcap/tidb@5448f9f
PD: 60b960a

@overvenus overvenus added the type/bug The issue is confirmed as a bug. label Nov 21, 2024
@lhy1024
Copy link
Contributor

lhy1024 commented Nov 21, 2024

maybe #8781? Depending on the version of the pd server, it is possible to happen.

@lhy1024
Copy link
Contributor

lhy1024 commented Nov 21, 2024

First time became leader

[2024/11/19 09:11:44.012 +00:00] [INFO] [tso.go:161] ["start to sync timestamp"] []
[2024/11/19 09:11:44.014 +00:00] [INFO] [tso.go:221] ["sync and save timestamp"] [] [last=0001/01/01 00:00:00.000 +00:00] [last-saved=0001/01/01 00:00:00.000 +00:00] [save=2024/11/19 09:11:45.013 +00:00] [next=2024/11/19 09:11:44.013 +00:00]
[2024/11/19 09:11:44.015 +00:00] [WARN] [cluster.go:340] ["cluster is not bootstrapped"]
[2024/11/19 09:11:44.016 +00:00] [INFO] [id.go:175] ["idAllocator allocates a new id"] [new-end=1000] [new-base=0] [label=idalloc] [check-curr-end=true]
[2024/11/19 09:11:44.016 +00:00] [INFO] [util.go:50] ["load pd and cluster version"] [pd-version=8.5.0-alpha-3-g60b960a4] [cluster-version=0.0.0]
[2024/11/19 09:11:44.016 +00:00] [INFO] [server.go:1768] ["PD leader is ready to serve"] [leader-name=pd1]

But meet cluster is not bootstrapped, so tso sync but not run runServiceCheckJob and also not found raft cluster is stopped, because running is false.

Quickly became follower

[2024/11/19 09:11:53.660 +00:00] [INFO] [server.go:1648] ["start to watch pd leader"] 
[2024/11/19 09:11:46.377 +00:00] [WARN] [tso.go:335] ["clock offset"] [] [jet-lag=328.003313ms] [prev-physical=2024/11/19 09:11:46.049 +00:00] [now=2024/11/19 09:11:46.377 +00:00] [update-physical-interval=50ms]
[2024/11/19 09:11:46.416 +00:00] [INFO] [server.go:1792] ["etcd leader changed, resigns pd leadership"] [old-pd-leader-name=pd1]

The second time it becomes a leader, but it doesn't immediately synchronize tso

[2024/11/19 09:22:34.208 +00:00] [INFO] [server.go:1717] ["campaign PD leader ok"] [campaign-leader-name=pd1]
[2024/11/19 09:22:34.250 +00:00] [WARN] [tso.go:335] ["clock offset"] [] [jet-lag=10m47.850666159s] [prev-physical=2024/11/19 09:11:46.399 +00:00] [now=2024/11/19 09:22:34.250 +00:00] [update-physical-interval=50ms]

@lhy1024
Copy link
Contributor

lhy1024 commented Nov 21, 2024

After discussion with @rleungx , we should not move checkTSOService, which is needed by Tikv.

So we should call stopTSOJobsIfNeeded whatever running is true or false.

@overvenus
Copy link
Member Author

It happens again: https://tcms.pingcap.net/dashboard/executions/case/12789276

pd-server -V
Release Version: v8.5.0-alpha-6-gac87a0083
Edition: Community
Git Commit Hash: ac87a0083784a18ddde8173f52a28bb990775b2a
Git Branch: release-8.5
UTC Build Time:  2024-11-26 06:24:12

tidb-server -V
Release Version: v8.5.0-alpha-115-gf89b6ab823
Edition: Community
Git Commit Hash: f89b6ab823d2bd3cfa1c52546dfeac5d670a7dc7
Git Branch: release-8.5
UTC Build Time: 2024-11-27 02:50:52
GoVersion: go1.23.3
Race Enabled: false
Check Table Before Drop: false
Store: unistore
/opt/tidb/db.log:[2024/12/07 06:08:24.193 +00:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:330
runtime.gopanic
	/usr/local/go/src/runtime/panic.go:785
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196
go.uber.org/zap/zapcore.(*CheckedEntry).Write
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262
go.uber.org/zap.(*Logger).Panic
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285
github.com/pingcap/log.Panic
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54
github.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:647
github.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:597
github.com/tikv/pd/client.(*tsoStream).recvLoop
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:330
runtime.gopanic
	/usr/local/go/src/runtime/panic.go:785
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196
go.uber.org/zap/zapcore.(*CheckedEntry).Write
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262
go.uber.org/zap.(*Logger).Panic
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285
github.com/pingcap/log.Panic
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54
github.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:647
github.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:597
github.com/tikv/pd/client.(*tsoStream).recvLoop
	/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:444"]

@rleungx
Copy link
Member

rleungx commented Dec 10, 2024

[2024/12/09 23:49:44.543 +00:00] [INFO] [cluster.go:489] ["initializing the global TSO allocator"]

[2024/12/09 23:49:55.085 +00:00] [ERROR] [server.go:1744] ["failed to create raft cluster"] [error="[PD:etcd:ErrEtcdKVGet]context deadline exceeded: context deadline exceeded"]

At 23:49:44.543, we initiated tso, but the leader failed to create the raft cluster which resulted in tso not being reset.

[2024/12/09 23:51:23.755 +00:00] [INFO] [server.go:1717] ["campaign PD leader ok"] [campaign-leader-name=pd2]
[2024/12/09 23:51:24.234 +00:00] [WARN] [cluster.go:497] ["the global TSO allocator is already initialized before, but the cluster is not running"]

At 23:51:24.234, it becomes leader and skips sync timestamp from etcd.

TiDB report tso fallback

[2024/12/09 23:51:23.980 +00:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/home/jenkins/agent/workspace/ld_tidb_multi_branch_release-8.5/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241111073742-238d4d79ea31/tso_stream.go:444"]

@rleungx
Copy link
Member

rleungx commented Dec 10, 2024

tracked by #8889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. impact/panic severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants