Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed task failed after inject s3 failure last for 1mins #10710

Closed
Lily2025 opened this issue Mar 5, 2024 · 3 comments · Fixed by #10745
Closed

changefeed task failed after inject s3 failure last for 1mins #10710

Lily2025 opened this issue Mar 5, 2024 · 3 comments · Fixed by #10745
Assignees
Labels

Comments

@Lily2025
Copy link

Lily2025 commented Mar 5, 2024

What did you do?

1、br restore data
2、create changefeed with redo log
config: "memory-quota = 2147483648\n[consistent]\nlevel = "eventual"\nflush-interval = 500\nstorage = "s3://tmp/test-changefeed?access-key=xxxn&secret-access-key=xxx&endpoint=http%3a%2f%2fminio-peer%3a9000""

3、inject s3 failure last for 1mins
[2024/03/04 21:00:07.457 +08:00] [INFO] [chaos.go:74] ["Run chaos"] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"] [name=failure] [selectors="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [selectorsRetainPolicy(selectors)="[endless-ha-test-ticdc-tps-7170215-1-752/minio-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="FailureExperimentSpec{Duration: "", Scheduler: }"]
[2024/03/04 21:01:07.458 +08:00] [INFO] [chaos.go:92] ["Clean chaos"] [name=failure] [chaosId="ns=endless-ha-test-ticdc-tps-7170215-1-752,kind=failure,name=pod-failure-dkneisww,spec=&k8s.ChaosIdentifier{Namespace:"endless-ha-test-ticdc-tps-7170215-1-752", Name:"pod-failure-dkneisww", Spec:FailureExperimentSpec{Duration: "", Scheduler: }}"]

ticdc logs:
ticdc-1.tar.gz

What did you expect to see?

changefeed can recover automatically after fault recover

What did you see instead?

changefeed task failed after inject s3 failure last for 1mins

{
"upstream_id": 7342169415863000245,
"namespace": "default",
"id": "ticdc-task",
"sink_uri": "mysql://root:xxxxx@cdc-downstream-tc-tidb.endless-ha-test-ticdc-tps-7170215-1-752:4000",
"create_time": "2024-03-04 00:42:24.389",
"start_ts": 448131011512107011,
"resolved_ts": 0,
"target_ts": 0,
"checkpoint_tso": 448150164062077038,
"checkpoint_time": "2024-03-04 21:00:05.437",
"sort_engine": "unified",
"state": "failed",
"error": null,
"error_history": null,
"creator_version": "v8.0.0-alpha"
}

ticdc logs:
[2024/03/04 21:00:08.356 +08:00] [ERROR] [manager.go:269] ["redo: failed to create redo log writer"] [namespace=default] [changefeed=ticdc-task] [duration=1.046949ms] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo/writer/memory.NewLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/memory/mem_log_writer.go:53\ngithub.com/pingcap/tiflow/cdc/redo/writer/factory.NewRedoLogWriter\n\tgithub.com/pingcap/tiflow/cdc/redo/writer/factory/factory.go:53\ngithub.com/pingcap/tiflow/cdc/redo.(*logManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/manager.go:267\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func5\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:654\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.357 +08:00] [INFO] [manager.go:590] ["redo manager closed"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160]
[2024/03/04 21:00:08.357 +08:00] [INFO] [shared_stream.go:149] ["event feed going to create grpc stream"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.357 +08:00] [INFO] [changefeed.go:384] ["since the downstream is not a MQ, remove MQ only fields"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.373 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai]
[2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:299] ["changefeed tick failed"] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.379 +08:00] [ERROR] [changefeed.go:306] ["an error occurred in Owner"] [namespace=default] [changefeed=ticdc-task] [error="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log"] [errorVerbose="[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it%!(EXTRA *errors.withStack=[DFLOW:ErrFailToCreateExternalStorage]failed to create external storage%!(EXTRA string=creating ExternalStorage for s3): failed to get region of bucket tmp: RequestError: send request failed\ncaused by: dial tcp 10.233.78.161:9000: connect: connection refused): [CDC:ErrStorageInitialize]fail to open storage for redo log\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/errors.WrapChangefeedUnretryableErr\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:155\ngithub.com/pingcap/tiflow/pkg/redo.glob..func1\n\tgithub.com/pingcap/tiflow/pkg/redo/config.go:195\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).preStart\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:140\ngithub.com/pingcap/tiflow/cdc/redo.(*metaManager).Run\n\tgithub.com/pingcap/tiflow/cdc/redo/meta_manager.go:170\ngithub.com/pingcap/tiflow/cdc/owner.(*changefeed).initialize.func6\n\tgithub.com/pingcap/tiflow/cdc/owner/changefeed.go:663\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_puller.go:756] ["close the ddl puller"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.379 +08:00] [INFO] [ddl_sink.go:281] ["owner ddl sink background loop exits"] [namespace=default] [changefeed=ticdc-task] [error="context canceled"]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=13] [storeID=1] [addr=tc-tikv-0.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_stream.go:157] ["event feed grpc stream exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller] [streamID=14] [storeID=4] [addr=tc-tikv-2.tc-tikv-peer.endless-ha-test-ticdc-tps-7170215-1-752.svc:20160] [canceled=true]
[2024/03/04 21:00:08.379 +08:00] [INFO] [shared_client.go:268] ["event feed exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.379 +08:00] [INFO] [multiplexing_puller.go:242] ["MultiplexingPuller exits"] [namespace=default] [changefeed=ticdc-task_owner_ddl_puller]
[2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] [errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.380 +08:00] [INFO] [coordinator.go:261] ["schedulerv3: coordinator closed"] [ownerRev="{"revision":244137}"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.380 +08:00] [INFO] [changefeed.go:780] ["changefeed closed"] [namespace=default] [changefeed=ticdc-task] [isRemoved=false]
[2024/03/04 21:00:08.380 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/scheduler]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:854] ["Processor drop sort engine successfully"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:860] ["Processor try to close agent"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:499] ["topic handler returned error"] [error="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled"] errorVerbose="[CDC:ErrWorkerPoolHandleCancelled]workerpool handle is cancelled[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister.func1\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:230\ngithub.com/pingcap/tiflow/pkg/workerpool.(*defaultEventHandle).GracefulUnregister\n\tgithub.com/pingcap/tiflow/pkg/workerpool/pool_impl.go:252\ngithub.com/pingcap/tiflow/pkg/p2p.(*MessageServer).run.func2\n\tgithub.com/pingcap/tiflow/pkg/p2p/server.go:275\nruntime.goexit\n\truntime/asm_amd64.s:1650"]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:869] ["Processor closed agent successfully"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [INFO] [processor.go:878] ["processor closed"] [namespace=default] [changefeed=ticdc-task]
[2024/03/04 21:00:08.423 +08:00] [WARN] [server.go:297] ["handler not found"] [topic=changefeed/default/ticdc-task/agent]

Versions of the cluster

./cdc version
Release Version: v8.0.0-alpha
Git Commit Hash: 31dd24f
Git Branch: heads/refs/tags/v8.0.0-alpha
UTC Build Time: 2024-03-01 11:38:18
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

@Lily2025 Lily2025 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Mar 5, 2024
@Lily2025
Copy link
Author

Lily2025 commented Mar 5, 2024

/remove-area dm
/area ticdc

@ti-chi-bot ti-chi-bot bot added area/ticdc Issues or PRs related to TiCDC. and removed area/dm Issues or PRs related to DM. labels Mar 5, 2024
@Lily2025
Copy link
Author

Lily2025 commented Mar 5, 2024

/type bug
/assign sdojjy

@fubinzh
Copy link

fubinzh commented Mar 7, 2024

/severity major

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
4 participants