ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

liuzix · 2020-11-10T13:00:59Z

Bug Report

This issue is part of the problem described in #1056

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.
Run a changefeed on three CDC nodes, with five tables and a total QPS around 5000. The changefeed stops with ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed, after scaling in and scaling out the CDC cluster.
What did you expect to see?
That the changefeed runs normally.
What did you see instead?
The errors ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed.
Versions of the cluster
All components are in v4.0.8.

Cause Analysis

This problem is caused by bugs in handling table migrations. Since stopping a table is not immediate, the checkpoint will proceed a bit after the checkpoint at which the owner started the migration. But the owner has recorded the checkpoint prematurely in replica-info.startTs. When the table is restarted on another capture that previously has not run this changefeed, the SchemaStorage does not have information about the Schemas before the current checkpoint, but the table pullers fetch data from replica-info.startTs, which is before the current checkpoint. Hence the errors.

The text was updated successfully, but these errors were encountered:

liuzix · 2020-11-10T16:41:51Z

My thoughts on how to fix: I'm wondering if we can just let the global checkpoint override the startTs of a table, i.e. we always start the table puller with the current global checkpoint as the starting point. @amyangfei @leoppro

amyangfei · 2020-11-11T02:20:13Z

My thoughts on how to fix: I'm wondering if we can just let the global checkpoint override the startTs of a table, i.e. we always start the table puller with the current global checkpoint as the starting point. @amyangfei @leoppro

Maybe this bug can be fixed via following ways

as you have described, use the max(startTs, globalCheckpointTs to start puller
or change the logic of schema snapshot, such as amending missing schema version, which is more complex.

IMO the first solution is all right, we could add an integration test and have a try @liuzix

liuzix added priority/P0 The issue has P0 priority. component/scheduler TiCDC inner scheduler component. bug-from-user Bugs found by users. labels Nov 10, 2020

liuzix added this to the v4.0.9 milestone Nov 10, 2020

liuzix self-assigned this Nov 10, 2020

amyangfei added the type/bug The issue is confirmed as a bug. label Nov 11, 2020

liuzix linked a pull request Nov 20, 2020 that will close this issue

Temporary fix for ErrSnapshotSchemaNotFound and ErrSchemaStorageGCed #1069

Merged

amyangfei closed this as completed in #1069 Nov 24, 2020

AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

liuzix commented Nov 10, 2020

liuzix commented Nov 10, 2020

amyangfei commented Nov 11, 2020 •

edited

Loading

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

Comments

liuzix commented Nov 10, 2020

Bug Report

Cause Analysis

liuzix commented Nov 10, 2020

amyangfei commented Nov 11, 2020 • edited Loading

amyangfei commented Nov 11, 2020 •

edited

Loading