Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

Closed
liuzix opened this issue Nov 10, 2020 · 2 comments · Fixed by #1069
Closed

ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed after migrating tables #1067

liuzix opened this issue Nov 10, 2020 · 2 comments · Fixed by #1069
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-user Bugs found by users. component/scheduler TiCDC inner scheduler component. priority/P0 The issue has P0 priority. type/bug The issue is confirmed as a bug.
Milestone

Comments

@liuzix
Copy link
Contributor

liuzix commented Nov 10, 2020

Bug Report

This issue is part of the problem described in #1056

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
    Run a changefeed on three CDC nodes, with five tables and a total QPS around 5000. The changefeed stops with ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed, after scaling in and scaling out the CDC cluster.

  2. What did you expect to see?
    That the changefeed runs normally.

  3. What did you see instead?
    The errors ErrSnapshotSchemaNotFound or ErrSchemaStorageGCed.

  4. Versions of the cluster
    All components are in v4.0.8.

Cause Analysis

This problem is caused by bugs in handling table migrations. Since stopping a table is not immediate, the checkpoint will proceed a bit after the checkpoint at which the owner started the migration. But the owner has recorded the checkpoint prematurely in replica-info.startTs. When the table is restarted on another capture that previously has not run this changefeed, the SchemaStorage does not have information about the Schemas before the current checkpoint, but the table pullers fetch data from replica-info.startTs, which is before the current checkpoint. Hence the errors.

@liuzix liuzix added priority/P0 The issue has P0 priority. component/scheduler TiCDC inner scheduler component. bug-from-user Bugs found by users. labels Nov 10, 2020
@liuzix liuzix added this to the v4.0.9 milestone Nov 10, 2020
@liuzix liuzix self-assigned this Nov 10, 2020
@liuzix
Copy link
Contributor Author

liuzix commented Nov 10, 2020

My thoughts on how to fix: I'm wondering if we can just let the global checkpoint override the startTs of a table, i.e. we always start the table puller with the current global checkpoint as the starting point. @amyangfei @leoppro

@amyangfei amyangfei added the type/bug The issue is confirmed as a bug. label Nov 11, 2020
@amyangfei
Copy link
Contributor

amyangfei commented Nov 11, 2020

My thoughts on how to fix: I'm wondering if we can just let the global checkpoint override the startTs of a table, i.e. we always start the table puller with the current global checkpoint as the starting point. @amyangfei @leoppro

Maybe this bug can be fixed via following ways

  1. as you have described, use the max(startTs, globalCheckpointTs to start puller
  2. or change the logic of schema snapshot, such as amending missing schema version, which is more complex.

IMO the first solution is all right, we could add an integration test and have a try @liuzix

@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. bug-from-user Bugs found by users. component/scheduler TiCDC inner scheduler component. priority/P0 The issue has P0 priority. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants