Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

puller(ticdc): fix wrong update splitting behavior after table scheduling (#11269) #11281

Merged

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #11269

What problem does this PR solve?

Issue Number: close #11219

What is changed and how it works?

After #11030, we introduce a mechanism to get the current timestamp thresholdTS from pd when changefeed starts, and split all update kv entries which commitTS is smaller than the thresholdTS.

This mechanism has the following problem:

  1. There are two cdc nodes A and B, and B start before A, that is thresholdTSB < thresholdTSA;
  2. The sync task of table t is first on A;
  3. Table t has an update event which commitTS is smaller than thresholdTSA and larger than thresholdTSB. So the update event is split to a delete event and an insert event on node A;
  4. But the delete event and insert event cannot be send to the downstream in an atomic way. So if after the delete event is send to downstream and before the insert event being send, the table sync task is scheduling to node B, the update event are received by node B again;
  5. The update event is not split by node B because its commitTS is larger than the thresholdTSB, and node B just send an update sql to downstream which cause data inconsistency;

And there is also another thing to notice that after scheduling, node B will send some events to downstream which are already send by node A; So node B must send these events in an idempotent way;
Previously, this is handled by getting a replicateTS in sink module when sink starts and split these events which commitTS is smaller than replicateTS. But this mechanism is also removed in #11030. So we need to handle this case in puller too.

In this pr, instead of maintaining a separate thresholdTS in sourcemanager, we try to get the replicateTS from sink when puller need to check whether to split the update event.
And since puller module starts working before sink module, so we give replicateTS a default value MAXUInt64 which means to split all update events. After sink starts working, replicateTS will be set to the correct value.

The last thing to notice, when sink restarts due to some error, after restart, the sink may send some events downstream which are already send before restart. These events also need be send in an idempotent way. But these events are already in sorter, so just restart sink cannot accomplish this goal. So we forbid restarting sink in this pr and just restart the changefeed when meet error.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
  1. deploy a cluster with three cdc nodes;
  2. kill all nodes occasionally while running workload and check whether the data is consistent;

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

None

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added lgtm release-note-none Denotes a PR that doesn't merit a release note. labels Jun 11, 2024
@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-7.1 This PR is cherry-picked to release-7.1 from a source PR. labels Jun 11, 2024
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 11, 2024
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 11, 2024
@lidezhu
Copy link
Collaborator

lidezhu commented Jun 11, 2024

/retest

Copy link
Contributor

ti-chi-bot bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Jun 11, 2024
@ti-chi-bot ti-chi-bot added the cherry-pick-approved Cherry pick PR approved by release team. label Jun 11, 2024
@ti-chi-bot ti-chi-bot bot merged commit f203805 into pingcap:release-7.1 Jun 11, 2024
11 of 12 checks passed
@lidezhu lidezhu deleted the cherry-pick-11269-to-release-7.1 branch June 11, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-7.1 This PR is cherry-picked to release-7.1 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants