Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36086] Set isSnapshotCompleted only immediately after snapshot #3552

Closed
wants to merge 1 commit into from

Conversation

morozov
Copy link
Contributor

@morozov morozov commented Aug 18, 2024

If the SQL Server source connector is restarted while handling updates from a transaction with multiple updates, upon restart, it will skip the non-processed changes and proceed from the next transaction.

This is an analog of DBZ-1128 but reproducible only in Flink CDC.

This is a regression introduced in #2176.

Lower-level details

The isSnapshotCompleted offset context flag in Debezium tells the source connector to jump one transaction ahead in the beginning of the streaming phase. This is only necessary during the transition from the initial state snapshot to streaming. Without this flag set, the streaming change data source would stream the changes from the transaction that was already included in the snapshot. This is is the issue that #2176 attempted to address.

The following line sets this flag unconditionally for the stream split:

It makes Debezium jump one transaction ahead not only after completing the initial state snapshot but always during the start of streaming. This way, it may potentially skip changes from a transaction that wasn't fully captured prior to the restart.

The proposed solution is to set it only during the transition from the initial state snapshot to streaming.

Testing considerations

  1. In order to implement a test for the scenario in question, I would need a way to restart the connector in the middle of processing a transaction (something like start(..., Predicate<SourceRecord> isStopRecord) in the Debezium testing framework). How could I implement a similar case in Flink CDC?
  2. The test implemented in [sqlserver] Fix old change data that will be captured when the latest… #2176 doesn't fail even if the fix for the issue it covers is reverted, so at this point it's not clear how to reproduce the original issue.

@morozov morozov marked this pull request as ready for review August 18, 2024 21:38
@leonardBang leonardBang requested a review from GOODBOY008 August 19, 2024 13:11
@leonardBang
Copy link
Contributor

@GOODBOY008 Would you like to take a look at this PR when you have time?

@morozov
Copy link
Contributor Author

morozov commented Aug 28, 2024

@GOODBOY008 do you have an idea why the test introduced in #2176 passes even with the fix reverted? I'd like to make sure that I'm not breaking the original fix. I would also appreciate advise on testing the restart-mid-transaction scenario, if that's possible.

@GOODBOY008
Copy link
Member

GOODBOY008 commented Aug 29, 2024

@GOODBOY008 do you have an idea why the test introduced in #2176 passes even with the fix reverted? I'd like to make sure that I'm not breaking the original fix. I would also appreciate advise on testing the restart-mid-transaction scenario, if that's possible.

@morozov , I had try remove sourceFetchContext.getOffsetContext().preSnapshotCompletion(); and unit test also passed. I'm researching on it.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity for 60 days. It will be closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Oct 29, 2024
@morozov
Copy link
Contributor Author

morozov commented Oct 29, 2024

@leonardBang, @GOODBOY008, what are the next steps to get this merged?

@leonardBang
Copy link
Contributor

Hey @GOODBOY008 Would you like to review this PR when you have time ?

@morozov
Copy link
Contributor Author

morozov commented Jan 15, 2025

@leonardBang this PR is getting merge conflicts and may require extra care. Are you interested in getting this applied?

@GOODBOY008
Copy link
Member

@morozov I will design and add unit tests to verify the effectiveness of the changes.

@morozov
Copy link
Contributor Author

morozov commented Feb 22, 2025

Fixed in #3873.

@morozov morozov closed this Feb 22, 2025
@morozov morozov deleted the FLINK-36086 branch February 22, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants