Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix] Fix schema registry flush state switch logic with multiple parallelism #3567

Merged
merged 1 commit into from
Aug 23, 2024

Conversation

yuxiqian
Copy link
Contributor

@yuxiqian yuxiqian commented Aug 23, 2024

Currently, SchemaRegistryRequestHandler regards Flush success when it has collected all FlushSuccessEvent from all registered subTasks.

if (flushedSinkWriters.equals(activeSinkWriters)) {
    // ...
}

However, some subTasks might not be registered early enough, which means flush success condition might be falsely triggered:

// 6 early came sink tasks have registered & flushed successfully
SchemaRegistry - Sink subtask 5 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 0 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 1 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 2 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 4 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 7 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0

// Mistake here: Schema registry thought it has collected enough flush events
schemaChangeStatus WAITING_FOR_FLUSH -> APPLYING for request [CreateTableEvent{tableId=default_namespace.default_schema.table1, schema=columns={`col1` STRING,`col2` STRING}, primaryKeys=col1, options=()}]. Collected enough flush writers [0, 1, 2, 4, 5, 7] of [0, 1, 2, 4, 5, 7]

// This is not correct: some subtasks have not been registered yet
SchemaRegistryRequestHandler - All sink subtask have flushed for table default_namespace.default_schema.table1. Start to apply schema change.

// Late-coming flush events will wrongly trigger transition again
SchemaRegistryRequestHandler - Register sink subtask 3.
SchemaRegistryRequestHandler - Register sink subtask 6.
SchemaRegistry - Sink subtask 3 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistry - Sink subtask 6 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0

// Causing schema registry crash
java.lang.IllegalStateException: Illegal schemaChangeStatus state: should be WAITING_FOR_FLUSH before collecting enough FlushEvents, not FINISHED. When receiving sink success from 6 with table default_namespace.default_schema.table1. Current flushedSinkWriters: [0, 1, 2, 3, 4, 5, 6, 7]; activeSinkWriters: [0, 1, 2, 3, 4, 5, 6, 7]

After this change, the WAITING_FOR_FLUSH -> APPLYING transition also verifies if all subTasks has been registered (should be >= parallelism).

SchemaRegistry - Sink subtask 4 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 2 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 6 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 3 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 0 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 1 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.
SchemaRegistry - Sink subtask 5 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0
SchemaRegistryRequestHandler - Not all active sink writers have been registered.

// Will not start schema change until all sinks registered and flushed
SchemaRegistryRequestHandler - Register sink subtask 7.
SchemaRegistry - Sink subtask 7 succeed flushing for table default_namespace.default_schema.table1. Attempt number: 0

// OK now
SchemaRegistryRequestHandler - SchemaChangeStatus WAITING_FOR_FLUSH -> APPLYING
SchemaRegistryRequestHandler - All sink subtask have flushed for table default_namespace.default_schema.table1. Start to apply schema change.

@yuxiqian
Copy link
Contributor Author

@leonardBang @ruanhang1993 PTAL

@yuxiqian yuxiqian force-pushed the hotfix/schema-registry-hangling branch from 62ee539 to 1c940d8 Compare August 23, 2024 04:00
@yuxiqian yuxiqian changed the title [hotfix] Fix schema registry hanging in multiple parallelism [hotfix] Fix schema registry flush state switch logic with multiple parallelism Aug 23, 2024
Copy link
Contributor

@leonardBang leonardBang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @yuxiqian for the hotfix

@leonardBang leonardBang merged commit 060d203 into apache:master Aug 23, 2024
17 of 21 checks passed
qiaozongmi pushed a commit to qiaozongmi/flink-cdc that referenced this pull request Sep 23, 2024
ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants