Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

Closed
hzxa21 opened this issue Apr 10, 2024 · 5 comments · Fixed by #18553
Closed

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

hzxa21 opened this issue Apr 10, 2024 · 5 comments · Fixed by #18553
Labels
priority/high type/bug Something isn't working
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Apr 10, 2024

Describe the bug

Recently there are two user reporting the following assertion triggered in compaction both on the data related to cdc source state table:

// Epoch from the same user key should be monotonically decreasing

Here are some info for the two user reports:

  1. v..7.2: brand new cluster and compactor panics after table creation. No more information provided.
  2. v1.8.0: brand new cluster and compactor panics after table creation (using dedicated cdc source).
    • Log:
    key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258454417244160) >= prev epoch EpochWithGap(6258454417244160)
    stack backtrace:
    thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21:
    key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258387423461376) >= prev epoch EpochWithGap(6258387423461376)
    
    • rw_internal_tables:
        select * from rw_internal_tables where id = 13;
        -[ RECORD 1 ]------------------+-------------------------------------------------------------------------------
        id                             | 13
        name                           | __internal_****_2_source_0
        schema_id                      | 7
        owner                          | 1
        definition                     | 
        acl                            | {****}
        initialized_at                 | 2024-04-09 13:45:42+00:00
        created_at                     | 2024-04-09 13:45:42+00:00
        initialized_at_cluster_version | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)
        created_at_cluster_version     | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)
    
    
    • rw_hummock_sstables. All SSTs related to table 13 is in level0 and only sub_level 6258454417244160 (the epoch in the panic log) has two SSTs. That means there are 2 CNs writing files in the same checkpoint epoch for table id, which is strange because direct cdc source only has one parallelism and there should only be one CN writing data to table 13 in this epoch.
        compaction_group_id | level_id |   sub_level_id   | sstable_id | file_size 
        ---------------------+----------+------------------+------------+-----------
       ....
        2 |        0 | 6258387423461376 |      23250 |       788
        2 |        0 | 6258387423461376 |      23240 |       788
       ....
    
    • The sst dump of sst 23250 and 23240 shows that these two SSTs contain only a single entry with FullKey { UserKey { 13, TableKey { 00000001313200000000000002 } }, epoch: 6258387423461376, epoch_with_gap: 6258387423461376, spill_offset: 0}, len=25. Full sst dump result can be found here.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@hzxa21 hzxa21 added the type/bug Something isn't working label Apr 10, 2024
@github-actions github-actions bot added this to the release-1.9 milestone Apr 10, 2024
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Apr 10, 2024

@yezizp2012 @StrikeW We may need to check whether this is a bug in cdc source or in meta. If there is a race in actor assignment in meta, it may affect other use cases as well.

@tcodehuber
Copy link

tcodehuber commented Apr 20, 2024

I met the same issue after I rebooted the risingwave cluster.
Error log is found in the compactor pod:

thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21: key UserKey { 246, TableKey { 00000001312d30000000000003 } } epoch EpochWithGap(6317061581963264) >= prev epoch EpochWithGap(6317061581963264) 2024-04-20T15:34:02.883393802Z INFO risingwave_storage::hummock::compactor::compactor_runner: Ready to handle compaction group 2 task: 147635 compact_task_statistics CompactTaskStatistics { total_file_count: 44, total_key_count: 46, total_file_size: 25004, total_uncompressed_file_size: 24590 } target_level 0 compression_algorithm 0 table_ids [76, 86, 96, 106, 111, 131, 141, 196, 201, 226, 231, 246, 256] parallelism 1

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Jul 10, 2024

Another occurrence that is probably related to this issue: https://risingwave-community.slack.com/archives/C03BW71523T/p1719592780560509

@zwang28
Copy link
Contributor

zwang28 commented Jul 24, 2024

Another occurance in v1.10.0-rc3. But because the cluster has already been reset, we don't know the kind of problematic table.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Aug 2, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants