bug: dedicated cdc source writes the same full key more than once in an epoch #16235

hzxa21 · 2024-04-10T09:33:56Z

Describe the bug

Recently there are two user reporting the following assertion triggered in compaction both on the data related to cdc source state table:

risingwave/src/storage/hummock_sdk/src/key.rs

Line 1069 in 9f2ac7e

// Epoch from the same user key should be monotonically decreasing

Here are some info for the two user reports:

v..7.2: brand new cluster and compactor panics after table creation. No more information provided.

v1.8.0: brand new cluster and compactor panics after table creation (using dedicated cdc source).

Log:

key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258454417244160) >= prev epoch EpochWithGap(6258454417244160)
stack backtrace:
thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21:
key UserKey { 13, TableKey { 00000001313200000000000002 } } epoch EpochWithGap(6258387423461376) >= prev epoch EpochWithGap(6258387423461376)

rw_internal_tables:

    select * from rw_internal_tables where id = 13;
    -[ RECORD 1 ]------------------+-------------------------------------------------------------------------------
    id                             | 13
    name                           | __internal_****_2_source_0
    schema_id                      | 7
    owner                          | 1
    definition                     | 
    acl                            | {****}
    initialized_at                 | 2024-04-09 13:45:42+00:00
    created_at                     | 2024-04-09 13:45:42+00:00
    initialized_at_cluster_version | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)
    created_at_cluster_version     | PostgreSQL 13.14.0-RisingWave-1.8.0 (96c76cae54de990d310d243018dfd4b054118e3e)

rw_hummock_sstables. All SSTs related to table 13 is in level0 and only sub_level 6258454417244160 (the epoch in the panic log) has two SSTs. That means there are 2 CNs writing files in the same checkpoint epoch for table id, which is strange because direct cdc source only has one parallelism and there should only be one CN writing data to table 13 in this epoch.

    compaction_group_id | level_id |   sub_level_id   | sstable_id | file_size 
    ---------------------+----------+------------------+------------+-----------
   ....
    2 |        0 | 6258387423461376 |      23250 |       788
    2 |        0 | 6258387423461376 |      23240 |       788
   ....

The sst dump of sst 23250 and 23240 shows that these two SSTs contain only a single entry with FullKey { UserKey { 13, TableKey { 00000001313200000000000002 } }, epoch: 6258387423461376, epoch_with_gap: 6258387423461376, spill_offset: 0}, len=25. Full sst dump result can be found here.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

hzxa21 · 2024-04-10T09:36:46Z

@yezizp2012 @StrikeW We may need to check whether this is a bug in cdc source or in meta. If there is a race in actor assignment in meta, it may affect other use cases as well.

tcodehuber · 2024-04-20T15:39:14Z

I met the same issue after I rebooted the risingwave cluster.
Error log is found in the compactor pod:

thread 'rw-compaction' panicked at /risingwave/src/storage/hummock_sdk/src/key.rs:1066:21: key UserKey { 246, TableKey { 00000001312d30000000000003 } } epoch EpochWithGap(6317061581963264) >= prev epoch EpochWithGap(6317061581963264) 2024-04-20T15:34:02.883393802Z INFO risingwave_storage::hummock::compactor::compactor_runner: Ready to handle compaction group 2 task: 147635 compact_task_statistics CompactTaskStatistics { total_file_count: 44, total_key_count: 46, total_file_size: 25004, total_uncompressed_file_size: 24590 } target_level 0 compression_algorithm 0 table_ids [76, 86, 96, 106, 111, 131, 141, 196, 201, 226, 231, 246, 256] parallelism 1

hzxa21 · 2024-07-10T09:19:37Z

Another occurrence that is probably related to this issue: https://risingwave-community.slack.com/archives/C03BW71523T/p1719592780560509

zwang28 · 2024-07-24T01:07:53Z

Another occurance in v1.10.0-rc3. But because the cluster has already been reset, we don't know the kind of problematic table.

hzxa21 · 2024-08-02T14:05:24Z

Another occurrence: https://risingwave-community.slack.com/archives/C03BW71523T/p1722607407932359?thread_ts=1719592780.560509&cid=C03BW71523T

hzxa21 added the type/bug Something isn't working label Apr 10, 2024

github-actions bot added this to the release-1.9 milestone Apr 10, 2024

fuyufjh assigned StrikeW Apr 22, 2024

StrikeW modified the milestones: release-1.9, release-1.10 May 14, 2024

hzxa21 modified the milestones: release-1.10, release-1.11 Jul 10, 2024

fuyufjh added the priority/high label Aug 19, 2024

StrikeW removed their assignment Aug 20, 2024

This was referenced Aug 27, 2024

Compactor container crashed #18054

Closed

fix: check split assignment before pushing mutation #18134

Merged

pjpringlenom mentioned this issue Sep 5, 2024

Compactor errors with duplicate key #18262

Open

fuyufjh mentioned this issue Sep 17, 2024

fix(metadata v2): existing actor splits were not updated #18553

Merged

9 tasks

fuyufjh closed this as completed in #18553 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

hzxa21 commented Apr 10, 2024 •

edited

Loading

hzxa21 commented Apr 10, 2024

tcodehuber commented Apr 20, 2024 •

edited

Loading

hzxa21 commented Jul 10, 2024

zwang28 commented Jul 24, 2024

hzxa21 commented Aug 2, 2024

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

bug: dedicated cdc source writes the same full key more than once in an epoch #16235

Comments

hzxa21 commented Apr 10, 2024 • edited Loading

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

hzxa21 commented Apr 10, 2024

tcodehuber commented Apr 20, 2024 • edited Loading

hzxa21 commented Jul 10, 2024

zwang28 commented Jul 24, 2024

hzxa21 commented Aug 2, 2024

hzxa21 commented Apr 10, 2024 •

edited

Loading

tcodehuber commented Apr 20, 2024 •

edited

Loading