bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

BugenZhao · 2024-05-10T08:35:29Z

Describe the bug

When a standalone instance on Cloud (with only 1 CPU core) restarts while there are many Kafka sources with brokers down in the catalog, it'll fail to enter the recovery phase, thus cannot function or serve any queries at all.

Error message/log

There could be various types of errors that prevent the meta service from starting up, where almost all of them are some sort of "timeout", including...

Unable to start leader services: BackupStorage error: s3 error: dispatch failure: other: identity resolver timed out after 5s
lease ... keep alive timeout (lose leader)
recover mview progress should not fail: Failed to acquire connection from pool: Connection pool timed out

To Reproduce

export TOKIO_WORKER_THREADS=1 to simulate the case for the resource limit of 1 CPU core in Cloud.
Start risingwave and a healthy Kafka cluster.
Create 10 Kafka sources.
Kill risingwave and the Kafka cluster.
Run nc -k -l <broker-port> to simulate a Kafka broker that never refuses a connection nor responses.
Start risingwave again.
Observe bunch of error logs from rdkafka.
Find it not be able to serve and panics after seconds.

Expected behavior

Should at least start the meta service on 5690 port and then the frontend service, allowing users to run DROP SOURCE to drop the problematic sources.

How did you deploy RisingWave?

Single-node or standalone mode, with only 1 tokio worker threads.
Or, free-tier in RisingWave cloud.

The version of RisingWave

No response

Additional context

By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.

By attaching the debugger, I believe it's caused by the synchronous interfaces in rust-rdkafka (issue: fede1024/rust-rdkafka#358) called in ConnectorSourceWorker::run:

risingwave/src/meta/src/stream/source_manager.rs

Lines 187 to 191 in 90eb6f1

    
           let splits = self.enumerator.list_splits().await.map_err(|e| { 
        
               source_is_up(0); 
        
               self.fail_cnt += 1; 
        
               e 
        
           })?;

Then

risingwave/src/connector/src/source/kafka/enumerator/client.rs

Lines 376 to 381 in 90eb6f1

    
           async fn fetch_topic_partition(&self) -> ConnectorResult<Vec<i32>> { 
        
               // for now, we only support one topic 
        
               let metadata = self 
        
                   .client 
        
                   .fetch_metadata(Some(self.topic.as_str()), self.sync_call_timeout) 
        
                   .await?;

Note that fetch_metadata is actually a sync interface. It's marked as async to be compatible with madsim mocked interfaces. The timeout is also implemented in a sync way. When there's something wrong with the connection, the function call will block the thread.

Since there's only 1 tokio worker thread, a thread being blocked means that the whole RW service is blocked. As a result, there could be any kinds of weird timeout error to be happen.

fix: wrap fetch_watermarks in spawn_blocking madsim-rs/madsim#196 wraps some interfaces in spawn_blocking, which is in the right way for working around this problem. However, the remaining unwrapped interfaces could still be problematic.
Investigation: new kafka sdk #13462 could help
feat: use seperated runtime for source manager #12571 could also help but was reverted due to unknown stuck

The text was updated successfully, but these errors were encountered:

tabVersion · 2024-05-10T10:35:06Z

@wangrunji0408 shall we also make fetch_metadata spawn blocking?

tabVersion · 2024-05-10T12:24:53Z

madsim-rs/madsim#209 can help with the issue.

xxchan · 2024-05-15T02:52:43Z

Let's retest this

BugenZhao · 2024-05-20T08:29:31Z

By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.

Also I believe that using multiple worker threads for deployments with only a single CPU core, especially in standalone or single-node mode, would generally be better as it reduces the possibility for encountering such problems.

tabVersion · 2024-07-10T09:32:46Z

close as completed

BugenZhao added the type/bug Something isn't working label May 10, 2024

github-actions bot added this to the release-1.10 milestone May 10, 2024

BugenZhao added the component/connector label May 10, 2024

fuyufjh added the priority/high label May 10, 2024

fuyufjh assigned tabVersion May 10, 2024

wangrunji0408 mentioned this issue May 13, 2024

fix(kafka): wrap some rdkafka functions in spawn_blocking #16720

Merged

9 tasks

This was referenced May 20, 2024

source: build_stream_source_reader failure may block cluster #16813

Open

fix(frontend): use dedicated runtime to accept connection #16838

Merged

BugenZhao mentioned this issue May 28, 2024

fix: use a minimal value of 4 for RW_WORKER_THREADS risingwavelabs/risingwave-operator#663

Merged

2 tasks

tabVersion closed this as completed Jul 10, 2024

BugenZhao mentioned this issue Sep 29, 2024

feat: ensure number of tokio worker threads to be at least 4 #18762

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

BugenZhao commented May 10, 2024 •

edited

Loading

tabVersion commented May 10, 2024

tabVersion commented May 10, 2024

xxchan commented May 15, 2024

BugenZhao commented May 20, 2024

tabVersion commented Jul 10, 2024

bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

Comments

BugenZhao commented May 10, 2024 • edited Loading

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

tabVersion commented May 10, 2024

tabVersion commented May 10, 2024

xxchan commented May 15, 2024

BugenZhao commented May 20, 2024

tabVersion commented Jul 10, 2024

BugenZhao commented May 10, 2024 •

edited

Loading