You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a standalone instance on Cloud (with only 1 CPU core) restarts while there are many Kafka sources with brokers down in the catalog, it'll fail to enter the recovery phase, thus cannot function or serve any queries at all.
Error message/log
There could be various types of errors that prevent the meta service from starting up, where almost all of them are some sort of "timeout", including...
Unable to start leader services: BackupStorage error: s3 error: dispatch failure: other: identity resolver timed out after 5s
lease ... keep alive timeout (lose leader)
recover mview progress should not fail: Failed to acquire connection from pool: Connection pool timed out
To Reproduce
export TOKIO_WORKER_THREADS=1 to simulate the case for the resource limit of 1 CPU core in Cloud.
Start risingwave and a healthy Kafka cluster.
Create 10 Kafka sources.
Kill risingwave and the Kafka cluster.
Run nc -k -l <broker-port> to simulate a Kafka broker that never refuses a connection nor responses.
Start risingwave again.
Observe bunch of error logs from rdkafka.
Find it not be able to serve and panics after seconds.
Expected behavior
Should at least start the meta service on 5690 port and then the frontend service, allowing users to run DROP SOURCE to drop the problematic sources.
How did you deploy RisingWave?
Single-node or standalone mode, with only 1 tokio worker threads.
Or, free-tier in RisingWave cloud.
The version of RisingWave
No response
Additional context
By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.
By attaching the debugger, I believe it's caused by the synchronous interfaces in rust-rdkafka (issue: fede1024/rust-rdkafka#358) called in ConnectorSourceWorker::run:
Note that fetch_metadata is actually a sync interface. It's marked as async to be compatible with madsim mocked interfaces. The timeout is also implemented in a sync way. When there's something wrong with the connection, the function call will block the thread.
Since there's only 1 tokio worker thread, a thread being blocked means that the whole RW service is blocked. As a result, there could be any kinds of weird timeout error to be happen.
By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.
Also I believe that using multiple worker threads for deployments with only a single CPU core, especially in standalone or single-node mode, would generally be better as it reduces the possibility for encountering such problems.
Describe the bug
When a standalone instance on Cloud (with only 1 CPU core) restarts while there are many Kafka sources with brokers down in the catalog, it'll fail to enter the recovery phase, thus cannot function or serve any queries at all.
Error message/log
There could be various types of errors that prevent the meta service from starting up, where almost all of them are some sort of "timeout", including...
To Reproduce
export TOKIO_WORKER_THREADS=1
to simulate the case for the resource limit of 1 CPU core in Cloud.risingwave
and a healthy Kafka cluster.risingwave
and the Kafka cluster.nc -k -l <broker-port>
to simulate a Kafka broker that never refuses a connection nor responses.risingwave
again.rdkafka
.Expected behavior
Should at least start the meta service on
5690
port and then the frontend service, allowing users to runDROP SOURCE
to drop the problematic sources.How did you deploy RisingWave?
The version of RisingWave
No response
Additional context
By setting
TOKIO_WORKER_THREADS
to a larger number (like 4), the problem is addressed.By attaching the debugger, I believe it's caused by the synchronous interfaces in
rust-rdkafka
(issue: fede1024/rust-rdkafka#358) called inConnectorSourceWorker::run
:risingwave/src/meta/src/stream/source_manager.rs
Lines 187 to 191 in 90eb6f1
Then
risingwave/src/connector/src/source/kafka/enumerator/client.rs
Lines 376 to 381 in 90eb6f1
Note that
fetch_metadata
is actually a sync interface. It's marked asasync
to be compatible withmadsim
mocked interfaces. The timeout is also implemented in a sync way. When there's something wrong with the connection, the function call will block the thread.Since there's only 1 tokio worker thread, a thread being blocked means that the whole RW service is blocked. As a result, there could be any kinds of weird timeout error to be happen.
fetch_watermarks
inspawn_blocking
madsim-rs/madsim#196 wraps some interfaces inspawn_blocking
, which is in the right way for working around this problem. However, the remaining unwrapped interfaces could still be problematic.The text was updated successfully, but these errors were encountered: