Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlsmith bug hunt: channels not propagating errors #3908

Closed
Tracked by #3896 ...
kwannoel opened this issue Jul 15, 2022 · 9 comments · Fixed by #6566
Closed
Tracked by #3896 ...

sqlsmith bug hunt: channels not propagating errors #3908

kwannoel opened this issue Jul 15, 2022 · 9 comments · Fixed by #6566
Assignees
Labels
help wanted Issues that need help from contributors type/bug Something isn't working

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Jul 15, 2022

Some expressions might only fail during runtime, for instance if they only fetch columns.
Errors should be propagated. In the following cases, the underlying error is NumericOutOfRangeError, but it is not propagated to downstream executors.

dev=> create table t(v int);                                                                                                                                                                                                    CREATE_TABLE
dev=> insert into t values (33);                                                                                                                                                                                                             INSERT 0 1
dev=> select (SMALLINT '22' << v) from t;                                                                                                                                                                                                    ERROR:  RPC error: Status { code: Internal, message: "internal error: broken fifo_channel", metadata: MetadataMap { headers: {"risingwave-error-bin": "CAESI2ludGVybmFsIGVycm9yOiBicm9rZW4gZmlmb19jaGFubmVs"} }, source: None }
dev=> create table t(v int);                                                                                                                                                                                                    CREATE_TABLE
dev=> insert into t values (33);                                                                                                                                                                                                             INSERT 0 1
dev=> select (SMALLINT '22' << v) from t;                                                                                                                                                                                                    ERROR:  RPC error: Status { code: Internal, message: "internal error: broken fifo_channel", metadata: MetadataMap { headers: {"risingwave-error-bin": "CAESI2ludGVybmFsIGVycm9yOiBicm9rZW4gZmlmb19jaGFubmVs"} }, source: None }
dev => insert into t values(0);
INSERT 0 1
dev=> select (SMALLINT '33' / SMALLINT v) from t;
ERROR:  RPC error: Status { code: Internal, message: "internal error: broken fifo_channel", metadata: MetadataMap { headers: {"risingwave-error-bin": "CAESI2ludGVybmFsIGVycm9yOiBicm9rZW4gZmlmb19jaGFubmVs"} }, source: None }

@kwannoel kwannoel changed the title Expr: handle evaluation of invalid numeric expressions during runtime sqlsmith bug hunt: handle evaluation of invalid numeric expressions Jul 15, 2022
@kwannoel kwannoel self-assigned this Jul 15, 2022
@kwannoel kwannoel changed the title sqlsmith bug hunt: handle evaluation of invalid numeric expressions sqlsmith bug hunt: channels not propagating errors Jul 17, 2022
@kwannoel kwannoel added the help wanted Issues that need help from contributors label Jul 17, 2022
@kwannoel
Copy link
Contributor Author

This is because the channel we use between executors passes Optionals around:
https://github.com/singularity-data/risingwave/blob/b64d8294c091413fa517eb70a0659ae3c6bcd3d0/src/batch/src/task/fifo_channel.rs#L31

And so when we read from the channel, the errors are not propagated:
https://github.com/singularity-data/risingwave/blob/b64d8294c091413fa517eb70a0659ae3c6bcd3d0/src/batch/src/task/fifo_channel.rs#L47-L59

Am thinking of changing the Item type in fifo channel to Result for error propagation. Any other suggestions?

@skyzh
Copy link
Contributor

skyzh commented Jul 17, 2022

Result might not be serializable and cannot be sent over the network. I think the current implementation is okay. If there's an error, we can find it on upstream compute node?

We can modify the error message to be more user friendly, like "[E2333] internal error when computing", where E2333 = fifo channel broken. Then developers can go into the log dashboard (maybe ELK in the future) to find what's happening.

@kwannoel
Copy link
Contributor Author

kwannoel commented Jul 17, 2022

Result might not be serializable and cannot be sent over the network. I think the current implementation is okay. If there's an error, we can find it on upstream compute node?

We can modify the error message to be more user friendly, like "[E2333] internal error when computing", where E2333 = fifo channel broken. Then developers can go into the log dashboard (maybe ELK in the future) to find what's happening.

That makes sense, the above behaviour should be undefined, following PostgreSQL, so it is alright to return internal error.

Instead our fuzzers need to generate queries w/o undefined behaviour, since those may raise internal errors as above (which we can't distinguish with actual internal system errors).

@kwannoel
Copy link
Contributor Author

kwannoel commented Jul 17, 2022

Actually what about other defined errors? E.g. division by zero. These errors are also not propagated in RisingWave but in postgres they return the relevant error code. 🤔

RisingWave

dev=> create table t(v int);                                                                                                                                                                                                    CREATE_TABLE
dev => insert into t values(0);
INSERT 0 1
dev=> select (SMALLINT '33' / SMALLINT v) from t;
ERROR:  RPC error: Status { code: Internal, message: "internal error: broken fifo_channel", metadata: MetadataMap { headers: {"risingwave-error-bin": "CAESI2ludGVybmFsIGVycm9yOiBicm9rZW4gZmlmb19jaGFubmVs"} }, source: None }

Postgres should throw division by zero error:

postgres=# create table t(v int);
CREATE TABLE
postgres=# insert into t values(0);
INSERT 0 1
postgres=# select (1/v) from t;
ERROR:  division by zero

Related: #3939

@kwannoel kwannoel reopened this Jul 17, 2022
@Graphcalibur
Copy link
Contributor

Related issue: #2473

@kwannoel
Copy link
Contributor Author

kwannoel commented Jul 18, 2022

Had an offline discussion with @lmatz and he suggested the following:

Multiple upstream executors may yield multiple errors, causing the errors to accumulate. Does not make sense to pass all these downstream.
One way to deal with this is by having a separate mechanism for tracking all errors.
We can push all errors to some error queue as they are encountered, then frontend can use this to view errors.

This currently blocks sqlsmith work, since we can't know what the source of broken fifo channel error is: Runtime error? System internal error? etc... This means we won't know if a query is failing due to runtime error or system internal error.

This issue will take a fair amount of work to resolve.
As a workaround to unblock SqlSmith work, we let broken fifo channel errors pass through for now.

@jon-chuang
Copy link
Contributor

jon-chuang commented Jul 18, 2022

For stream, we can collect the actor errors here: https://github.com/singularity-data/risingwave/blob/b64d8294c091413fa517eb70a0659ae3c6bcd3d0/src/stream/src/task/stream_manager.rs#L621 and then pipe it directly from the node to the frontend.

@kwannoel kwannoel added the type/bug Something isn't working label Jul 19, 2022
@xxchan
Copy link
Member

xxchan commented Jul 25, 2022

related issue #3730

@BowenXiao1999
Copy link
Contributor

Create a issue relates to this: #4811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants