Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: flaky logic test failed in macos #6614

Closed
1 of 2 tasks
Tracked by #6733
ZeaLoVe opened this issue Jul 14, 2022 · 11 comments · Fixed by #6816
Closed
1 of 2 tasks
Tracked by #6733

bug: flaky logic test failed in macos #6614

ZeaLoVe opened this issue Jul 14, 2022 · 11 comments · Fixed by #6816
Assignees
Labels
C-bug Category: something isn't working

Comments

@ZeaLoVe
Copy link
Contributor

ZeaLoVe commented Jul 14, 2022

Search before asking

  • I had searched in the issues and found no similar issues.

Version

currently

What's Wrong?

https://github.com/datafuselabs/databend/runs/7322628730?check_suite_focus=true

logic test frequency failed in macos test, like https://github.com/datafuselabs/databend/runs/7322628730?check_suite_focus=true

sql : insert into tbl_01_0002 values(2);
http handler response: INFO:log:response content: {'id': '', 'session_id': None, 'schema': None, 'data': [], 'state': 'Failed', 'error': {'code': 1006, 'message': 'last query on the session not finished'}, 'stats': {'scan_progress': {'rows': 0, 'bytes': 0}, 'write_progress': {'rows': 0, 'bytes': 0}, 'result_progress': {'rows': 0, 'bytes': 0}, 'running_time_ms': 0.0}, 'stats_uri': None, 'final_uri': None, 'next_uri': None, 'kill_uri': None}

cc @youngsofun please take a look.

How to Reproduce?

macos ci is optional now, see it in action pages.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@ZeaLoVe ZeaLoVe added the C-bug Category: something isn't working label Jul 14, 2022
@youngsofun
Copy link
Member

youngsofun commented Jul 20, 2022

select written_rows from system.query_log where query_text='insert into  tbl_01_0002 values(1)' and written_rows != 0;

this SQL can not finish
and client polling results show the result set keep growing

'scan_progress': {'rows': 29, 'bytes': 30083}
...
'scan_progress': {'rows': 349, 'bytes': 373706}

@everpcpc

I vaguely remember you mentioned something like this in daily meeting, is it fixed?

this happened 7 days ago

@everpcpc
Copy link
Member

You mean the warehouse for query history? I thinks it's not related to this.

@youngsofun
Copy link
Member

youngsofun commented Jul 20, 2022

another related fail in clickhouse handler, in k3d test

Orig exception: {"error":{"kind":"InternalError","message":"EOF"}}

select written_rows from system.query_log where query_text='insert into  tbl_01_0002 values(1)' and written_rows != 0;
 
insert into  tbl_01_0002 values(2);

select count(*) from tbl_01_0002 where a > 1;

@youngsofun
Copy link
Member

youngsofun commented Jul 21, 2022

@sundy-li
Copy link
Member

sundy-li commented Jul 22, 2022

I think we should improve system.query_log, the lock is too heavy for reading & writing.

@sundy-li
Copy link
Member

A reproducible case in macos:

  1. start databend-query
  2. using curl to query via clickhouse HTTP handler:
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'drop table if exists tbl_01_0002;'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'create table tbl_01_0002(a int)'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'insert into  tbl_01_0002 values(1);'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'select count(*) from tbl_01_0002 where a > 0'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'drop table tbl_01_0002;'

Server hangs on querying select count(*) from tbl_01_0002 where a > 0.

But it only occurs once only after databend-query starts.
I debugged and found it was hanging inside pipeline, but fail to find the reason. cc @zhang2014 may help.

@zhang2014 zhang2014 self-assigned this Jul 23, 2022
@zhang2014
Copy link
Member

A reproducible case in macos:

  1. start databend-query
  2. using curl to query via clickhouse HTTP handler:
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'drop table if exists tbl_01_0002;'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'create table tbl_01_0002(a int)'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'insert into  tbl_01_0002 values(1);'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'select count(*) from tbl_01_0002 where a > 0'
curl 'http://default@localhost:8125?enable_planner_v2=1' -d'drop table tbl_01_0002;'

Server hangs on querying select count(*) from tbl_01_0002 where a > 0.

But it only occurs once only after databend-query starts. I debugged and found it was hanging inside pipeline, but fail to find the reason. cc @zhang2014 may help.

it's only occurs in Minio and S3. Adding some debugging code, I found that using async to call opendal to read objects, future will not be scheduled again after returning first pending. Please refer to the following output(same query). CC: @Xuanwo may help.

unsuccessful :

ProcessorAsyncTask poll
async process before await
return pending
2022-07-26T04:10:54.253060Z DEBUG opendal::services::s3::backend: object 1/90/_b/c064c73b76664262adaed20a7e9eeebe_v0.parquet read start: offset Some(4), size Some(27)    
2022-07-26T04:10:54.253202Z DEBUG reqsign::services::aws::v4: calculated canonical request: GET
/test/1/90/_b/c064c73b76664262adaed20a7e9eeebe_v0.parquet

host:127.0.0.1:9900
range:bytes=4-30
x-amz-content-sha256:UNSIGNED-PAYLOAD
x-amz-date:20220726T041054Z

host;range;x-amz-content-sha256;x-amz-date
UNSIGNED-PAYLOAD    
2022-07-26T04:10:54.253356Z DEBUG reqsign::services::aws::v4: calculated scope: 20220726/us-east-1/s3/aws4_request    
2022-07-26T04:10:54.253424Z DEBUG reqsign::services::aws::v4: calculated string to sign: AWS4-HMAC-SHA256
20220726T041054Z
20220726/us-east-1/s3/aws4_request
f0fa8d0e09cc59c36c12b3bfa88a7a6dff7bda33190a46ef03d7fe1e26e86a5a    
2022-07-26T04:10:54.253690Z DEBUG hyper::client::pool: reuse idle connection for ("http", 127.0.0.1:9900)

successful:

ProcessorAsyncTask poll
async process before await
return pending
2022-07-26T04:16:29.416253Z DEBUG opendal::services::s3::backend: object 1/90/_b/c064c73b76664262adaed20a7e9eeebe_v0.parquet read start: offset Some(4), size Some(27)    
2022-07-26T04:16:29.416518Z DEBUG reqsign::services::aws::v4: calculated canonical request: GET
/test/1/90/_b/c064c73b76664262adaed20a7e9eeebe_v0.parquet

host:127.0.0.1:9900
range:bytes=4-30
x-amz-content-sha256:UNSIGNED-PAYLOAD
x-amz-date:20220726T041629Z

host;range;x-amz-content-sha256;x-amz-date
UNSIGNED-PAYLOAD    
2022-07-26T04:16:29.416708Z DEBUG reqsign::services::aws::v4: calculated scope: 20220726/us-east-1/s3/aws4_request    
2022-07-26T04:16:29.416786Z DEBUG reqsign::services::aws::v4: calculated string to sign: AWS4-HMAC-SHA256
20220726T041629Z
20220726/us-east-1/s3/aws4_request
17312e70c0c5fe4aacda1c19ad240ee5bf5a67b9aa9d91fb05d8e3db0d8ba00e    
2022-07-26T04:16:29.417169Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:9900
2022-07-26T04:16:29.417530Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:9900
2022-07-26T04:16:29.417804Z DEBUG hyper::proto::h1::io: flushed 409 bytes
2022-07-26T04:16:29.418601Z DEBUG hyper::proto::h1::io: parsed 16 headers
2022-07-26T04:16:29.418659Z DEBUG hyper::proto::h1::conn: incoming body is content-length (27 bytes)
2022-07-26T04:16:29.418726Z DEBUG hyper::proto::h1::conn: incoming body completed
2022-07-26T04:16:29.418818Z DEBUG hyper::client::pool: pooling idle connection for ("http", 127.0.0.1:9900)
2022-07-26T04:16:29.418884Z DEBUG opendal::services::s3::backend: object 1/90/_b/c064c73b76664262adaed20a7e9eeebe_v0.parquet reader created: offset Some(4), size Some(27)    
ProcessorAsyncTask poll

@Xuanwo
Copy link
Member

Xuanwo commented Jul 26, 2022

I can reproduce this on Archlinux, digging into the cause now.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 26, 2022

After disabling hyper's connection pool, this bug seems resolved:

2022-07-26T06:10:31.038164Z DEBUG databend_query::pipelines::executor::processor_async_task: start poll on worker 0
2022-07-26T06:10:31.038303Z DEBUG databend_query::pipelines::executor::processor_async_task: worker 0 returns Ready
2022-07-26T06:10:31.039863Z DEBUG databend_query::sessions::session_ref: Destroy session cf26978f-c7d1-4c68-8db5-9bbdbbe40778
2022-07-26T06:10:31.040271Z DEBUG databend_query::sessions::query_ctx: Destroy QueryContext
2022-07-26T06:10:31.040543Z DEBUG hyper::proto::h1::io: flushed 152 bytes
2022-07-26T06:10:31.040709Z DEBUG hyper::proto::h1::conn: read eof

How about applying this workaround until we find out what happened?

impl HttpClient {
    /// Create a new http client.
    pub fn new() -> Self {
        HttpClient(
            hyper::Client::builder()
+                .pool_max_idle_per_host(0)
                .build(hyper_tls::HttpsConnector::new()),
        )
    }
}

@Xuanwo
Copy link
Member

Xuanwo commented Jul 26, 2022

Will be fixed by apache/opendal#474

@Xuanwo
Copy link
Member

Xuanwo commented Jul 26, 2022

Cc @youngsofun @sundy-li @zhang2014, please verify whether #6816 fix this issue or not.

@mergify mergify bot closed this as completed in #6816 Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants