-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(query): csv reader support prefetch #14983
Conversation
Docker Image for PR
|
This comment was marked as outdated.
This comment was marked as outdated.
You can try follow this issue: #14973 |
@zhang2014 description updated. plz review again. |
Co-authored-by: Winter Zhang <coswde@gmail.com>
Docker Image for PR
|
now cost 11m36, much faster than before (24 - 32min), but still slower than expected 9min12 (my last run with this image is right before ending, the progress is correct now (99997497 rows, 15.47G) |
I think the cost time is ok. Two questions:
|
the new impl is much better to maintain, logic is infact the same, but nearly half of the codes are rewritten, I`m not sure there are no problem.
yes, we can merge first it is strange that the if it is not acceptable, we can set enable_new_copy_for_text_formats=0 by default until it is improved. |
This is the real time which cost.
|
* chore: enable_new_copy_for_text_formats=0 by default. reason: #14983 * fix: wrong csv row id for the last row. * Update src/query/settings/src/settings_default.rs
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
so for a large file read of next batch and processing of the prev can overlap.
make the whole time shorter especially for relative slow storage.
add PrefetchAsyncSourcer, keep return
Async
(instead ofNEED_CONSUME
) until prefetch goal is reached or finished.rely on some implementation details (correct me if I am wrong @zhang2014 ): trigger of downstream processor dependent on
push_data
, notNEED_CONSUME
event,NEED_CONSUME
is only used to change processor state toIDLE
update
the reason for #14973 is simple: the num of processors is not correctly set.
fixed.
but the prefetching still make sense on slow backend.
Tests
Type of change
This change is