Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(query): csv reader support prefetch #14983

Merged
merged 7 commits into from
Mar 19, 2024

Conversation

youngsofun
Copy link
Member

@youngsofun youngsofun commented Mar 17, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

so for a large file read of next batch and processing of the prev can overlap.
make the whole time shorter especially for relative slow storage.

add PrefetchAsyncSourcer, keep return Async (instead of NEED_CONSUME) until prefetch goal is reached or finished.
rely on some implementation details (correct me if I am wrong @zhang2014 ): trigger of downstream processor dependent on push_data, not NEED_CONSUME event, NEED_CONSUME is only used to change processor state to IDLE

update

the reason for #14973 is simple: the num of processors is not correctly set.

fixed.

but the prefetching still make sense on slow backend.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Mar 17, 2024
@youngsofun youngsofun changed the title refactor: csv reader support prefetch. refactor(query): csv reader support prefetch. Mar 17, 2024
@youngsofun youngsofun changed the title refactor(query): csv reader support prefetch. refactor(query): csv reader support prefetch Mar 17, 2024
@databendlabs databendlabs deleted a comment from github-actions bot Mar 17, 2024
@databendlabs databendlabs deleted a comment from github-actions bot Mar 17, 2024
@youngsofun youngsofun marked this pull request as draft March 17, 2024 10:33
@youngsofun youngsofun requested a review from zhang2014 March 17, 2024 11:29
@youngsofun youngsofun marked this pull request as ready for review March 17, 2024 11:29
@BohuTANG BohuTANG added the ci-cloud Build docker image for cloud test label Mar 17, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-14983-400e01f

note: this image tag is only available for internal use,
please check the internal doc for more details.

@BohuTANG

This comment was marked as outdated.

@youngsofun youngsofun enabled auto-merge March 17, 2024 15:00
@youngsofun youngsofun disabled auto-merge March 17, 2024 15:00
@youngsofun
Copy link
Member Author

I have test this PR, the csv copy speed from 0.06M/s to 0.09M/s, but still slow than the tsv(0.16M/s):

image

It seems rows/s for the CSV is not correct, now the COPY of CSV cost is same as TSV now, about 10minutes.

how many rows in this file?

@BohuTANG
Copy link
Member

Update:
CSV COPY is also slow than TSV:
image

TSV will cost 9 minute:
image

@BohuTANG
Copy link
Member

It seems this PR is not help for CSV:
image

@BohuTANG
Copy link
Member

how many rows in this file?

You can try follow this issue: #14973

@youngsofun
Copy link
Member Author

@zhang2014 description updated. plz review again.

@BohuTANG BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Mar 18, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-14983-d047095

note: this image tag is only available for internal use,
please check the internal doc for more details.

@BohuTANG
Copy link
Member

Still not working:/
image

@youngsofun
Copy link
Member Author

@BohuTANG

now cost 11m36, much faster than before (24 - 32min), but still slower than expected 9min12 (my last run with enable_new_copy_for_text_formats=0)

this image is right before ending, the progress is correct now (99997497 rows, 15.47G)

image

@BohuTANG
Copy link
Member

BohuTANG commented Mar 19, 2024

now cost 11m36,

I think the cost time is ok.

Two questions:

  1. Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?
  2. This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

@youngsofun
Copy link
Member Author

Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?

the new impl is much better to maintain, logic is infact the same, but nearly half of the codes are rewritten, I`m not sure there are no problem.

This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

yes, we can merge first


it is strange that the Duraition in the history is diff from the time. it is longer :

image

if it is not acceptable, we can set enable_new_copy_for_text_formats=0 by default until it is improved.

@BohuTANG
Copy link
Member

Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?

the new impl is much better to maintain, logic is infact the same, but nearly half of the codes are rewritten, I`m not sure there are no problem.

This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

yes, we can merge first

it is strange that the Duraition in the history is diff from the time. it is longer :

This is the real time which cost.

image

if it is not acceptable, we can set enable_new_copy_for_text_formats=0 by default until it is improved.

@BohuTANG BohuTANG merged commit 623b536 into databendlabs:main Mar 19, 2024
77 checks passed
youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 19, 2024
@youngsofun youngsofun mentioned this pull request Mar 19, 2024
11 tasks
youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 19, 2024
youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 20, 2024
BohuTANG pushed a commit that referenced this pull request Mar 20, 2024
* chore: enable_new_copy_for_text_formats=0 by default.

reason: #14983

* fix: wrong csv row id for the last row.

* Update src/query/settings/src/settings_default.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

copy tsv.gz 3X faster than csv.gz
3 participants