refactor(query): csv reader support prefetch #14983

youngsofun · 2024-03-17T10:24:28Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Fixes copy tsv.gz 3X faster than csv.gz #14973

so for a large file read of next batch and processing of the prev can overlap.
make the whole time shorter especially for relative slow storage.

add PrefetchAsyncSourcer, keep return Async (instead of NEED_CONSUME) until prefetch goal is reached or finished.
rely on some implementation details (correct me if I am wrong @zhang2014 ): trigger of downstream processor dependent on push_data, not NEED_CONSUME event, NEED_CONSUME is only used to change processor state to IDLE

update

the reason for #14973 is simple: the num of processors is not correctly set.

fixed.

but the prefetching still make sense on slow backend.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2024-03-17T14:37:36Z

Docker Image for PR

tag: pr-14983-400e01f

note: this image tag is only available for internal use,
please check the internal doc for more details.

youngsofun · 2024-03-17T15:04:38Z

I have test this PR, the csv copy speed from 0.06M/s to 0.09M/s, but still slow than the tsv(0.16M/s):

It seems rows/s for the CSV is not correct, now the COPY of CSV cost is same as TSV now, about 10minutes.

how many rows in this file？

BohuTANG · 2024-03-17T15:05:01Z

Update:
CSV COPY is also slow than TSV:

TSV will cost 9 minute:

BohuTANG · 2024-03-17T15:10:34Z

It seems this PR is not help for CSV:

BohuTANG · 2024-03-17T15:11:56Z

how many rows in this file？

You can try follow this issue: #14973

youngsofun · 2024-03-18T10:41:59Z

@zhang2014 description updated. plz review again.

src/query/pipeline/sources/src/prefetch_async_source.rs

Co-authored-by: Winter Zhang <coswde@gmail.com>

src/query/pipeline/sources/src/prefetch_async_source.rs

github-actions · 2024-03-18T13:56:13Z

Docker Image for PR

tag: pr-14983-d047095

note: this image tag is only available for internal use,
please check the internal doc for more details.

BohuTANG · 2024-03-18T14:00:12Z

Still not working:/

youngsofun · 2024-03-19T02:15:38Z

@BohuTANG

now cost 11m36, much faster than before (24 - 32min), but still slower than expected 9min12 (my last run with enable_new_copy_for_text_formats=0)

this image is right before ending, the progress is correct now (99997497 rows, 15.47G)

BohuTANG · 2024-03-19T06:16:33Z

now cost 11m36,

I think the cost time is ok.

Two questions:

Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?
This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

youngsofun · 2024-03-19T07:31:40Z

Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?

the new impl is much better to maintain, logic is infact the same, but nearly half of the codes are rewritten, I`m not sure there are no problem.

This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

yes, we can merge first

it is strange that the Duraition in the history is diff from the time. it is longer :

if it is not acceptable, we can set enable_new_copy_for_text_formats=0 by default until it is improved.

BohuTANG · 2024-03-19T07:45:42Z

Why we should do this setting: enable_new_copy_for_text_formats=0, does it mean the new copy impl is still need to improve?

the new impl is much better to maintain, logic is infact the same, but nearly half of the codes are rewritten, I`m not sure there are no problem.

This PR is helpful but is one part of the improvement(Means we can merge this PR first)?

yes, we can merge first

it is strange that the Duraition in the history is diff from the time. it is longer :

This is the real time which cost.

if it is not acceptable, we can set enable_new_copy_for_text_formats=0 by default until it is improved.

reason: databendlabs#14983

* chore: enable_new_copy_for_text_formats=0 by default. reason: #14983 * fix: wrong csv row id for the last row. * Update src/query/settings/src/settings_default.rs

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Mar 17, 2024

youngsofun changed the title ~~refactor: csv reader support prefetch.~~ refactor(query): csv reader support prefetch. Mar 17, 2024

youngsofun changed the title ~~refactor(query): csv reader support prefetch.~~ refactor(query): csv reader support prefetch Mar 17, 2024

databendlabs deleted a comment from github-actions bot Mar 17, 2024

youngsofun marked this pull request as draft March 17, 2024 10:33

youngsofun requested a review from zhang2014 March 17, 2024 11:29

youngsofun marked this pull request as ready for review March 17, 2024 11:29

BohuTANG added the ci-cloud Build docker image for cloud test label Mar 17, 2024

This comment was marked as outdated.

Sign in to view

youngsofun enabled auto-merge March 17, 2024 15:00

youngsofun disabled auto-merge March 17, 2024 15:00

csv reader support prefetch.

4a6f594

youngsofun force-pushed the fixcsv branch from d607494 to e41b20d Compare March 18, 2024 10:38

fix num of processor in CSV read pipeline.

2fdd965

youngsofun force-pushed the fixcsv branch from e41b20d to 2fdd965 Compare March 18, 2024 11:45

Merge branch 'main' into fixcsv

1a0bb5a

zhang2014 reviewed Mar 18, 2024

View reviewed changes

src/query/pipeline/sources/src/prefetch_async_source.rs Outdated Show resolved Hide resolved

zhang2014 approved these changes Mar 18, 2024

View reviewed changes

youngsofun and others added 3 commits March 18, 2024 20:23

Update src/query/pipeline/sources/src/prefetch_async_source.rs

120eb5b

Co-authored-by: Winter Zhang <coswde@gmail.com>

Merge branch 'main' into fixcsv

1904c09

Update prefetch_async_source.rs

cf21a1f

youngsofun commented Mar 18, 2024

View reviewed changes

src/query/pipeline/sources/src/prefetch_async_source.rs Outdated Show resolved Hide resolved

Update src/query/pipeline/sources/src/prefetch_async_source.rs

645bc67

BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Mar 18, 2024

BohuTANG approved these changes Mar 19, 2024

View reviewed changes

BohuTANG merged commit 623b536 into databendlabs:main Mar 19, 2024
77 checks passed

youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 19, 2024

chore: enable_new_copy_for_text_formats=0 by default.

fa7e9d5

reason: databendlabs#14983

youngsofun mentioned this pull request Mar 19, 2024

fix: wrong row id #15018

Merged

11 tasks

youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 19, 2024

chore: enable_new_copy_for_text_formats=0 by default.

6278d03

reason: databendlabs#14983

youngsofun added a commit to youngsofun/databend that referenced this pull request Mar 20, 2024

chore: enable_new_copy_for_text_formats=0 by default.

b51e480

reason: databendlabs#14983

BohuTANG pushed a commit that referenced this pull request Mar 20, 2024

fix: wrong row id (#15018)

722ae22

* chore: enable_new_copy_for_text_formats=0 by default. reason: #14983 * fix: wrong csv row id for the last row. * Update src/query/settings/src/settings_default.rs

This was referenced Jul 18, 2024

Link Checker Report databendlabs/databend-docs#978

Closed

Link Checker Report databendlabs/databend-docs#991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(query): csv reader support prefetch #14983

refactor(query): csv reader support prefetch #14983

youngsofun commented Mar 17, 2024 •

edited

Loading

github-actions bot commented Mar 17, 2024

This comment was marked as outdated.

youngsofun commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

youngsofun commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

BohuTANG commented Mar 18, 2024

youngsofun commented Mar 19, 2024

BohuTANG commented Mar 19, 2024 •

edited

Loading

youngsofun commented Mar 19, 2024

BohuTANG commented Mar 19, 2024

refactor(query): csv reader support prefetch #14983

refactor(query): csv reader support prefetch #14983

Conversation

youngsofun commented Mar 17, 2024 • edited Loading

Summary

update

Tests

Type of change

github-actions bot commented Mar 17, 2024

Docker Image for PR

This comment was marked as outdated.

youngsofun commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

BohuTANG commented Mar 17, 2024

youngsofun commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

Docker Image for PR

BohuTANG commented Mar 18, 2024

youngsofun commented Mar 19, 2024

BohuTANG commented Mar 19, 2024 • edited Loading

youngsofun commented Mar 19, 2024

BohuTANG commented Mar 19, 2024

youngsofun commented Mar 17, 2024 •

edited

Loading

BohuTANG commented Mar 19, 2024 •

edited

Loading