-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(query): unify pipeline for all inputs with format. #7613
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
4c75403
to
998ecf0
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I regret... The failure tests: |
src/query/pipeline/sources/src/processors/sources/input_formats/impls/input_format_csv.rs
Show resolved
Hide resolved
explain for commit 6749386 according to https://clickhouse.com/docs/en/interfaces/formats/#tabseparated-data-formatting TSV parser should unescape the TSV format and hits dataset are fixed. I will fix csv tommorow. |
fail to start cluster https://github.com/datafuselabs/databend/actions/runs/3086922753/jobs/4991831923 |
Re-run the job :) |
the only remaining error is 04_0001_mini_hits
can not reproduce if I copy from s3(minio)
do you know why very slow to read https://repo.databend.rs/dataset/stateful/hits_100k.tsv takes over 40s to read a batch of 1MB on my mac (the file is about 80MB), read is done here, is there anything need to improve? solved by change http port to 80 |
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
part of #7732
unify pipeline for all input(copy into, streaming load, clickhouse insert with format)
the insight is that
sync Deserializer
can not do async read, so it is better to feed it with aligned RowBatch (including those in Column format like RowGroup in parquet).and these RowBatches is independent.
prepare for a distributed copy: split files to splits early
for row-based format:
use mpmc channel between them for task sharing/balance.
parquet files in streaming load fit into this pattern too.
we end up with 2 kinds of pipelines:
we use 1 stage pipeline (Deserializer only ) for Parquet/ORC/ARROW in copy into,
and 2 stage pipeline (Aligner and Deserializer ) for other cases (all formats in streaming load).
other optimize:
\r\n
, look for\n
onlythis pr(migrate the existing capabilities):
Fixes #issue
part of #7732