Improve the COPY from external location performance #4308

BohuTANG · 2022-03-03T04:37:56Z

Summary
If we COPY a s3 file and insert into a table, the progresses are:
S1. Read s3 file by blocks from s3 location
S2. Write blocks stream to table t1
S3. Commit

https://github.com/datafuselabs/databend/blob/16e06e414c4680f0d640abada631af89369be877/query/src/interpreters/interpreter_copy.rs#L83-L102

S1 and S2 is in the same thread, looks we can make them in parallel.

GrapeBaBa · 2022-03-04T14:22:10Z

/assignme

GrapeBaBa · 2022-03-10T04:07:08Z

@BohuTANG Since I am not quite familiar with rust threads, have questions about this issue. I feel the source stream items will produced by pipeline executor, it seems already in other threads, right? The consumer is in current thread?

sundy-li · 2022-03-10T04:36:13Z

I think if we transform the Copy Plan into insert select Plan, then the pipeline engine will make it on parallel.
We can do it after #4345

sundy-li · 2022-03-17T05:05:07Z

@GrapeBaBa

It's ready to do it now.

We can build pipeline through new processor, example

Feel free to contact me if you need any help.

GrapeBaBa · 2022-03-22T05:37:35Z

@sundy-li Got it.

GrapeBaBa · 2022-04-03T03:53:22Z

@sundy-li Should we use a sink pipe and a source pipe to create a complete pipeline for refactoring?

sundy-li · 2022-04-03T03:58:05Z

Should we use a sink pipe and a source pipe to create a complete pipeline for refactoring?

Yes!

GrapeBaBa · 2022-04-03T09:47:44Z

@BohuTANG @sundy-li Do we have an approach to test copy into feature local?

BohuTANG · 2022-04-03T10:14:01Z

Run databend-query with disk:

create a named internal stage
Streaming a file into the stage via the http streaming_load api
Do COPY from internal stage

GrapeBaBa · 2022-04-03T10:40:03Z

Why is there

Run databend-query with disk:

create a named internal stage

Streaming a file into the stage via the http streaming_load api

Do COPY from internal stage

OK. I also find some tests in the codes which seems using minio without detail steps. And why is interpreter_copy unit test is missing? Is it difficult to mock or something else?

sundy-li · 2022-04-03T10:43:08Z

interpreter_copy unit test
Yes, it's hard to make a unit stateful test. So tests must run using minio.

GrapeBaBa · 2022-04-03T12:08:21Z

Upload file to stage using

curl -H "stage_name:my_internal_stage" -F "upload=@./books.csv" -XPUT http://localhost:8081/v1/upload_to_stage

Got this error

unexpected: (op: write, path: /Users/kaichen/Documents/projects/databend/target/debug/benddata/datas/stage/my_internal_stage, source: File exists (os error 17))%

GrapeBaBa · 2022-04-03T13:11:51Z

The log shows

^[[2m2022-04-03T12:47:38.884447Z^[[0m ^[[31mERROR^[[0m ^[[2mopendal::services::fs::backend^[[0m^[[2m:^[[0m object /Users/kaichen/Documents/projects/databend/target/debug/benddata/datas/stage/my_internal_stage/books.csv create_dir_all for parent /Users/kaichen/Documents/projects/databend/target/debug/benddata/datas/stage/my_internal_stage: Custom { kind: AlreadyExists, error: ObjectError { op: "write", path: "/Users/kaichen/Documents/projects/databend/target/debug/benddata/datas/stage/my_internal_stage", source: File exists (os error 17) } }
^[[2m2022-04-03T12:47:38.884878Z^[[0m ^[[33m WARN^[[0m ^[[2mdatabend_query::servers::http::middleware^[[0m^[[2m:^[[0m http request error: status=500 Internal Server Error, msg=op: write, path: /Users/kaichen/Documents/projects/databend/target/debug/benddata/datas/stage/my_internal_stage, source: File exists (os error 17)

GrapeBaBa · 2022-04-03T13:14:14Z

After

create stage xxx;

there will be a xxx file generated in local dir, is it expected? @BohuTANG @sundy-li

GrapeBaBa · 2022-04-03T13:14:43Z

-rw-r--r--  1 kaichen  staff    0 Apr  3 19:47 my_internal_stage
-rw-r--r--  1 kaichen  staff    0 Apr  3 20:12 test
-rw-r--r--  1 kaichen  staff    0 Apr  3 20:26 test1
-rw-r--r--  1 kaichen  staff    0 Apr  3 20:54 test2

BohuTANG · 2022-04-04T10:43:41Z

xxx should be a folder under the stage directory not a file, this should be a bug:
https://github.com/datafuselabs/databend/blob/cf6da3ead9111b8bfa893cd5c0957cdabf149e2b/query/src/interpreters/interpreter_user_stage_create.rs#L55-L67

GrapeBaBa · 2022-04-04T13:01:50Z

example

Yes, it should be a bug since I already tested successfully using minio. Let me try to fix it.

GrapeBaBa · 2022-04-11T10:08:19Z

#4783 (comment) @zhang2014 @sundy-li @BohuTANG let me back here, I looked at the MemoryTable code and know we can use one pipe to implement it. However about if using S3StageTable, I am not quite sure. If I understand correct, this copy operation should do internal stage as well which is not using S3 storage. What is your suggestion?

GrapeBaBa · 2022-04-11T15:09:10Z

I finally got actually S3StageTable may handle both internal and external stage source. Maybe just name it StageTable more make sense.

BohuTANG added A-query Area: databend query good first issue Category: good first issue labels Mar 3, 2022

BohuTANG mentioned this issue Mar 3, 2022

Copy INTO <table> from external location #3586

Closed

7 tasks

databend-bot assigned GrapeBaBa Mar 4, 2022

databend-bot added the community-take label Mar 4, 2022

BohuTANG mentioned this issue Mar 27, 2022

Release proposal: Nightly v0.8 #4591

Closed

55 tasks

GrapeBaBa mentioned this issue Apr 10, 2022

Change copy read files parallel. #4783

Merged

wubx mentioned this issue Apr 12, 2022

make copy into can parallel load directory multi file #4584

Closed

BohuTANG closed this as completed in #4783 Apr 14, 2022

Xuanwo added this to the v0.8 milestone May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the COPY from external location performance #4308

Improve the COPY from external location performance #4308

BohuTANG commented Mar 3, 2022

GrapeBaBa commented Mar 4, 2022

GrapeBaBa commented Mar 10, 2022

sundy-li commented Mar 10, 2022 •

edited

Loading

sundy-li commented Mar 17, 2022

GrapeBaBa commented Mar 22, 2022

GrapeBaBa commented Apr 3, 2022

sundy-li commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

BohuTANG commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

sundy-li commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022 •

edited

Loading

GrapeBaBa commented Apr 3, 2022

BohuTANG commented Apr 4, 2022 •

edited

Loading

GrapeBaBa commented Apr 4, 2022

GrapeBaBa commented Apr 11, 2022

GrapeBaBa commented Apr 11, 2022

Improve the COPY from external location performance #4308

Improve the COPY from external location performance #4308

Comments

BohuTANG commented Mar 3, 2022

GrapeBaBa commented Mar 4, 2022

GrapeBaBa commented Mar 10, 2022

sundy-li commented Mar 10, 2022 • edited Loading

sundy-li commented Mar 17, 2022

GrapeBaBa commented Mar 22, 2022

GrapeBaBa commented Apr 3, 2022

sundy-li commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

BohuTANG commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

sundy-li commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022

GrapeBaBa commented Apr 3, 2022 • edited Loading

GrapeBaBa commented Apr 3, 2022

BohuTANG commented Apr 4, 2022 • edited Loading

GrapeBaBa commented Apr 4, 2022

GrapeBaBa commented Apr 11, 2022

GrapeBaBa commented Apr 11, 2022

sundy-li commented Mar 10, 2022 •

edited

Loading

GrapeBaBa commented Apr 3, 2022 •

edited

Loading

BohuTANG commented Apr 4, 2022 •

edited

Loading