Support stream level parallel insert #2921

ygf11 · 2021-11-20T14:33:56Z

Summary
For insert into select, it may generate multiple stream by SelectPlan, and these will merge to one now.
It's better to avoid merging and enable stream-level parallel insert.

https://github.com/datafuselabs/databend/blob/73ed5a38ff04caee65bc91de2551d59cd9b9b4c9/query/src/pipelines/processors/pipeline.rs#L156-L161

relative #2765 (review)

ygf11 · 2021-11-20T14:35:31Z

Seems this is not conflict with #2914

@sundy-li @dantengsky Should this do now or after #2914 ?

dantengsky · 2021-11-20T15:30:20Z

hi @ygf11, great to see u again :D

Seems this is not conflict with #2914

I am not sure if there will be conflicts, would you please share with us the ideas for implementing this feature?

btw, I think(@sundy-li correct me if I am mistaken) the comments that @sundy-li given in #2765 (review) is the key hint that indicates the last_pipe is "converged", but I am afraid that by only adjusting the Pipeline::execute, this feature might not be fully implemented:

currently, Table::append_data will commit a new snapshot to metaserver whenever it has been executed successfully
thus the table will be appended concurrently, or even worse, they will be commit conflicts
we may need to tweak other components, e.g. PlanScheduler as well, so that, the pipeline / scheduler components could help us to execute the logic that we need, but in a general way.

hope this comment helps, if any further clarifications are needed, please let me know.

sundy-li · 2021-11-20T23:08:55Z

Yes, @dantengsky said is absolutely right.

Can Table::append_data make the part in pre-commit stage? Then at last merge pipe, it'll commit all pre-commit parts.


sink-->pre-commit-
		  \
sink-->pre-commit--  converged merge then commit the pre-commit parts
		   |
sink-->pre-commit-/

dantengsky · 2021-11-21T02:38:54Z

Can Table::append_data make the part in pre-commit stage? Then at last merge pipe, it'll commit all pre-commit parts.

yes, we can do that

ygf11 · 2021-11-21T03:59:11Z

I am not sure if there will be conflicts, would you please share with us the ideas for implementing this feature?

hi @dantengsky, Thanks for comment:). Sorry for I didn't make it clear before.

Seems this is not conflict with #2914

I mean the two feature is not same, so they can both make insert faster.

in Proposal: supports parallel insertion #2914, we want to do parallel insertion in node(process) level(I'm not sure I fully understand this proposal, please correct me If I'm mistake).

queryNode0 ---> SourceTransform ---> SinkTransform(do append in node0)-->
                                                                        | 
queryNode1 ---> SourceTransform ---> SinkTransform(do append in node1)-->  QueryNode0(Receive Commits) -> Client
                                                                        | 
queryNode2 ---> SourceTransform ---> SinkTransform(do append in node2)-->

here, we want to do parallel in stream(thread) level just in a query node.

but I am afraid that by only adjusting the Pipeline::execute, this feature might not be fully implemented:

You're right, beside Pipeline this need adjust other parts. Maybe this can be talk and do after #2914 done, because it is a base work and more important. Also I 'm not familiar with the cluster operations now:)

dantengsky · 2021-11-22T03:54:48Z

Maybe this can be talk and do after #2914 done

Sure :-)

BohuTANG · 2021-12-01T00:56:41Z

#2945 checked in, closed

dantengsky · 2021-12-02T03:33:28Z

hi @ygf11 , sorry for the delay in responding

here, we want to do parallel in stream(thread) level just in a query node.

Hope I get it right that the suggestion here is, to parallelize the insertions, even in a single query node.

As PR#2945 merged, I think the above issue is addressed.

but If I misunderstood your suggestions, correct me pls, or other suggestions are sincerely welcome.

BTW: you might be interested in the following code snippet

https://github.com/datafuselabs/databend/blob/1f1e7ff29fb43ce219fb664a9637db996fd42b48/query/src/pipelines/processors/pipeline_builder.rs#L311-L325

BohuTANG added C-improvement Category: improvement C-performance Category: Performance labels Nov 21, 2021

BohuTANG closed this as completed Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stream level parallel insert #2921

Support stream level parallel insert #2921

ygf11 commented Nov 20, 2021

ygf11 commented Nov 20, 2021

dantengsky commented Nov 20, 2021

sundy-li commented Nov 20, 2021

dantengsky commented Nov 21, 2021

ygf11 commented Nov 21, 2021

dantengsky commented Nov 22, 2021

BohuTANG commented Dec 1, 2021

dantengsky commented Dec 2, 2021

Support stream level parallel insert #2921

Support stream level parallel insert #2921

Comments

ygf11 commented Nov 20, 2021

ygf11 commented Nov 20, 2021

dantengsky commented Nov 20, 2021

sundy-li commented Nov 20, 2021

dantengsky commented Nov 21, 2021

ygf11 commented Nov 21, 2021

dantengsky commented Nov 22, 2021

BohuTANG commented Dec 1, 2021

dantengsky commented Dec 2, 2021