Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stream level parallel insert #2921

Closed
ygf11 opened this issue Nov 20, 2021 · 8 comments
Closed

Support stream level parallel insert #2921

ygf11 opened this issue Nov 20, 2021 · 8 comments
Labels
C-improvement Category: improvement C-performance Category: Performance

Comments

@ygf11
Copy link
Contributor

ygf11 commented Nov 20, 2021

Summary
For insert into select, it may generate multiple stream by SelectPlan, and these will merge to one now.
It's better to avoid merging and enable stream-level parallel insert.

https://github.com/datafuselabs/databend/blob/73ed5a38ff04caee65bc91de2551d59cd9b9b4c9/query/src/pipelines/processors/pipeline.rs#L156-L161

relative #2765 (review)

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 20, 2021

Seems this is not conflict with #2914

@sundy-li @dantengsky Should this do now or after #2914 ?

@dantengsky
Copy link
Member

hi @ygf11, great to see u again :D

Seems this is not conflict with #2914

I am not sure if there will be conflicts, would you please share with us the ideas for implementing this feature?

btw, I think(@sundy-li correct me if I am mistaken) the comments that @sundy-li given in #2765 (review) is the key hint that indicates the last_pipe is "converged", but I am afraid that by only adjusting the Pipeline::execute, this feature might not be fully implemented:

  1. currently, Table::append_data will commit a new snapshot to metaserver whenever it has been executed successfully
    thus the table will be appended concurrently, or even worse, they will be commit conflicts
  2. we may need to tweak other components, e.g. PlanScheduler as well, so that, the pipeline / scheduler components could help us to execute the logic that we need, but in a general way.

hope this comment helps, if any further clarifications are needed, please let me know.

@sundy-li
Copy link
Member

Yes, @dantengsky said is absolutely right.

Can Table::append_data make the part in pre-commit stage? Then at last merge pipe, it'll commit all pre-commit parts.


sink-->pre-commit-
		  \
sink-->pre-commit--  converged merge then commit the pre-commit parts
		   |
sink-->pre-commit-/

@dantengsky
Copy link
Member

Can Table::append_data make the part in pre-commit stage? Then at last merge pipe, it'll commit all pre-commit parts.

yes, we can do that

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 21, 2021

I am not sure if there will be conflicts, would you please share with us the ideas for implementing this feature?

hi @dantengsky, Thanks for comment:). Sorry for I didn't make it clear before.

Seems this is not conflict with #2914

I mean the two feature is not same, so they can both make insert faster.

  1. in Proposal: supports parallel insertion #2914, we want to do parallel insertion in node(process) level(I'm not sure I fully understand this proposal, please correct me If I'm mistake).
queryNode0 ---> SourceTransform ---> SinkTransform(do append in node0)-->
                                                                        | 
queryNode1 ---> SourceTransform ---> SinkTransform(do append in node1)-->  QueryNode0(Receive Commits) -> Client
                                                                        | 
queryNode2 ---> SourceTransform ---> SinkTransform(do append in node2)-->
  1. here, we want to do parallel in stream(thread) level just in a query node.

but I am afraid that by only adjusting the Pipeline::execute, this feature might not be fully implemented:

You're right, beside Pipeline this need adjust other parts. Maybe this can be talk and do after #2914 done, because it is a base work and more important. Also I 'm not familiar with the cluster operations now:)

@BohuTANG BohuTANG added C-improvement Category: improvement C-performance Category: Performance labels Nov 21, 2021
@dantengsky
Copy link
Member

Maybe this can be talk and do after #2914 done

Sure :-)

@BohuTANG
Copy link
Member

BohuTANG commented Dec 1, 2021

#2945 checked in, closed

@BohuTANG BohuTANG closed this as completed Dec 1, 2021
@dantengsky
Copy link
Member

hi @ygf11 , sorry for the delay in responding

here, we want to do parallel in stream(thread) level just in a query node.

Hope I get it right that the suggestion here is, to parallelize the insertions, even in a single query node.

As PR#2945 merged, I think the above issue is addressed.

but If I misunderstood your suggestions, correct me pls, or other suggestions are sincerely welcome.

BTW: you might be interested in the following code snippet

https://github.com/datafuselabs/databend/blob/1f1e7ff29fb43ce219fb664a9637db996fd42b48/query/src/pipelines/processors/pipeline_builder.rs#L311-L325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-improvement Category: improvement C-performance Category: Performance
Projects
None yet
Development

No branches or pull requests

4 participants