Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support logical plan compilation #2648

Closed
wants to merge 7 commits into from
Closed

Conversation

waynexia
Copy link
Member

Which issue does this PR close?

Closes #.

Rationale for this change

This PR implements logical plan compilation and executes the compiled plan on record batch.

But this implementation only covers a very few types. Only i64 & f64 array's batch and projection plan are supported. It wouldn't be difficult to extend to all primitive types. However, other complex array types and other plans that may require variable output length (filter) and complex algorithms (join, aggr etc) are hard to implement in pure cranelift IR. It would be unavoidable to call into rust code. This is acceptable in my perspective as the overhead can be alleviated or eliminated (via LLVM, I guess).

What changes are included in this PR?

This PR adds serval JIT-related structs. The most important is JITContext and JITExecutionPlan.

JITContext supports compile logical plan into executable JITExecutionPlan. And JITExecutionPlan is a minimal ExecutionPlan which support execute() on RecordBatch.

Now we can do something like

let ctx = JITContext::default();
let exec_plan = ctx.compile_logical_plan(logical_plan);
let output = exec_plan.execute(input_record_batch);

Are there any user-facing changes?

No

Does this PR break compatibility with Ballista?

No

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label May 29, 2022
@waynexia

This comment was marked as outdated.

@andygrove
Copy link
Member

Hi @waynexia and thanks for the contribution. I understand using the JIT to compile expression trees but I don't understand why we would want to compile a logical plan rather than leverage the JIT in the physical plan? Could you explain that some more? The logical plan has the concept of a join but we don't know whether that is a HashJoin, CrossJoin, or SortMergeJoin until we get to the physical plan so I don't see how we could compile anything from the logical plan for joins as one example.

@waynexia
Copy link
Member Author

Hi @andygrove, thanks for your review.

My proposal is to leverage JIT in the phase we convert logical plan into physical execution plan (in the PhysicalPlanner). And the compiled JIT plan is one kind of physical plan. This new flow looks like:

       Logical Plan
            │
   ┌────────▼──────────┐
   │ PhysicalPlanner   │
   │ (with JITContext) │
   └───┬────────────┬──┘
       │            │
       ▼            ▼
JITExecPlan     other ExecutionPlan

I suppose in this phase we have enough information on how to "physically" execute a SQL. JIT module should only replace some previous ExecutionPlan with their compiled variant. Like physical projection to JIT projection or physical HashJoin to JIT HashJoin.

And in another perspective, I think it's not easy to compile a physical plan. One main reason is we can translate logical expr to JIT expr, but not physical expr to JIT expr. Logical expr is a sort of AST which is easy to compile, and physical expr is an "opaque" operation.

However this is not infeasible at all. As showed in this paper. One of the idea is to let those operators used to operate on data to generate IR, and then use the IR to operate data. We could achieve something like

// add a new method to this existing trait
trait ExecutionPlan {
    fn jit_compile(&self, jit_ctx: JITContext) {
        // add the logic to JIT Context. E.g. a filter plan:
        jit_ctx.expr(
            // let column_c = column_a + column_b
            // if column_c > 0, materialize column_d to output
        )
    }
}

fn jit_exec(plan: ExecutionPlan, ctx: JITContext) {
    // let every physical plans to register their logic to context
    // and finalize these logic to executable program.
    ctx.compile(plan);
    ctx.execute()
}

I haven't considered and compared these two ways deeply. But in the current stage, I think they only differ on how to structure our implementation. But they may have an influence on the future topic, like minimizing memory footprint or optimizing (on our side) generated code.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Member Author

Ok I think it is ready for review. The generated code and test preparation are verbose 😢 Maybe there is some way to do it better?

@waynexia waynexia marked this pull request as ready for review May 31, 2022 16:37
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia waynexia changed the title WIP: Support logical plan compilation Support logical plan compilation May 31, 2022
@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2022

Hi @waynexia I think it would help move this effort forward if we could have some sort of high-level issue with a brief overview of the issues this is seeking to tackle, along with a link to some sort of design document. Recent examples of this might be, #2502 or #2199. This will help provide context for reviewers, and allow discussion of different approaches with all the necessary information available.

On a related note, I wonder if perhaps this could live as a repo in https://github.com/datafusion-contrib/ ?

@waynexia
Copy link
Member Author

waynexia commented Jun 1, 2022

Appreciate your advice and suggestion @tustvold! I'm drafting the proposal, hope I could put it into discussion this week.

On a related note, I wonder if perhaps this could live as a repo in https://github.com/datafusion-contrib/ ?

I'm not very familiar with this mechanism. Which aspects are different for living in the main repo or as a standalone repo

@andygrove
Copy link
Member

Appreciate your advice and suggestion @tustvold! I'm drafting the proposal, hope I could put it into discussion this week.

On a related note, I wonder if perhaps this could live as a repo in https://github.com/datafusion-contrib/ ?

I'm not very familiar with this mechanism. Which aspects are different for living in the main repo or as a standalone repo

datafusion-contrib is not under Apache governance so there is more freedom there to move fast when prototyping. You can merge your own PRs for example while you are the only person working on a crate there. You can also release to crates.io without a formal vote. If/when we want to move any code into the official repo then we would go through the usual IP clearance process.

@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
@waynexia
Copy link
Member Author

waynexia commented Jun 5, 2022

datafusion-contrib is not under Apache governance so there is more freedom there to move fast when prototyping. You can merge your own PRs for example while you are the only person working on a crate there. You can also release to crates.io without a formal vote. If/when we want to move any code into the official repo then we would go through the usual IP clearance process.

This is a good choice, I'm ok with both. What are others' opinions?

And by the way, I have drafted a proposal for JIT. Please let me know what do you think on this too 😃
https://docs.google.com/document/d/1K7gYog2qsmVTw0VKlivlETM9MXK8UVGCGc_X7ykc_xk/edit?usp=sharing

cc @yjshen @Dandandan @alamb @houqp @viirya (sorry for disturbing)

@tustvold
Copy link
Contributor

tustvold commented Jun 6, 2022

Thank you for writing this up, I've left some comments on the document.

Some high-level thoughts:

  • datafusion-contrib is probably the right place to experiment with plan-level JIT. I suspect it will need to go through many iterations before it starts to yield discernible performance improvements, and I suspect it will otherwise likely get held up by the limited review capacity on this repo
  • My suggestion would be to first focus on simpler applications of JIT so that we can all get experience with how the various technologies interplay, before undertaking a more complex effort. Something like Improve Sorting / Merge performance  #2427 might be a good starting point. I'm sure @alamb would be more than happy to provide guidance should you want to take this on

@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

I agree that plan level JIT is a great idea, thank you @waynexia for writing up the document as well as this PR. I am sorry it took so long to review it.

TLDR:

  1. I think JIT'ing Exprs (JIT-Compile DataFusion Expressions to create RecordBatches #2122) is a required step for fully JIT'ing plans
  2. datafusion-contrib is likely a good place for this kind of work until it is ready -- I already feel like we have several partly done features (JIT exprs, scheduler) in the core. However, given most of this PR is in the jit crate I am not opposed to adding it too,
  3. I would love to see some performance experiments showing the effect of this work

I am not sure if you have seen the following paper, but it gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans and I think it is quite relevant to this discussion: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

Here is the canonical plan I think of that benefits from JIT'ing:

            ▲                                      ▲               
            │                                      │               
            │                                      │               
┌───────────────────────┐          ┌──────────────────────────────┐
│     HashAggregate     │          │        Compiled Node         │
│        gby: x         │          │         JIT'ed code:         │
│     agg: SUM(y+5)     │          │                              │
└───────────────────────┘          │    if x > 5 and y != 10:     │
            ▲                      │      hash (x)                │
            │                      │      probe hash table        │
            │                      │      update SUM              │
┌───────────────────────┐          │                              │
│        Filter:        │          └──────────────────────────────┘
│       x > 5 AND       │                          ▲               
│        y != 10        │                          │               
└───────────────────────┘                          │               
            ▲                      ┌──────────────────────────────┐
            │                      │             Scan             │
            │                      └──────────────────────────────┘
┌───────────────────────┐                                          
│         Scan          │                                          
└───────────────────────┘                                          

In this case, the code to filter and update the hashtable are compiled together so that input rows from the data source directly update the hash table without ever leaving registers. I believe this kind of plan can be made super fast and results state of the art in terms of query performance.

The idea of JIT'ing multiple plan nodes together is a necessary part of this for sure, but one of the last ones. The first step is fully JIT'ing the expr evaluation.

I really like the idea of using DataFusion's extensibility model to develop / prototype this approach in a datafusion-contrib or other repo until it is mature enough to bring into the core codebase. Though to be honest, perhaps the same approach could/should be used for the new scheduler @tustvold is working on too 🤔

In terms of the cache invalidation issues, I think #2199 will help the current vectorized approach to minimize the number of times a batch

I will also try to update #2122 with some more specifics

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer if this was in datafusion-contrib but I think in the jit module is OK too

let jit_exec_plan = jit_ctx.compile_logical_plan(projection_plan).unwrap();

// execute
let output_batch = jit_exec_plan.execute(input_batch).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty neat

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Member Author

waynexia commented Jun 6, 2022

Thank you all for the reviews ❤️

datafusion-contrib is probably the right place to experiment with plan-level JIT

How should I get started in datafusion-contrib, I think I should ask someone permitted to create a repo? And should we move the entire jit sub-crate or just copy it. The row format will still need the JIT utils.

Something like #2427 might be a good starting point.

Looks cool. I'll try to get involved. And I really like @alamb's reason 1!

The idea of JIT'ing multiple plan nodes together is a necessary part of this for sure, but one of the last ones. The first step is fully JIT'ing the expr evaluation.

This makes sense. I shall revisit the task list. That paper also gives some interesting data, like compilation sometimes will miss more branches. And besides the conclusion, it shows (me) a concrete way to measure cost and improvement.

@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

I have spent some time thinking about this and I am not as sure JIT'ing expressions other than multi-column row comparisons will really improve performance over Arrow's optimized kernels -- I think as @waynexia has suggested, the only way we will ever really know is to try it

@alamb alamb marked this pull request as draft June 6, 2022 19:07
@alamb
Copy link
Contributor

alamb commented Jun 6, 2022

Marking as draft so we don't accidentally merge this

@waynexia
Copy link
Member Author

Close this. I'll continue under a separate repo, appreciate it again ❤️

@waynexia waynexia closed this Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants