[EPIC] JIT support for `DataFusion` #2703

alamb · 2022-06-06T12:38:16Z

Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

Another example can be found in these slides from this presentation

@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375

This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort

Describe the solution you'd like

The text was updated successfully, but these errors were encountered:

alamb · 2022-06-06T12:44:05Z

Actually, this is basically the same as #1861 so closing in favor of that ticket

leoluan2009 · 2024-05-16T06:31:59Z

Hi @alamb , do we need to support expression JIT for performance like ClickHouse: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit

alamb · 2024-05-20T16:21:11Z

Hi @leoluan2009

In my opinion, I don't think DataFusion needs JIT to get good performance.

In general, I find the paper "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" to explain the tradeoffs well

DataFusion is a vectorized engine and we haven't found areas where JIt would be compelling compared to vectorized code. The only area I can really think of would be to implement type specialized comparisons for sorting (to avoid the RowFormat) but we would need to have a pretty compelling benchmark showing improvements to justify I think

faucct · 2024-05-24T13:26:17Z

Though the paper that you have mentioned admits that JIT-compilation is beneficial for OLTP workloads:

Besides OLAP performance, other factors also play an important role. Compilation-based engines have advantages in
< OLTP as they can create fast stored procedures

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

alamb · 2024-05-24T14:52:04Z

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

FWIW there is no reason you couldn't JIT compile arbitrary expressions and run them as UDFs. The API is now basically complete: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html

It would also be possible to do the same with aggregates / windows / etc

faucct · 2024-05-24T16:09:51Z

I think that compiling SQL-expressions to UDFs by hand would kinda kill the whole point of the framework, but it seems like most of the framework would be irrelevant for the in-memory transformation of Online Features, so I guess it would be easier to build the same thing from scratch, though using the same ideas, like for example SQL and Arrow format for intermediate data representation.

alamb · 2024-05-24T19:38:18Z

FIW there is a lot more to SQL evaluation than just the expression evaluation, so that might be a reason to use DataFusion even if you had to implement your own expressions 🤔

alamb added the enhancement New feature or request label Jun 6, 2022

alamb changed the title ~~[EPIC] Full JIT support for DataFusion~~ [EPIC] JIT support for DataFusion Jun 6, 2022

alamb closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] JIT support for `DataFusion` #2703

[EPIC] JIT support for `DataFusion` #2703

alamb commented Jun 6, 2022 •

edited

Loading

alamb commented Jun 6, 2022

leoluan2009 commented May 16, 2024

alamb commented May 20, 2024

faucct commented May 24, 2024

alamb commented May 24, 2024

faucct commented May 24, 2024 •

edited

Loading

alamb commented May 24, 2024

[EPIC] JIT support for DataFusion #2703

[EPIC] JIT support for DataFusion #2703

Comments

alamb commented Jun 6, 2022 • edited Loading

alamb commented Jun 6, 2022

leoluan2009 commented May 16, 2024

alamb commented May 20, 2024

faucct commented May 24, 2024

alamb commented May 24, 2024

faucct commented May 24, 2024 • edited Loading

alamb commented May 24, 2024

[EPIC] JIT support for `DataFusion` #2703

[EPIC] JIT support for `DataFusion` #2703

alamb commented Jun 6, 2022 •

edited

Loading

faucct commented May 24, 2024 •

edited

Loading