Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate JIT'd expression over arrays #2587

Merged
merged 5 commits into from
May 24, 2022
Merged

Conversation

waynexia
Copy link
Member

@waynexia waynexia commented May 22, 2022

Which issue does this PR close?

Related to #2122

Rationale for this change

This PR tries a way to perform JIT'd computation over Arrow array.

As I understand, we have (at least) two ways to JIT the query execution. One is to glue all the low-level compute functions together (those in arrow compute kernel), and another is this PR, which tries to perform all the computation in the JIT engine.

The first way is easier to implement (compared to the second one). And can get performance improvement from eliminated dispatch and branch. However, the second fully compiled way will take lots of effort as it requires a JIT version of compute kernel. Gandiva in Arrow C++ is an LLVM-based compute kernel that might help, but I'm not very familiar with it. Whatever, being able to combine both ways should be a better situation 😆.

Back to this PR, it will generate a loop like @Dandandan presented here. I haven't inspected whether the compiler will vectorize it. Currently, it only wraps over one expr, but we can explore the possibility to compile multiple plans into one loop like here. The row format for pipeline breaker is also significant to fully JIT.

This PR only implements a very early stage "example" with many hard-code like types and fn sig. Please let me know what do you think of it!

What changes are included in this PR?

Are there any user-facing changes?

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label May 22, 2022
@alamb
Copy link
Contributor

alamb commented May 22, 2022

This looks very cool @waynexia -- I will give it a good look tomorrow.

The first way is easier to implement (compared to the second one). And can get performance improvement from eliminated dispatch and branch. However, the second fully compiled way will take lots of effort as it requires a JIT version of compute kernel

The other reason that JIT execution for DataFusion is interesting is due to a few operations which are fundamentally row-oriented (and thus not amenable to vectorized execution), the key being a multi-tuple comparison (not just equality) which appears in sorting, grouping, and joining)

@waynexia
Copy link
Member Author

I got this from compiled function:

function u0:0(i64, i64, i64, i64) system_v {
block0(v0: i64, v1: i64, v2: i64, v3: i64):
    v11 -> v0
    v14 -> v1
    v17 -> v2
    v6 -> v3
    v4 = iconst.i64 0
    jump block1(v4)

block1(v5: i64):
    v7 = icmp slt v5, v6
    v8 = bint.i64 v7
    brz v8, block3
    jump block2

block2:
    v9 = iconst.i64 8
    v10 = imul.i64 v5, v9
    v12 = iadd.i64 v11, v10
    v13 = load.i64 v12
    v15 = iadd.i64 v14, v10
    v16 = load.i64 v15
    v18 = iadd.i64 v17, v10
    v19 = iadd v13, v16
    store v19, v18
    v20 = iconst.i64 1
    v21 = iadd.i64 v5, v20
    jump block1(v21)

block3:
    return
}

It just looks like a bare translation of what we build. So I suspect the vectorization is not done here (after translation). Further, I find this I64X8 type from the document (we are currently using I64). Perhaps this means that we need to manually vectorize our computation.

@alamb
Copy link
Contributor

alamb commented May 23, 2022

It just looks like a bare translation of what we build. So I suspect the vectorization is not done here (after translation). Further, I find this I64X8 type from the document (we are currently using I64). Perhaps this means that we need to manually vectorize our computation.

I believe this is a known limitation with cranelift -- we can also potentially consider using llvm in the future, but that would likely result in longer query planning times. I believe @tustvold has thought about this as well.

Writing some sort of basic vectorization optimizations in cranelift (or in datafusion) is also a possibility. @Dandandan discussed it a bit #2124 (comment)

);

let result = run_df_expr(df_expr, schema, array_a, array_b).unwrap();
assert_eq!(result, expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoo, really nice 🎉

@tustvold
Copy link
Contributor

tustvold commented May 23, 2022

I think it is important to understand what cranelift is, and what it isn't. Cranelift is a code generator originally intended to take optimised WASM and convert it to native code. It is not an optimising compiler like LLVM.

I could see it being very well suited for doing runtime monomorphisation, i.e. removing conditional branches, especially for row-oriented operators. I think it will struggle to out-perform the columnar kernels in arrow-rs, some of which are hand-rolled and all of which benefit from LLVM compilation.

As for using LLVM as a JIT, it is possible, and there are fairly mature ffi bindings for doing this, however, it would be a significant increase in implementation and runtime complexity. At the moment I suspect there are significantly lower hanging fruit in terms of optimisation...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very exciting step forward 🎉 Thank you very much @waynexia

I wonder why you chose the name deref rather than store? The underlying cranelift library seems to use load and store which I think are the more common terms in compilers for what Rust terms deref

datafusion/jit/src/api.rs Outdated Show resolved Hide resolved
datafusion/jit/src/api.rs Show resolved Hide resolved
Comment on lines 111 to 112
let code_fn =
unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code like this is some of my favorite ❤️ -- cast some memory to a function pointer and call it ;)

Extremely unsafe but feels very powerful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why not cast to the types you really want, like fn(*i64, *i64, *i64, i64) and then you can avoid the as i64 stuff below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion 👍

Extremely unsafe but feels very powerful

Same!


use super::*;

fn run_df_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the longer term I would like to see this type of logic encapsulated somehow

So we would have a function or struct that took an Expr and several ArrayRefs and then called a JIT or non-JIT version of evaluation depending on flags or options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

I'm going to figure out how to play with var length param list and type casting, and then encapsulate a plan node level interface (maybe take LogicalPlan and data as input, haven't thought in detail) as next step.


#[test]
fn array_add() {
let array_a: PrimitiveArray<Int64Type> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using different values for array_a and array_b so issues in argument handling would be evident

Like maybe

        let array_b: PrimitiveArray<Int64Type> =
            PrimitiveArray::from_iter_values((10..20).map(|x| x + 1));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 😆

let expected = r#"fn calc_fn_0(table1.a_array: i64, table1.b_array: i64, result: i64, len: i64) -> () {
let index: i64;
index = 0;
while index < len {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this code and it looks 👍 to me

format!("{}_ptr", input),
w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
)?;
w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the generated func only accepts i64 arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to parameter 😉

let mut fn_body = builder.enter_block();
fn_body.declare_as("index", fn_body.lit_i(0))?;
fn_body.while_block(
|cond| cond.lt(cond.id("index")?, cond.id("len")?),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we need sanity check for equal array lengths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated code is working on pointer directly, it might be hard to do these check inside it. I check inputs' length before pass them to generated code at here.

        if lhs.len() != rhs.len() {
            return Err(DataFusionError::NotImplemented(
                "Computing on different length arrays not yet supported".to_string(),
            ));
        }

But I agree that we should consider how to improve safety of generated code when its logic get complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general the idea is that all the safety checks are done during JIT generation as suggested by @waynexia

waynexia and others added 2 commits May 24, 2022 12:53
Add doc for `deref()` and `store()`

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Member Author

Thanks for all the reviews and information!

I think it is important to understand what cranelift is, and what it isn't. Cranelift is a code generator originally intended to take optimised WASM and convert it to native code. It is not an optimising compiler like LLVM.

That makes sense. I could see a long way to implement this JIT framework and (another long way) to make it outperform the existing interpreter executor 🤣

I wonder why you chose the name deref rather than store? The underlying cranelift library seems to use load and store which I think are the more common terms in compilers for what Rust terms deref

I also have hesitated at the naming, but have now changed it to load().

@alamb
Copy link
Contributor

alamb commented May 24, 2022

I also agree with @tustvold that in general it will be very hard to beat the performance of using the optimized, vectorized kernels in arrow-rs and in general I think we should continue to use them whenever possible.

As I mentioned in #2587 (comment) I think the super valuable usecase for JIT'ed Exprs is for operations that can't easily (or at all) be vectorized, namely those doing multi column comparisons during Merge / Group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants