Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Create new datafusion-comet-spark-expr crate containing Spark-compatible DataFusion expressions #638

Merged
merged 19 commits into from
Jul 10, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 7, 2024

Which issue does this PR close?

Part of #630

Also contains workaround for #642 that was discovered as part of this PR

Rationale for this change

This PR demonstrates how we can start moving expressions into a separate crate that other Rust projects can use, without having dependencies on JVM or Spark.

What changes are included in this PR?

  • New crate
  • Move abs expression to new crate as ther first example. If this approach is acceptable, we can follow up with PRs to move the other functions over
  • Added Rust tests for abs
  • Fall back to Spark for abs in legacy mode because behavior is incorrect (see abs implementation seems incorrect #642)

How are these changes tested?

Existing tests + new tests

@andygrove andygrove marked this pull request as ready for review July 8, 2024 15:56
# under the License.

[package]
name = "datafusion-comet-expr"
Copy link
Member Author

@andygrove andygrove Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether to name this datafusion-spark-expr or datafusion-comet-expr. I think either could work.

I went with the latter because the versioning of this crate will be different from the version numbers used for other datafusion-* crates from the core project and thought that may cause confusion.

I also think it is unlikely that we would move this crate to the core project because we would want the implementation and tests to live in the same repo and the tests depend on Apache Spark, which would be a huge burden to add to the core project.

cc @alamb in case you have any opinions on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think datafusion-comet-expr sounds good to me

You could get the best of both worlds like datafusion-comet-spark-expr but that is probably only useful if people end up using this crate outside the context of comet

Maybe if this crate really becomes a great set of spark compatible functions, that would be a good time to consider moving it to the core repo and releasing it along side the others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer datafusion-spark-expr as it's a more general name so that the potential third party could find this crate more easily.

I agree that we may never move this crate to the core of the DataFusion project, but it's still possible we may want to move this crate to a datafusion-contrib repo once the supported expressions are stable and mature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking more about this, and I also now think we should name it datafusion-spark-expr because the functions are not specific to the Comet project. I will make that change.

I don't think we would ever want to move the code into datafusion-contrib because that is outside of Apache Software Foundation governance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would ever want to move the code into datafusion-contrib because that is outside of Apache Software Foundation governance.

Thanks for your explanation, I think it makes total sense and agree with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like datafusion-comet-spark-expr fwiw. It would be nice for all things versioned together to have the sama datafusion-comet prefix, and as the reason to use these exprs over DF’s inbuilt one is to align with spark, having that in the name would make sense to me as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the crate to datafusion-comet-spark-expr.

@advancedxy @viirya @huaxingao what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

I have renamed the crate to datafusion-comet-spark-expr.

@advancedxy @viirya @huaxingao what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -68,17 +71,15 @@ impl ScalarUDFImpl for CometAbsFunc {

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue, DataFusionError> {
match self.inner_abs_func.invoke(args) {
Err(DataFusionError::ArrowError(ArrowError::ComputeError(msg), trace))
Err(DataFusionError::ArrowError(ArrowError::ComputeError(msg), _trace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change, because _trace is never used?

We can simply change it to _?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we get an overflow error then we don't care about the trace. I have renamed it to _.

# under the License.

[package]
name = "datafusion-comet-expr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer datafusion-spark-expr as it's a more general name so that the potential third party could find this crate more easily.

I agree that we may never move this crate to the core of the DataFusion project, but it's still possible we may want to move this crate to a datafusion-contrib repo once the supported expressions are stable and mature.

@andygrove andygrove changed the title feat: Create new datafusion-comet-expr crate containing Spark-compatible DataFusion expressions feat: Create new datafusion-spark-expr crate containing Spark-compatible DataFusion expressions Jul 9, 2024
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Pending CI passes.

BTW, I'm not familiar with the cargo related changes, so it would be great that others can take a look at that.

datafusion-functions = { workspace = true }

[lib]
name = "datafusion_spark_expr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder why we don't have comet in the library name? Are we planning to move these to datafusion? I personally prefer to keep spark specified stuffs in Comet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discussion about this at #638 (comment)


[package]
name = "datafusion-spark-expr"
description = "DataFusion expressions that emulate Apache Spark's behavior"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "DataFusion expressions that emulate Apache Spark's behavior"
description = "DataFusion Comet expressions that emulate Apache Spark's behavior"

@andygrove andygrove changed the title feat: Create new datafusion-spark-expr crate containing Spark-compatible DataFusion expressions feat: Create new datafusion-comet-spark-expr crate containing Spark-compatible DataFusion expressions Jul 9, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry forgot to submit my review

# under the License.

[package]
name = "datafusion-comet-expr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think datafusion-comet-expr sounds good to me

You could get the best of both worlds like datafusion-comet-spark-expr but that is probably only useful if people end up using this crate outside the context of comet

Maybe if this crate really becomes a great set of spark compatible functions, that would be a good time to consider moving it to the core repo and releasing it along side the others

datafusion-functions = { workspace = true }

[lib]
name = "datafusion_comet_spark_expr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@andygrove andygrove merged commit 10347b9 into apache:main Jul 10, 2024
73 checks passed
@andygrove andygrove deleted the spark-expr-crate branch July 10, 2024 19:32
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
…-compatible DataFusion expressions (apache#638)

* convert into workspace project

* update GitHub actions

* update Makefile

* fix regression

* update target path

* update protobuf path in pom.xml

* update more paths

* add new datafusion-comet-expr crate

* rename CometAbsFunc to Abs and add documentation

* fix error message

* improve error handling

* update crate description

* remove unused dep

* address feedback

* finish renaming crate

* update README for datafusion-spark-expr

* rename crate to datafusion-comet-spark-expr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants