Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: what new functions should and should not be accepted into DataFusion #14777

Open
Omega359 opened this issue Feb 19, 2025 · 1 comment

Comments

@Omega359
Copy link
Contributor

Omega359 commented Feb 19, 2025

With the extraction of builtin functions to UDF's last year it has become much easier to add new functions to DataFusion and if @findepi's simple functions is merged into main in the future it's only going to become easier to add new functions.

There have been a number of comments in the past concerning functions in DataFusion and what should and should not be in core and what should likely be in an external repository. The only guidance right now in the contributor guide is:

Contributions that will likely involve more discussion (see Discussing New Features above) prior to acceptance include:

* Major new functionality (even if it is part of the “standard SQL”)
* New functions, especially if they aren’t part of “standard SQL”
* New data sources (e.g. support for Apache ORC)

That's great, except that there is no definition of what is 'standard SQL' (typically I would think we would point to postgresql as the 'standard') and there are many many useful functions in other systems such as duckdb, singlestore, spark, etc that could be candidates for inclusion.

There already is a PR to include spark functions in DataFusion as a separate crate spearheaded by the comet team.

The concern with having so many functions in core is the maintenance burden it incurs. The community has been able to handle it so far but if we keep adding more functions in the future that may no longer be the case.

I would like to come to a consensus as to what we would accept in a more specific worded manner, but perhaps more importantly, can we provide a home for non-core functions where the community could maintain them outside of DataFusion core?

My proposal is:

  • New functions will only be accepted in DataFusion if they fill in a gap compared to Postgresql or fill a gap identified by the community compared to an alternative systems such as DuckDB. In the later case the functions should be contributed as a group within an epic that fills out the specified gap (for example, the union* functions in DuckDB), not single functions coming in piecemeal.
  • A new apache repository is setup (datafusion-additional-functions ?) where we provide the framework for adding, testing and packaging new functions but with the explicitly stated understanding that the maintenance of any functions contained have lower maintenance priority in the DataFusion team and releases may or may not coincide with DataFusion releases.

Note that there already is datafusion-functions-extra in datafusion-contrib but that is not an official apache release. Is that sufficient instead of point #2 above?

Thoughts?

@findepi
Copy link
Member

findepi commented Feb 19, 2025

can we provide a home for non-core functions where the community could maintain them outside of DataFusion core?

you mean something like https://github.com/datafusion-contrib/datafusion-functions-extra?
It has some downsides too (being non-Apache limits contribution from corporations; has unpredictable release cycles)
That's not where I personally am feeling encouraged to contribute.

but i'm not advocating for "open for all" approach either.

New functions will only be accepted in DataFusion if they fill in a gap compared to Postgresql or fill a gap identified by the community compared to an alternative systems such as DuckDB. In the later case the functions should be contributed as a group within an epic that fills out the specified gap (for example, the union* functions in DuckDB), not single functions coming in piecemeal.

I agree it makes sense to add functions that close functionality gap to established popular systems.
What's exactly an "established popular system"? For every potential contributor it will be theirs system of choice, se we need to apply some judgement. Perhaps based on "aggregated request rate" or "common themes".
Spark-compatibility is so clearly a common theme that it makes sense to maintain Spark functions as part of this repo.
PostgreSQL being our reference implementation - same
DuckDB being our look-up to role model for arrays - same.
I guess wuould be a few more on this list where we can expect some low-profile but sustained interest.

A new apache repository is setup (datafusion-additional-functions ?) where we provide the framework for adding, testing and packaging new functions but with the explicitly stated understanding that the maintenance of any functions contained have lower maintenance priority in the DataFusion team and releases may or may not coincide with DataFusion releases.

Apache projects need to have PMC members who do and vote the releases. It's their duty to do releases with all the burden this entails (#14428).

Is the release burden proportional to amount of code being shipped? Is it proportional to number of releases being made? Can the burden be minimized into pure automation? It's 21st century...

I would prefer separate crates within this repository for the top-popular function collections. No more than 6.
Alternatively, we could have separate repositories for each collection (including Spark's), so that interested community members can step up and review and eventually become subproject maintainers.


We won't really know what's the actual costuntil we try things out. So whatever we feel is the best model, we should try out open minded accepting we can change later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants