-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: what new functions should and should not be accepted into DataFusion #14777
Comments
you mean something like https://github.com/datafusion-contrib/datafusion-functions-extra? but i'm not advocating for "open for all" approach either.
I agree it makes sense to add functions that close functionality gap to established popular systems.
Apache projects need to have PMC members who do and vote the releases. It's their duty to do releases with all the burden this entails (#14428). Is the release burden proportional to amount of code being shipped? Is it proportional to number of releases being made? Can the burden be minimized into pure automation? It's 21st century... I would prefer separate crates within this repository for the top-popular function collections. No more than 6. We won't really know what's the actual costuntil we try things out. So whatever we feel is the best model, we should try out open minded accepting we can change later. |
With the extraction of builtin functions to UDF's last year it has become much easier to add new functions to DataFusion and if @findepi's simple functions is merged into main in the future it's only going to become easier to add new functions.
There have been a number of comments in the past concerning functions in DataFusion and what should and should not be in core and what should likely be in an external repository. The only guidance right now in the contributor guide is:
That's great, except that there is no definition of what is 'standard SQL' (typically I would think we would point to postgresql as the 'standard') and there are many many useful functions in other systems such as duckdb, singlestore, spark, etc that could be candidates for inclusion.
There already is a PR to include spark functions in DataFusion as a separate crate spearheaded by the comet team.
The concern with having so many functions in core is the maintenance burden it incurs. The community has been able to handle it so far but if we keep adding more functions in the future that may no longer be the case.
I would like to come to a consensus as to what we would accept in a more specific worded manner, but perhaps more importantly, can we provide a home for non-core functions where the community could maintain them outside of DataFusion core?
My proposal is:
Note that there already is datafusion-functions-extra in datafusion-contrib but that is not an official apache release. Is that sufficient instead of point #2 above?
Thoughts?
The text was updated successfully, but these errors were encountered: