Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extension plugin to parse SQL into logical plan / user defined SQL parser #533

Closed
jorgecarleitao opened this issue Jun 10, 2021 · 14 comments · Fixed by #9729
Closed
Labels
enhancement New feature or request

Comments

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Jun 10, 2021

As a user of DataFusion, I would like to be able to install custom parsing rules of SQL to DataFusion, so that I can plan custom nodes from SQL.

This would allow me to extend datafusions' core capabilities beyond its supported SQL.

Examples:

  • OPTIMIZE, VACUUM
  • select * from t version as of n (delta lake)

I would like support for 3 main cases:

  • Parse entire SQL statements (e.g. select * from t version as of n) into a logical node
  • Parse single SQL expressions (e.g. my_custom_expr in select my_custom_expr(t) from table1) a custom logical expression
  • parse a table format into a Custom Table Provider logical node
@jorgecarleitao jorgecarleitao added the enhancement New feature or request label Jun 10, 2021
@adsharma
Copy link
Contributor

select * from t version as of n

Is preventing a query such as the following desirable?

select * from t1 version as of n1
union  all
select * from t2 version as of n2

or is this by design?

@houqp
Copy link
Member

houqp commented Jun 13, 2021

@adsharma I can't think of a reason why we would want to prevent the union query you mentioned. I don't think what @jorgecarleitao wrote in the issue description implies this?

@jorgecarleitao
Copy link
Member Author

I also do not see the issue with the example above, but I would say that, In general, custom SQL parsers effectively modify the SQL dialect that is being used and therefore the responsibility to document variations, including any limitation that they may introduce to the "default" postgres dialect, lays to the applications that use/install custom parsers.

@adsharma
Copy link
Contributor

adsharma commented Jun 14, 2021

I don't have much context about the proposal. Trying to understand things better. Please bear with me.

The reason why SQL doesn't have select * from t version as of n is that it violates the isolation property in ACID, by allowing a query to read data from two different points in time (or logical sequence numbers if you prefer).

Instead, they prefer a flow such as:

BEGIN // implicitly selects a version and all queries following would read from that version
select * from t1
union all
select * from t2;
COMMIT

I can also imagine a variant such as:

BEGIN TRANSACTION @n1
...
COMMIT

which has the same effect. The benefit of these variants is that it makes it harder to write SQL that violates isolation properties.

@houqp
Copy link
Member

houqp commented Jun 21, 2021

version as of n is a deltalake specific SQL extension. It's better to think of t version as of n as a different table. Datafusion is not a transactional query engine, so querying the same table should always return the same result.

@adsharma
Copy link
Contributor

adsharma commented Jun 21, 2021 via email

@jorgecarleitao
Copy link
Member Author

For avoidance of doubt, the goal of this issue is not to support these extension languages in DataFusion itself, but to allow users to plugin their own custom extensions, so that they can support them themselves.

@alamb alamb changed the title Add extension plugin to parse SQL into logical plan Add extension plugin to parse SQL into logical plan / user defined SQL Nov 12, 2021
@alamb
Copy link
Contributor

alamb commented Aug 20, 2022

An update here:

There are several examples of people who have implemented their own custom query language / domain specific language using DataFusion. For example vega fusion with the vega language, and InfluxDB IOx with the language that underlies flux and influxql.

Typically the approach is to use the LogialPlanBuilder and skip the datafusion SQLPLanner entirely

One way we might be able to accomplish this extension point would be like:

/// Default SQL planner 
pub trait  SQLPlanner :
  fn plan_query(&mut self, query: sqlparser::ast::Query) -> LogicalPlan {
    // call free function 
    plan_query(self, query)
   }

  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
    plan_expr(self, expr)
  }
}

// Free functions that do the work (as you can't call default trait impls)
// https://stackoverflow.com/questions/43849041/reuse-default-method-implementations-in-trait-impls

fn plan_query<P: SQLPlanner>(planner: &mut P, query: sqlparser::ast::Query) -> LogicalPlan {
    // existing logic
 }

fn plan_expr<P: SQLPlanner>(planner: &mut P, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
}

I was able to make this type of pattern work for "sql rewriting" where the user is allowed to rewrite an SQL query prior to planning with the existing datafusion planner. You can see a sketch of such a solution in https://github.com/influxdata/influxdb_iox/pull/5438.

@alamb
Copy link
Contributor

alamb commented Aug 20, 2022

An extension planner then looks like

struct MySqlPlanner {
}

impl QueryPlanner for MySqlPlanner {
  // override expression planning
  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
    if need_to_rewrite(expr) {
        // custom planning logic
     } else {
       // fall back to default impl
       plan_expr(self, expr)
     }
  }
}

The issue with this approach is that it complicates the sql planner non trivially , so I am not sure it is a great idea

@philippemnoel
Copy link
Contributor

An update here:

There are several examples of people who have implemented their own custom query language / domain specific language using DataFusion. For example vega fusion with the vega language, and InfluxDB IOx with the language that underlies flux and influxql.

Typically the approach is to use the LogialPlanBuilder and skip the datafusion SQLPLanner entirely

One way we might be able to accomplish this extension point would be like:

/// Default SQL planner 
pub trait  SQLPlanner :
  fn plan_query(&mut self, query: sqlparser::ast::Query) -> LogicalPlan {
    // call free function 
    plan_query(self, query)
   }

  fn plan_expr(&mut self, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
    plan_expr(self, expr)
  }
}

// Free functions that do the work (as you can't call default trait impls)
// https://stackoverflow.com/questions/43849041/reuse-default-method-implementations-in-trait-impls

fn plan_query<P: SQLPlanner>(planner: &mut P, query: sqlparser::ast::Query) -> LogicalPlan {
    // existing logic
 }

fn plan_expr<P: SQLPlanner>(planner: &mut P, expr: sqlparser::ast::Expr) -> Expr {
   // existing logic work
}

I was able to make this type of pattern work for "sql rewriting" where the user is allowed to rewrite an SQL query prior to planning with the existing datafusion planner. You can see a sketch of such a solution in https://github.com/influxdata/influxdb_iox/pull/5438.

That PR no longer links correctly, perhaps due to a rename. Does it still exist? I'd be curious look at the example

@alamb
Copy link
Contributor

alamb commented Nov 21, 2023

Hi @philippemnoel -- I am not sure how useful that PR would be, but I can dig up the diff if you would like

What are you trying to do?

You can see an example of using DataFusion to implement your own language here https://github.com/influxdata/influxdb/tree/main/iox_query_influxql/src (our implementation of InfluxQL)

I believe Greptime DB has their own frontend for promql here: https://github.com/GreptimeTeam/greptimedb/tree/9ff7670adfb56a80fe6ffeab8bdab9bcfe55543c/src/promql

There are several ways to extend DataFusion, such as with scalar / aggregate / window functions , documented here
https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html

As well as custom operators and optimizer passes

@alamb alamb changed the title Add extension plugin to parse SQL into logical plan / user defined SQL Add extension plugin to parse SQL into logical plan / user defined SQL parser Mar 21, 2024
@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

@tshauck mentioned this in discord: https://discord.com/channels/885562378132000778/1166447479609376850/1219782560436191342


Hi, I'm wondering if someone could give some advice on how to use DFParser in my own parser. It's impl is:

pub struct DFParser<'a> {
    parser: Parser<'a>,
}

And basically I want to do:

pub struct MyParser<'a> {
    parser: DFParser<'a>,
}

so I can reuse as much as possible (e.g. parse_statement), but implement my own parsing logic on Keyword::COPY. The issue I'm having is that it doesn't seem possible to access the underlying sqlparser Parser as it's private to DFParser. At least with this approach I'd need access to it so I could call next_token() and other methods.

Is there a better approach to adding something like that on top of the DFParser, or would the maintainers be open to making parser available for modification on DFParser struct.

Put another way, the end goal is to customize how the SQL becomes a logical plan so I can use CopyTo and a user defined logical node to write to a custom file format.

Thanks!

For reference, here's parse_statement: https://github.com/apache/arrow-datafusion/blob/8074ca1e758470319699a562074290906003b312/datafusion/sql/src/parser.rs#L352-L384

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

@tshauck I think this feature has been missing for a while and it would be great if you help / propose improvements

I personally think exposing the underlying Parser would be fine

What I think would be the most compelling would be a new example in https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples/examples showing how to do what you are describing above (aka support a custom file format).

As part of that PR, you would likely have to add / modify the DFParser API, and that modification would be easy to review / understand along with the example.

What do you think?

@tshauck
Copy link
Contributor

tshauck commented Mar 21, 2024

@alamb Cool yeah, happy to give this a go. I'll start w/ the example + making Parser public as you suggest.

FWIW, I did a spike yesterday to see if I could go the encapsulation route, but perhaps unsurprisingly, you need somewhat lower level control for a parser, so it ended up with a lot of methods on DFParser that mirrored Parser just to delegate to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants