-
Notifications
You must be signed in to change notification settings - Fork 92
Bridging Druid Results into the Spark Data pipeline
The DruidQueryBuilder maintains a map from the Druid Query Result columnName to the triple: (Expression, spark DataType, druid DataType):
- Expression is the Catalyst Expression from the original Plan that the Druid column in the Result row represents.
- The DataType of the Expression in the original SQL plan.
- The DataType of the value returned by Druid. The 2 datatypes need not match; during rewrite a check is made to see if the conversion from the Druid datatype to Spark Expression datatype is valid. If not, the rewrite doesn't happen.
This map as populated as expressions from the Aggregate Operator are added to the DruidQueryBuilder.
The rewritten Physical Plan contains a PhysicalRDD Operator at its base that wraps the DruidRDD build by the DruidRelation. More on the overall translation process can be found here ??? The schema for the PhysicalRDD Operator is formed by creating a StructType from each of the columns in the output Map maintained by the DruidQueryBuilder. For Grouping Expressions that were AttributeReferences in the original Plan, we reuse their ExprIds; for non AttributeReferences new ExprIds are generated. This way any resolved AttributeReferences above the replaced Plan SubTree are still valid and point to the correct child Attribute in the rewritten Plan.
A Projection Operator is added above the PhysicalRDD Operator to:
- provide the same schema as the original Aggregate Operator. (or the Ordering/Filter Operator above the Agg.Op in case of having/order/limit rewrites)
- To ensure Attribute names, ExprId and DataTypes match what was in the original Operator.
The ProjectionList is formed from the aggregation expressions of the original Agg. Operator. Any expressions that were mapped to Druid Result columns are replaced by AttributeReferences to the child PhysicalRDD Attributes. The following rules are followed:
- If needed the AttributeReference is wrapped in a cast to convert to the original Spark Plan's dataType.
- AttributeReferences in the original Plan carry the original ExprId, so that references above this Operator remain valid. Names from the original AttributeReference are also maintained by wrapping the new AttributeReference in an Alias.
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark