-
Notifications
You must be signed in to change notification settings - Fork 92
Defining a Star Schema
Define a Star Schema using the create|alter star schema
command. The Star Schemas is defined on the
Fact Table. Star Schemas will be the basis for defining OLAP Indexes and other Metadata. Internally the Star Schema specification is stored as Table Properties on the Fact Table.
For example for the TPCH schema the create command is:
create star schema on lineitembase
as many_to_one join of lineitembase with orders on l_orderkey = o_orderkey
many_to_one join of lineitembase with partsupp on
l_partkey = ps_partkey and l_suppkey = ps_suppke
many_to_one join of partsupp with part on ps_partkey = p_partkey
many_to_one join of partsupp with supplier on ps_suppkey = s_suppkey
many_to_one join of orders with customer on o_custkey = c_custkey
many_to_one join of customer with custnation on c_nationkey = cn_nationkey
many_to_one join of custnation with custregion on cn_regionkey = cr_regionkey
many_to_one join of supplier with suppnation on s_nationkey = sn_nationkey
many_to_one join of suppnation with suppregion on sn_regionkey = sr_regionkey
You create the Star Schema on the Fact Table and you specify how the dimension tables are related(primary key - foreign key relationships). Each relation definition specifies the tables involved(leftTable, rightTable) the type of relation(1-1 or n-1) and the joining condition(specified as a conjunction of equality predicates)
The '''Star Schema'''s we support have the following constraints:
- We only support '''one-one''' or '''many-one''' relations between entities.
- A table can be related to the '''Fact Table''' via only 1 unique Path.
- The ''column names'' across the Star Schema must be unique. So 2 tables in the Star Schema cannot have columns with the same name.
The first 2 points are an issue only in the most involved star schema models; for e.g. we show how tpch can be modeled below. The 3rd restriction is an implementation issue: when performing QueryPlan rewrites we don't have access to the table an Attribute belongs to, for now we get around this issue by forcing column names to be unique across the Star Schema. So for the Tpch Model we model the Star Schema as:
FactTable = LineItem
StarRelations: [
LineItem - n:1 - Order => [[li_orderkey],[o_orderkey]]
LineItem - n:1 - PartSupp => [[li_partkey, li_suppkey],[ps_partkey, ps_suppkey]]
Order - n: 1 - Customer => [[o_custkey], [c_custkey]]
PartSupp - n:1 - Part => [[ps_partkey], [p_partkey]]
PartSupp - n:1 - Supplier => [[ps_suppkey], [s_suppkey]]
Customer - n:1 - CustNation => [[c_nationkey], [cn_nationkey]]
CustNation - n:1 - CustRegion => [[cn_regionkey], [cr_regionkey]]
Supplier - n:1 - SupptNation => [[s_nationkey], [sn_nationkey]]
SuppNation - n:1 - SuppRegion => [[sn_regionkey], [sr_regionkey]]
]
Because of our restrictions we have had to model the Nation table as separate CustNation and SuppNation tables. Similar separation has to be done for CustRegion and SuppRegion. Having to setup separate entities for Supplier and Customer Nation is not atypical when directly writing SQLs; these would be views on the same Nation Dimension table. Currently we are being more restrictive than this, we require the 2 views to be tables in the Metastore(this is because during Plan rewrite we loose the Table association in __ Catalyst Attributereferences__. But note, this doesn't require the data to be copied, both tables can point to the same underlying data in the storage layer.
We have to rename the column names in the 2 Nation(and region) tables. This is so that we can infer the Attribute to Tables(in the Star Schema) associations in a Query Plan.
For backward compatibility, we will continue to support specifying the Star Schema for a Druid DataSource backed DataFrame, this is specified as an option in the DataSource definition. The option name is starSchema. The option must be specified as json string, for example:
For example for the TPCH schema the definition is:
{
"factTable": "lineitem",
"relations": [
{
"leftTable": "lineitem",
"rightTable": "orders",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "l_orderkey",
"rightAttribute": "o_orderkey"
}
]
},
{
"leftTable": "lineitem",
"rightTable": "partsupp",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "l_partkey",
"rightAttribute": "ps_partkey"
},
{
"leftAttribute": "l_suppkey",
"rightAttribute": "ps_suppkey"
}
]
},
{
"leftTable": "partsupp",
"rightTable": "part",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "ps_partkey",
"rightAttribute": "p_partkey"
}
]
},
{
"leftTable": "partsupp",
"rightTable": "supplier",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "ps_suppkey",
"rightAttribute": "s_suppkey"
}
]
},
{
"leftTable": "orders",
"rightTable": "customer",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "o_custkey",
"rightAttribute": "c_custkey"
}
]
},
{
"leftTable": "customer",
"rightTable": "custnation",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "c_nationkey",
"rightAttribute": "cn_nationkey"
}
]
},
{
"leftTable": "custnation",
"rightTable": "custregion",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "cn_regionkey",
"rightAttribute": "cr_regionkey"
}
]
},
{
"leftTable": "supplier",
"rightTable": "suppnation",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "s_nationkey",
"rightAttribute": "sn_nationkey"
}
]
},
{
"leftTable": "suppnation",
"rightTable": "suppregion",
"relationType": "n-1",
"joinCondition": [
{
"leftAttribute": "sn_regionkey",
"rightAttribute": "sr_regionkey"
}
]
}
]
}
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark