Defining a Star Schema

Define a Star Schema using the create|alter star schema command. The Star Schemas is defined on the Fact Table. Star Schemas will be the basis for defining OLAP Indexes and other Metadata. Internally the Star Schema specification is stored as Table Properties on the Fact Table.

For example for the TPCH schema the create command is:

create star schema on lineitembase
as many_to_one join of lineitembase with orders on l_orderkey = o_orderkey
   many_to_one join of lineitembase with partsupp on
          l_partkey = ps_partkey and l_suppkey = ps_suppke
   many_to_one join of partsupp with part on ps_partkey = p_partkey
   many_to_one join of partsupp with supplier on ps_suppkey = s_suppkey
   many_to_one join of orders with customer on o_custkey = c_custkey
   many_to_one join of customer with custnation on c_nationkey = cn_nationkey
   many_to_one join of custnation with custregion on cn_regionkey = cr_regionkey
   many_to_one join of supplier with suppnation on s_nationkey = sn_nationkey
   many_to_one join of suppnation with suppregion on sn_regionkey = sr_regionkey

You create the Star Schema on the Fact Table and you specify how the dimension tables are related(primary key - foreign key relationships). Each relation definition specifies the tables involved(leftTable, rightTable) the type of relation(1-1 or n-1) and the joining condition(specified as a conjunction of equality predicates)

The '''Star Schema'''s we support have the following constraints:

We only support '''one-one''' or '''many-one''' relations between entities.
A table can be related to the '''Fact Table''' via only 1 unique Path.
The ''column names'' across the Star Schema must be unique. So 2 tables in the Star Schema cannot have columns with the same name.

The first 2 points are an issue only in the most involved star schema models; for e.g. we show how tpch can be modeled below. The 3rd restriction is an implementation issue: when performing QueryPlan rewrites we don't have access to the table an Attribute belongs to, for now we get around this issue by forcing column names to be unique across the Star Schema. So for the Tpch Model we model the Star Schema as:

FactTable = LineItem
StarRelations: [
  LineItem - n:1 - Order => [[li_orderkey],[o_orderkey]]
  LineItem - n:1 - PartSupp => [[li_partkey, li_suppkey],[ps_partkey, ps_suppkey]]
  Order - n: 1 - Customer => [[o_custkey], [c_custkey]]
  PartSupp - n:1 - Part => [[ps_partkey], [p_partkey]]
  PartSupp - n:1 - Supplier => [[ps_suppkey], [s_suppkey]]
  Customer - n:1 - CustNation => [[c_nationkey], [cn_nationkey]]
  CustNation - n:1 - CustRegion => [[cn_regionkey], [cr_regionkey]]
  Supplier - n:1 - SupptNation => [[s_nationkey], [sn_nationkey]]
  SuppNation - n:1 - SuppRegion => [[sn_regionkey], [sr_regionkey]]
]

Because of our restrictions we have had to model the Nation table as separate CustNation and SuppNation tables. Similar separation has to be done for CustRegion and SuppRegion. Having to setup separate entities for Supplier and Customer Nation is not atypical when directly writing SQLs; these would be views on the same Nation Dimension table. Currently we are being more restrictive than this, we require the 2 views to be tables in the Metastore(this is because during Plan rewrite we loose the Table association in __ Catalyst Attributereferences__. But note, this doesn't require the data to be copied, both tables can point to the same underlying data in the storage layer.

We have to rename the column names in the 2 Nation(and region) tables. This is so that we can infer the Attribute to Tables(in the Star Schema) associations in a Query Plan.

Specifying the Star Schema on a Druid DataSource backed DataFrame

For backward compatibility, we will continue to support specifying the Star Schema for a Druid DataSource backed DataFrame, this is specified as an option in the DataSource definition. The option name is starSchema. The option must be specified as json string, for example:

For example for the TPCH schema the definition is:

{
  "factTable": "lineitem",
  "relations": [
    {
      "leftTable": "lineitem",
      "rightTable": "orders",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "l_orderkey",
          "rightAttribute": "o_orderkey"
        }
      ]
    },
    {
      "leftTable": "lineitem",
      "rightTable": "partsupp",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "l_partkey",
          "rightAttribute": "ps_partkey"
        },
        {
          "leftAttribute": "l_suppkey",
          "rightAttribute": "ps_suppkey"
        }
      ]
    },
    {
      "leftTable": "partsupp",
      "rightTable": "part",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "ps_partkey",
          "rightAttribute": "p_partkey"
        }
      ]
    },
    {
      "leftTable": "partsupp",
      "rightTable": "supplier",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "ps_suppkey",
          "rightAttribute": "s_suppkey"
        }
      ]
    },
    {
      "leftTable": "orders",
      "rightTable": "customer",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "o_custkey",
          "rightAttribute": "c_custkey"
        }
      ]
    },
    {
      "leftTable": "customer",
      "rightTable": "custnation",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "c_nationkey",
          "rightAttribute": "cn_nationkey"
        }
      ]
    },
    {
      "leftTable": "custnation",
      "rightTable": "custregion",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "cn_regionkey",
          "rightAttribute": "cr_regionkey"
        }
      ]
    },
    {
      "leftTable": "supplier",
      "rightTable": "suppnation",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "s_nationkey",
          "rightAttribute": "sn_nationkey"
        }
      ]
    },
    {
      "leftTable": "suppnation",
      "rightTable": "suppregion",
      "relationType": "n-1",
      "joinCondition": [
        {
          "leftAttribute": "sn_regionkey",
          "rightAttribute": "sr_regionkey"
        }
      ]
    }
  ]
}

Overview
Quick Start
- Installing and Setup Druid
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
Dev. Guide
Reference Architectures
- Accelerating existing SQL Datasets
Releases
Cluster Spinup Tool
TPCH Benchmark
- Generating Denormalized TPCH Dataset
- Build TPCH Index for Benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining a Star Schema

Specifying the Star Schema on a Druid DataSource backed DataFrame

Clone this wiki locally