Skip to content
This repository has been archived by the owner on Mar 30, 2021. It is now read-only.

Quick Start Guide

hbutani edited this page Aug 2, 2016 · 20 revisions

Install and Setup Druid

You can follow the instructions here

Download and unzip Spark

Download a spark version. As of this writing, we have tested with spark-1.6.0

Retail dataset

This demo provides an ultra quick start but does not have the depth of scenarios and scale as the tpch dataset below.

The TPCH dataset

We included a sample of the TPCH dataset scale1 in the quickstart folder. This is includes a sample of facts from the datascale1 dataset. Note the rows in the flattened dataset and the original lineitem table may not be the same; so don't use these datasets to compare result data values.

Index sample tpch dataset

Follow the instructions here to create the Druid Index for the flattened dataset. Copy and edit the tpch_index_task.json.template. You need to change the location of the baseDir in your configuration.

Download the Sparkline BI Accelerator jar ( Use 0.1 )

The jar is available in the releases page, here or Download.

Starting the spark-shell

When starting the spark-shell, include the SparklineData packages and setup the DruidPlanner

For example here is how you can start the spark-shell

bin/spark-shell \
--packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2 \
--jars <location of downloaded assembly jar>

In the shell issue the following setup statements: these register the Sparklinedata datetime functions and also register the Rewrite rules of the Sparklinedata DruidPlanner with the Spark SQLContext.

import org.sparklinedata.spark.dateTime.Functions._
import org.apache.spark.sql.sources.druid.DruidPlanner

register(sqlContext)
DruidPlanner(sqlContext)

Queries in the Spark Shell

Setup the Datasets in Spark:

  • you need to change the following to point to your folder for the orderLineItemPartSupplierCustomer flattened dataset.
sql("""
CREATE TABLE orderLineItemPartSupplierBase(o_orderkey integer,
             o_custkey integer,
      o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
      o_clerk string,
      o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
      l_linenumber integer,
      l_quantity double, l_extendedprice double, l_discount double, l_tax double,
      l_returnflag string,
      l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
      l_shipinstruct string,
      l_shipmode string, l_comment string, order_year string, ps_partkey integer,
      ps_suppkey integer,
      ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
      s_phone string, s_acctbal double, s_comment string, s_nation string,
      s_region string, p_name string,
      p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
      p_retailprice double,
      p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
      c_mktsegment string , c_comment string , c_nation string , c_region string)
      USING com.databricks.spark.csv
      OPTIONS (path "<location of orderLineItemPartSupplierCustomer>",
      header "false", delimiter "|")
""".stripMargin
)

// now create the orderLineItemPartSupplier DataFrame that links the raw data with the Druid Index
sql("""
CREATE TABLE orderLineItemPartSupplier
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
      timeDimensionColumn "l_shipdate",
      druidDatasource "tpch",
      druidHost "<zookeeper host for the Druid services>",
      columnMapping '{  "l_quantity" : "sum_l_quantity",  "ps_availqty" : "sum_ps_availqty",  "cn_name" : "c_nation",  "cr_name" : "c_region",   "sn_name" : "s_nation",  "sr_name" : "s_region" }     ',
      functionalDependencies '[   {"col1" : "c_name", "col2" : "c_address", "type" : "1-1"},   {"col1" : "c_phone", "col2" : "c_address", "type" : "1-1"},   {"col1" : "c_name", "col2" : "c_mktsegment", "type" : "n-1"},   {"col1" : "c_name", "col2" : "c_comment", "type" : "1-1"},   {"col1" : "c_name", "col2" : "c_nation", "type" : "n-1"},   {"col1" : "c_nation", "col2" : "c_region", "type" : "n-1"} ]     ',
      starSchema ' {   "factTable" : "orderLineItemPartSupplier",   "relations" : []  }     ')
""".stripMargin
)

Now you can run queries against orderLineItemPartSupplier which will be rewritten to use the Druid Index. For example:

sql("""
select l_returnflag as r, l_linestatus as ls, 
  count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a  
from orderLineItemPartSupplier 
group by l_returnflag, l_linestatus 
order by s, ls, r 
limit 3""".stripMargin
).show()

You can compare this query against the non rewritten query against the raw flattened table:

sql("""
select l_returnflag as r, l_linestatus as ls, 
  count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a  
from orderLineItemPartSupplierBase
group by l_returnflag, l_linestatus 
order by s, ls, r 
limit 3""".stripMargin
).show()
Clone this wiki locally