-
Notifications
You must be signed in to change notification settings - Fork 92
Quick Start Guide
You can follow the instructions here
Download a spark version. As of this writing, we have tested with spark-1.6.0
This demo provides an ultra quick start but does not have the depth of scenarios and scale as the tpch dataset below.
We included a sample of the TPCH dataset scale1 in the quickstart folder. This is includes a sample of facts from the datascale1 dataset. Note the rows in the flattened dataset and the original lineitem table may not be the same; so don't use these datasets to compare result data values.
Follow the instructions here to create the Druid Index for the flattened dataset. Copy and edit the tpch_index_task.json.template. You need to change the location of the baseDir in your configuration.
The jar is available in the releases page, here or Download.
When starting the spark-shell, include the SparklineData packages and setup the DruidPlanner
For example here is how you can start the spark-shell
bin/spark-shell \
--packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2 \
--jars <location of downloaded assembly jar>
In the shell issue the following setup statements: these register the Sparklinedata datetime functions and also register the Rewrite rules of the Sparklinedata DruidPlanner with the Spark SQLContext.
import org.sparklinedata.spark.dateTime.Functions._
import org.apache.spark.sql.sources.druid.DruidPlanner
register(sqlContext)
DruidPlanner(sqlContext)
Queries in the Spark Shell
Setup the Datasets in Spark:
- you need to change the following to point to your folder for the orderLineItemPartSupplierCustomer flattened dataset.
sql("""
CREATE TABLE orderLineItemPartSupplierBase(o_orderkey integer,
o_custkey integer,
o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
o_clerk string,
o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
l_linenumber integer,
l_quantity double, l_extendedprice double, l_discount double, l_tax double,
l_returnflag string,
l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
l_shipinstruct string,
l_shipmode string, l_comment string, order_year string, ps_partkey integer,
ps_suppkey integer,
ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
s_phone string, s_acctbal double, s_comment string, s_nation string,
s_region string, p_name string,
p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
p_retailprice double,
p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
c_mktsegment string , c_comment string , c_nation string , c_region string)
USING com.databricks.spark.csv
OPTIONS (path "<location of orderLineItemPartSupplierCustomer>",
header "false", delimiter "|")
""".stripMargin
)
// now create the orderLineItemPartSupplier DataFrame that links the raw data with the Druid Index
sql("""
CREATE TABLE orderLineItemPartSupplier
USING org.sparklinedata.druid
OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
timeDimensionColumn "l_shipdate",
druidDatasource "tpch",
druidHost "<zookeeper host for the Druid services>",
columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation", "cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ',
functionalDependencies '[ {"col1" : "c_name", "col2" : "c_address", "type" : "1-1"}, {"col1" : "c_phone", "col2" : "c_address", "type" : "1-1"}, {"col1" : "c_name", "col2" : "c_mktsegment", "type" : "n-1"}, {"col1" : "c_name", "col2" : "c_comment", "type" : "1-1"}, {"col1" : "c_name", "col2" : "c_nation", "type" : "n-1"}, {"col1" : "c_nation", "col2" : "c_region", "type" : "n-1"} ] ',
starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ')
""".stripMargin
)
Now you can run queries against orderLineItemPartSupplier which will be rewritten to use the Druid Index. For example:
sql("""
select l_returnflag as r, l_linestatus as ls,
count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a
from orderLineItemPartSupplier
group by l_returnflag, l_linestatus
order by s, ls, r
limit 3""".stripMargin
).show()
You can compare this query against the non rewritten query against the raw flattened table:
sql("""
select l_returnflag as r, l_linestatus as ls,
count(*), sum(l_extendedprice) as s, max(ps_supplycost) as m, avg(ps_availqty) as a
from orderLineItemPartSupplierBase
group by l_returnflag, l_linestatus
order by s, ls, r
limit 3""".stripMargin
).show()
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark