-
Notifications
You must be signed in to change notification settings - Fork 92
Quick Start Guide
There are 3 options of quickly get started.
- You can use our Cluster Spin Tool to spin up an AWS EMR cluster with Druid daemons setup and the TPCH(datascale 1 or datascale 100) datasets and Table metadata setup. Once started look at our Sample Queries for examples; the User Guide has more details on how to configure and understand query processing.
- If you want to setup an environment on your own, follow the instructions on this page. We provide you with instructions to setup a dev. environment and instructions on how to index datasets. We have provide a sample for the TPCH dataset and a demo Retail dataset.
You can follow the instructions here
Download a spark version. As of this writing, we have tested with spark-1.6.0
We included a sample of the TPCH dataset scale1 in the quickstart folder. This is includes a sample of facts from the datascale1 dataset. Note the rows in the flattened dataset and the original lineitem table may not be the same; so don't use these datasets to compare result data values.
Follow the instructions here to create the Druid Index for the flattened dataset. Copy and edit the tpch_index_task.json.template. You need to change the location of the baseDir in your configuration.
As of this writing the current release version(0.2.0) is available here
Following instructions on this page
For example:
sbin/start-sparklinedatathriftserver.sh ~/sparkline/spark-druid-olap/target/scala-2.10/accelerator-assembly-0.2.0.jar
start-sparklinedatathriftserver.sh
is a thin wrapper on spark scripts, so all spark command line options work when starting the Sparkline Enhanced ThriftServer.
- connect to the thriftserver, connectivity is the sames as connecting to Spark; if you are new to Spark detailed instructions are available here
- create the Spark underlying and Druid Datasource tables. For TPCH these would look like th following:
CREATE TABLE orderLineItemPartSupplierBase(
o_orderkey integer, o_custkey integer,
o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
o_clerk string,
o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
l_linenumber integer,
l_quantity double, l_extendedprice double, l_discount double, l_tax double,
l_returnflag string,
l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
l_shipinstruct string,
l_shipmode string, l_comment string, order_year string, ps_partkey integer,
ps_suppkey integer,
ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
s_phone string, s_acctbal double, s_comment string, s_nation string,
s_region string, p_name string,
p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
p_retailprice double,
p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
c_mktsegment string , c_comment string , c_nation string , c_region string)
USING com.databricks.spark.csv
OPTIONS (path "<location of orderLineItemPartSupplierCustomer>",
header "false", delimiter "|")
;
-- now create the orderLineItemPartSupplier DataFrame that links the raw data with the Druid Index
CREATE TABLE if not exists orderLineItemPartSupplier
USING org.sparklinedata.druid
OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
timeDimensionColumn "l_shipdate",
druidDatasource "tpch",
druidHost "localhost",
zkQualifyDiscoveryNames "true",
columnMapping '{ "l_quantity" : "sum_l_quantity", "ps_availqty" : "sum_ps_availqty", "cn_name" : "c_nation", "cr_name" : "c_region", "sn_name" : "s_nation", "sr_name" : "s_region" } ',
numProcessingThreadsPerHistorical '1',
starSchema ' { "factTable" : "orderLineItemPartSupplier", "relations" : [] } ')
;
Now you can run queries against orderLineItemPartSupplier Druid DataSource Table. The Sparkline Planner will generate optimized query plans for queries against this table. For example:
explain
select l_returnflag as r, sum(l_extendedprice) as s
from orderLineItemPartSupplier
group by l_returnflag
;
== Physical Plan ==
Project [l_returnflag#75 AS r#126,alias-1#129 AS s#127]
+- Scan DruidQuery(1581882452): {
"q" : {
"jsonClass" : "GroupByQuerySpec",
"queryType" : "groupBy",
"dataSource" : "tpch",
"dimensions" : [ {
"jsonClass" : "DefaultDimensionSpec",
"type" : "default",
"dimension" : "l_returnflag",
"outputName" : "l_returnflag"
} ],
"granularity" : "all",
"aggregations" : [ {
"jsonClass" : "FunctionAggregationSpec",
"type" : "doubleSum",
"name" : "alias-1",
"fieldName" : "l_extendedprice"
} ],
"intervals" : [ "1993-01-01T00:00:00.000Z/1997-12-31T00:00:01.000Z" ]
},
"queryHistoricalServer" : false,
"numSegmentsPerQuery" : -1,
"intervalSplits" : [ {
"start" : 725846400000,
"end" : 883526401000
} ],
"outputAttrSpec" : [ {
"exprId" : {
"id" : 75,
"jvmId" : { }
},
"name" : "l_returnflag",
"dataType" : { },
"tf" : "toString"
}, {
"exprId" : {
"id" : 129,
"jvmId" : { }
},
"name" : "alias-1",
"dataType" : { },
"tf" : ""
} ]
}[l_returnflag#75,alias-1#129]
This gets run as Druid Query with a Spark Project Operator. The Aggregation is performed using the Druid Index, which is significantly faster than performing an Aggregation in Spark. The same query against the underlying Plan has a traditional Spark Plan:
explain
select l_returnflag as r, sum(l_extendedprice) as s
from orderLineItemPartSupplierBase
group by l_returnflag
== Physical Plan ==
TungstenAggregate(key=[l_returnflag#22], functions=[(sum(l_extendedprice#19),mode=Final,isDistinct=false)], output=[r#130,s#131])
+- TungstenExchange hashpartitioning(l_returnflag#22,200), None
+- TungstenAggregate(key=[l_returnflag#22], functions=[(sum(l_extendedprice#19),mode=Partial,isDistinct=false)], output=[l_returnflag#22,sum#135])
+- Project [l_returnflag#22,l_extendedprice#19]
+- Scan CsvRelation(file:///Users/hbutani/tpch/datascale1/orderLineItemPartSupplierCustomer.small/,false,|,"...
More examples on the TPCH dataset are in the Sample Queries page; the User Guide has more details on how to configure and understand query processing
This demo provides an ultra quick start but does not have the depth of scenarios and scale as the tpch dataset below.
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark