This repository has been archived by the owner on Mar 30, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 92
Generating Denormalized TPCH Dataset
hbutani edited this page Jan 14, 2016
·
1 revision
These instructions are for creating the Flattened dataset when running locally(in developer environment).
-
Use TPCH DBGen tool to generate the tpch dataset for a certain datascale. DataScale 1 should be more than enough for a dev. environment.
-
Clone and build tpch utils package. To build issue the commands:
cd tpchData; sbt clean compile package
(You need sbt installed for this) -
Download a spark version. As of this writing, we have tested with spark-1.5.2
-
Issue the following to create the flattened dataset:
bin/spark-submit \
--packages com.databricks:spark-csv_2.10:1.1.0,SparklineData:spark-datetime:0.0.2,SparklineData:spark-druid-olap:0.0.2 \
--class org.sparklinedata.tpch.TpchGenMain \
/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar \
--baseDir /Users/hbutani/tpch/ --scale 1
where:
- "/Users/hbutani/sparkline/tpch-spark-druid/tpchData/target/scala-2.10/tpchdata-assembly-0.0.1.jar' is the location of the tpch-utils jar
- "/Users/hbutani/tpch/" is the location of the tpch data. Under this folder there are one or more datascale folders whose names are of the form 'datascale%n'(for e.g. 'datascale1')
- The flattened dataset is written to a subfolder named 'orderLineItemPartSupplierCustomer' under the datascale folder
- Overview
- Quick Start
-
User Guide
- [Defining a DataSource on a Flattened Dataset](https://github.com/SparklineData/spark-druid-olap/wiki/Defining-a Druid-DataSource-on-a-Flattened-Dataset)
- Defining a Star Schema
- Sample Queries
- Approximate Count and Spatial Queries
- Druid Datasource Options
- Sparkline SQLContext Options
- Using Tableau with Sparkline
- How to debug a Query Plan?
- Running the ThriftServer with Sparklinedata components
- [Setting up multiple Sparkline ThriftServers - Load Balancing & HA] (https://github.com/SparklineData/spark-druid-olap/wiki/Setting-up-multiple-Sparkline-ThriftServers-(Load-Balancing-&-HA))
- Runtime Views
- Sparkline SQL extensions
- Sparkline Pluggable Modules
- Dev. Guide
- Reference Architectures
- Releases
- Cluster Spinup Tool
- TPCH Benchmark