Skip to content
Johnny Foulds edited this page Aug 6, 2019 · 10 revisions

For this project the dataset is explored with Apache Spark using the Scala programming language. Because of the nature of data exploration is quite experimental Apache Zeppelin was chose to work interactively with the data while also providing the functionality to easily produce data visualizations on the fly.

Installing Zeppelin

The Zeppelin notebooks will be run from a Docker container which allows a development environment to be quickly spun up as needed to work on data that does not require massive processing power.

Instead of building a custom Dockerfile for Zeppelin the official Apache Container will be used from Docker Hub.

docker pull apache/zeppelin:0.8.1

docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.8.1

If you want to specify logs and notebook dir,

docker run -p 8080:8080 --rm
-v $PWD/logs:/logs
-v $PWD/notebook:/notebook
-e ZEPPELIN_LOG_DIR='/logs'
-e ZEPPELIN_NOTEBOOK_DIR='/notebook'
--name zeppelin apache/zeppelin: # e.g '0.7.1'

docker run -p 8080:8080 — rm -v $PWD/logs:/logs -v $PWD/notebook:/notebook -e ZEPPELIN_LOG_DIR=’/logs’ -e ZEPPELIN_NOTEBOOK_DIR=’/notebook’ — name zeppelin apache/zeppelin:0.7.2

-- remember port forwarding in virtuabox if quickstart terminal is used - put image Simple test:

val nums = Array(1,2,3,5,6)
val rdd = sc.parallelize(nums)

import spark.implicits._
val df = rdd.toDF("num")

df.show()
Clone this wiki locally