Skip to content

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡

License

Notifications You must be signed in to change notification settings

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Repository files navigation

Apache Spark Standalone Cluster on Docker

The project was featured on an article at MongoDB official tech blog! 😱

The project just got its own article at Towards Data Science Medium blog! ✨

Introduction

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

build-master sponsor jupyterlab-latest-version spark-latest-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Spark Driver localhost:4040 Spark Driver web ui
Spark Master localhost:8080 Spark Master node
Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Download from Docker Hub (easier)

  1. Download the docker compose file;
curl -LO https://raw.githubusercontent.com/cluster-apps-on-docker/spark-standalone-cluster-on-docker/master/docker-compose.yml
  1. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  2. Start the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c on the terminal;
  3. Run step 3 to restart the cluster.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

  1. Download the source code or clone the repository;
  2. Move to the build directory;
cd build
  1. Edit the build.yml file with your favorite tech stack version;
  2. Match those version on the docker compose file;
  3. Build up the images;
chmod +x build.sh ; ./build.sh
  1. Start the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c on the terminal;
  3. Run step 6 to restart the cluster.

Tech Stack

  • Infra
Component Version
Docker Engine 1.13.0+
Docker Compose 1.10.0+
  • Languages and Kernels
Spark Hadoop Scala Scala Kernel Python Python Kernel R R Kernel
3.x 3.2 2.12.10 0.10.9 3.7.3 7.19.0 3.5.2 1.1.1
2.x 2.7 2.11.12 0.6.0 3.7.3 7.19.0 3.5.2 1.1.1
  • Apps
Component Version Docker Tag
Apache Spark 2.4.0 | 2.4.4 | 3.0.0 <spark-version>
JupyterLab 2.1.4 | 3.0.0 <jupyterlab-version>-spark-<spark-version>

Metrics

Image Size Downloads
JupyterLab docker-size-jupyterlab docker-pull
Spark Master docker-size-master docker-pull
Spark Worker docker-size-worker docker-pull

Contributing

We'd love some help. To contribute, please read this file.

Contributors

A list of amazing people that somehow contributed to the project can be found in this file. This project is maintained by:

André Perez - dekoperez - andre.marcos.perez@gmail.com

Support

Support us on GitHub by staring this project ⭐

Support us on Patreon. 💖