Dockerfile for Spark
- reetawwsum/hadoop
- Spark 2.1.0
Pull docker image from DockerHub
$ docker pull reetawwsum/spark
To run Spark application using Jupyter notebooks:
$ docker run --rm -t -i --name spark -p 8888:8888 -p 50070:50070 -p 8088:8088 -p 8042:8042 -p 4040:4040 reetawwsum/spark --ip=0.0.0.0
To run Spark application using Jupyter notebooks on current directory:
$ docker run --rm -t -i --name spark -p 8888:8888 -p 50070:50070 -p 8088:8088 -p 8042:8042 -p 4040:4040 -v $PWD:/usr/local/src/notebooks reetawwsum/spark --ip=0.0.0.0
To run shell after launching Jupyter Notebook:
$ docker exec -t -i spark /bin/bash
To view Hadoop process status:
$ jps
To run SimpleApp (Scala) self-contained application:
$ spark-submit --class "SimpleApp" --master local[4] Simple-Project/target/scala-2.11/simple-project_2.11-1.0.jar
To run SimpleApp (Python) self-contained application:
$ spark-submit --master local[4] SimpleApp.py
Clone this repo and
$ git clone https://github.com/reetawwsum/Spark-Dockerfile.git
$ cd Spark-Dockerfile
to build image from Dockerfile:
$ docker build -t spark .
to build PySpark script present in current directory from Sublime Text 3:
$ cp PySpark.sublime-build [user-packages folder]