This docker container is meant to be used for learning purpose for programming PySpark. It has the following components.
- Hadoop v3.2.1
- Spark v2.4.4
- Conda 3 with Python v3.7
After running the container, you may visit the following pages.
As can be seen, Jupyter Lab is running on port 8888
. An example notebook is mounted at /root/ipynb
. To get the PySpark code to run, you will have to upload the data.csv
file to HDFS first. View the example notebook.
To run the container.
docker run -it \
-p 9870:9870 \
-p 8088:8088 \
-p 8080:8080 \
-p 18080:18080 \
-p 9000:9000 \
-p 8888:8888 \
-p 9864:9864 \
-v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
-e PYSPARK_MASTER=spark://localhost:7077 \
spark-jupyter:local
To run the container with a password.
docker run -it \
-p 9870:9870 \
-p 8088:8088 \
-p 8080:8080 \
-p 18080:18080 \
-p 9000:9000 \
-p 8888:8888 \
-p 9864:9864 \
-v $HOME/git/docker-containers/spark-jupyter/ubuntu/root/ipynb:/root/ipynb \
-e PYSPARK_MASTER=spark://localhost:7077 \
-e NOTEBOOK_PASSWORD=sha1:6676da7235c8:9c7d402c01e330b9368fa9e1637233748be11cc5 \
spark-jupyter:local
Stuff to do after going into the container e.g. docker exec -it <id> /bin/bash
# test yarn
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 1 50
# test spark against yarn
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
$SPARK_HOME/examples/jars/spark-examples*.jar \
100
# test spark standalone
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://localhost:7077 \
$SPARK_HOME/examples/jars/spark-examples*.jar \
100
# start a scala spark shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
# start a python spark shell
pyspark --master spark://localhost:7077 > /tmp/jupyter.log 2>&1 &
# start a python spark shell against yarn
pyspark \
--driver-memory 2g \
--executor-memory 2g \
--num-executors 1 \
--executor-cores 1 \
--conf spark.driver.maxResultSize=8g \
--conf spark.network.timeout=2000 \
--queue default \
--master yarn-client > /tmp/jupyter.log 2>&1 &