- Create VM with 2 CPU & 8GB memory with debian 7 wheezy.
- Install prerequisite binaries:
- (Optional) Install Emacs
$ sudo apt-get install emacs
- (Optional) Install Byobu
$ sudo apt-get install byobu
- Attach to byobu session:
$ byobu
- Install Git
$ sudo apt-get install git
- Install sbt
$ sudo apt-get install apt-transport-https
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
$ sudo apt-get update
$ sudo apt-get install sbt
- Install Java 7 (OpenJDK is fine.):
$ sudo apt-get install openjdk-7-jre
- Export
to environment. Append following in~/.bashrc
$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
- Apply the contents of
by doing following:$ source ~/.bashrc
- Download Hadoop 2.7.1 and install.
$ wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
$ tar -xzf hadoop-2.7.1.tar.gz
$ cd hadoop-2.7.1
Export environment variables by appending
- Make sure to replace
to directory in which Hadoop is installed at. - Apply the contents of
by doing following:$ source ~/.bashrc
- Make sure to replace
Test Hadoop on Standalone mode.
$ mkdir input $ cp etc/hadoop/*.xml input $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+' $ cat output/*
- You will see an ouput that looks like:
1 dfsadmin
Configure HDFS. (http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/)
file to have following:<configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///home/yosub_shin_0/hadoop-2.7.1/hdfs/datanode</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/yosub_shin_0/hadoop-2.7.1/hdfs/namenode</value> <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> </property> </configuration>
- Make sure to edit
to where Hadoop is installed.
- Make sure to edit
Also, add following to
:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost/</value> <description>NameNode URI</description> </property> </configuration>
- For deploying a cluster with more than just a single machine, make sure to look at 'Cluster Installation' section and follow the direction there.
Configure YARN.
You can skip this section if you'll run Spark on Standalone Cluster mode.
as following:<configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>6144</value> <description>Physical memory, in MB, to be made available to running containers</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>2</value> <description>Number of CPU cores that can be allocated for containers.</description> </property> </configuration>
- For deploying a cluster with more than just a single machine, make sure to look at 'Cluster Installation' section and follow the direction there.
Start everything with following script:
## Start HDFS daemons # Format the namenode directory (DO THIS ONLY ONCE, THE FIRST TIME) bin/hdfs namenode -format # Start the namenode daemon sbin/hadoop-daemon.sh start namenode # Start the datanode daemon sbin/hadoop-daemon.sh start datanode ## Start YARN daemons # Start the resourcemanager daemon sbin/yarn-daemon.sh start resourcemanager # Start the nodemanager daemon sbin/yarn-daemon.sh start nodemanager
- Use
ps aux | grep java
to make sure all daemons are up and running.
- Use
Test Hadoop with:
$ bin/hadoop jar share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.1.jar --shell_command date --num_containers 2 --master_memory 1024
Run following command
$ grep "" logs/userlogs/application_1442404936521_0001/**/stdout
. -
This will result in each container giving us system time at each line:
logs/userlogs/application_1442404936521_0001/container_1442404936521_0001_01_000002/stdout:Wed Sep 16 17:35:47 UTC 2015 logs/userlogs/application_1442404936521_0001/container_1442404936521_0001_01_000003/stdout:Wed Sep 16 17:35:48 UTC 2015
Cluster Installation
For cluster set up, we do the same thing, but we set up ResourceManager and NameNode only on one machine, whereas DataNode and NodeManager should run on all of the machines.-
HDFS Configuration: Change
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://test-01/</value> <description>NameNode URI</description> </property> </configuration>
YARN Configuration: Change
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>test-01</value> <description>The hostname of the RM.</description> </property> </configuration>
Automatic deployment script setup
- In order to conveniently deploy HDFS and YARN, we can use
scripts. - But before that, we need to set
files as following:
- Also, we need to set
correctly tolibexec/hadoop-config.sh
file as following (hadoop is unable to retrieve environment variables from~./bashrc
# Newer versions of glibc use an arena memory allocator that causes virtual
# memory usage to explode. This interacts badly with the many threads that
# we use in Hadoop. Tune the variable down to prevent vmem explosion.
# Add this line here
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
# Attempt to set JAVA_HOME if it is not set
if [[ -z $JAVA_HOME ]]; then
- Then, we can start and stop HDFS and YARN as following:
$ $HADOOP_PREFIX/sbin/start-dfs.sh
$ $HADOOP_PREFIX/sbin/start-yarn.sh
$ $HADOOP_PREFIX/sbin/stop-yarn.sh
$ $HADOOP_PREFIX/sbin/stop-dfs.sh
- Download Spark 1.5.0 and install (Choose 'Pre-built for Hadoop 2.6 and later' option).
$ cd ~/
$ wget http://apache.mirrors.ionfish.org/spark/spark-1.5.0/spark-1.5.0-bin-hadoop2.6.tgz
$ tar -xzf spark-1.5.0-bin-hadoop2.6.tgz
$ cd spark-1.5.0-bin-hadoop2.6
$ cp conf/spark-env.sh.template conf/spark-env.sh
- Update
$ emacs conf/spark-env.sh
by appending:
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
export SPARK_PUBLIC_DNS=$(curl http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip -H "Metadata-Flavor: Google")
- Note that you should change
to your home directory path followed byhadoop-2.7.1
- Try to run a sample job using YARN cluster mode.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 2 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
lib/spark-examples*.jar \
- Or run an interactive shell as following:
$ ./bin/spark-shell --master yarn-client
- In order to run Spark without YARN and in the Standalone mode:
- install spark binary to all of the machines.
- In the master node, add
file and add all hosts that will spawn a worker:
- In order for this to work, one has to enable passwordless SSH by adding public/private keys and add the public key to
of all hosts. - In the master node, run
$ ./sbin/start-all.sh
- Try to run a sample job under Standalone cluster mode:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://test-01:7077 \
--num-executors 4 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
lib/spark-examples*.jar \