Get up and running in less than 5 minutes
This program demonstrates Hadoop's Map-Reduce concept in Java using a very simple example. The input is raw data files listing earthquakes by region, magnitude and other information.
nc,71920701,1,”Saturday, January 12, 2013 19:43:18 UTC”,38.7865,-122.7630,1.5,1.10,27,“Northern California”
The fields in bold are magnitude of the quake and name of region where the reading was taken, respectively. The goal is to process all input files to find the maximum magnitude quake reading for every region listed. The output is in the form:
"region_name" <maximum magnitude of earthquake recorded>
The raw data files are in the input/
folder.
-
Download Hadoop 1.1.1 binary. Mirror
-
Extract it to a folder on your computer:
$ tar xvfz hadoop-1.1.1.tar.gz
-
Setup JAVA_HOME environment variable to point to the directory where Java is installed. For my Mac OS X, I did the following:
$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
Note: If you are running Lion, you may want to update the JAVA_HOME to point to java_home
command which outputs Java's home directory, that is,
$ export JAVA_HOME=$(/usr/libexec/java_home)
-
Setup HADOOP_INSTALL environment variable to point the directory where you extracted hadoop binary in step 2:
$ export HADOOP_INSTALL=/Users/umermansoor/Documents/hadoop-1.1.1
-
Edit the PATH environment variable:
$ export PATH=$PATH:$HADOOP_INSTALL/bin
Or you can add these variables to your standard shell script. For example, checkout my Mac OSX's
~/.bash_profile
-
Clone the project:
$ git clone git@github.com:umermansoor/hadoop-java-example.git
-
Change to the project directory:
$ cd hadoop-java-example
-
Build the project:
$ mvn clean install
-
Setup the HADOOP_CLASSPATH environment variable to tell Hadoop where to find the java classes for the sample:
$ export HADOOP_CLASSPATH=target/classes/
-
Run the sample. The
output
directory shouldn't exists otherwise this will fail.$ hadoop com.umermansoor.App input/ output
Note: the output will go to the
output/
folder which Hadoop will create when run. The output will be in a file calledpart-r-00000
.
-
Exception: java.lang.NoClassDefFoundError Cause: You didn't setup the HADOOP_CLASSPATH environment variable. You need to tell Hadoop where to find the java classes. Resolution: In this case, execute the following to setup HADOOP_CLASSPATH variable to point to the
target/classes/
folder.$ export HADOOP_CLASSPATH=target/classes/
-
Exception: org.apache.hadoop.mapred.FileAlreadyExistsException or 'Output directory output already exists'. Cause: Output directory already exists. Hadoop requires that the output directory doesn't exists when run. Resolution: Change the output directory or remove the existing one:
$ hadoop com.umermansoor.App input/input.csv output_new
Note: Hadoop failing if the output folder already exists is a good thing: it ensures that you don't accidentally overwrite your previous output, as typical Hadoop jobs take hours to complete.