This is a Java program for creating an inverted index of words occurring in a large set of documents extracted from web pages using Hadoop MapReduce and Google Dataproc.
As our dataset, we are using a subset of 74 files from a total of 408 files (text extracted from HTML tags) derived from the Stanford WebBase project that is available here. It was obtained from a web crawl done in February 2007. It is one of the largest collections totaling more than 100 million web pages from more than 50,000 websites. This version has already been cleaned.
In this project we first set up a sample Hadoop cluster using Local (Standalone) Mode and then implement the actual project in Fully-Distributed Mode on Google Dataproc on the actual data set.
We need to setup the project on GNU/Linux as it is the supported development and production platform for Hadoop. To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. Please note that this project was deployed and tested using hadoop-3.1.1
. Unpack the downloaded Hadoop distribution. In the distribution folder, edit the file etc/hadoop/hadoop-env.sh
to define environment variable parameters as follows:
# add these line to the hadoop-env.sh or set them in terminal
export JAVA_HOME=/usr/java/latest
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Then try the following command:
$ bin/hadoop
This should display the usage documentation for the Hadoop script. Now you are ready to start your Hadoop cluster in the standalone mode.
This simple Hadoop job, gets two text files from the "input" folder as the arguments of the Mapper.
#file01
5722018411 Hello World Bye World
#file02
6722018415 Hello Hadoop Goodbye Hadoop
And by submitting a Hadoop job and applying Reduce step, it generates an inverted index as below:
bye 5722018411:1
goodbye 6722018415:1
hadoop 6722018415:2
hello 5722018411:1 6722018415:1
world 5722018411:2
To submit a Hadoop job, the MadReduce implementation should be packaged as a jar
file. To do so, copy the InvertedIndex.java
file of this project to the Hadoop's distribution root folder and while you are still there run the following commands to compile InvertedIndex.java
and create a jar
file.
$ bin/hadoop com.sun.tools.javac.Main InvertedIndex.java
$ jar cf invertedindex.jar InvertedIndex*.class
Copy input/file01
and input/file02
of this project and place them inside the input
folder of the Hadoop distrubution folder. While you are still there, run the following command to submit the job, get the input files from input
folder, generate the inverted index and store its output in the output
folder:
$ bin/hadoop jar invertedindex.jar InvertedIndex input output
And finally to see the output, run the below command:
$ bin/hadoop dfs -cat output/part-r-00000
In this section we create a cluster with 3 worker nodes on Google Dataproc and run invertedindex.jar
job on the actual data set.
First we need an account on Google Cloud Platform. You can sign up for a trial with a $300 free credits if you don't have one already.
In the Google Cloud Console either create a new project or select an existing one. For this exercise we will use Dataproc. Using Dataproc, we can quickly create a cluster of compute instances running Hadoop. The alternative to Dataproc would be to individually setup each compute node, install Hadoop on it, set up HDFS, set up master node, etc. Dataproc automates this grueling process for us.
In Google Cloud Consol, select Dataproc from the navigation list on the left. If this is the first time you’re using Dataproc then you might need to enable Dataproc API first. Clicking on 'Create Cluster' will take you to the cluster configuration section. Give any unique name to your cluster and select a desired zone. You need to create a master and 3 worker nodes. Select the default configuration processors (n1-standard-4 4vCPU 15 GB memory) for master and each member and reduce the storage to 32 GB HDD storage. Change the number of Worker nodes to 3. Leave everything else default and click on 'Create'.
Now that the cluster is setup we’ll have to configure it a little before we can run jobs on it. Select the cluster you just created from the list of clusters under the cloud Dataproc section on your console. Go to the VM Instances tab and click on the SSH button next to the instance with the Master Role.
Clicking on the SSH button will take you to a Command line Interface(CLI) like an xTerm or Terminal. All the commands in the following steps are to be entered in the
CLI. There is no home directory on HDFS for the current user do we’ll have to set this up before proceeding further. (To find out your user name run whoami
)
$ hadoop fs -mkdir -p /user/<your username here>
JAVA_HOME
has already been setup and we don't need to set it up again.
Please note that this step has to be done each time you open a new SSH terminal. For eliminating this step you can also setup this up JAVA_HOME
, PATH
, and HADOOP_CLASSPATH
in the etc/hadoop/hadoop-env.sh
.
$ export PATH=${JAVA_HOME}/bin:${PATH}
$ export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
Now run:
$ hadoop fs -ls
If there is no error this implies that your cluster was successfully set up. If you do encounter an error it’s most likely due to a missing environment variable or user home directory not being set up right.
To ensure that the environment variables are set, run the command env
.
NOTE: Please disable the billing for the cluster when you are not using it. Leaving it running will cost extra credits.
Download the dataset from this link and unzip the contents. You will find two folders inside named 'development' and 'full data'. 'development' data can be used for development and testing purposes.
Click on 'Dataproc' in the left navigation menu. Next, locate the address of the default Google cloud storage staging bucket for your cluster.
Go to the storage section in the left navigation bar and select your cluster’s default bucket from the list of buckets. Click on the Upload Folder
button and upload the devdata
and fulldata
folders individually.
Now we are ready to submit a Hadoop job to run our MapReduce implementation on the actual data. Either use SSH or nano
/vi
command to copy or create InvertedIndex.java
on the master cluster.
Run the following command to create the jar
file:
$ hadoop com.sun.tools.javac.Main InvertedIndex.java
$ jar cf invertedindex.jar InvertedIndex*.class
Now you have a jar file for your job. You need to place this jar file in the default cloud bucket of your cluster. Just create a folder called JAR on your bucket and upload it to that folder. If you created your jar file on the cluster’s master node itself use the following commands to copy it to the JAR folder.
$ hadoop fs -copyFromLocal ./invertedindex.jar
$ hadoop fs -cp ./invertedindex.jar gs://dataproc-69070.../JAR
The gs://dataproc-69070...
part is the default bucket of your cluster. It needs to be prepended by the gs:// to tell the Hadoop environment that it is a bucket and not a regular location on the filesystem.
Go to the 'Jobs' section in the left navigation bar of the Dataproc page and click on 'Submit job'. Fill the job parameters as follows:
- Cluster: Select the cluster you created
- Job Type: Hadoop
- Jar File: Full path to the jar file you uploaded earlier to the Google storage bucket. E.g.
gs:///dataproc-69070.../JAR/invertedindex.jar
. - Main Class or jar: The name of the java class you wrote the mapper and reducer in.
- Arguments: This takes two arguments.
- Input: Path to the input data you uploaded e.g.
gs:///dataproc-69070.../fulldata
. - Output: Path to the storage bucket followed by a new folder name e.g.
gs:///dataproc-69070.../fulloutput
. The folder is created during execution. You will get an error if you give the name of an existing folder.
- Input: Path to the input data you uploaded e.g.
- Leave the rest at their default settings.
Now submit the job. You can see the log while it is running.
NOTE: If you encounter a Java.lang.Interrupted Exception
you can safely ignore it. Your job will still execute.
The output files will be stored in the output folder on the bucket. If you open this folder you’ll notice that the inverted index is in several segments.(Delete the _SUCCESS file in the folder before mergin all the output files). To merge the files and create a final text output run the following commands:
$ hadoop fs -getmerge gs://dataproc-69070.../fulloutput ./output.txt
$ hadoop fs -copyFromLocal ./output.txt
$ hadoop fs -cp ./output.txt gs://dataproc-69070.../output.txt
Now you have successfully created an inverted index of the whole data set. You can use grep
to see index entries for any apecific word:
$ grep -w '^peace' output.txt
If you want to re-compile and submit a new job again you can remove the .jar
, .class
,.java
and Hadoop files using the command below accordingly.
$ hadoop fs -rm ./wordcount.jar ./output.txt WordCount*.class
$ hadoop fs -rm gs://dataproc-69070.../JAR/wordcount.jar; gs://dataproc-69070.../output.txt
$ hadoop fs -rm gs://dataproc-69070.../fulloutput/*
$ hadoop fs -rmdir gs://dataproc-69070.../fulloutput
This work is licensed under a Creative Commons Attribution 4.0 International License.