Skip to content

Latest commit

 

History

History
83 lines (66 loc) · 3.07 KB

File metadata and controls

83 lines (66 loc) · 3.07 KB

Running a Cascalog Job on Elastic MapReduce

by Alex Robbins

Problem

You have a large amount of data to process, but you don’t have a Hadoop cluster.

Solution

Amazon’s Elastic MapReduce (EMR) provides on-demand Hadoop clusters. You’ll need an Amazon Web Services account to use EMR.

First, write a Cascalog job as you normally would. There are a number of recipes in this chapter that can help you create a complete Cascalog job. If you don’t have your own, you can clone [sec_cascalog_etl]:

$ git clone https://github.com/clojure-cookbook/cascalog-samples.git
$ cd cascalog-samples
$ git checkout etl-sample

Once you have a Cascalog project, package it into an uberjar:

$ lein compile
$ lein uberjar

Next, upload the generated JAR (target/cookbook-0.1.0-SNAPSHOT-standalone.jar if you’re following along with the ETL sample) to S3. If you haven’t ever uploaded a file to S3, follow the S3 documentation to "Create a Bucket" and for "Adding an Object to a Bucket". Repeat this process to upload your input data. Take note of the path to the JAR and the input data location.

To create your MapReduce job, visit https://console.aws.amazon.com/elasticmapreduce/ and select "Create New Job Flow" (Specifying parameters in the new job flow wizard). Once you’re in the new job flow wizard, choose the "Custom JAR" job type. Select "Continue" and enter your JAR’s location and arguments. "JAR Location" is the S3 path you noted earlier. "JAR Arguments" are all of the arguments you would normally pass when executing your JAR. For example, using the Cascalog samples repository, the arguments would be the fully qualified class name to execute, cookbook.etl.Main, an s3n:// URI for input data, and an s3n:// URI for the output.

The next few wizard windows allow you to specify additional configuration options for the job. Select "Continue" until you reach the review phase and start your job.

After your job has run, you should be able to retrieve the results from S3. Elastic MapReduce also allows you to set up a logging path to help with debugging if your job doesn’t complete like you expect.

clcb 0902
Figure 1. Specifying parameters in the new job flow wizard

Discussion

Amazon’s EMR is a great solution if you have big Cascalog jobs but you don’t have to run them very often. Maintaining your own Hadoop cluster can take a fair amount of time and money. If you can keep the cluster busy, it is a great investment. If you only need it a couple of times a month, you might be better off using EMR for on-demand Hadoop clusters.

See Also