by Alex Robbins
You love testing your code. You love writing Cascalog jobs. You hate trying to test your Cascalog jobs.
Midje-Cascalog provides a small amount of extra functionality that makes writing tests for Cascalog jobs quite easy.
Add the following entries to the dev profile in your project.clj or your personal lein profile:
[lein-midje "3.0.1"]
[cascalog/midje-cascalog "1.10.2"]
Given a query like the following:
(ns cookbook.midje-cascalog
(:require [cascalog.api :refer :all]))
(defn capitalize [s]
(.toUpperCase s))
(defn capitalize-authors-query [author-path]
(<- [?capitalized-author]
((hfs-textline author-path) ?author)
(capitalize ?author :> ?capitalized-author)))
The Midje-Cascalog test would look like this:
(ns cookbook.midje-cascalog-test
(:require [cookbook.midje-cascalog :refer :all]
[midje
[sweet :refer :all]
[cascalog :refer :all]]))
(fact "Query should return capitalized versions of the input names."
(capitalize-authors-query :author-path) => (produces [["LUKE VANDERHART"]
["RYAN NEUFELD"]])
(provided
(hfs-textline :author-path) => [["Luke Vanderhart"]
["Ryan Neufeld"]]))
Unit testing is an important aspect of software craftsmanship. However, unit testing Hadoop workflows is difficult, to say the least. Most distributed computing development is done using trial and error, with only limited manual testing happening before the workflow is "good enough" and put into production use. You shouldn’t let your code quality slip, but testing distributed code can be difficult. Midje-Cascalog makes it easy to test different parts of your Cascalog workflow by making it dead simple to mock out the results of subqueries.
In the solution outlined above, you are testing a simple query. It reads lines from the input path, capitalizes and outputs them. Normally, you’d need to make sure part of the test wrote some test data into a file, then reference that file in the test, then cleanup and delete the file. Instead, using Midje-Cascalog, you mock the hfs-textline call.
fact is provided by the midje library, which is well worth learning on its own. It is an alternative to deftest from clojure.test. Here, you state the test as a call, an arrow, and then the produces function. produces lets you write out the results of a query as a vector of vectors. Having established the test, you use provided to outline the functions you want to mock. This lets you test only the function in question, and not the functions it depends on.
Testing your Cascalog workflows is as important as testing any other part of your application. With Midje-Cascalog, you can no longer excuse your lack of tests.