Skip to content
jpatanooga edited this page Nov 15, 2012 · 16 revisions

Why build Knitting Boar?

We started experimenting with building a custom YARN application on Hadoop's next gen YARN framework. We chose SGD as an example since its a very common machine learning technique. Certain parallel strategies work in MapReduce while others required a BSP-style framework. IterativeReduce was our answer to a BSP-style framework built on top of Hadoop's YARN plumbing.

Jeff Dean (Google) on scaling learning systems like SGD:

If you have more data you should build bigger models ... You just parallelize the heck out of everything ... You want to train models as fast as you possibly can ... I want to train the biggest model I possibly can in [time period N]

IterativeReduce and Knitting Boar should be considered experimental . They are intended to be used to develop new prototypes of parallel iterative algorithms on YARN. What you should take from this is that this code base is for exploring the parallel iterative algorithm space on top of Hadoop.

What Paper is this Implementation of SGD Based Around?

McDonald, Hall, and Mann (2010), "Distributed Training Strategies for the Structured Perceptron"

What was interesting about their approach?

Some notes:

  1. Iterative parameter mixing achieves performance as good as or better than training serially on all the data
  2. Distributed algorithms return better classifiers much quicker than training serially on all the data

What are the common parallel SGD approaches?

  • asynchronous optimization via gradient descent
    • multiple machines run stochastic gradient descent simultaneously as they update and read from a shared parameter vector asynchronously
    • The asynchronous algorithms in these studies require shared memory between the distributed computations and are less suitable to the more common cluster computing environment (Zinkevich et al, 2009)

Couldn't you have done this with MapReduce?

McDonald (et al) talk about this in their paper:

It is easy to see how this can be implemented on a cluster through a map-reduce framework, i.e., the map step trains the individual models in parallel and the reduce step mixes their parameters.

going on to extol the virtues of parameter mixing:

The advantages of parameter mixing are: 1) that it is parallel, making it possibly to scale to extremely large data sets, and 2) it is resource efficient, in particular with respect to network usage as parameters are not repeatedly passed across the network as is often the case for exact distributed training strategies.

McDonald then lays out their approach "Iterative Parameter Mixing":

Previously, each parallel perceptron was trained to convergence before the parameter mixing step. Instead, > shard the data as before, but train a single epoch of the perceptron algorithm for each shard (in
parallel) and mix the model weights. This mixed weight vector is then re-sent to each shard and the
perceptrons on those shards reset their weights to the new mixed weights.

Given these properties its easy to see how multiple passes of SGD could be done in MapReduce. However, there are tremendous setup and teardown costs involved in MapReduce passes. Given this overhead we developed IterativeReduce on Hadoop's YARN to provide a better platform for parallel iterative algorithms.

If we want to do 100 passes of our dataset, and the minimum MapReduce scheduling setup/teardown cost is 30 seconds then our total over head slack time is

30sec x 100passes == 3000 seconds

and we're just wasting 50 minutes. In this scenario IterativeReduce becomes attractive because it can do parameter mixing in batches or at the end of each epoch. Either way, the MapReduce setup/teardown is eliminated allowing us to build bigger models faster.

Do you randomize the input shards between passes over the dataset?

Comment from the Ryan McDonald on this:

"The results in the paper do not shuffle every epoch. However, we did run those experiments and found little difference. In particular, averaging seems to ameliorate any losses from not shuffling the data every epoch."

Given the notes from Ryan we decided to not shuffle between passes.

Iterative Reduce in comparison to Spark / Giraph / Graphlab ?

Iterative Reduce came about incidentally as we were parallelizing Stochastic Gradient Descent on top of YARN. As we began to abstract away the "plumbing" we progressively ended up with the common patterns which became Iterative Reduce.

Initially we considered Spark due to its parallel iterative nature but it was introducing extra dependencies along with the requirement (until 0.6) that you use Scala. Spark at the time only had experimental YARN support and we knew this facet would involve development overhead. The Giraph framework was a first class Hadoop citizen on a modified MapReduce but it required the problem to be abstracted into a graphical model which we didnt want to deal with. Graphlab is very fast and in C++ but it has only basic HDFS support.

Creating a first class YARN / Hadoop citizen allows Iterative Reduce applications to be easily integrated into existing Hadoop based tool suites which lower the cost of adoption and usage by mainline commercial enterprises.

Is Iterative Reduce's API really similar to Giraph's API?

Yes. Initially that's not what we were going for, but that's where we ended up. After we had the framework running and we looked at what we had, we realized they were really similar.

Doesn't Spark already do parallel iterative algorithms?

Yes. We don't use a ton of Scala and the java bindings were only recently released. We wanted a YARN-first first class citizen for parallel iterative algorithms in java. Spark is a tremendous system and we really respect the work that Matei and Co. have produced.

Clone this wiki locally