Skip to content
This repository has been archived by the owner on Apr 8, 2021. It is now read-only.

Quick Start example in README. #95

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 118 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,122 @@
![Brushfire](brushfire.png)

Brushfire
What is Brushfire?
=========

Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.
Brushfire is a framework developed at [Stripe](http://stripe.com) for distributed [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) of [decision trees](https://en.wikipedia.org/wiki/Decision_tree_learning) in [Scala](http://www.scala-lang.org/) using [ensemble models](https://en.wikipedia.org/wiki/Ensemble_learning).

<img src="brushfire.png" width="400">

# Quick start

Brushfire rides on Scala and [SBT](http://www.scala-sbt.org/) (Scala's interactive build tool). Get those installed and then we can run the example code which uses Brushfire to build example decision tree models from the [Iris dataset](https://archive.ics.uci.edu/ml/datasets/Iris).

We can run it from your local machine (which pulls down all the required dependencies):

```bash
$ git clone https://github.com/stripe/brushfire.git
$ cd brushfire
$ ./quick-start
```

Or we can run it from a Hadoop cluster (which pulls down all the required dependencies except for Hadoop jars which are provided by the Hadoop execution environment):


```bash
$ git clone https://github.com/stripe/brushfire.git
$ cd brushfire
$ sbt brushfireScalding/assembly
$ cd example
$ ./iris
```

The `example/iris.output` directory will be created. Inside we can see 4 versions of a decision tree, represented as JSON, for classifying irises:

The basic approach to distributed tree learning is inspired by Google's [PLANET](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36296.pdf), but considerably generalized thanks to Scala's type parameterization and Algebird's aggregation abstractions.
```bash
$ ls iris.output | grep step_*
step_00
step_01
step_02
step_03
```

Here's an example decision tree output from `step_03`:

```json
{
"key":"petal-width",
"predicate":{
"lt":0.6015625
},
"left":{
"leaf":0,
"distribution":{
"Iris-setosa":42
}
},
"right":{
"key":"petal-width",
"predicate":{
"lt":1.703125
},
"left":{
"key":"petal-length",
"predicate":{
"lt":5.09375
},
"left":{
"leaf":1,
"distribution":{
"Iris-virginica":1,
"Iris-versicolor":38
}
},
"right":{
"leaf":2,
"distribution":{
"Iris-virginica":2
}
}
},
"right":{
"key":"sepal-width",
"predicate":{
"lt":2.703125
},
"left":{
"leaf":3,
"distribution":{
"Iris-virginica":4
}
},
"right":{
"leaf":4,
"distribution":{
"Iris-virginica":29
}
}
}
}
}
```

To use brushfire in your own SBT project, we add the following to our `build.sbt`:

```scala
libraryDependencies += "com.stripe" %% "brushfire" % "0.6.3"
```

To use brushfire as a jar in our own Maven project, we add the following to our POM file:

```xml
<dependency>
<groupId>com.stripe</groupId>
<artifactId>brushfire_${scala.binary.version}</artifactId>
<version>0.6.3</version>
</dependency>
```

# Background

The basic approach to distributed tree learning is inspired by Google's [PLANET](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36296.pdf), but considerably generalized thanks to Scala's type parameterization and [Algebird's](https://github.com/twitter/algebird) aggregation abstractions.

Brushfire currently supports:
* binary and multi-class classifiers
Expand All @@ -25,9 +136,7 @@ In the future we plan to add support for:

# Authors

* Avi Bryant <http://twitter.com/avibryant>

Thanks for assistance and contributions:
Avi Bryant <http://twitter.com/avibryant> with assistance and contributions from:

* Edwin Chen <https://twitter.com/echen>
* Dan Frank <http://twitter.com/danielhfrank>
Expand All @@ -38,32 +147,7 @@ Thanks for assistance and contributions:
* Erik Osheim <http://twitter.com/d6>
* Tom Switzer <https://twitter.com/tixxit>

# Quick start

````
sbt brushfireScalding/assembly
cd example
./iris
cat iris.output/step_03
````

If it worked, you should see a JSON representation of 4 versions of a decision tree for classifying irises.

To use brushfire in your own SBT project, add the following to your `build.sbt`:

```scala
libraryDependencies += "com.stripe" %% "brushfire" % "0.6.3"
```

To use brushfire as a jar in your own Maven project, add the following to your POM file:

```
<dependency>
<groupId>com.stripe</groupId>
<artifactId>brushfire_${scala.binary.version}</artifactId>
<version>0.6.3</version>
</dependency>
```

# Using Brushfire with Scalding

Expand Down Expand Up @@ -163,4 +247,4 @@ Brushfire is designed to be extremely pluggable. Some ways you might want to ext
* Add a new evaluation strategy (such as log-likelihood or entropy): define a new [Evaluator](http://stripe.github.io/brushfire/#com.stripe.brushfire.Evaluator)
* Adding a new feature type, or a new way of binning an existing feature type (such as log-binning real numbers): define a new [Splitter](http://stripe.github.io/brushfire/#com.stripe.brushfire.Splitter)
* Adding a new target type (such as real-valued targets for regression trees): define a new [Evaluator](http://stripe.github.io/brushfire/#com.stripe.brushfire.Evaluator), a new [Stopper](http://stripe.github.io/brushfire/#com.stripe.brushfire.Stopper) and quite likely also define a new [Splitter](http://stripe.github.io/brushfire/#com.stripe.brushfire.Splitter) for any continuous or sparse feature types you want to be able to use.
* Add a new distributed computation platform: define a new equivalent of [Trainer](http://stripe.github.io/brushfire/#com.stripe.brushfire.scalding.Trainer), idiomatically to the platform you're using. (There's no specific interface this should implement.)
* Add a new distributed computation platform: define a new equivalent of [Trainer](http://stripe.github.io/brushfire/#com.stripe.brushfire.scalding.Trainer), idiomatically to the platform you're using. (There's no specific interface this should implement.)
5 changes: 4 additions & 1 deletion brushfire-scalding/build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ libraryDependencies ++= Seq(

mainClass := Some("com.twitter.scalding.Tool")

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

runMain in Compile <<= Defaults.runMainTask(fullClasspath in Compile, runner in(Compile, run))

Publish.settings

MakeJar.settings

2 changes: 1 addition & 1 deletion example/iris
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/sh
java -Xmx2G -cp ../brushfire-scalding/target/scala-2.11/brushfire-scalding-0.7.3-SNAPSHOT-jar-with-dependencies.jar \
java -Xmx2G -cp ../brushfire-scalding/target/scala-2.11/brushfire-scalding-0.7.5-SNAPSHOT-jar-with-dependencies.jar \
com.twitter.scalding.Tool \
com.stripe.brushfire.scalding.IrisJob \
--local \
Expand Down
2 changes: 1 addition & 1 deletion project/plugins.sbt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
addSbtPlugin("com.eed3si9n" % "sbt-unidoc" % "0.3.2")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
addSbtPlugin("com.typesafe.sbt" % "sbt-native-packager" % "1.0.0-RC1")
addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.0.0")
addSbtPlugin("no.arktekk.sbt" % "aether-deploy" % "0.14")
Expand Down
3 changes: 3 additions & 0 deletions quick-start
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

sbt "brushfireScalding/runMain com.twitter.scalding.Tool com.stripe.brushfire.scalding.IrisJob --local --input example/iris.data --output example/iris.output"