layout | title | displayTitle |
---|---|---|
global |
Overview: estimators, transformers and pipelines - spark.ml |
Overview: estimators, transformers and pipelines - spark.ml |
\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]
The spark.ml
package aims to provide a uniform set of high-level APIs built on top of
DataFrames that help users create and tune practical
machine learning pipelines.
See the algorithm guides section below for guides on sub-packages of
spark.ml
, including feature transformers unique to the Pipelines API, ensembles, and more.
Table of contents
- This will become a table of contents (this text will be scraped). {:toc}
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Spark ML API, where the pipeline concept is mostly inspired by the scikit-learn project.
-
DataFrame
: Spark ML usesDataFrame
from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., aDataFrame
could have different columns storing text, feature vectors, true labels, and predictions. -
Transformer
: ATransformer
is an algorithm which can transform oneDataFrame
into anotherDataFrame
. E.g., an ML model is aTransformer
which transformsDataFrame
with features into aDataFrame
with predictions. -
Estimator
: AnEstimator
is an algorithm which can be fit on aDataFrame
to produce aTransformer
. E.g., a learning algorithm is anEstimator
which trains on aDataFrame
and produces a model. -
Pipeline
: APipeline
chains multipleTransformer
s andEstimator
s together to specify an ML workflow. -
Parameter
: AllTransformer
s andEstimator
s now share a common API for specifying parameters.
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
Spark ML adopts the DataFrame
from Spark SQL in order to support a variety of data types.
DataFrame
supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types.
In addition to the types listed in the Spark SQL guide, DataFrame
can use ML Vector
types.
A DataFrame
can be created either implicitly or explicitly from a regular RDD
. See the code examples below and the Spark SQL programming guide for examples.
Columns in a DataFrame
are named. The code examples below use names such as "text," "features," and "label."
A Transformer
is an abstraction that includes feature transformers and learned models.
Technically, a Transformer
implements a method transform()
, which converts one DataFrame
into
another, generally by appending one or more columns.
For example:
- A feature transformer might take a
DataFrame
, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a newDataFrame
with the mapped column appended. - A learning model might take a
DataFrame
, read the column containing feature vectors, predict the label for each feature vector, and output a newDataFrame
with predicted labels appended as a column.
An Estimator
abstracts the concept of a learning algorithm or any algorithm that fits or trains on
data.
Technically, an Estimator
implements a method fit()
, which accepts a DataFrame
and produces a
Model
, which is a Transformer
.
For example, a learning algorithm such as LogisticRegression
is an Estimator
, and calling
fit()
trains a LogisticRegressionModel
, which is a Model
and hence a Transformer
.
Transformer.transform()
s and Estimator.fit()
s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
Each instance of a Transformer
or Estimator
has a unique ID, which is useful in specifying parameters (discussed below).
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
- Split each document's text into words.
- Convert each document's words into a numerical feature vector.
- Learn a prediction model using the feature vectors and labels.
Spark ML represents such a workflow as a Pipeline
, which consists of a sequence of
PipelineStage
s (Transformer
s and Estimator
s) to be run in a specific order.
We will use this simple workflow as a running example in this section.
A Pipeline
is specified as a sequence of stages, and each stage is either a Transformer
or an Estimator
.
These stages are run in order, and the input DataFrame
is transformed as it passes through each stage.
For Transformer
stages, the transform()
method is called on the DataFrame
.
For Estimator
stages, the fit()
method is called to produce a Transformer
(which becomes part of the PipelineModel
, or fitted Pipeline
), and that Transformer
's transform()
method is called on the DataFrame
.
We illustrate this for the simple text document workflow. The figure below is for the training time usage of a Pipeline
.
Above, the top row represents a Pipeline
with three stages.
The first two (Tokenizer
and HashingTF
) are Transformer
s (blue), and the third (LogisticRegression
) is an Estimator
(red).
The bottom row represents data flowing through the pipeline, where cylinders indicate DataFrame
s.
The Pipeline.fit()
method is called on the original DataFrame
, which has raw text documents and labels.
The Tokenizer.transform()
method splits the raw text documents into words, adding a new column with words to the DataFrame
.
The HashingTF.transform()
method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame
.
Now, since LogisticRegression
is an Estimator
, the Pipeline
first calls LogisticRegression.fit()
to produce a LogisticRegressionModel
.
If the Pipeline
had more stages, it would call the LogisticRegressionModel
's transform()
method on the DataFrame
before passing the DataFrame
to the next stage.
A Pipeline
is an Estimator
.
Thus, after a Pipeline
's fit()
method runs, it produces a PipelineModel
, which is a
Transformer
.
This PipelineModel
is used at test time; the figure below illustrates this usage.
In the figure above, the PipelineModel
has the same number of stages as the original Pipeline
, but all Estimator
s in the original Pipeline
have become Transformer
s.
When the PipelineModel
's transform()
method is called on a test dataset, the data are passed
through the fitted pipeline in order.
Each stage's transform()
method updates the dataset and passes it to the next stage.
Pipeline
s and PipelineModel
s help to ensure that training and test data go through identical feature processing steps.
DAG Pipeline
s: A Pipeline
's stages are specified as an ordered array. The examples given here are all for linear Pipeline
s, i.e., Pipeline
s in which each stage uses data produced by the previous stage. It is possible to create non-linear Pipeline
s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline
forms a DAG, then the stages must be specified in topological order.
Runtime checking: Since Pipeline
s can operate on DataFrame
s with varied types, they cannot use
compile-time type checking.
Pipeline
s and PipelineModel
s instead do runtime checking before actually running the Pipeline
.
This type checking is done using the DataFrame
schema, a description of the data types of columns in the DataFrame
.
Unique Pipeline stages: A Pipeline
's stages should be unique instances. E.g., the same instance
myHashingTF
should not be inserted into the Pipeline
twice since Pipeline
stages must have
unique IDs. However, different instances myHashingTF1
and myHashingTF2
(both of type HashingTF
)
can be put into the same Pipeline
since different instances will be created with different IDs.
Spark ML Estimator
s and Transformer
s use a uniform API for specifying parameters.
A Param
is a named parameter with self-contained documentation.
A ParamMap
is a set of (parameter, value) pairs.
There are two main ways to pass parameters to an algorithm:
- Set parameters for an instance. E.g., if
lr
is an instance ofLogisticRegression
, one could calllr.setMaxIter(10)
to makelr.fit()
use at most 10 iterations. This API resembles the API used inspark.mllib
package. - Pass a
ParamMap
tofit()
ortransform()
. Any parameters in theParamMap
will override parameters previously specified via setter methods.
Parameters belong to specific instances of Estimator
s and Transformer
s.
For example, if we have two LogisticRegression
instances lr1
and lr2
, then we can build a ParamMap
with both maxIter
parameters specified: ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)
.
This is useful if there are two algorithms with the maxIter
parameter in a Pipeline
.
Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported.
This section gives code examples illustrating the functionality discussed above.
For more info, please refer to the API documentation
(Scala,
Java,
and Python).
Some Spark ML algorithms are wrappers for spark.mllib
algorithms, and the
MLlib programming guide has details on specific algorithms.
This example covers the concepts of Estimator
, Transformer
, and Param
.
This example follows the simple text document Pipeline
illustrated in the figures above.
An important task in ML is model selection, or using data to find the best model or parameters for a given task. This is also called tuning.
Pipeline
s facilitate model selection by making it easy to tune an entire Pipeline
at once, rather than tuning each element in the Pipeline
separately.
Currently, spark.ml
supports model selection using the CrossValidator
class, which takes an Estimator
, a set of ParamMap
s, and an Evaluator
.
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets; e.g., with $k=3$
folds, CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
CrossValidator
iterates through the set of ParamMap
s. For each ParamMap
, it trains the given Estimator
and evaluates it using the given Evaluator
.
The Evaluator
can be a RegressionEvaluator
for regression problems, a BinaryClassificationEvaluator
for binary data, or a MultiClassClassificationEvaluator
for multiclass problems. The default metric used to choose the best ParamMap
can be overridden by the setMetricName
method in each of these evaluators.
The ParamMap
which produces the best evaluation metric (averaged over the $k$
folds) is selected as the best model.
CrossValidator
finally fits the Estimator
using the best ParamMap
and the entire dataset.
The following example demonstrates using CrossValidator
to select from a grid of parameters.
To help construct the parameter grid, we use the ParamGridBuilder
utility.
Note that cross-validation over a grid of parameters is expensive.
E.g., in the example below, the parameter grid has 3 values for hashingTF.numFeatures
and 2 values for lr.regParam
, and CrossValidator
uses 2 folds. This multiplies out to $(3 \times 2) \times 2 = 12$
different models being trained.
In realistic settings, it can be common to try many more parameters and use more folds ($k=3$
and $k=10$
are common).
In other words, using CrossValidator
can be very expensive.
However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
{% include_example python/ml/cross_validator.py %}
In addition to CrossValidator
Spark also offers TrainValidationSplit
for hyper-parameter tuning.
TrainValidationSplit
only evaluates each combination of parameters once as opposed to k times in
case of CrossValidator
. It is therefore less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large.
TrainValidationSplit
takes an Estimator
, a set of ParamMap
s provided in the estimatorParamMaps
parameter,
and an Evaluator
.
It begins by splitting the dataset into two parts using trainRatio
parameter
which are used as separate training and test datasets. For example with $trainRatio=0.75$
(default),
TrainValidationSplit
will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
Similar to CrossValidator
, TrainValidationSplit
also iterates through the set of ParamMap
s.
For each combination of parameters, it trains the given Estimator
and evaluates it using the given Evaluator
.
The ParamMap
which produces the best evaluation metric is selected as the best option.
TrainValidationSplit
finally fits the Estimator
using the best ParamMap
and the entire dataset.