Proposal: interface for JuliaStats packages (Statistics and Machine Learning)

StatisticalModel is at the top of the type hierarchy.

There is one universal method:

Every statistical model can be fit to data. This takes data as input: fit(obj::StatisticalModel, data...). fit should be able to handle data in the form of DataFrame or Matrix, and it should return another instance of StatisticalModel (or perhaps FittedModel). Alternatively, one can call fit!, which would update obj instead of creating a new StatisticalModel.

This would be a useful addition:

fit_more!(obj::StatisticalModel, extra_data...) would be implemented in models that have online updates. This includes all exponential family models.

Now, we make a distinction between supervised and unsupervised learning.

Supervised learning

Every supervised model can make "predictions" using predict(FittedModel, newX::SamplePoint). Beware: predictions may not live in the sample space, e.g. in the case of logistic regression, predictions live on the interval [0,1], but the sample space is the binary set {0,1}.

Unsupervised learning comprises methods for clustering and transformation of the data, often with dimension reduction.

project(FittedModel, X::SamplePoint) for unsupervised models is analogous to predict for supervised models. It returns the "compressed" value of the observation: in the case of PCA with a given k, it projects to the subspace spanned by the top k eigenvectors; in the case of clustering, it projects each observation to its cluster center. Both the input and the output live in the sample space.
transform (can we find a less general name?) takes in points in the sample space and returns points in the transformed space (typically lower-dimensional), e.g. PCA loadings, or cluster assignments.
we should also have a way to access the output of fit, e.g. the subspace as represented by PCA "scores", or set of cluster centers.