Skip to content
This repository has been archived by the owner on Apr 19, 2019. It is now read-only.

Proposal: interface for JuliaStats packages (Statistics and Machine Learning)

Gustavo Lacerda edited this page Jul 16, 2014 · 14 revisions

StatisticalModel is at the top of the type hierarchy.

There is one universal method:

  • Every statistical model can be fit to data. This takes data as input: fit(obj::StatisticalModel, data, tuning_parameters). fit should be able to handle data in the form of DataFrame or Matrix, and it should return another instance of StatisticalModel (or perhaps FittedModel). Alternatively, one can call fit!, which would update obj instead of creating a new StatisticalModel.

This would be a useful addition:

  • fit_more!(obj::StatisticalModel, extra_data...) would be implemented in models that have online updates. This includes all exponential family models.

Also universal is the notion of loss functions, used to calculate training error and test error.

  • cv(obj::StatisticalModel, data, k) would perform cross-validation, i.e. call fit k times, and return the list of fits, or at least the average loss. In practice, one would call cv multiple times, for different values of the tuning parameters / hyperparameters.

Now, we make a distinction between supervised and unsupervised learning.

Supervised learning

  • Every supervised model can make "predictions" using predict(FittedModel, newX::SamplePoint). Beware: predictions may not live in the sample space, e.g. in the case of logistic regression, predictions live on the interval [0,1], but the sample space is the binary set {0,1}.

  • We should be able to see training_error(FittedModel) and test_error(FittedModel, test_data)

Unsupervised learning

Unsupervised learning comprises methods for clustering and transformation of the data, often with dimension reduction.

  • transform (can we find a less general name?) takes in points in the sample space and returns points in the transformed space (typically lower-dimensional), e.g. PCA loadings, or cluster assignments.

  • project(FittedModel, X::SamplePoint) for unsupervised models is analogous to predict for supervised models. It returns the "compressed" value of the observation: in the case of PCA with a given k, it projects to the subspace spanned by the top k eigenvectors; in the case of clustering, it projects each observation to its cluster center. Both the input and the output live in the sample space.

  • we should also have a way to access the output of fit, e.g. the subspace as represented by PCA "scores", or set of cluster centers.

Discussions

https://github.com/JuliaStats/Roadmap.jl/issues/4