-
Notifications
You must be signed in to change notification settings - Fork 3
Proposal: interface for JuliaStats packages (Statistics and Machine Learning)
StatisticalModel
is at the top of the type hierarchy.
There is one universal method:
- Every statistical model can be fit to data. This takes data as input:
fit(obj::StatisticalModel, data...)
.fit
should be able to handle data in the form ofDataFrame
orMatrix
, and it should return another instance ofStatisticalModel
(or perhapsFittedModel
). Alternatively, one can callfit!
, which would updateobj
instead of creating a newStatisticalModel
.
This would be a useful addition:
-
fit_more!(obj::StatisticalModel, extra_data...)
would be implemented in models that have online updates. This includes all exponential family models.
Now, we make a distinction between supervised and unsupervised learning.
- Every supervised model can make "predictions" using
predict(FittedModel, newX::SamplePoint)
. Beware: predictions may not live in the sample space, e.g. in the case of logistic regression, predictions live on the interval [0,1], but the sample space is the binary set {0,1}.
Unsupervised learning comprises methods for clustering and transformation of the data, often with dimension reduction.
-
project(FittedModel, X::SamplePoint)
for unsupervised models is analogous topredict
for supervised models. It returns the "compressed" value of the observation: in the case of PCA with a given k, it projects to the subspace spanned by the top k eigenvectors; in the case of clustering, it projects each observation to its cluster center. Both the input and the output live in the sample space. -
transform
(can we find a less general name?) takes in points in the sample space and returns points in the transformed space (typically lower-dimensional), e.g. PCA loadings, or cluster assignments. -
we should also have a way to access the output of
fit
, e.g. the subspace as represented by PCA "scores", or set of cluster centers.