-
Notifications
You must be signed in to change notification settings - Fork 3
Proposal: interface for JuliaStats packages (Statistics and Machine Learning)
StatisticalModel
is at the top of the type hierarchy.
There is one universal method:
- Every statistical model can be fit to data. This takes data as input:
fit(obj::StatisticalModel, data, tuning_parameters)
.fit
should be able to handle data in the form ofDataFrame
orMatrix
, and it should return another instance ofStatisticalModel
(or perhapsFittedModel
). Alternatively, one can callfit!
, which would updateobj
instead of creating a newStatisticalModel
.
This would be a useful addition:
-
fit_more!(obj::StatisticalModel, extra_data...)
would be implemented in models that have online updates. This includes all exponential family models.
Also universal is the notion of loss functions, used to calculate training error and test error.
-
cv(obj::StatisticalModel, data, k)
would perform cross-validation, i.e. callfit
k times, and return the list of fits, or at least the average loss. In practice, one would callcv
multiple times, for different values of the tuning parameters / hyperparameters.
Now, we make a distinction between supervised and unsupervised learning.
-
Every supervised model can make "predictions" using
predict(FittedModel, newX::SamplePoint)
. Beware: predictions may not live in the sample space, e.g. in the case of logistic regression, predictions live on the interval [0,1], but the sample space is the binary set {0,1}. -
We should be able to see
training_error(FittedModel)
andtest_error(FittedModel, test_data)
Unsupervised learning comprises methods for clustering and transformation of the data, often with dimension reduction.
-
transform
(can we find a less general name?) takes in points in the sample space and returns points in the transformed space (typically lower-dimensional), e.g. PCA loadings, or cluster assignments. -
project(FittedModel, X::SamplePoint)
for unsupervised models is analogous topredict
for supervised models. It returns the "compressed" value of the observation: in the case of PCA with a given k, it projects to the subspace spanned by the top k eigenvectors; in the case of clustering, it projects each observation to its cluster center. Both the input and the output live in the sample space. -
we should also have a way to access the output of
fit
, e.g. the subspace as represented by PCA "scores", or set of cluster centers.