-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API] fit
API design - separating data from model specification
#23
Comments
Thanks for the extensive comment! I generally like the idea and I see the additional benefit that this generalizes quite well to supporting To a certain extent it will also reduce the memory overhead if, as we (externally) don't need multiple @BerriJ what do you think? |
If you have a single See here: https://numpy.org/doc/stable/user/basics.copies.html |
Yes, but for each distribution parameter we need to store the Gramians G and H.
where subset is coming from the dict. The gram will occupy memory (thought much less since it is |
Very silly remark perhaps, but isn't |
Yes, without weights that's true, but we have the weights needed for the iteratively reweighted least-squares and they're different for each distribution parameter. So the full specification of the gram is where We could also introduce different forgetting for different distribution parameters, e.g. allowing the volatility to adapt faster than the mean estimation - helpful if we use this for volatility tracking with an intercept only. |
fit
API design - separating data from model specificationfit
API design - separating data from model specification
ok, I see - though also in the weighted case I do not see an advantage from passing multiple copies of column-subset |
Yes, that's still a valid point. |
closed with #25 :) |
Thanks a lot for the interesting presentation in the
sktime
meetup today!Following on from the presentation of the interface, I believe you have
model.fit(y, x1, x2, ...)
, wherey
is the target matrix, andxi
are regressors that you regress on different parameters of a parametric distribution, number varying with distribution.I have a strong, firm opinion that this is not a good interface design, as it mixes model specification into the
fit
call - namely, the selection of the regressors, which is clearly a modelling choice.In sklearn-like framework interfaces, it is a strict, nearly absolute API design line that model specification must happen in
__init__
, and this would also be a major friction point with anskpro
interface.Instead, I would design the interface as follows:
fit
has two args, justX
andy
, and in this sequence (noty
first).update(X, y)
andpredict(X)
__init__
(notfit
etc),regressors
orvars
or similar, which is adict
with keys being distribution parameter name, and values being an iterable specifying the columns withinX
. The default isNone
, meaning thatX
used internally for fitting is the same for as theX
seen as input to the methods, for all parameters.I am very very certain about not using the current design, but less (yet still quite) certain about the "optimal" interface here.
If you are worried about specifying cross-effects or inter-dependencies, imo this does not impact the above, as those are modeling choices too, so in the
sklearn
-like API designs they should go in some__init__
and not in thefit
either.Minor added details, I feel less strongly about those:
predict(X)
for the predictive mean, then the estimator can be used assklearn
estimatorpredict
insteadpredict_proba_params
or similarThe text was updated successfully, but these errors were encountered: