Remove intercept coming from workflows based on engine encodings #1033

hfrick · 2023-12-01T20:33:23Z

parsnip captures in its encodings (example) what to do about the intercept for a specific engine. In a "pure parsnip use case" this only matters when parsnip has to convert data between its own formula interface and the engine's matrix interface, covered in .convert_form_to_xy(). (The reason why we might have both include_intercept and remove_intercept as TRUE is because the indicators for factor variables are affected by the absence/presence of the intercept.)

When a parsnip model is used inside of a workflow together with a formula as the preprocessor, the encodings matter again (they do not matter for other kinds of preprocessors):

In workflow's pre stage:

finalize_blueprint() encodes include_intercept in the blueprint (if not supplied by the user, which is the default)
fit.action_formula() uses the (preprocessing) formula and that blueprint to make the mold (containing the outcomes and the predictors) - this may now contain an intercept based on the encodings.

In workflow's fit stage:

workflows sends the data (with the potential intercept) off to parsnip:
- to fit() if the workflow contains a model formula
- to fix_xy() otherwise

In parsnip:

one of four fit helpers will bridge between the parsnip interface and the engine interface (and actually get the fitted model):
- xy_xy()
- xy_form()
- form_form()
- form_xy()

That was all for fitting a model - they matter again when predicting with a model.
For "pure parsnip" and the formula-to-matrix conversion case, .convert_form_to_xy_new() deals with the intercept based on the encodings.
In workflows, predictors are forge()ed based on the blueprint generated during the fit process, containing the information on whether to include the intercept.

We've reckoned with that intercept coming in from workflows previously:

For xy_form(): Add argument for one hot encoding to parsnip #332, among many other things, adds a step inside of .convert_xy_to_form_fit() to remove the intercept from the predictor matrix (as always, based on the encodings) before putting it together with the outcome into a data frame and sending it off with a y ~ . formula. That dot will mean the engine may add its own intercept as appropriate.
For xy_xy(), remove double intercepts for glmnet models #352 adds that encoding-based removal step.

And then we touched it once more, in the follow-up PR #353 where we added that step to prepare_data() which is used in parsnip's predict call stack, to remove the intercept added by workflow's predict method.

Leaves us with the form_xy() and form_form() cases. We've now found that one-too-many intercept for form_xy() in this comment and for form_form() in tidymodels/workflows#210.

So, after a lot of sleuthing and diagramming, this PR adds that encoding-based removal step to both those cases!

As for prediction: we don't need to touch prepare_data() because for

xy-to-form (aka the xy_form() case for fitting): .convert_xy_to_form_new() subsets on x_var from $preproc which was generated after the intercept was removed.
form-to-xy: .convert_form_to_xy_new() makes the model frame with the terms from $preproce which were (with this PR) now also generated after the intercept from workflows was removed.
xy-to-xy and form-to-form are both seeing the workflows intercept being removed by the step added in remove intercept during predict #353.

Tests are in tidymodels/extratests#151

topepo · 2023-12-07T14:05:28Z

R/convert_data.R

@@ -45,6 +45,10 @@
    rlang::abort("`composition` should be either 'data.frame' or 'matrix'.")
  }

+  if (remove_intercept) {
+    data <- data[, colnames(data) != "(Intercept)", drop = FALSE]


Should we use a a regex or more general method for determining what should be taken out?

I think that it is unlikely to happen but, depending on how you make the data, you could get multiple columns with nearly the same names:

mat_1 <- model.matrix( ~ ., data = mtcars[, 2:4]) colnames(mat_1) #> [1] "(Intercept)" "cyl" "disp" "hp" mat_2 <- model.matrix( ~ ., data = as.data.frame(mat_1)) colnames(mat_2) #> [1] "(Intercept)" "`(Intercept)`" "cyl" "disp" #> [5] "hp" colnames(mat_2) != "(Intercept)" #> [1] FALSE TRUE TRUE TRUE TRUE

^{Created on 2023-12-07 with reprex v2.0.2}

Maybe something like:

col_names <- c("(Intercept)", "`(Intercept)`", "cyl", "disp", "hp") grepl("^[[:punct:]]+Intercept[[:punct:]]+", col_names) #> [1] TRUE TRUE FALSE FALSE FALSE

^{Created on 2023-12-07 with reprex v2.0.2}

I did run into this.

I think the only time that we would trigger this (as opposed to somebody including a predictor named (Intercept) or the like) is when we send the intercept coming from workflows into model.matrix(). We only do that in .convert_form_to_xy_fit().

We could let that happen and just clean up "more thoroughly"via such a regrex afterwards. That is not a good idea because then that intercept-as-predictor is stored in the terms which are re-used during prediction. That's what caused tidymodels/censored#272. Therefore, we need to remove the intercept coming from workflows before we make the terms.

As for generally beefing up and using the regrex instead of the direct name match as we have it now: I'm inclined to leave it as is because I think it might be helpful if such a case errors rather than "gets fixed". I am relatively confident that we now properly clean up after workflows. If not, I'd like to see it fail and investigate. If there is an additional intercept to clean up via the regrex that's not coming from workflows, maybe that should error too?

github-actions · 2023-12-22T00:51:19Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

hfrick added 2 commits December 1, 2023 20:21

for intercepts coming from workflows (details in the PR description)

e479ba0

bump version for extratests

c166281

hfrick mentioned this pull request Dec 1, 2023

No double intercepts when using a workflow tidymodels/extratests#151

Merged

update NEWS

b54dcd1

hfrick requested a review from topepo December 1, 2023 21:39

hfrick mentioned this pull request Dec 4, 2023

predict() on a glmnet workflow with a model formula fails tidymodels/censored#272

Closed

topepo reviewed Dec 7, 2023

View reviewed changes

topepo merged commit 0372970 into main Dec 7, 2023
7 checks passed

topepo deleted the workflows-intercept branch December 7, 2023 17:42

This was referenced Dec 8, 2023

Don't modify spark data in form_form() #1035

Merged

Double intercept for preprocessing formula with dot notation in model formula tidymodels/workflows#210

Closed

adapt test on interaction w/ parsnip re intercept tidymodels/workflows#214

Merged

github-actions bot locked and limited conversation to collaborators Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove intercept coming from workflows based on engine encodings #1033

Remove intercept coming from workflows based on engine encodings #1033

hfrick commented Dec 1, 2023 •

edited

Loading

topepo Dec 7, 2023

hfrick Dec 7, 2023 •

edited

Loading

github-actions bot commented Dec 22, 2023

Remove intercept coming from workflows based on engine encodings #1033

Remove intercept coming from workflows based on engine encodings #1033

Conversation

hfrick commented Dec 1, 2023 • edited Loading

topepo Dec 7, 2023

Choose a reason for hiding this comment

hfrick Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Dec 22, 2023

hfrick commented Dec 1, 2023 •

edited

Loading

hfrick Dec 7, 2023 •

edited

Loading