Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove intercept coming from workflows based on engine encodings #1033

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

hfrick
Copy link
Member

@hfrick hfrick commented Dec 1, 2023

closes #1032

parsnip captures in its encodings (example) what to do about the intercept for a specific engine. In a "pure parsnip use case" this only matters when parsnip has to convert data between its own formula interface and the engine's matrix interface, covered in .convert_form_to_xy(). (The reason why we might have both include_intercept and remove_intercept as TRUE is because the indicators for factor variables are affected by the absence/presence of the intercept.)

When a parsnip model is used inside of a workflow together with a formula as the preprocessor, the encodings matter again (they do not matter for other kinds of preprocessors):

In workflow's pre stage:

  • finalize_blueprint() encodes include_intercept in the blueprint (if not supplied by the user, which is the default)
  • fit.action_formula() uses the (preprocessing) formula and that blueprint to make the mold (containing the outcomes and the predictors) - this may now contain an intercept based on the encodings.

In workflow's fit stage:

  • workflows sends the data (with the potential intercept) off to parsnip:
    • to fit() if the workflow contains a model formula
    • to fix_xy() otherwise

In parsnip:

  • one of four fit helpers will bridge between the parsnip interface and the engine interface (and actually get the fitted model):
    • xy_xy()
    • xy_form()
    • form_form()
    • form_xy()

That was all for fitting a model - they matter again when predicting with a model.
For "pure parsnip" and the formula-to-matrix conversion case, .convert_form_to_xy_new() deals with the intercept based on the encodings.
In workflows, predictors are forge()ed based on the blueprint generated during the fit process, containing the information on whether to include the intercept.

We've reckoned with that intercept coming in from workflows previously:

  • For xy_form(): Add argument for one hot encoding to parsnip #332, among many other things, adds a step inside of .convert_xy_to_form_fit() to remove the intercept from the predictor matrix (as always, based on the encodings) before putting it together with the outcome into a data frame and sending it off with a y ~ . formula. That dot will mean the engine may add its own intercept as appropriate.
  • For xy_xy(), remove double intercepts for glmnet models #352 adds that encoding-based removal step.

And then we touched it once more, in the follow-up PR #353 where we added that step to prepare_data() which is used in parsnip's predict call stack, to remove the intercept added by workflow's predict method.

Leaves us with the form_xy() and form_form() cases. We've now found that one-too-many intercept for form_xy() in this comment and for form_form() in tidymodels/workflows#210.

So, after a lot of sleuthing and diagramming, this PR adds that encoding-based removal step to both those cases!

As for prediction: we don't need to touch prepare_data() because for

  • xy-to-form (aka the xy_form() case for fitting): .convert_xy_to_form_new() subsets on x_var from $preproc which was generated after the intercept was removed.
  • form-to-xy: .convert_form_to_xy_new() makes the model frame with the terms from $preproce which were (with this PR) now also generated after the intercept from workflows was removed.
  • xy-to-xy and form-to-form are both seeing the workflows intercept being removed by the step added in remove intercept during predict #353.

Tests are in tidymodels/extratests#151

workflows-parsnip-intercept

@@ -45,6 +45,10 @@
rlang::abort("`composition` should be either 'data.frame' or 'matrix'.")
}

if (remove_intercept) {
data <- data[, colnames(data) != "(Intercept)", drop = FALSE]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a a regex or more general method for determining what should be taken out?

I think that it is unlikely to happen but, depending on how you make the data, you could get multiple columns with nearly the same names:

mat_1 <- model.matrix( ~ ., data = mtcars[, 2:4])
colnames(mat_1)
#> [1] "(Intercept)" "cyl"         "disp"        "hp"

mat_2 <- model.matrix( ~ ., data = as.data.frame(mat_1))
colnames(mat_2)
#> [1] "(Intercept)"   "`(Intercept)`" "cyl"           "disp"         
#> [5] "hp"

colnames(mat_2) != "(Intercept)"
#> [1] FALSE  TRUE  TRUE  TRUE  TRUE

Created on 2023-12-07 with reprex v2.0.2

Maybe something like:

col_names <- c("(Intercept)", "`(Intercept)`", "cyl", "disp", "hp")
grepl("^[[:punct:]]+Intercept[[:punct:]]+", col_names)
#> [1]  TRUE  TRUE FALSE FALSE FALSE

Created on 2023-12-07 with reprex v2.0.2

Copy link
Member Author

@hfrick hfrick Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run into this.

I think the only time that we would trigger this (as opposed to somebody including a predictor named (Intercept) or the like) is when we send the intercept coming from workflows into model.matrix(). We only do that in .convert_form_to_xy_fit().

We could let that happen and just clean up "more thoroughly"via such a regrex afterwards. That is not a good idea because then that intercept-as-predictor is stored in the terms which are re-used during prediction. That's what caused tidymodels/censored#272. Therefore, we need to remove the intercept coming from workflows before we make the terms.

As for generally beefing up and using the regrex instead of the direct name match as we have it now: I'm inclined to leave it as is because I think it might be helpful if such a case errors rather than "gets fixed". I am relatively confident that we now properly clean up after workflows. If not, I'd like to see it fail and investigate. If there is an additional intercept to clean up via the regrex that's not coming from workflows, maybe that should error too?

Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Double intercept when used in a workflow with a formula as preprocessor
2 participants