Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolving aliased argument names with lightgbm #53

Closed
simonpcouch opened this issue Nov 4, 2022 · 5 comments · Fixed by #55
Closed

resolving aliased argument names with lightgbm #53

simonpcouch opened this issue Nov 4, 2022 · 5 comments · Fixed by #55

Comments

@simonpcouch
Copy link
Contributor

Following up on @jameslamb's comment here—thank you for being willing to discuss. :)


Some background, for the GitHub archeologists:

lightgbm allows passing many of its arguments with aliases. On the parsnip side, these include both main and engine arguments to boost_tree(), including the now-tunable engine argument num_leaves. On the lightgbm side, these include both "core" and "control" arguments.

As of now, any aliases supplied to set_engine are passed in the dots of bonsai::train_lightgbm() to the dots of lightgbm::lgb.train(). lightgbm's machinery takes care of resolving aliases, with some rules that generally prevent silent failures while tuning:

https://github.com/microsoft/LightGBM/blob/e45fc48405e9877138ffb5f7e1fd4c449752d323/R-package/R/utils.R#L176-L181

  • If a main argument is marked for tuning and its main translation (i.e. lightgbm's non-alias argument name) is supplied as an engine arg, parsnip machinery will throw warnings that the engine argument will be ignored. e.g. for min_n -> min_data_in_leaf:
! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: min_data_in_leaf
...
  • If a main argument is marked for tuning and a lightgbm alias is supplied as an engine arg, we ignore the alias silently. (Note that bonsai::train_lightgbm() sets lgb.train()'s verbose argument to 1L if one isn't supplied.)

  • The scariest issue I'd anticipate is the user not touching the main argument (that will be translated to the main, non-alias lgb.train argument), but setting the alias in set_engine(). In that case, the bonsai::train_lightgbm() default kicks in, and the user-supplied engine argument is silently ignored in favor of the default supplied as the non-alias lightgbm argument.🫣

Reprex here. (Click to expand)
library(tidymodels)
library(bonsai)
library(testthat)

data("penguins", package = "modeldata")

penguins <- penguins[complete.cases(penguins),]
penguins_split <- initial_split(penguins)
set.seed(1)
boots <- bootstraps(training(penguins_split), 3)
base_wf <- workflow() %>% add_formula(bill_length_mm ~ .)

Marking a main argument for tuning, as usual:

bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm") %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

set.seed(1)
bt_res_correct <- tune_grid(bt_wf, boots, grid = 3, control = control_grid(save_pred = TRUE))

bt_res_correct
#> # Tuning results
#> # Bootstrap sampling 
#> # A tibble: 3 × 5
#>   splits           id         .metrics         .notes           .predictions
#>   <list>           <chr>      <list>           <list>           <list>      
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>    
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>    
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>

Marking a main argument for tuning, and supplying its non-alias translation as engine arg:

bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm", min_data_in_leaf = 1) %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

set.seed(1)
bt_res_both <- tune_grid(bt_wf, boots, grid = 3)
#> ! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...

bt_res_both
#> # Tuning results
#> # Bootstrap sampling 
#> # A tibble: 3 × 4
#>   splits           id         .metrics         .notes          
#>   <list>           <chr>      <list>           <list>          
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [3 × 3]>
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [3 × 3]>
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [3 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x9: The following arguments cannot be manually modified and were remo...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

Marking a main argument for tuning, and supplying an alias to tune as engine arg:

set.seed(1)
bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm", min_data_per_leaf = 1) %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

bt_res_alias <- 
  tune_grid(
    bt_wf, boots, grid = 3, 
    control = control_grid(extract = extract_fit_engine, save_pred = TRUE)
  )

Note that both params end up in the resulting object, though only one is reference when making predictions.

bt_res_alias %>%
  pull(.extracts) %>%
  `[[`(1) 
#> # A tibble: 3 × 3
#>   min_n .extracts  .config             
#>   <int> <list>     <chr>               
#> 1    13 <lgb.Bstr> Preprocessor1_Model1
#> 2    33 <lgb.Bstr> Preprocessor1_Model2
#> 3    25 <lgb.Bstr> Preprocessor1_Model3

lgb_fit <- bt_res_alias %>%
  pull(.extracts) %>%
  `[[`(1) %>%
  pull(.extracts) %>%
  `[[`(1)

lgb_fit$params$min_data_in_leaf
#> [1] 13
lgb_fit$params$min_data_per_leaf
#> [1] 1

# all good
expect_equal(
  collect_predictions(bt_res_correct),
  collect_predictions(bt_res_alias)
)

bt_mets_correct <- 
  bt_res_correct %>%
  select_best("rmse") %>%
  finalize_workflow(bt_wf, parameters = .) %>%
  last_fit(penguins_split)

bt_mets_alias <- 
  bt_res_alias %>%
  select_best("rmse") %>%
  finalize_workflow(bt_wf, parameters = .) %>%
  last_fit(penguins_split)

# all good
expect_equal(
  bt_mets_correct$.metrics,
  bt_mets_alias$.metrics
)

Created on 2022-11-04 with reprex v2.0.2


I think the best approach here would be to raise a warning or error whenever an alias that maps to a main boost_tree() argument is supplied, and note that it can be resolved by passing as a main argument to boost_tree(). Otherwise, passing aliases as engine arguments (i.e. that don't map to main arguments) seems unproblematic to me. Another option is to set verbose to a setting that allows lightgbm to propogate its own prompts with duplicated aliases when any alias is supplied, though this feels like it might obscure train_lightgbm()s role in passing a non-aliased argument. Either way, this requires being able to detect when an alias is supplied.

A question for you, James, if you're up for it—is there any sort of dictionary that we could reference that would contain these mappings? A list like that currently outputted by lightgbm:::.PARAMETER_ALIASES() would be perfect, though that also contains the parameters listed under "Learning Control Parameters".

We could also put that together ourselves—we'd just need the mappings for 8 of them:

library(tidymodels)
library(bonsai)

get_from_env("boost_tree_args") %>%
  filter(engine == "lightgbm")
#> # A tibble: 8 × 5
#>   engine   parsnip        original                func             has_submodel
#>   <chr>    <chr>          <chr>                   <list>           <lgl>       
#> 1 lightgbm tree_depth     max_depth               <named list [2]> FALSE       
#> 2 lightgbm trees          num_iterations          <named list [2]> TRUE        
#> 3 lightgbm learn_rate     learning_rate           <named list [2]> FALSE       
#> 4 lightgbm mtry           feature_fraction_bynode <named list [2]> FALSE       
#> 5 lightgbm min_n          min_data_in_leaf        <named list [2]> FALSE       
#> 6 lightgbm loss_reduction min_gain_to_split       <named list [2]> FALSE       
#> 7 lightgbm sample_size    bagging_fraction        <named list [2]> FALSE       
#> 8 lightgbm stop_iter      early_stopping_rounds   <named list [2]> FALSE

Created on 2022-11-04 with reprex v2.0.2

@jameslamb
Copy link
Contributor

Sorry for the delayed response @simonpcouch . I don't know a lot about {parsnip} so I don't know what an "engine argument" is (I can go read docs, just haven't yet).

But I do think I can answer your core question. Yes, the information in lightgbm::.PARAMETER_ALIASES() is exactly what you want.

Given that:

  • aliases are never removed, and only added very very rarely
    • (on the order of one every 2 years for some parameters, and never for at least half of the parameters)
  • .PARAMETER_ALIASES is not exported in the library's public API
  • even if we did decide to export it, that wouldn't happen until LightGBM v4.0.0 which has an uncertain release date

I recommend the following:

  1. build the latest development version of {lightgbm} locally
  2. run lightgbm::.PARAMETER_ALIASES() to get all the mappings
  3. hard-code the mappings for the parameters you care about in {bonsai}'s source code
    • note that that list's keys are the LightGBM "main" arguments and its values are all the aliases
  4. do whatever you want for parameter resolution inside {bonsai}

I know that's not ideal, but the list of parameters and their aliases changes so infrequently, and LightGBM's release cycle is so unreliable, that I think it's the best way forward for what you're working on here.

@simonpcouch
Copy link
Contributor Author

Thanks for chiming in, @jameslamb! This is helpful—it's good to have a better sense of these aliases' lifecycles, and a thumbs-up on hard-coding those values in rather than waiting on an export for lightgbm:::.PARAMETER_ALIASES().

To save you some time navigating the parsnip docs (sorry for not clarifying here), loosely: parsnip differentiates between "main" (or "model") arguments and "engine" arguments. lightgbm is one possible "engine" for boosted trees in parsnip, alongside friends like xgboost or C5.0. Main arguments are hyperparameters for boosted trees that are common to (most) all boosted tree engines, and have a standardized parsnip argument name and structure to be passed to boost_tree(). Engine arguments are specific to an individual engine, and are passed via the ... in set_engine() with the name and structure supported by the engine. The big idea is that, once a user understands the boost_tree() arguments, they can fit a boosted tree with any engine supported by parsnip:

mtcars$cyl <- as.factor(mtcars$cyl)

library(bonsai)
#> Loading required package: parsnip

bt <- boost_tree(trees = 100, min_n = 5) %>% set_mode("classification")

bt_xgb <- bt %>% set_engine("xgboost") %>% fit(cyl ~ ., mtcars)
bt_c50 <- bt %>% set_engine("C5.0") %>% fit(cyl ~ ., mtcars)
bt_lgb <- bt %>% set_engine("lightgbm") %>% fit(cyl ~ ., mtcars)

predict(bt_xgb, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 6          
#> 2 6          
#> 3 4          
#> 4 4          
#> 5 8          
#> 6 6
predict(bt_c50, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 6          
#> 2 6          
#> 3 4          
#> 4 6          
#> 5 8          
#> 6 6
predict(bt_lgb, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 6          
#> 2 6          
#> 3 4          
#> 4 6          
#> 5 8          
#> 6 6

# to supply an "engine argument":
bt_lgb2 <- bt %>% set_engine("lightgbm", num_leaves = 10) %>% fit(cyl ~ ., mtcars)

predict(bt_lgb2, head(mtcars))
#> # A tibble: 6 × 1
#>   .pred_class
#>   <fct>      
#> 1 6          
#> 2 6          
#> 3 4          
#> 4 6          
#> 5 8          
#> 6 6

Created on 2022-11-28 with reprex v2.0.2

@jameslamb
Copy link
Contributor

Ahhhhh got it, thanks very much for that explanation!

And sorry again for the delayed response. I'll try to be more responsive in the future. @ me any time if you need advice on these sort of integrating-with-LightGBM questions.

@simonpcouch
Copy link
Contributor Author

No need to apologize—your insight has been much appreciated!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants