-
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Voting methods for feature ranking in efs (#112)
* add stability selection article * add Rcpp code for approval voting feature ranking method * add citation * extra check during init() * update doc + use the Rcpp interface for approval voting * add templates for params in ArchiveBatchFSelect + updocs * use testthat expectations (not checkmate ones!) * add test for newly implemented voting methods * update test for av * fix note * refactor AV_rcpp, add SAV_rcpp * add norm_score, and SAV R function * add sav, improve doc * fix efs test * update and improve test for AV * add sav test * add borda score * update tests * add seq and revseq PAV Rcpp methods * add R functions for the PAV methods * comment printing * add tests for PAV methods * add PAV methods to efs * refactor: do not use C++ RNGs * fix startsWith * updocs * fix data.table note * add committee_size parameter, refactor borda score * add large data test for seq pav * refactor C++ code, add optimized PAV * remove revseq-PAV method, use optimized seqPAV * update tests * remove suboptimal seqPAV function * shuffle candidates outside Rcpp functions (same tie-breaking) * optimize Phragmen a bit => do not randomly select the candidate with min load * add phragmen's rule in efs * correct borda score + use phragmens rule * add tests for Phragmen's rule * correct weighted Phragmen's rule * add specific test for phragmen's rule * run document() * show data.table result after using ':=' * add n_resamples field + nicer obj print * cover edge case (eg lasso resulted in no features getting selected) * updocs * small styling fix * add Stabl ref * more descriptive name * add embedded ensemble feature selection * remove print() * add TOCHECK comment on benchmark design * use internal valid task * simplify * ... * store_models = FALSE * ... * separate the use of inner_measure and measure used in the test sets * updocs * update tests * refactor: expect_vector => expect_numeric * fix partial arg match * fix example * use fastVoteR for feature ranking * pass named list to callback parameter * skip test if fastVoteR is not available * refactor: better handling of inner measure * add tests for embedded_ensemble_fselect() * update NEWs * add active_measure field * remove Remotes as fastVoteR is now on CRAN :) * refine doc --------- Co-authored-by: be-marc <marcbecker@posteo.de>
- Loading branch information
Showing
12 changed files
with
804 additions
and
224 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
#' @title Embedded Ensemble Feature Selection | ||
#' | ||
#' @include CallbackBatchFSelect.R | ||
#' | ||
#' @description | ||
#' Ensemble feature selection using multiple learners. | ||
#' The ensemble feature selection method is designed to identify the most predictive features from a given dataset by leveraging multiple machine learning models and resampling techniques. | ||
#' Returns an [EnsembleFSResult]. | ||
#' | ||
#' @details | ||
#' The method begins by applying an initial resampling technique specified by the user, to create **multiple subsamples** from the original dataset (train/test splits). | ||
#' This resampling process helps in generating diverse subsets of data for robust feature selection. | ||
#' | ||
#' For each subsample (train set) generated in the previous step, the method applies learners | ||
#' that support **embedded feature selection**. | ||
#' These learners are then scored on their ability to predict on the resampled | ||
#' test sets, storing the selected features during training, for each | ||
#' combination of subsample and learner. | ||
#' | ||
#' Results are stored in an [EnsembleFSResult]. | ||
#' | ||
#' @param learners (list of [mlr3::Learner])\cr | ||
#' The learners to be used for feature selection. | ||
#' All learners must have the `selected_features` property, i.e. implement | ||
#' embedded feature selection (e.g. regularized models). | ||
#' @param init_resampling ([mlr3::Resampling])\cr | ||
#' The initial resampling strategy of the data, from which each train set | ||
#' will be passed on to the learners and each test set will be used for | ||
#' prediction. | ||
#' Can only be [mlr3::ResamplingSubsampling] or [mlr3::ResamplingBootstrap]. | ||
#' @param measure ([mlr3::Measure])\cr | ||
#' The measure used to score each learner on the test sets generated by | ||
#' `init_resampling`. | ||
#' If `NULL`, default measure is used. | ||
#' @param store_benchmark_result (`logical(1)`)\cr | ||
#' Whether to store the benchmark result in [EnsembleFSResult] or not. | ||
#' | ||
#' @template param_task | ||
#' | ||
#' @returns an [EnsembleFSResult] object. | ||
#' | ||
#' @source | ||
#' `r format_bib("meinshausen2010", "hedou2024")` | ||
#' @export | ||
#' @examples | ||
#' \donttest{ | ||
#' eefsr = embedded_ensemble_fselect( | ||
#' task = tsk("sonar"), | ||
#' learners = lrns(c("classif.rpart", "classif.featureless")), | ||
#' init_resampling = rsmp("subsampling", repeats = 5), | ||
#' measure = msr("classif.ce") | ||
#' ) | ||
#' eefsr | ||
#' } | ||
embedded_ensemble_fselect = function( | ||
task, | ||
learners, | ||
init_resampling, | ||
measure, | ||
store_benchmark_result = TRUE | ||
) { | ||
assert_task(task) | ||
assert_learners(as_learners(learners), task = task, properties = "selected_features") | ||
assert_resampling(init_resampling) | ||
assert_choice(class(init_resampling)[1], choices = c("ResamplingBootstrap", "ResamplingSubsampling")) | ||
assert_measure(measure, task = task) | ||
assert_flag(store_benchmark_result) | ||
|
||
init_resampling$instantiate(task) | ||
|
||
design = benchmark_grid( | ||
tasks = task, | ||
learners = learners, | ||
resamplings = init_resampling | ||
) | ||
|
||
bmr = benchmark(design, store_models = TRUE) | ||
|
||
trained_learners = bmr$score()$learner | ||
|
||
# extract selected features | ||
features = map(trained_learners, function(learner) { | ||
learner$selected_features() | ||
}) | ||
|
||
# extract n_features | ||
n_features = map_int(features, length) | ||
|
||
# extract scores on the test sets | ||
scores = bmr$score(measure) | ||
|
||
set(scores, j = "features", value = features) | ||
set(scores, j = "n_features", value = n_features) | ||
setnames(scores, "iteration", "resampling_iteration") | ||
|
||
# remove R6 objects | ||
set(scores, j = "learner", value = NULL) | ||
set(scores, j = "task", value = NULL) | ||
set(scores, j = "resampling", value = NULL) | ||
set(scores, j = "prediction_test", value = NULL) | ||
set(scores, j = "task_id", value = NULL) | ||
set(scores, j = "nr", value = NULL) | ||
set(scores, j = "resampling_id", value = NULL) | ||
set(scores, j = "uhash", value = NULL) | ||
|
||
EnsembleFSResult$new( | ||
result = scores, | ||
features = task$feature_names, | ||
benchmark_result = if (store_benchmark_result) bmr, | ||
measure = measure | ||
) | ||
} |
Oops, something went wrong.