'DecisionTreeTuningAnalysis' is an automated R code used to generate automated graphical analysis of our paper 'Better Trees: An empirical study on hyperparameter tuning of classification decision trees' [01]. The automated analysis coded here handles data generated by our hyperparameter tuning project (HpTuning) but may be easily extended. The main features available cover the hyperparameter profile of the decision tree induction algorithms, i.e. , answering the following questions:
- Question 01: Is tuning of trees really necessary?
- Question 02: When performing tuning, which are the most recommended techniques, considering our choices?
- Questino 03: Which hyperparameters most impact the induced trees?
- Question 04: In which situations should we tune trees?
The installation process is via git clone. You should use the following command inside your terminal session:
git clone https://github.com/rgmantovani/DecisionTreeTuningAnalysis
The classification algorithms analyzed must follow 'mlr' R package implementation [02]. A complete list of the available learners may be found here. The code generated here provides results for two decision tree induction algorithms: J48 (classif.J48) and CART (classif.rpart).
Hyperparameter tuning results should be placed in the data/hptuning_full_space/<algorithm.name>/results
sub-directory. We did not upload raw results since they have more than 50GB of data (But you can download it from here). Thus, we developed some scripts to extract useful information from the executed jobs. These scripts are in the scripts
folder. The automated analysis will only work if these scripts have run before. This is also checked by the automated code and returned to the user with instructions on how to proceed. There are 4 auxiliary scripts:
- 01_extractRepResults.R: it extracts all the average performance measures obtained from 30 repetitions of a single job composed by an algorithm, a dataset, and a tuning technique. Most of the performance plots use this information;
- 02_extractOptPaths.R: it extracts all the optimization paths obtained by the tuning techniques when executed 30 times in each dataset and algorithm configuration. All the convergence and learning curve plots use this information;
- 03_extractModelStats.R: it extracts models' statistics (number of leaves, number of rules, tree's size, etc) for each job;
- 04_createFanovaInpus.R : create the input files used by the fAnova framework [03], which computes hyperparameter importance using marginal distributions.
All extraction scripts require the algorithm's name as a parameter (<algorithm.name>
).
There is no order to run these scripts, but all of them must be executed. The files generated by these scripts will be later read and aggregated as data.frame
objects and used by the automated code.
cd script
Rscript 01_extractRepResults.R --algo=<algorithm.name> &
# examples:
# Rscript 01_extractRepResults.R --algo="classif.J48" &
# Rscript 01_extractRepResults.R --algo="classif.rpart" &
cd script
Rscript 02_extractOptPaths.R --algo=<algorithm.name> &
# examples:
# Rscript 02_extractOptPaths.R --algo="classif.J48" &
# Rscript 02_extractOptPaths.R --algo="classif.rpart" &
cd script
Rscript 03_extractModelStats.R --algo=<algorithm.name> &
# examples:
# Rscript 03_extractModelStats.R --algo="classif.J48" &
# Rscript 03_extractModelStats.R --algo="classif.rpart" &
FAnova marginal predictions are obtained by an external project [03]. This our script will generate input files in the pattern required by the FAnova Python script. To run it:
cd scripts
Rscript 04_createFanovaInputs.R --algo=<algorithm.name> &
# examples:
# Rscript 04_createFanovaInputs.R --algo="classif.J48" &
# Rscript 04_createFanovaInputs.R --algo="classif.rpart" &
The output will be placed in a folder named data/hptuning_full_space/<algorithm.name>/fanova_input
,
with one file per dataset. Provide these files to the external project, and it will also generate one correspondent file per dataset. These new files should be placed in the data/hptuning_full_space/<algorithm.name>/fanova_output
sub-directory.
To run the project, please call it by the following command:
Rscript 01_mainAnalysis.R --algo=<algorithm.name> &
# examples:
# Rscript 01_mainAnalysis.R --algo="classif.rpart" &
# Rscript 01_mainAnalysis.R --algo="classif.J48" &
Meta-level results are independent and can be generated by:
Rscript 02_metaAnalysis.R &
Meta-level results are independent and can be generated by:
Rafael Gomes Mantovani (rgmantovani@gmail.com / rafaelmantovani@utfpr.edu.br), Federal Technology University - Paraná (UTFPR) - Apucarana - PR, Brazil.
[01] Rafael Gomes Mantovani, Tomas Horvath, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. Carvalho. Better Trees: An empirical study on hyperparameter tuning of classification decision trees. Data Min Knowl Disc (2024). https://doi.org/10.1007/s10618-024-01002-5.
[02] B. Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, Zachary Jones. mlr: Machine Learning in R. Journal of Machine Learning in R, v.17, n.170, 2016, pgs 1-5.
[03] F. Hutter, H. Hoos, K. Leyton-Brown. An Efficient Approach for Assessing Hyperparameter Importance. In: Proceedings of the 31th International Conference on Machine Learning, ICMC 2014, Beijing, China, 2014, pgs 754-762.