The respiratory is partially inspired by benchopt, but aimed for a more R-user-friendly, easily-hacking, more statistically benchmarking.
The source files are stored in R/, which contains three files
datasets.R: each dataset (either simulated or real dataset) is defined as a function
For the simulation data, we consider
- different covariance structure of the design matrix
- Independent
- AR(1)
- AR(1)+
- Factor
- different data generating model
- Friedman
- Checkerboard
- Linear
- Max
For the real datasets, we consider
Data | Description | n | p | URL |
---|---|---|---|---|
CASP | Physicochemical Properties of Protein Tertiary Structure | 45730 | 9 | 🔗 |
Energy | Appliances Energy Prediction | 19735 | 27 | 🔗 |
AirQuality | Air Quality | 9357 | 12 | 🔗 |
BiasCorrection | Bias correction of numerical prediction model temperature forecast | 7590 | 21 | 🔗 |
ElectricalStability | Electrical Grid Stability Simulated Data | 10000 | 12 | 🔗 |
GasTurbine | Gas Turbine CO and NOx Emission Data Set | 36733 | 10 | 🔗 |
ResidentialBuilding | Residential Building | 372 | 107 | 🔗 |
LungCancerGenomic | Lung cancer genomic data from the Chemores Cohort Study | 123 | 945 | 🔗 |
StructureActivity | Qualitative Structure Activity Relationships (triazines) | 186 | 60 | 🔗 |
BloodBrain | Blood Brain Barrier Data data("BloodBrain", package = "caret") |
208 | 134 | 🔗 |
GSE65904 | Whole-genome expression analysis of melanoma tumor biopsies from a population-based cohort. | 210 | 47323 | 🔗 |
methods.R: each method is defined as a function and with tuning parameters
Here are the method we considered:
Method | Software |
---|---|
Bayesian Additive Regression Trees (BART) | |
Accelerated BART (XBART) | |
Random Forests | |
XGBoost | |
Multivariate Adaptive Regression Splines (MARS) |
evaluate.R: evaluate each method (with particular parameter) on each dataset, and report the CV error and running time (the criteria can be customized)
To interactively display the results, we adopt the Shiny App, and the related files are in the folder benchmark-tree-regressions/
Note that a typical shiny app is server-based. With the help of shinylive
, we deploy the shiny website on the GitHub pages https://hohoweiya.xyz/benchmark.tree.regressions/.
The repository used renv
to manage the compatible R packages. It is also better to use the same R version 4.1.2 in case.
Then one can install the dependencies of R package via
> renv::restore() # Restores packages from the lockfile (if using renv)
Next, you can check the scripts run-local.R
and run-action.R
.
Due to the limited computational resources, run-action.R
take shorter time to finish.
Specifically, the current local results are run on the Yale Mccelary HPC.
$ module load R/4.2.0-foss-2020b
$ R
> renv::restore() # restore packages from the lockfile
> system.time({source("run-local.R")})
user system elapsed
518757.236 6565.216 59712.296
The current respiratory is mainly benchmarking various tree-based regressions, but the structure is much more flexible for other tasks. If you find this structure useful, please feel free to use it as a template for your benchmarking analysis via the link 🔗.