Skip to content

Benchmarking Tree Regressions (e.g., RF, BART, MARS, etc.) on various simulated and real datasets

License

Notifications You must be signed in to change notification settings

szcf-weiya/benchmark.tree.regressions

Repository files navigation

Benchmarking Tree Regressions

The respiratory is partially inspired by benchopt, but aimed for a more R-user-friendly, easily-hacking, more statistically benchmarking.

🌲 What is the Repo Structure?

Source Files

The source files are stored in R/, which contains three files

datasets.R: each dataset (either simulated or real dataset) is defined as a function

For the simulation data, we consider

  • different covariance structure of the design matrix
    • Independent
    • AR(1)
    • AR(1)+
    • Factor
  • different data generating model
    • Friedman
    • Checkerboard
    • Linear
    • Max

For the real datasets, we consider

Data Description n p URL
CASP Physicochemical Properties of Protein Tertiary Structure 45730 9 🔗
Energy Appliances Energy Prediction 19735 27 🔗
AirQuality Air Quality 9357 12 🔗
BiasCorrection Bias correction of numerical prediction model temperature forecast 7590 21 🔗
ElectricalStability Electrical Grid Stability Simulated Data 10000 12 🔗
GasTurbine Gas Turbine CO and NOx Emission Data Set 36733 10 🔗
ResidentialBuilding Residential Building 372 107 🔗
LungCancerGenomic Lung cancer genomic data from the Chemores Cohort Study 123 945 🔗
StructureActivity Qualitative Structure Activity Relationships (triazines) 186 60 🔗
BloodBrain Blood Brain Barrier Data data("BloodBrain", package = "caret") 208 134 🔗
GSE65904 Whole-genome expression analysis of melanoma tumor biopsies from a population-based cohort. 210 47323 🔗

methods.R: each method is defined as a function and with tuning parameters

Here are the method we considered:

Method Software
Bayesian Additive Regression Trees (BART)
Accelerated BART (XBART)
Random Forests
XGBoost
Multivariate Adaptive Regression Splines (MARS)

evaluate.R: evaluate each method (with particular parameter) on each dataset, and report the CV error and running time (the criteria can be customized)

Shiny App

To interactively display the results, we adopt the Shiny App, and the related files are in the folder benchmark-tree-regressions/

Note that a typical shiny app is server-based. With the help of shinylive, we deploy the shiny website on the GitHub pages https://hohoweiya.xyz/benchmark.tree.regressions/.

🚀 How to Run Locally?

The repository used renv to manage the compatible R packages. It is also better to use the same R version 4.1.2 in case.

Then one can install the dependencies of R package via

> renv::restore()  # Restores packages from the lockfile (if using renv)

Next, you can check the scripts run-local.R and run-action.R.

Due to the limited computational resources, run-action.R take shorter time to finish.

Yale Mccelary HPC

Specifically, the current local results are run on the Yale Mccelary HPC.

$ module load R/4.2.0-foss-2020b
$ R
> renv::restore() # restore packages from the lockfile
> system.time({source("run-local.R")})
      user     system    elapsed
518757.236   6565.216  59712.296

📓 Use the Repo as a Template

The current respiratory is mainly benchmarking various tree-based regressions, but the structure is much more flexible for other tasks. If you find this structure useful, please feel free to use it as a template for your benchmarking analysis via the link 🔗.

About

Benchmarking Tree Regressions (e.g., RF, BART, MARS, etc.) on various simulated and real datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages