🧬 Phenotype Prediction Using Tree-Based Models in the UK Biobank

🚀 Motivation

Predicting human phenotypes from genomic and environmental data holds significant promise for personalized medicine, risk stratification, and efficient allocation of healthcare resources. Although genome‐wide association studies (GWAS) have identified numerous relevant genetic variants, integrating these findings with demographic and behavioral data through advanced machine learning approaches can substantially improve prediction accuracy. This project benchmarks a variety of tree-based algorithms on the UK Biobank dataset to assess their predictive performance and interpretability.

📚 Overview

This repository accompanies the IEEE BIBM 2023 paper titled “Assessing Tree-Based Phenotype Prediction on the UK Biobank.” We evaluate a suite of ensemble and boosting methods—including XGBoost, LightGBM, CatBoost, AdaBoost, Random Forest, and Extra Trees—alongside decision trees and linear models. Performance is measured using standard metrics (AUC for binary traits, R² for continuous traits), and model interpretability is achieved via SHAP (SHapley Additive exPlanations) to elucidate feature contributions.

🧬 Data

All genetic and phenotypic information originates from the UK Biobank, a comprehensive biomedical resource comprising over 500,000 participants aged 40–69 years. We applied for and received access through the UK Biobank application portal. Phenotype selection and data preprocessing were facilitated by the Stanford Biobank Engine, which provides tools for exploring and filtering traits of interest.

⚙️ Methodology

Our predictive framework utilizes nine tree-based algorithms, ranging from simple decision trees to sophisticated gradient-boosting machines. Each model is trained on combined sets of genetic variants, demographic covariates (such as age and sex), and lifestyle factors. A rigorous hyperparameter tuning process, based on multi‐objective optimization and five-fold cross-validation, ensures that each algorithm operates under its optimal parameter configuration. To explore the balance between prediction accuracy and computational efficiency, we employ Random Feature Selection (RFS), varying the number of genetic variants included and observing its impact on both model performance and runtime.

📈 Results

We report results for four representative binary and continuous phenotypes to illustrate key findings. Without hyperparameter tuning, gradient-boosting methods such as LightGBM and HGB outperform linear approaches, while Random Forest excels among tree ensembles for binary outcomes. Hyperparameter optimization further amplifies the performance gap, enabling tree-based models to surpass the performance of sparse linear methods like SNPnet, particularly for continuous traits. Incorporating age and sex as covariates yields additional gains in predictive accuracy, with CatBoost, LightGBM, and HGB emerging as the top performers overall.

Model	Phenotype	AUC / R²
XGBoost	Type 2 Diabetes	0.81
LightGBM	Hypertension	0.78
RandomForest	Smoking (ever vs never)	0.76
Decision Trees	Fasting Glucose	0.23

📝 Citation

If you use this work, please cite: Meléndez A, López C, Bonet D, Sant G, Marquès F, Rivas M, Mas Montserrat D, Abante J, Ioannidis AG. Assessing Tree-Based Phenotype Prediction on the UK Biobank. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey; 2023. p. 3804–3810. doi.

🔗 Relevant Links

Access the UK Biobank at https://www.ukbiobank.ac.uk and explore phenotypes via the Stanford Biobank Engine at https://biobankengine.stanford.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Results		Results
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Phenotype Prediction Using Tree-Based Models in the UK Biobank

🚀 Motivation

📚 Overview

🧬 Data

⚙️ Methodology

📈 Results

📝 Citation

🔗 Relevant Links

About

Uh oh!

Releases

Packages

Languages

Mele0/Phenotype_Prediction

Folders and files

Latest commit

History

Repository files navigation

🧬 Phenotype Prediction Using Tree-Based Models in the UK Biobank

🚀 Motivation

📚 Overview

🧬 Data

⚙️ Methodology

📈 Results

📝 Citation

🔗 Relevant Links

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages