single cell informed polygenic risk scoring
This repository contains the R script analysis pipeline for scRNAseq informed polygenic risk scoring. This analysis is designed to integrate single-cell RNA sequencing (scRNAseq) data with polygenic risk scores (PRS) for predicting clinical outcomes.
Ensure you have R and the necessary libraries installed. The required libraries include:
install.packages(c("data.table", "dplyr", "readxl", "lm.beta", "caret", "e1071", "pROC", "glmnet"))
Additionally, make sure to set the library path correctly if you are using specific directories for package installations:
.libPaths(c("/data/Common_Folder/R/Single_cell_packages/", .libPaths()))
The pipeline follows these major steps:
- Load annotated GWAS genes.
- Load and filter differentially expressed genes (DEGs).
- Merge DEGs with GWAS genes.
- Define SNP column format.
- Load Parkinson's disease GWAS summary statistics.
- Merge SNPs and prepare PRSice2 input.
- Calculate PRS using PRSice2.
- Predict clinical outcomes using linear and logistic regression.
- Perform k-fold cross-validation.
- Apply regularization techniques to address overfitting.
To run the analysis, follow these steps:
- Ensure all required input files are available in the specified directories.
- Set the correct paths for input and output files within the script.
- Execute the R script using R or RStudio.
source("path/to/your/script.R")
- Annotated GWAS genes:
/data/dehestani/scPRS_analysis/Annotated_PD_GWAS/gwasgenes
- DEG list:
path/to/DEGs
- PD GWAS summary statistics:
/data/dehestani/scPRS_analysis/Sum_stats/Chang2017_GWAS.tab
- PRSice2 output:
/data/dehestani/scPRS_analysis/PRsice2_output/bestPRS
- Clinical outcomes:
/data/dehestani/scPRS_analysis/Clinical_outcomes/clinical_outcomes.csv
- Covariates file:
/data/dehestani/scPRS_analysis/Clinical_outcomes/Covariates
- PRSice2 input summary statistics:
/data/dehestani/scPRS_analysis/PRsice2_input_sumstats/ODC_test.txt
- Various regression models' outputs and cross-validation results.
- ROC curves and AUC values for model performance evaluation.
The main script contains several sections, each performing specific tasks:
-
Loading and Preprocessing Data:
- Load GWAS genes and DEG list.
- Filter and merge data.
- Define SNP column format.
-
PRS Calculation:
- Merge GWAS summary statistics with DEGs.
- Prepare input for PRSice2.
-
Clinical Outcome Prediction:
- Linear regression models for UPDRS-III, MoCA, and BDI-II.
- Logistic regression for case/control prediction.
-
Cross-validation and Regularization:
- K-fold cross-validation for regression models.
- Lasso and Ridge regularization.
For any questions or issues, please contact Mo Dehestani at smdehestani@gmail.com