GitHub - dmhenke/BioPrimeLASSO: Bio-primed LASSO feature selection for biomarker discovery

Bio-primed machine learning to enhance discovery of relevant biomarkers

Introduction

Precision medicine relies on identifying reliable biomarkers for gene dependencies to tailor individualized therapeutic strategies. The advent of high-throughput technologies presents unprecedented opportunities to explore molecular disease mechanisms but also challenges due to high dimensionality and collinearity among features. Traditional statistical methods often fall short in this context, necessitating novel computational approaches that harness the full potential of big data in bioinformatics. Here, we introduce a novel machine learning approach extending the Least Absolute Shrinkage and Selection Operator (LASSO) regression framework to incorporate biological knowledge, such as protein-protein interaction databases, into the regularization process. This bio-primed approach prioritizes variables that are both statistically significant and biologically relevant. Applying our method to multiple dependency datasets, we identified biomarkers which traditional methods overlooked. Our biologically informed LASSO method effectively identifies relevant biomarkers from high-dimensional collinear data, bridging the gap between statistical rigor and biological insight. This method holds promise for advancing personalized medicine by uncovering novel therapeutic targets and understanding the complex interplay of genetic and molecular factors in disease.

Reproducibility

Analysis code to reproduce results described in our manuscript can be found here.

R Package Walkthrough

1) Installation

Our R package called BioPrimeLASSO requires the following R packages to be installed: glmnet and ggplot2.

install.packages("devtools")
devtools::install_github("dmhenke/BioPrimeLASSO")

2) Load toy data (total size ~20Mb)

In this toy example we will use BioPrimeLASSO to discover copy number biomarkers for EGFR dependency. BioPrimeLASSO also makes use of Protein-Protein interaction information from STRING DB. Please download the following three files:

Copy number variation (cnv_EGFR.tsv)
Dependency data (demeter2_EGFR.tsv)
Protein-protein interaction network (ppi_w_symbols_EGFR.tsv)

cnv <- read.csv("./cnv_EGFR.tsv",sep = '\t',header=T)
ppi <- read.csv("./ppi_w_symbols_EGFR.tsv",sep = '\t',header=T)
demeter2 <- read.csv("./demeter2_EGFR.tsv",sep = '\t',header=T)

2.1) Load supplemental information (optional)

Next, we load some information for each gene including genomic location using the biomaRt R package.

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
gene_info <- getBM(
  attributes = c("chromosome_name", "start_position", "hgnc_symbol"),
  filters = "hgnc_symbol",
  values = colnames(cnv),
  mart = mart)

chrs <- as.character(1:22)
gene_info <- gene_info[gene_info$chromosome_name %in% chrs, ]
uniq <- names(which(table(gene_info$hgnc_symbol) == 1))
gene_info <- gene_info[gene_info$hgnc_symbol %in% uniq, ]
gene_info$chromosome_name <- factor(
  gene_info$chromosome_name, levels = chrs)

3) Define gene of interest: EGFR

GoI <- "EGFR"

4) Setup data objects for analysis

# Dependency score resource: demeter2
y <- demeter2[,GoI]
names(y) <- rownames(demeter2)

# Identify 'omic information to test against dependency score: cnv
X_omic <- cnv

## Refine population to overlapping cell lines
ok_cells <- intersect(names(y), rownames(X_omic))
X_omic_OK  <- X_omic[ok_cells, ]
y_ok <- y[ok_cells]

## Remove features without variance ####
X_omic_OK <- X_omic_OK[, apply(X_omic_OK, 2, var) > 0]

### Generate scores
# Format: colnames(network) <- c("combined_score","gene1","gene2")
scores <- get_scores(gene=GoI, network=ppi)

5) Run BioPrimeLASSO

results_omic <- bplasso(
  scale(X_omic_OK), y_ok, scores,
  n_folds = 10, phi_range = seq(0, 1, length = 30))

# Add Pearson correlation: cor2score
results_omic$cor2score <- cor(
  X_omic_OK, y_ok,
  use = "pairwise.complete")[,1]

# Save results
file_results <- paste0("./",GoI,"_demeter2_CNV.RData")
save(results_omic,file = file_results)

6) Visualize results

## Correlation of Dependency score and CNV for each gene overlaying bio-primed LASSO & baseline LASSO hits
plot_manhattan(gene=GoI,
  resIn=file_results,
  subplotChr=11,
  dependency=demeter2,
  gene_info=gene_info,
  dir_save="./")

Data

For full analysis and to reproduce the results in our manuscript please use the following files (total size ~2Gb):

Protein-protein interaction network (ppi_w_symbols.tsv)
Copy number variation (cnv.tsv)
RNA expression (rna.tsv)
Demeter2 dependency data (demeter2.tsv)
Chronos dependency data (chronos.tsv)

These files were originally downloaded from the DepMap webportal (22Q2) and STRING DB website.

License

BioPrimeLASSO uses GNU General Public License GPL-3.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
R		R
Reproducibility		Reproducibility
man		man
BioPrimeLASSO_overview.png		BioPrimeLASSO_overview.png
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bio-primed machine learning to enhance discovery of relevant biomarkers

Introduction

Reproducibility

R Package Walkthrough

1) Installation

2) Load toy data (total size ~20Mb)

2.1) Load supplemental information (optional)

3) Define gene of interest: EGFR

4) Setup data objects for analysis

5) Run BioPrimeLASSO

6) Visualize results

Data

License

About

Releases

Packages

Contributors 2

Languages

License

dmhenke/BioPrimeLASSO

Folders and files

Latest commit

History

Repository files navigation

Bio-primed machine learning to enhance discovery of relevant biomarkers

Introduction

Reproducibility

R Package Walkthrough

1) Installation

2) Load toy data (total size ~20Mb)

2.1) Load supplemental information (optional)

3) Define gene of interest: EGFR

4) Setup data objects for analysis

5) Run BioPrimeLASSO

6) Visualize results

Data

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages