mixomics_CCA_IBD.Rmd

---
title: "mixOmics analysis of PSC patients"
author: "Ghada Nouairia, William Wu"
date: "2024-11-15"
output: 
  html_document:
    toc: TRUE
    toc_float: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Background

The N-integration from the mixOmics package is used to integrate high-throughput miRNA-, protein-, and metabolite data. The data types are layered upon each other creating a three dimensional block. The algorithm used is block PLS-DA where block indicates the multiple data layers, PLS (partial least squares) indicates the algorithm which is similar to PCA, and DA (discriminant analysis) indicates the dependent variable as categorical.

The data is divided into a training and testing subset which will be used to create and validate the model respectively. The outcome variable is categorical. To visualise the results we will plot the grouped individuals and the relationship between the variables.

The parameters that are varied and tweaked are included in **Parameter choice**. They consist of the weight of the design matrix, the number of components, and the number of variables.

# Analysis

## Read the libraries

```{r read_libraries}
library(tidyverse)
library(mixOmics)
library(caret)
```

## Load the data

The omics data comes in a uniform format with each row representing a patient/sample and each column representing a compound/feature/variable except the first column which contains the patient ID's.

The metadata has been cleaned for the variables of interest so that they are represented by new binary columns. See the preprocessing script for details.
```{r load_data}
metabolites <- read_csv("results/metabolite_preprocessed.csv")
proteins <- read_csv("results/protein_preprocessed.csv")
miRNA <- read_csv("results/miRNA_preprocessed.csv")
metadata <- read_csv("results/metadata_preprocessed.csv")
```

## Prepare the data

We prepare the data by selecting the PSC patients (n = 33) and filtering the variables with low variance using nearZeroVar. Thereafter we split the data into a training and testing set. 
```{r prepare_data}
# Assigning test patients by stratification to ensure both outcomes are represented
test_sample <- c(1, 2, 8, 9)

miRNA <- miRNA[2:ncol(miRNA)] %>% 
         dplyr::slice(1:33)
miRNA <- miRNA[,-nearZeroVar(miRNA, uniqueCut = 30)] # cuts 1920 miRNA (mature and hairpin) 
miRNA_train <- miRNA %>% 
  dplyr::slice(-test_sample) %>% 
  as.matrix
miRNA_test <- miRNA %>% 
  dplyr::slice(test_sample) %>% 
  as.matrix
proteins <- proteins[2:ncol(proteins)] %>% 
  dplyr::slice(1:33)
proteins_train <- proteins %>% 
  dplyr::slice(-test_sample) %>% 
  as.matrix
proteins_test <- proteins %>% 
  dplyr::slice(test_sample) %>% 
  as.matrix
metabolites <- metabolites[2:ncol(metabolites)] %>% 
  log2() %>% 
  dplyr::slice(1:33)
metabolites <- metabolites[,-nearZeroVar(metabolites, uniqueCut = 30)] # cuts 29 metabolites
metabolites_train <- metabolites %>% 
  dplyr::slice(-test_sample) %>% 
  as.matrix
metabolites_test <- metabolites %>% 
  dplyr::slice(test_sample) %>% 
  as.matrix

metadata <- metadata %>% 
  dplyr::slice_head(n = 33)
metadata_train <- metadata %>% 
  dplyr::slice(-test_sample)
metadata_test <- metadata %>% 
  dplyr::slice(test_sample)

# Assigning the block and outcome variable for train and test
X_train <- list(miRNA = miRNA_train,
  proteins = proteins_train,
  metabolites = metabolites_train)

cca_train <- metadata_train$cca_binary
ibd_train <- metadata_train$ibd_binary

X_test <- list(miRNA = miRNA_test,
  proteins = proteins_test,
  metabolites = metabolites_test)

cca_test <- metadata_test$cca_binary
ibd_test <- metadata_test$ibd_binary
```

## Choose parameters {.tabset}

### Design matrix

We start by investigating the pair-wise correlation between the different layers in our block using the PLS algorithm as regression analysis. These values can later be compared to the results from plotDiablo().
```{r correlation_analysis}
# miRNA x Proteins
res1.pls <- pls(X_train$miRNA, X_train$proteins, ncomp = 1)
cor(res1.pls$variates$X, res1.pls$variates$Y)
# miRNA x Metabolites
res2.pls <- pls(X_train$miRNA, X_train$metabolites, ncomp = 1)
cor(res2.pls$variates$X, res2.pls$variates$Y)
# Proteisn x Metabolites
res3.pls <- pls(X_train$proteins, X_train$metabolites, ncomp = 1)
cor(res3.pls$variates$X, res3.pls$variates$Y)
```

**Optimize**: Thereafter we create the design matrix with where a lower weight gives higher prediction accuracy while a higher weight extracts the correlation structure more precisely.
```{r create_design_matrix}
design <- matrix(0.1, 
  nrow = length(X_train), 
  ncol = length(X_train), 
  dimnames = list(names(X_train), names(X_train)))

diag(design) <- 0
```

### Number of components

To find the optimal number of components I create a initial model with block.plsda() and analyse it with perf(). The crossvalidation is set to leave-one-out since the sample size is small.
```{r calculate_number_of_components}
# Finding the global performance of the model without variable selection
diablo.initial <- block.plsda(X_train, cca_train, ncomp = 5, design = design)

# 10-fold cross validating the model with 10 repeats
perf.diablo.initial <- perf(diablo.initial, validation = "loo", nrepeat = 10)

# Plotting the error rate
# pdf(file = "results/plots/error_rates_diablo_model.pdf")
plot(perf.diablo.initial)

# Getting the number of components according to majority vote
perf.diablo.initial$choice.ncomp$WeightedVote

# pdf(file = "results/plots/Figure_3_CCA/correlation_by_ncomp_cca.pdf")
for (i in 1:5) {plotDiablo(diablo.initial, ncomp = i)}
```

**Optimize**: Since the outcome variable is binary the optimal number of components will usually be 1 as it is usually the number of distinct outcomes minus 1. Therefore I manually set the components to 2 which is the minimum amount needed for plotting.
```{r assign_number_of_components}
ncomp = perf.diablo.initial$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
# ncomp <- 2
```

### Number of variables

To find the optimal number of variables I create a grid for the span of variables to keep for each omics which can be adjusted and optimized depending on the amount of features contained. Thereafter I analyse the data with tune.block.splsda which gives me an optimal number of variables per layer and component.

**Optimize**: The most important parameters to tweak are test.keepX (start coarse go fine) and nrepeat (generally 10-50 repeats) and the form of validation.

```{r calculate_number_of_variables}
# Selecting a range of amount of variables to keep and storing them as a list
test.keepX <- list(miRNA = c(seq(20, 80, 10)),
                   proteins = c(seq(20, 80, 10)),
                   metabolites = c(seq(20, 80, 10)))

# Tuning with loo-crossvalidation, i.e. choosing the variables to keep
start <- Sys.time()
tune.diablo <- tune.block.splsda(X_train, cca_train, ncomp = ncomp,
                              test.keepX = test.keepX, design = design,
                              validation = 'loo',
                              BPPARAM = BiocParallel::SnowParam(workers = 16),
                              dist = "centroids.dist")
end <- Sys.time()
print(end - start)
list.keepX <- tune.diablo$choice.keepX
```

Here I save the results of previous calculations as they take some time to make and may vary.
```{r assign_number_of_variables}
cca_list.keepX <- list(
  miRNA = c(40, 20),
  proteins = c(20, 70),
  metabolites = c(80, 20))

# ibd_list.keepX <- list(
#   miRNA = c(80, 20),
#   proteins = c(50, 20),
#   metabolites = c(20, 20))
```

## Create final model

We create the final model with the training data set and the three parameters that we have defined.
```{r create_final_model}
cca_diablo <- block.splsda(X_train, cca_train, keepX = cca_list.keepX, ncomp = ncomp, design = design)
```


```{r export_variables}
export_variables <- function(diablo) {
  variables1_1 <- selectVar(diablo, block = "miRNA", comp = 1)
  variables1_2 <- selectVar(diablo, block = "miRNA", comp = 2)
  variables2_1 <- selectVar(diablo, block = "proteins", comp = 1)
  variables2_2 <- selectVar(diablo, block = "proteins", comp = 2)
  variables3_1 <- selectVar(diablo, block = "metabolites", comp = 1)
  variables3_2 <- selectVar(diablo, block = "metabolites", comp = 2)
  
  results <- rbind(
    cbind(compounds = variables1_1[["miRNA"]][["name"]], contribution = variables1_1[["miRNA"]][["value"]][["value.var"]], component = variables1_1[["comp"]]), 
    cbind(compounds = variables1_2[["miRNA"]][["name"]], contribution = variables1_2[["miRNA"]][["value"]][["value.var"]], component = variables1_2[["comp"]]), 
    cbind(compounds = variables2_1[["proteins"]][["name"]], contribution = variables2_1[["proteins"]][["value"]][["value.var"]], component = variables2_1[["comp"]]), 
    cbind(compounds = variables2_2[["proteins"]][["name"]], contribution = variables2_2[["proteins"]][["value"]][["value.var"]], component = variables2_2[["comp"]]), 
    cbind(compounds = variables3_1[["metabolites"]][["name"]], contribution = variables3_1[["metabolites"]][["value"]][["value.var"]], component = variables3_1[["comp"]]), 
    cbind(compounds = variables3_2[["metabolites"]][["name"]], contribution = variables3_2[["metabolites"]][["value"]][["value.var"]], component = variables3_2[["comp"]]))
  
  results <- results %>%
    as_tibble() %>%
    mutate(
      contribution = as.numeric(contribution),
      component = as.numeric(component))}
```

# Results

## cca {.tabset}

### Performance

We investigate the error rate of our final model and plot AUC. Here it can be important to choose the right type of validation and amount of folds
```{r cca_analyse_performance}
# pdf(file = "results/plots/Figure_3_CCA/ROC_cca.pdf")
perf.cca_diablo <- perf(cca_diablo,  validation = 'loo',  
                         nrepeat = 10, dist = 'centroids.dist')

# Performance with Majority vote
perf.cca_diablo$MajorityVote.error.rate

# Performance with Weighted vote
perf.cca_diablo$WeightedVote.error.rate

# AUC plot per block
for (i in c("miRNA", "proteins", "metabolites")) {
  auc.cca_diablo <- auroc(cca_diablo, roc.block = i, roc.comp = 2,
                           print = FALSE)}
```

### Prediction

We use our final model to predict the outcome for the test data set and then compare to the actual outcome.
```{r cca_analyse_prediction}
predict.cca_diablo <- predict(cca_diablo, newdata = X_test)

confusion.mat <- get.confusion_matrix(truth = cca_test, 
                     predicted = predict.cca_diablo$WeightedVote$centroids.dist[,2])

dimnames(confusion.mat) <- list(
  c("low cca", "high cca"),
  c("predicted as low cca", "predicted as high cca")
)  

confusion.mat

get.BER(confusion.mat)
```

### Plot correlation

We plot the correlation between the layers.
```{r cca_plot_correlation}
# pdf(file = "results/plots/Figure_3_CCA/cca_correlation.pdf")
for (i in 1:2) {
  plotDiablo(cca_diablo, ncomp = i)}
```

### Plot individuals

```{r cca_plot_individuals}
# pdf(file = "results/plots/Figure_3_CCA/cca_diablo_blocks_network.pdf")
plotIndiv(cca_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2')

plotIndiv(cca_diablo, ind.names = FALSE, legend = TRUE,
          title = 'DIABLO comp 1 - 2', block = "average")

plotIndiv(cca_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2', block = "weighted.average")

# Arrow plot
plotArrow(cca_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2')
```

### Plot variables

```{r cca_plot_variables}
# pdf(file = "results/plots/Figure_3_CCA/cca_variables_barplot.pdf")
plotVar(cca_diablo, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), 
        col = c('#377EB8', '#E7298A', '#1B9E77'),
        title = 'Multiomics analysis in relation to cca level')

# pdf(file = "results/plots/Figure_3_CCA/cca_loadings.pdf")
for (i in 1:2) {plotLoadings(cca_diablo, comp = i, contrib = 'max', method = 'median')}

# pdf(file = "results/plots/Figure_3_CCA/cca_circos_plot.pdf")
circosPlot(cca_diablo, cutoff = 0.7, line = TRUE, 
           color.blocks = c('#377EB8', '#E7298A', '#1B9E77'),
           color.cor = c("#D95F02","#984EA3"), size.labels = 1.5)

# Network
pdf("results/plots/Figure_3_CCA/cca_network1.pdf")  # Create a PNG file with specified dimensions
network(cca_diablo, blocks = c(1,2,3),
        cutoff = 0.7,
        color.node = c('#377EB8', '#E7298A', '#1B9E77'))

# Clustered image map
# pdf("results/plots/Figure_3_CCA/cca_clustered_map.pdf")
cimDiablo(cca_diablo, color.blocks = c('#377EB8', '#E7298A', '#1B9E77'),
          comp = 1, margin=c(8,20), legend.position = "right")
```

### Export variables

We export a table with all compounds used in the model with columns representing compound name, coefficient, and component.
```{r cca_export}
variables <- export_variables(cca_diablo)

# Save the data as .csv
write_csv(variables, "results/mixOmics_cca_variables.csv")
```


## IBD {.tabset}

### Performance

We investigate the error rate of our final model and plot AUC. Here it can be important to choose the right type of validation and amount of folds

### Number of components

To find the optimal number of components I create a initial model with block.plsda() and analyse it with perf(). The crossvalidation is set to leave-one-out since the sample size is small.
```{r calculate_number_of_components}
# Finding the global performance of the model without variable selection
diablo.initial <- block.plsda(X_train, ibd_train, ncomp = 5, design = design)

# 10-fold cross validating the model with 10 repeats
perf.diablo.initial <- perf(diablo.initial, validation = "loo", nrepeat = 10)

# Plotting the error rate
# pdf(file = "results/plots/error_rates_diablo_model.pdf")
plot(perf.diablo.initial)

# Getting the number of components according to majority vote
perf.diablo.initial$choice.ncomp$WeightedVote

# pdf(file = "results/plots/Figure_3_CCA/correlation_by_ncomp_cca.pdf")
for (i in 1:5) {plotDiablo(diablo.initial, ncomp = i)}
```

**Optimize**: Since the outcome variable is binary the optimal number of components will usually be 1 as it is usually the number of distinct outcomes minus 1. Therefore I manually set the components to 2 which is the minimum amount needed for plotting.
```{r assign_number_of_components}
ncomp = perf.diablo.initial$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
# ncomp <- 2
```

### Number of variables

To find the optimal number of variables I create a grid for the span of variables to keep for each omics which can be adjusted and optimized depending on the amount of features contained. Thereafter I analyse the data with tune.block.splsda which gives me an optimal number of variables per layer and component.

**Optimize**: The most important parameters to tweak are test.keepX (start coarse go fine) and nrepeat (generally 10-50 repeats) and the form of validation.

```{r calculate_number_of_variables}
# Selecting a range of amount of variables to keep and storing them as a list
test.keepX <- list(miRNA = c(seq(20, 80, 10)),
                   proteins = c(seq(20, 80, 10)),
                   metabolites = c(seq(20, 80, 10)))

# Tuning with loo-crossvalidation, i.e. choosing the variables to keep
start <- Sys.time()
tune.diablo <- tune.block.splsda(X_train, ibd_train, ncomp = ncomp,
                              test.keepX = test.keepX, design = design,
                              validation = 'loo',
                              BPPARAM = BiocParallel::SnowParam(workers = 16),
                              dist = "centroids.dist")
end <- Sys.time()
print(end - start)
list.keepX <- tune.diablo$choice.keepX
```

Here I save the results of previous calculations as they take some time to make and may vary.
```{r assign_number_of_variables}
ibd_list.keepX <- list(
  miRNA = c(20, 20, 80),
  proteins = c(20, 20, 40),
  metabolites = c(20, 80, 20))
```

## Create final model

We create the final model with the training data set and the three parameters that we have defined.
```{r create_final_model}
ibd_diablo <- block.splsda(X_train, ibd_train, keepX = ibd_list.keepX, ncomp = ncomp, design = design)
```

```{r IBD_analyse_performance}
perf.ibd_diablo <- perf(ibd_diablo,  validation = 'loo',  
                         nrepeat = 10, dist = 'centroids.dist')

# Performance with Majority vote
perf.ibd_diablo$MajorityVote.error.rate

# Performance with Weighted vote
perf.ibd_diablo$WeightedVote.error.rate

# pdf(file = "results/plots/Figure_4_IBD/ROC_ibd.pdf")
for (i in c("miRNA", "proteins", "metabolites")) {
  auc.ibd_diablo <- auroc(ibd_diablo, roc.block = i, roc.comp = 2,
                           print = FALSE)}
```

### Prediction

We use our final model to predict the outcome for the test data set and then compare to the actual outcome.
```{r IBD_analyse_prediction}
predict.ibd_diablo <- predict(ibd_diablo, newdata = X_test)

confusion.mat <- get.confusion_matrix(truth = ibd_test, 
                     predicted = predict.ibd_diablo$WeightedVote$centroids.dist[,2])

dimnames(confusion.mat) <- list(
  c("low IBD", "high IBD"),
  c("predicted as low IBD", "predicted as high IBD")
)  

confusion.mat

get.BER(confusion.mat)
```

### Plot correlation

We plot the correlation between the layers.
```{r IBD_plot_correlation}
# pdf(file = "results/plots/Figure_4_IBD/correlation_omics_ibd.pdf")
for (i in 1:2) {
  plotDiablo(ibd_diablo, ncomp = i)}
```

### Plot individuals

```{r IBD_plot_individuals}
# pdf(file = "results/plots/Figure_4_IBD/components_plots_ibd.pdf")
plotIndiv(ibd_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2')

plotIndiv(ibd_diablo, ind.names = FALSE, legend = TRUE,
          title = 'DIABLO comp 1 - 2', block = "average")

plotIndiv(ibd_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2', block = "weighted.average")

# Arrow plot
plotArrow(ibd_diablo, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO comp 1 - 2')
```

### Plot variables

```{r IBD_plot_variables}
# pdf(file = "results/plots/Figure_4_IBD/variable_plot_ibd.pdf")
plotVar(ibd_diablo, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), 
        col = c('#377EB8', '#E7298A', '#1B9E77'),
        title = 'Multiomics analysis in relation to IBD level')

pdf(file = "results/plots/Figure_4_IBD/loadings_ibd.pdf")
for (i in 1:2) {
  plotLoadings(ibd_diablo, comp = i, contrib = 'max', method = 'median')}

# pdf(file = "results/plots/Figure_4_IBD/circos_plot_ibd.pdf")
circosPlot(ibd_diablo, cutoff = 0.7, line = TRUE, 
           color.blocks = c('#377EB8', '#E7298A', '#1B9E77'),
           color.cor = c("#D95F02","#984EA3"), size.labels = 1.5)

# pdf(file = "results/plots/Figure_4_IBD/network_ibd.pdf")
network(ibd_diablo, blocks = c(1,2,3),
        cutoff = 0.7,
        color.node = c('#377EB8', '#E7298A', '#1B9E77'))

# pdf(file = "results/plots/Figure_4_IBD/blocks_ibd.pdf")
cimDiablo(ibd_diablo, color.blocks = c('#377EB8', '#E7298A', '#1B9E77'),
          comp = 1, margin=c(8,20), legend.position = "right")
```

### Export variables

We export a table with all compounds used in the model with columns representing compound name, coefficient, and component.
```{r IBD_export}
variables <- export_variables(ibd_diablo)
write_csv(variables, "results/mixOmics_IBD_binary.csv")
```


# Conclusion