This tutorial provides a guided alignment of pancreatic islet cells sequenced across four prococols: CEL-Seq (GSE81076), CEL-Seq2 (GSE85241), Fluidigm C1 (GSE86469), and Smart-Seq2 (E-MTAB-5061) as described by Seurat. In this tutorial we demonstrate the unsupervised alignment strategy of scAlign
described in Johansen et al, 2018 extended to alignment of multiple datasets (conditions) along with typical analysis utilizing the aligned dataset.
The following is a walkthrough of a multi-way alignment problem with scAlign
and has been designed to provide an overview of data preprocessing, alignment and finally analysis on the alignment space. Here, our primary goals include:
- Train scAlign on a large dataset by using an initial step of dimensionality reduction.
- Learning a low-dimensional cell state space in which cells group by function and type, regardless of condition (sequencing protocol).
The gene count matrices used for this tutorial are hosted here.
First, we perform a typical scRNA preprocessing step using the Seurat
package. Then, reduce to the top 3,000 highly variable genes and perform an initial dimensionality reduction step using CCA improve convergence and reduce computation time.
library(scAlign)
library(Seurat)
library(SingleCellExperiment)
library(gridExtra)
## PMA depends on 'impute' from bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("impute")
library(PMA)
## User paths
working.dir = "." #where the data files are located
results.dir = "." #where the output should be stored
## Load in data
pancreas.data = readRDS(file = file.path(working.dir, "pancreas_expression_matrix.rds"))
metadata = readRDS(file = file.path(working.dir, "pancreas_metadata.rds"))
## Subset combined object into a list containing the four individual datasets
pancreas = CreateSeuratObject(pancreas.data, meta.data = metadata)
pancreas.list = SplitObject(pancreas, split.by = "tech")
## Independently preprocess each datasets
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]])
pancreas.list[[i]] <- ScaleData(pancreas.list[[i]], do.scale=T, do.center=T, display.progress=T)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], nFeatures = 3000)
}
## Extract common set of genes across all datasets
genes.use = Reduce(intersect, lapply(pancreas.list, function(seurat.obj) VariableFeatures(seurat.obj)))
The general design of scAlign
makes it agnostic to the input RNA-seq data representation. Thus, the input data can either be
gene-level counts, transformations of those gene level counts or a preliminary step of dimensionality reduction such
as canonical correlates or principal component scores. Here we convert the previously defined
Seurat
objects to SingleCellExperiment
objects in order to create the combined scAlign
object to which we append the results of Seurat's RunMultiCCA
.
## Convert each Seurat object into an SCE object, being sure to retain Seurat's metadata in SCE's colData field
pancreas.sce.list = lapply(pancreas.list,
function(seurat.obj){
SingleCellExperiment(assays = list(counts = seurat.obj@assays$RNA@counts[genes.use,],
logcounts = seurat.obj@assays$RNA@data[genes.use,],
scale.data = seurat.obj@assays$RNA@scale.data[genes.use,]),
colData = seurat.obj@meta.data)
})
## Create the combined scAlign object from list of SCE(s).
scAlignPancreas = scAlignCreateObject(sce.objects = pancreas.sce.list,
genes.use = genes.use,
cca.reduce=T,
ccs.compute=10,
project.name = "scAlign_Pancreatic_Islet")
Now we align the pancreas islets sequenced across different protocols using the results of RunMultiCCA
as input to scAlign
.
## Run scAlign with CCA results as input to the encoder (alignment).
scAlignPancreas = scAlignMulti(scAlignPancreas,
options=scAlignOptions(steps=15000,
log.every=5000,
batch.size=300,
perplexity=30,
norm=TRUE,
batch.norm.layer=FALSE,
architecture="large", ## 3 layer neural network
num.dim=64), ## Number of latent dimensions
encoder.data="MultiCCA",
decoder.data="scale.data",
supervised='none',
run.encoder=TRUE,
run.decoder=TRUE,
log.results=TRUE,
log.dir=file.path('./tmp'),
device="GPU")
After alignment, scAlign
returns a
low-dimensional joint embedding space where the effect of sequencing protocol is removed allowing us to use the complete dataset for downstream analyses such as clustering
or differential expression.
After alignment we can use the resulting embeddings to cluster the cells without dataset specific biases then perform differential expression analysis. Further information about using Seurat to find clusters and differentially expressed genes can be found here.
library(dplyr)
scAlignSeuratObj = as.Seurat(scAlignPancreas, counts="counts", data="logcounts", assay="RNA")
## Cluster the data based on scAlign embeddings, be sure to use all the embedding dimensions!
scAlignSeuratObj <- FindNeighbors(scAlignSeuratObj, dims = 1:64, reduction="ALIGNED-MultiCCA")
scAlignSeuratObj <- FindClusters(scAlignSeuratObj, resolution = 0.4)
table(Idents(scAlignSeuratObj))
## Find markers for each scAlign based cluster compared against the remaining cells.
pancreas.markers <- FindAllMarkers(scAlignSeuratObj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pancreas.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## Alternatively, find markers which are conserved when performing DE on each dataset individually.
pancreas.markers <- FindConservedMarkers(scAlignSeuratObj, ident.1 = 0, ident.2 = 1, grouping.var = "group.by")
```R