Skip to content

Latest commit

 

History

History
1150 lines (959 loc) · 48 KB

SingleCellsWorkshop.md

File metadata and controls

1150 lines (959 loc) · 48 KB
title author date output vignette
Single Cells 2018 Informatics Workshop
Anne Senabouth
2018-07-17
html_document md_document
html_document
md_document
%\VignetteIndexEntry{Single Cells 2018 Informatics Workshop} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

1. Preparing the data for the workshop

This vignette will use the unprocessed data from the dataset featured in the publication Single cell RNA sequencing of stem cell-derived retinal ganglion cells by Daniszewski et al. 2018.

This dataset consists of 1272 human embryonic stem cell-derived retina ganglion cells that have been separated into THY1-positive (Sample 1) and THY1-negative (Sample 2) cells via flow cytometry. This protein is a marker for pluripotency.

Please note that this data has been batch equalized with subsampling.

1.1 Loading the data into R

We will be bypassing the manual preparation of the data for this workshop. If you wish to know more about manually loading the data into R, please refer to the ascend vignettes.

Instead, we will use the loadCellRanger command to load the data into R. This command parses the filtered data generated by Cell Ranger and assumes mitochondrial and ribosomal genes will be used as controls. It also parses batch information from the cell identifiers assigned by Cell Ranger's batch aggregation function. Please note that this batch aggregation function has also normalised the dataset via the subsampling method described by Zheng et al. 2016.

library(ascend)

# BiocParallel configuration
library(BiocParallel)
ncores <- 3
register(MulticoreParam(workers = ncores, progressbar=TRUE), default = TRUE)
# Set path to data
em_set <- loadCellRanger("data/")
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

1.2 Exploring the EMSet We can review the contents of the EMSet by inputting the variable name of the EMSet into the console.

em_set
#> [1] "metadata:"
#> class: EMSet
#> dim: 33020 1272
#> metadata:
#> assays: counts
#> rownames: MIR1302-2, FAM138A, OR4F5, RP11-34P13.7, RP11-34P13.8, RP11-34P13.14, RP11-34P13.9, FO538757.2, FO538757.1, AP006222.2 ...
#> rowInfo: gene_id, ensembl_gene_id, control_group
#> rowData: gene_id, qc_ncounts, qc_ncells, qc_meancounts, qc_topgeneranking, qc_pct_total_expression
#> colnames: AAACCTGAGCTGTTCA-1, AAACCTGCAATTCCTT-1, AAACCTGGTCTACCTC-1, AAACCTGTCGGAGCAA-1, AAACGGGAGTCGATAA-1, AAACGGGCACAAGTAA-1, AAAGATGAGGAGCGAG-1, AAAGATGTCAGCTCTC-1, AAAGATGTCTTGACGA-1, AAAGCAATCTATGTGG-1 ...
#> colInfo: cell_barcode, batch
#> colData: cell_barcode, qc_libsize, qc_ngenes, qc_Rb_ncounts, qc_Rb_pct_counts, qc_Mt_ncounts, qc_Mt_pct_counts, qc_nfeaturecounts, qc_pctfeatures
#> reducedDimNames: 
#> spikeNames: 
#> clusterAnalysis: $set_controls
#> $set_controls$Mt
#>  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
#>  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB" 
#> 
#> $set_controls$Rb
#>   [1] "RPL22"          "RPL11"          "RPS6KA1"        "RPS8"          
#>   [5] "RPL5"           "RPS27"          "RPS6KC1"        "RPS7"          
#>   [9] "RPS27A"         "RPL31"          "RPL37A"         "RPL32"         
#>  [13] "RPL15"          "RPSA"           "RPL14"          "RPL29"         
#>  [17] "RPL24"          "RPL22L1"        "RPL39L"         "RPL35A"        
#>  [21] "RPL9"           "RPL34-AS1"      "RPL34"          "RPS3A"         
#>  [25] "RPL37"          "RPS23"          "RPS14"          "RPL26L1"       
#>  [29] "RPS18"          "RPS10-NUDT3"    "RPS10"          "RPL10A"        
#>  [33] "RPL7L1"         "RPS12"          "RPS6KA2"        "RPS6KA2-AS1"   
#>  [37] "RPS6KA3"        "RPS4X"          "RPS6KA6"        "RPL36A"        
#>  [41] "RPL36A-HNRNPH2" "RPL39"          "RPL10"          "RPS20"         
#>  [45] "RPL7"           "RPL30"          "RPL8"           "RPS6"          
#>  [49] "RPL35"          "RPL12"          "RPL7A"          "RPLP2"         
#>  [53] "RPL27A"         "RPS13"          "RPS6KA4"        "RPS6KB2"       
#>  [57] "RPS3"           "RPS25"          "RPS24"          "RPS26"         
#>  [61] "RPL41"          "RPL6"           "RPLP0"          "RPL21"         
#>  [65] "RPL10L"         "RPS29"          "RPL36AL"        "RPS6KL1"       
#>  [69] "RPS6KA5"        "RPS27L"         "RPL4"           "RPLP1"         
#>  [73] "RPS17"          "RPL3L"          "RPS2"           "RPS15A"        
#>  [77] "RPL13"          "RPL26"          "RPL23A"         "RPL23"         
#>  [81] "RPL19"          "RPL27"          "RPS6KB1"        "RPL38"         
#>  [85] "RPL17-C18orf32" "RPL17"          "RPS21"          "RPS15"         
#>  [89] "RPL36"          "RPS28"          "RPL18A"         "RPS16"         
#>  [93] "RPS19"          "RPL18"          "RPL13A"         "RPS11"         
#>  [97] "RPS9"           "RPL28"          "RPS5"           "RPS4Y1"        
#> [101] "RPS4Y2"         "RPL3"           "RPS19BP1"      
#> 
#> 
#> $controls
#> [1] TRUE

The summary shows how many genes are in the EMSet, what genes were set as controls and what is kept in the various slots of the EMSet. You may recognise some of the slot names from somewhere - this is because this version of the EMSet has inherited from the SingleCellExperiment superclass.

The EMSet differs from the SingleCellExperiment class in which it contains seperate slots for cell and gene-related metadata and additional slots for cluster analysis and log information. Also, the EMSet generates a set of quality control metrics based on the count matrix upon creation, and will update the metrics everytime this matrix is changed.

We can review the cell information that is stored in the colInfo slot as follows:

# Cell-related metadata
print("Using colInfo...")
#> [1] "Using colInfo..."
colInfo(em_set)
#> DataFrame with 1272 rows and 2 columns
#>                          cell_barcode     batch
#>                           <character> <numeric>
#> AAACCTGAGCTGTTCA-1 AAACCTGAGCTGTTCA-1         1
#> AAACCTGCAATTCCTT-1 AAACCTGCAATTCCTT-1         1
#> AAACCTGGTCTACCTC-1 AAACCTGGTCTACCTC-1         1
#> AAACCTGTCGGAGCAA-1 AAACCTGTCGGAGCAA-1         1
#> AAACGGGAGTCGATAA-1 AAACGGGAGTCGATAA-1         1
#> ...                               ...       ...
#> TTTATGCCACAGTCGC-2 TTTATGCCACAGTCGC-2         2
#> TTTGCGCGTTGACGTT-2 TTTGCGCGTTGACGTT-2         2
#> TTTGCGCGTTGTCTTT-2 TTTGCGCGTTGTCTTT-2         2
#> TTTGTCAGTGAGTGAC-2 TTTGTCAGTGAGTGAC-2         2
#> TTTGTCATCTTCATGT-2 TTTGTCATCTTCATGT-2         2

# Cell-related metrics
print("Using colData...")
#> [1] "Using colData..."
colData(em_set)
#> DataFrame with 1272 rows and 9 columns
#>                          cell_barcode qc_libsize qc_ngenes qc_Rb_ncounts
#>                           <character>  <numeric> <numeric>     <numeric>
#> AAACCTGAGCTGTTCA-1 AAACCTGAGCTGTTCA-1      13074      3889          3626
#> AAACCTGCAATTCCTT-1 AAACCTGCAATTCCTT-1       7957      3071          1891
#> AAACCTGGTCTACCTC-1 AAACCTGGTCTACCTC-1      18777      4726          5421
#> AAACCTGTCGGAGCAA-1 AAACCTGTCGGAGCAA-1       8320      2368          3299
#> AAACGGGAGTCGATAA-1 AAACGGGAGTCGATAA-1      11085      3415          3277
#> ...                               ...        ...       ...           ...
#> TTTATGCCACAGTCGC-2 TTTATGCCACAGTCGC-2       8365      3173          1521
#> TTTGCGCGTTGACGTT-2 TTTGCGCGTTGACGTT-2      21582      4428          6426
#> TTTGCGCGTTGTCTTT-2 TTTGCGCGTTGTCTTT-2     111166      9996         36060
#> TTTGTCAGTGAGTGAC-2 TTTGTCAGTGAGTGAC-2      16552      4531          3381
#> TTTGTCATCTTCATGT-2 TTTGTCATCTTCATGT-2      15640      3569          6474
#>                    qc_Rb_pct_counts qc_Mt_ncounts qc_Mt_pct_counts
#>                           <numeric>     <numeric>        <numeric>
#> AAACCTGAGCTGTTCA-1 27.7344347560043           213 1.62918770078017
#> AAACCTGCAATTCCTT-1 23.7652381550836           278 3.49377906246073
#> AAACCTGGTCTACCTC-1 28.8704265857166           677 3.60547478297918
#> AAACCTGTCGGAGCAA-1 39.6514423076923           258 3.10096153846154
#> AAACGGGAGTCGATAA-1 29.5624718087506           425  3.8340099233198
#> ...                             ...           ...              ...
#> TTTATGCCACAGTCGC-2 18.1829049611476           155 1.85295875672445
#> TTTGCGCGTTGACGTT-2 29.7748123436197           781 3.61875637104995
#> TTTGCGCGTTGTCTTT-2 32.4379756400338          3387 3.04679488332764
#> TTTGTCAGTGAGTGAC-2 20.4265345577574           655 3.95722571290478
#> TTTGTCATCTTCATGT-2 41.3938618925831           265 1.69437340153453
#>                    qc_nfeaturecounts   qc_pctfeatures
#>                            <numeric>        <numeric>
#> AAACCTGAGCTGTTCA-1              9235 70.6363775432155
#> AAACCTGCAATTCCTT-1              5788 72.7409827824557
#> AAACCTGGTCTACCTC-1             12679 67.5240986313043
#> AAACCTGTCGGAGCAA-1              4763 57.2475961538462
#> AAACGGGAGTCGATAA-1              7383 66.6035182679296
#> ...                              ...              ...
#> TTTATGCCACAGTCGC-2              6689 79.9641362821279
#> TTTGCGCGTTGACGTT-2             14375 66.6064312853304
#> TTTGCGCGTTGTCTTT-2             71719 64.5152294766385
#> TTTGTCAGTGAGTGAC-2             12516 75.6162397293378
#> TTTGTCATCTTCATGT-2              8901 56.9117647058823

# Gene-related metadata
print("Using rowInfo...")
#> [1] "Using rowInfo..."
rowInfo(em_set)
#> DataFrame with 33020 rows and 3 columns
#>                   gene_id ensembl_gene_id control_group
#>               <character>     <character>   <character>
#> MIR1302-2       MIR1302-2 ENSG00000243485            NA
#> FAM138A           FAM138A ENSG00000237613            NA
#> OR4F5               OR4F5 ENSG00000186092            NA
#> RP11-34P13.7 RP11-34P13.7 ENSG00000238009            NA
#> RP11-34P13.8 RP11-34P13.8 ENSG00000239945            NA
#> ...                   ...             ...           ...
#> AC233755.2     AC233755.2 ENSG00000277856            NA
#> AC233755.1     AC233755.1 ENSG00000275063            NA
#> AC240274.1     AC240274.1 ENSG00000271254            NA
#> AC213203.1     AC213203.1 ENSG00000277475            NA
#> FAM231C.1       FAM231C.1 ENSG00000268674            NA

# Gene-related metrics
print("Using rowData...")
#> [1] "Using rowData..."
rowData(em_set)
#> DataFrame with 33020 rows and 6 columns
#>                   gene_id qc_ncounts qc_ncells        qc_meancounts
#>               <character>  <numeric> <numeric>            <numeric>
#> MIR1302-2       MIR1302-2          0         0                    0
#> FAM138A           FAM138A          0         0                    0
#> OR4F5               OR4F5          0         0                    0
#> RP11-34P13.7 RP11-34P13.7          4         4  0.00314465408805031
#> RP11-34P13.8 RP11-34P13.8          1         1 0.000786163522012579
#> ...                   ...        ...       ...                  ...
#> AC233755.2     AC233755.2          0         0                    0
#> AC233755.1     AC233755.1          0         0                    0
#> AC240274.1     AC240274.1         97        94   0.0762578616352201
#> AC213203.1     AC213203.1          0         0                    0
#> FAM231C.1       FAM231C.1          0         0                    0
#>              qc_topgeneranking qc_pct_total_expression
#>                      <integer>               <numeric>
#> MIR1302-2                22278                       0
#> FAM138A                  22279                       0
#> OR4F5                    22280                       0
#> RP11-34P13.7             18061    1.84549157020974e-05
#> RP11-34P13.8             20358    4.61372892552436e-06
#> ...                        ...                     ...
#> AC233755.2               33017                       0
#> AC233755.1               33018                       0
#> AC240274.1               11394    0.000447531705775863
#> AC213203.1               33019                       0
#> FAM231C.1                33020                       0

2. Quality Control and Filtering

This workflow is similar to other scRNA-seq filtering workflows available through packages such as seurat and scater. You can read more about quality control through these links:

  1. Analyzing single-cell RNA-seq data containing UMI counts by Lun, McCarthy & Marioni 2018
  2. Cleaning the Expression Matrix by Kiselex, Andrews, McCarthy, Buttner & Hemberg 2018

2.1 Assessing quality of the dataset

First, we will have a look at the quality of the dataset. Some quality control metrics were generated when the EMSet was created; they are stored in the rowData (metrics for genes) and colData (metrics for cells) slots. We can visualise these values with quality control plots.

We can generate all the plots we will need for this step with with the plotGeneralQC function.

qc_plots <- plotGeneralQC(em_set)

The two plots below represent metrics related to the total number of UMIs mapped to a cell.

The first plot is a barplot that depicts the library size of each cell and is coloured by the batch. This plot lets us quickly review the distribution of library sizes and if there are any significant batch effects in the dataset.

The second plot is a histogram of library sizes, with library size on the x-axis and number of cells with the library size on the y-axis. Please note that this dataset comprises of cells that have already been filtered from the background based on a minimum threshold value for the number of UMIs present in the cell. There are some cells with significantly larger numbers of UMIs - they may be doublets where more than one cell ended up in a droplet. We will need to remove these cells.

library(gridExtra)
#> 
#> Attaching package: 'gridExtra'
#> The following object is masked from 'package:Biobase':
#> 
#>     combine
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     combine
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
grid.arrange(qc_plots$libsize_barplot, qc_plots$libsize_histogram, ncol = 1)

plot of chunk libsize_qc

We also want to review the number of genes being expressed in each cell, and what proportion of UMIs are mapped to top gene expressors. The first plot below represents the number of cells expressing a number of genes. The number of expressed genes will differ based on the cell type and other factors such as cell cycle phase.

The second plot represents the proportion of genes mapped to the top 500 most expressed genes (y-axis) per sample, with each point representing the percentage of expression contributed by the 100 most-expressed genes.

grid.arrange(qc_plots$ngenes_hist, qc_plots$topgenes_violin, ncol = 1)

plot of chunk gene_qc

Some cells appear to be dominated by the expression of a small subset of genes. These genes may be mitochondrial and ribosomal genes.

The plots below show the proportion of mitochondrial and ribosomal gene expression to total expression. Some cells have a higher proportion of ribosomal gene expression that can indicate they are of low quality, and should be removed from the dataset.

grid.arrange(qc_plots$control_hists$Mt, qc_plots$control_hists$Rb, ncol = 2)

plot of chunk control_qc

grid.arrange(qc_plots$control_violins$Mt, qc_plots$control_violins$Rb, ncol = 1)

plot of chunk control_qc2

2.2 Batch normalisation

Before we start filtering, we need to normalise the counts between the two batches to account for technical variations. The read counts have already been normalised already via the Cell Ranger pipeline's subsampling method. This batch normalisation method scales UMI counts between batches.

em_set <- normaliseBatches(em_set)
#> [1] "Calculating size factors..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "Scaling counts..."
#> [1] "Storing data in EMSet..."
#> [1] "Re-calculating QC metrics..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "Batch normalisation complete! Returning object..."

2.3 Filtering low quality cells from the dataset

There are three filtering functions in the ascend package.

  1. filterByOutliers removes outliers in terms of library size and control expression based on median absolute deviation (MAD).
  2. filterByControl removes cells where genes in the specified control group account for more than the specified percentage of expression.
  3. filterLowAbundanceGenes removes genes that are expressed in less than the specified percentage of cells.
# Filter library size and controls by MAD
filtered_set <- filterByOutliers(em_set, 
                                 cell.threshold = 3, 
                                 control.threshold = 3)
#> [1] "Identifying outliers..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
#> 
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

# Filter out cells based on proportion of control expression
filtered_set <- filterByControl(filtered_set, control = "Mt", pct.threshold = 20)
filtered_set <- filterByControl(filtered_set, control = "Rb", pct.threshold = 50)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%
filtered_set <- filterLowAbundanceGenes(filtered_set, pct.threshold = 1)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================================| 100%

We can review which cells and genes were removed from the dataset by reviewing the progress log. As thousands of low abundance genes were removed, we'll look at the summary of the log.

str(progressLog(filtered_set))
#> List of 7
#>  $ set_controls            :List of 2
#>   ..$ Mt: chr [1:13] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" ...
#>   ..$ Rb: chr [1:103] "RPL22" "RPL11" "RPS6KA1" "RPS8" ...
#>  $ controls                : logi TRUE
#>  $ normaliseBatches        : logi TRUE
#>  $ filterByOutliers        :List of 4
#>   ..$ CellsFilteredByLibSize      : chr(0) 
#>   ..$ CellsFilteredByLowExpression: chr(0) 
#>   ..$ CellsFilteredByMt           : chr [1:36] "AAAGTAGGTTAGTGGG-2" "AACACGTAGTGGGTTG-1" "AACGTTGTCACGAAGG-1" "AAGACCTGTCTGGAGA-1" ...
#>   ..$ CellsFilteredByRb           : chr [1:9] "AACTGGTTCATAAAGG-1" "ACGATACGTCACACGC-1" "AGCCTAACAGTAACGG-1" "AGTCTTTTCACCACCT-1" ...
#>  $ FilteringLog            :'data.frame':	1 obs. of  7 variables:
#>   ..$ CellsFilteredByLibSize      : int 0
#>   ..$ CellsFilteredByLowExpression: int 0
#>   ..$ CellsFilteredByMt           : int 36
#>   ..$ CellsFilteredByRb           : int 9
#>   ..$ CellsFilteredByMtPct        : int 0
#>   ..$ CellsFilteredByRbPct        : int 29
#>   ..$ FilteredLowAbundanceGenes   : int 17212
#>  $ filterByControl         :List of 2
#>   ..$ Mt: list()
#>   ..$ Rb: chr [1:29] "AACTGGTTCATAAAGG-1" "AAGACCTAGACAATAC-1" "ACCAGTATCGGTTCGG-1" "ACGATACGTCACACGC-1" ...
#>  $ RemovedLowAbundanceGenes: chr [1:17212] "A1CF" "A2ML1" "A2ML1-AS1" "A2ML1-AS2" ...

We can also review the impact of QC on the dataset by regenerating the QC plots.

filtered_qc_plots <- plotGeneralQC(filtered_set)
grid.arrange(filtered_qc_plots$libsize_histogram, filtered_qc_plots$ngenes_hist, 
             filtered_qc_plots$control_hists$Mt, filtered_qc_plots$control_hists$Rb, 
             filtered_qc_plots$control_violins$Mt, filtered_qc_plots$control_violins$Rb,
             ncol = 2)

plot of chunk display_review_qc

3. Cell-cell normalisation

ascend offers a normalisation method based on Relative Log Expression. This is the method used by DESeq to normalise counts, but has been adapated for zero-inflated data.

In this method, each cell is considered as one library and assumes that most genes are not differentially expressed. It uses gene expression values higher than 0 to calculate the geometric means of a gene. The geometric mean is the mean of the expression of the gene across all cells (for cells where the gene is detected). Each gene has one geometric mean value for all cell. For each cell, the gene expression values are divided by the geometric means to get one normalisation factor for a gene in that cell. The median of all the normalisation factors for all genes in that cell is the normalisation factor for the cell.

Post RLE normalisation, a gene with 0 expression still has 0 expression. A gene with expression higher than 0 will have an expression value equal the raw expression divided by the calculated normalization factor for the cell.

norm_set <- normaliseByRLE(filtered_set)
#> [1] "Calculating geometric means..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "Calculating normalisation factor..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "Scaling counts..."
#> [1] "Storing normalised counts..."

Original counts are retained in the EMSet. Normalised counts and log-transformed normalised counts are stored in the EMSet as well, accessible via the normcounts and logcounts accessors from the SingleCellExperiment class.

# Raw counts
counts(norm_set)[1:5,1:5]
#>               AAACCTGAGCTGTTCA-1 AAACCTGCAATTCCTT-1 AAACCTGGTCTACCTC-1
#> FO538757.1             0.0000000          0.2222222          0.0000000
#> AP006222.2             0.2222222          0.0000000          0.2222222
#> RP11-206L10.9          0.0000000          0.0000000          0.0000000
#> LINC00115              0.0000000          0.0000000          0.0000000
#> FAM41C                 0.0000000          0.0000000          0.0000000
#>               AAACCTGTCGGAGCAA-1 AAACGGGAGTCGATAA-1
#> FO538757.1                     0          0.2222222
#> AP006222.2                     0          0.0000000
#> RP11-206L10.9                  0          0.0000000
#> LINC00115                      0          0.0000000
#> FAM41C                         0          0.0000000

# Normalised counts
normcounts(norm_set)[1:5,1:5]
#>               AAACCTGAGCTGTTCA-1 AAACCTGCAATTCCTT-1 AAACCTGGTCTACCTC-1
#> FO538757.1             0.0000000          0.2825285          0.0000000
#> AP006222.2             0.2797099          0.0000000          0.2644579
#> RP11-206L10.9          0.0000000          0.0000000          0.0000000
#> LINC00115              0.0000000          0.0000000          0.0000000
#> FAM41C                 0.0000000          0.0000000          0.0000000
#>               AAACCTGTCGGAGCAA-1 AAACGGGAGTCGATAA-1
#> FO538757.1                     0          0.2749356
#> AP006222.2                     0          0.0000000
#> RP11-206L10.9                  0          0.0000000
#> LINC00115                      0          0.0000000
#> FAM41C                         0          0.0000000

# Log2(counts) + 1
logcounts(norm_set)[1:5, 1:5]
#>               AAACCTGAGCTGTTCA-1 AAACCTGCAATTCCTT-1 AAACCTGGTCTACCTC-1
#> FO538757.1             0.0000000          0.3589909           0.000000
#> AP006222.2             0.3558168          0.0000000           0.338519
#> RP11-206L10.9          0.0000000          0.0000000           0.000000
#> LINC00115              0.0000000          0.0000000           0.000000
#> FAM41C                 0.0000000          0.0000000           0.000000
#>               AAACCTGTCGGAGCAA-1 AAACGGGAGTCGATAA-1
#> FO538757.1                     0          0.3504243
#> AP006222.2                     0          0.0000000
#> RP11-206L10.9                  0          0.0000000
#> LINC00115                      0          0.0000000
#> FAM41C                         0          0.0000000

We can generate some plots to review the impact of normalisation on the dataset.

norm_qc <- plotNormQC(norm_set)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================================                                 |  50%
#> [1] "Plotting GAPDH expression..."
#> 
  |                                                                       
  |=================================================================| 100%
#> [1] "Plotting MALAT1 expression..."
#> 
#> [1] "Plotting gene expression box plots..."

This set of plots depict the library sizes of cells prior to and after RLE normalisation.

grid.arrange(norm_qc$libsize_histograms$count, 
             norm_qc$libsize_histograms$normcount, ncol = 1)

plot of chunk norm_libsize_plots

The following plots depict the impact of RLE normalisation on GAPDH counts.

grid.arrange(norm_qc$sampled_genes$GAPDH$counts,
             norm_qc$sampled_genes$GAPDH$normcounts, ncol = 1)

plot of chunk norm_gapdh

The following plots depict the impact of RLE normalisation on MALAT1 counts.

grid.arrange(norm_qc$sampled_genes$MALAT1$counts,
             norm_qc$sampled_genes$MALAT1$normcounts, ncol = 1)

plot of chunk norm_malat1

The following plots depict the impact of RLE normalisation on all gene counts from 100 randomly-selected cells.

grid.arrange(norm_qc$sampled_cell_gene_expression$counts,
             norm_qc$sampled_cell_gene_expression$normcounts, ncol = 1)

plot of chunk norm_genescatter

We have also included a wrapper for scran's deconvolution method.

4. Control removal

We can review the top expressors by examining the top gene expression boxplot.

filtered_qc_plots$topgenes_boxplot

plot of chunk unnamed-chunk-1

As mitochondrial and ribosomal genes still dominate top expression after filtering, we should remove them from the dataset so we can observe genes of interest.

# Exclude mitochondrial and ribosomal genes
norm_set <- excludeControl(norm_set, control = c("Mt", "Rb"))
print(plotTopGenesBoxplot(norm_set, n = 20))

plot of chunk plotTopGenesNoControls

5. Dimensionality reduction

Dimensionality reduction is used to reduce the number of variables we have to assess which will reduce noise and speed up subesequent analyses. It is also used to visualise the data in 2 or 3 dimensional space.

We will transform our data with Principal Component Analysis (PCA).

pca_set <- runPCA(norm_set, ngenes = 1500, scaling = TRUE)
#> [1] "Computing PCA values..."
#> [1] "PCA complete! Loading PCA into EMSet..."
reducedDim(pca_set, "PCA")[1:5,1:5]
#>                          PC1        PC2        PC3       PC4        PC5
#> AAACCTGAGCTGTTCA-1  5.981053 -1.5819663  2.7659207 -3.866405  2.1624718
#> AAACCTGCAATTCCTT-1 19.728636  0.6913078  0.4654700  2.509744 -0.9110175
#> AAACCTGGTCTACCTC-1 -3.853559  2.0534298 -3.2331363  7.292717 -1.6735128
#> AAACCTGTCGGAGCAA-1 18.650776  0.1556006  1.2830011  1.390305 -1.4029335
#> AAACGGGAGTCGATAA-1 12.538416 -0.3576140 -0.5896934  3.218731  0.2581786

We can observe how much variance each principal component contributes to a dataset using a scree plot. The first PC will contribute the most variance to a dataset, but the number of PCs that do contribute variance will differ.

plotPCAVariance(pca_set, n = 50)

plot of chunk plotPCAVariance

As previously mentioned, these values can also be used to represent the data in two dimensions.

plotPCA(pca_set, PCX = 1, PCY = 2, group = "batch")

plot of chunk plotPCABatch

6. Clustering

6.1 Using runCORE

The runCORE function generates a distance matrix based on the input and from this, builds a dendrogram. This dendrogram is then cut with the dynamicTreeCut algorithm to select clusters from the dendrogram based on the shape and size of the branches. The cut is then repeated over a range of heights ranging from 0 to 1. The clusters generated by the most robust height - defined as the point where cluster numbers stabilise, is chosen as the optimal clustering result.

clustered_set <- runCORE(pca_set, 
                         conservative = FALSE, 
                         remove_outlier = TRUE,
                         nres = 40)
#> [1] "Calculating distance matrix..."
#> [1] "Generating hclust object..."
#> [1] "Using dynamicTreeCut to generate reference set of clusters..."
#> [1] "Checking if outliers are present..."
#> [1] "Generating clusters by running dynamicTreeCut at different heights..."
#> Warning in split.default(windows, 1:nworkers): data length is not a
#> multiple of split variable
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "Calculating rand indices..."
#> [1] "Calculating stability values..."
#> [1] "Aggregating data..."
#> [1] "Finding optimal number of clusters..."
#> [1] "Optimal number of clusters found! Returning output..."

runCORE generates a series of clustering-related objects, that can be accessed as follows:

cluster_analysis <- clusterAnalysis(clustered_set)

# Number of clusters
cluster_analysis$nClusters
#> [1] 3

# Rand Index values
cluster_analysis$keyStats
#>    Height Stability RandIndex ConsecutiveRI ClusterCount
#> 1   0.025     0.250 1.0000000     1.0000000            5
#> 2    0.05     0.250 1.0000000     1.0000000            5
#> 3   0.075     0.250 1.0000000     1.0000000            5
#> 4     0.1     0.250 1.0000000     1.0000000            5
#> 5   0.125     0.250 1.0000000     1.0000000            5
#> 6    0.15     0.250 1.0000000     1.0000000            5
#> 7   0.175     0.250 1.0000000     1.0000000            5
#> 8     0.2     0.250 1.0000000     1.0000000            5
#> 9   0.225     0.250 1.0000000     1.0000000            5
#> 10   0.25     0.250 1.0000000     1.0000000            5
#> 11  0.275     0.025 0.9029543     0.9029543            4
#> 12    0.3     0.025 0.9029543     1.0000000            4
#> 13  0.325     0.025 0.8942298     0.9911850            3
#> 14   0.35     0.250 0.8942298     1.0000000            3
#> 15  0.375     0.250 0.8942298     1.0000000            3
#> 16    0.4     0.250 0.8942298     1.0000000            3
#> 17  0.425     0.250 0.8942298     1.0000000            3
#> 18   0.45     0.250 0.8942298     1.0000000            3
#> 19  0.475     0.250 0.8942298     1.0000000            3
#> 20    0.5     0.250 0.8942298     1.0000000            3
#> 21  0.525     0.250 0.8942298     1.0000000            3
#> 22   0.55     0.250 0.8942298     1.0000000            3
#> 23  0.575     0.250 0.8942298     1.0000000            3
#> 24    0.6     0.025 0.2304299     0.2773806            2
#> 25  0.625     0.400 0.2304299     1.0000000            2
#> 26   0.65     0.400 0.2304299     1.0000000            2
#> 27  0.675     0.400 0.2304299     1.0000000            2
#> 28    0.7     0.400 0.2304299     1.0000000            2
#> 29  0.725     0.400 0.2304299     1.0000000            2
#> 30   0.75     0.400 0.2304299     1.0000000            2
#> 31  0.775     0.400 0.2304299     1.0000000            2
#> 32    0.8     0.400 0.2304299     1.0000000            2
#> 33  0.825     0.400 0.2304299     1.0000000            2
#> 34   0.85     0.400 0.2304299     1.0000000            2
#> 35  0.875     0.400 0.2304299     1.0000000            2
#> 36    0.9     0.400 0.2304299     1.0000000            2
#> 37  0.925     0.400 0.2304299     1.0000000            2
#> 38   0.95     0.400 0.2304299     1.0000000            2
#> 39  0.975     0.400 0.2304299     1.0000000            2
#> 40      1     0.400 0.2304299     1.0000000            2

6.2 Visualising clustering results

The plotStabilityDendro and plotStability functions allows us to review the clustering process by showing other clustering results and their corresponding Rand index.

plotStabilityDendro(clustered_set)

plot of chunk stability_plot

#> $mar
#> [1] 1 5 0 1
plotStability(clustered_set)

plot of chunk plotStability

plotDendrogram generates a cluster-labelled dendrogram that also displays the size of each cluster.

plotDendrogram(clustered_set)
#> Warning in `labels<-.dendrogram`(dend, value = value, ...): The lengths
#> of the new labels is shorter than the number of leaves in the dendrogram -
#> labels are recycled.

plot of chunk plotDendrogram

The clustering information can also be shown on plots as the information has been added to colInfo.

# Reduce dimensionality with t-SNE
clustered_set <- runTSNE(clustered_set, PCA = TRUE, 
                        dimensions = 2, seed = 1, 
                        perplexity = 30, theta = 0.5)
#> [1] "Running Rtsne..."
#> [1] "Rtsne complete! Returning matrix..."

# Generate a t-SNE plot
tsne_plot <- plotTSNE(clustered_set, Dim1 = 1, Dim2 = 2, group = "cluster")

# Generate a PCA plot
pca_plot <- plotPCA(clustered_set, PCX=1, PCY=2, group = "cluster")

# Generate MDS plot
mds_plot <- plotMDS(clustered_set, Dim1 = 1, Dim2 = 2, group = "cluster")
#> [1] "EMSet has undergone clustering. Retrieving distance matrix..."
#> [1] "Running cmdscale..."
#> [1] "Cmdscale complete! Processing scaled data..."
#> [1] "Generating MDS plot..."
tsne_plot

plot of chunk tsne_plot

pca_plot

plot of chunk cluster_pca_plot

mds_plot

plot of chunk mds_plot

7. Differential expression

For differential expression analysis, ascend offers wrappers for the DESeq and DESeq2 packages. However, they are not suited for some datasets and systems.

Let's quickly review our dataset.

col_info <- colInfo(clustered_set)
# Batch numbers
print("Cells per batch")
#> [1] "Cells per batch"
table(col_info$batch)
#> 
#>    1    2 
#> 1010  188

# Cluster numbers
print("Cells per cluster")
#> [1] "Cells per cluster"
table(col_info$cluster)
#> 
#>   1   2   3 
#> 656 428 114

This dataset is unbalanced as the two batches and three clusters vary significantly in size. These are not accounted for in some models used for differential expression.

We decided to implement a method described by McDavid, Finak, Chattopadyay et al. 2013 [https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/bts714] that accounts for the bimodal nature of scRNA-seq data for differential expression analysis in ascend.

Firstly, genes are either on or off and can be represented as a discrete model. Secondly, if genes are being expressed, it can be represented with a continuous model. The discrete and continuous models are incorporated into a combined Likelihood Ratio Test (LRT). Genes with no variance are assumed to only have a discrete model. d

# Run combined LRT
cluster1_vs_others <- runDiffExpression(clustered_set, group = "cluster",
                                        condition.a = 1, condition.b = c(2, 3),
                                        subsampling = FALSE, ngenes = 1500)
#> [1] "Identifying genes to retain..."
#> [1] "Running LRT..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "LRT complete! Returning results..."
cluster2_vs_others <- runDiffExpression(clustered_set, group = "cluster",
                                        condition.a = 2, condition.b = c(1, 3),
                                        subsampling = FALSE, ngenes = 1500)
#> [1] "Identifying genes to retain..."
#> [1] "Running LRT..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "LRT complete! Returning results..."
cluster3_vs_others <- runDiffExpression(clustered_set, group = "cluster",
                                        condition.a = 3, condition.b = c(1, 2),
                                        subsampling = FALSE, ngenes = 1500)
#> [1] "Identifying genes to retain..."
#> [1] "Running LRT..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "LRT complete! Returning results..."
# Generate volcano plots
cluster1_volcano <- plotVolcano(cluster1_vs_others, l2fc = 2, 
                                threshold = 5e-3, 
                                labels = TRUE,
                                label.size = 3,
                                check.overlap = TRUE)

cluster2_volcano <- plotVolcano(cluster2_vs_others, l2fc = 2, 
                                threshold = 5e-3, 
                                labels = TRUE,
                                label.size = 3,
                                check.overlap = TRUE)

cluster3_volcano <- plotVolcano(cluster3_vs_others, l2fc = 2, 
                                threshold = 5e-3, 
                                labels = TRUE,
                                label.size = 3,
                                check.overlap = TRUE)

library(ggplot2)
cluster1_volcano <- cluster1_volcano + ggtitle("Cluster 1 vs Cluster 2 and 3")
cluster2_volcano <- cluster2_volcano + ggtitle("Cluster 2 vs Cluster 1 and 3")
cluster3_volcano <- cluster3_volcano + ggtitle("Cluster 3 vs Cluster 1 and 2")
cluster1_volcano

plot of chunk cluster1_volcano

cluster2_volcano

plot of chunk cluster2_volcano

cluster3_volcano

plot of chunk cluster3_volcano

8. Branching out with other R packages

In this tutorial, we covered the main ascend workflow that completes the following the tasks:

  1. Quality control
  2. Filtering
  3. Normalisation
  4. Dimensionality reduction
  5. Clustering
  6. Differential expression

As the package focuses on so many steps, it does not cover the individual steps in as great detail as other packages. Instead, we have chosen methods that are fast and effective. If you require more detailed analysis, you may export the data stored in the EMSet for use with other packages.

sce_object <- EMSet2SCE(clustered_set)

# Extract data from EMSet
# Expression matrix
count_matrix <- counts(clustered_set)
normalised_matrix <- normcounts(clustered_set)
log_matrix <- logcounts(clustered_set)

Package wrappers

We have created wrappers for specific tasks in scran, DESeq and DESeq2. They provide alternatives for specific stages in the ascend workflow and provide more in-depth analysis.

scran

We used this normalisation method for our publication. This method groups the cells into pools for size factor calculation before

scran normalisation by deconvolution
scran_normalised <- scranNormalise(filtered_set, quickCluster = FALSE, min_mean = 1e-05)
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "1198 cells detected. Running computeSumFactors with preset sizes of 40, 60, 80, 100..."
#> [1] "scran's computeSumFactors complete. Adjusting zero sum factors..."
#> [1] "Running scater's normalize method..."
#> [1] "Normalisation complete. Converting SingleCellExperiment back to EMSet..."

DESeq

cluster2_vs_others <- runDESeq(clustered_set, group = "cluster",
                               condition.a = 2, condition.b = c(1, 3),
                               ngenes = 1500)
#> [1] "Identifying genes to retain..."
#> [1] "Running DESeq..."
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |=================================================================| 100%
#> 
#> [1] "DESeq complete! Adjusting results..."
#> Warning in runDESeq(clustered_set, group = "cluster", condition.a = 2,
#> condition.b = c(1, : NaNs produced
plotVolcano(cluster1_vs_others, labels = TRUE)

plot of chunk DESeq

DESeq2

DESeq2 requires more time than allowed for this workshop, so we will not run it.

cluster1_vs_others <- runDESeq2(clustered_set, group = "cluster",
                               condition.a = 1, condition.b = c(2, 3),
                               ngenes = 1500)