PSORT Analysis

All analysis was conducted in R version 4.0.2 using the following script. Computations were performed on a MacBook Pro with 16GB of RAM and an i7 quad-core processor.

If you haven’t already installed the bioplotr package, you’ll need to do so to reproduce the figures below.

# Install bioplotr
devtools::install_github('dswatson/bioplotr')

# Load libraries, register cores
library(bioplotr)
library(data.table)
library(limma)
library(edgeR)
library(splines)
library(qvalue)
library(tidyverse)
library(ggsci)
library(doMC)
registerDoMC(8)

Preliminaries

Following RNA-seqencing of all samples, reads were pseudo-aligned using kallisto (Bray et al., 2016). Our reference genome was GRCh38.p10, Ensembl version 91. Transcript-level reads were aggregated to gene-level using the tximport package (Soneson et al., 2015). We save the raw count matrix for later use.

# Load data
anno <- fread('./Data/Hs.anno.csv')
clin <- fread('./Data/clinical_samples.csv')
t2g <- readRDS('./Data/Hs91.t2g.rds')
files <- file.path('./Aligned', clin$Sample_id, 'abundance.tsv')
txi <- tximport(files, type = 'kallisto', tx2gene = t2g, importer = fread, 
                countsFromAbundance = 'lengthScaledTPM')

# Export counts
cnts <- txi$counts
saveRDS(cnts, 'cnts.rds')

Clinical Exploratory Data Analysis (EDA)

Before examining the transcriptomic data, we analyse the clinical data. Our goal here is to get a sense for general trends in PASI trajectories for patients, as well as to compare performance by drug and check for potential outliers.

Close inspection by our colleagues in Newcastle has revealed mislabelled samples for one patient, whose lesional and nonlesional samples were swapped. We correct the mistake before running EDA or DEA.

# Relabel mislabelled samples
clin[Sample_id %in% c('006041-0005', '023012-1205'), Tissue := 'Nonlesional'
  ][Sample_id %in% c('006041-0006', '023012-1206'), Tissue := 'Lesional']

Baseline vs. 12 Week PASI

One helpful way to visualise the study’s results is with a baseline vs. 12 week PASI scatterplot. We include a black line of equality as well as zero-intercept regression lines for each drug. A perfect treatment would have a horizontal best fit line at (y = 0), representing full improvement regardless of baseline PASI; a completely ineffectual treatment would have a best fit line of (45^{\circ}), representing no improvement over the course of the study.

# Load patient-level clinical data
df <- fread('./Data/clinical_patients.csv')

# Regress 12 week PASI on baseline for each drug
m_a <- lm(PASI_wk12 ~ 0 + PASI_wk00, data = df[Drug == 'Adalimumab'])
m_u <- lm(PASI_wk12 ~ 0 + PASI_wk00, data = df[Drug == 'Ustekinumab'])

# Plot results
cols <- pal_d3()(2)
ggplot(df, aes(PASI_wk00, PASI_wk12, colour = Drug)) + 
  geom_point(size = 1) + 
  geom_abline(intercept = 0, slope = 1) + 
  geom_abline(intercept = 0, slope = coef(m_a), colour = cols[1]) + 
  geom_abline(intercept = 0, slope = coef(m_u), colour = cols[2]) + 
  xlim(0, 35) + ylim(0, 35) + 
  labs(title = 'Baseline vs. 12 Week PASI',
           x = 'Baseline PASI',
           y = '12 Week PASI') + 
  theme_bw() + 
  theme(plot.title = element_text(hjust = .5)) + 
  scale_color_d3()

This plot reveals several a couple of interesting points. First of all, two patients appear to have gotten worse over the course of study. Second, there are a number of high leverage points among the adalimumab patients that increase the slope of their regression line. This is why ustekinumab appears to be the more effective treatment according to a least squares estimate.

Transcriptomic EDA

Before conducting transcriptomic EDA, we merge counts with gene-level annotation data and remove probes with less than one count per million (CPM) in at least 20 libraries. This ensures that every probe is expressed in at least 5% of all samples. Counts are then TMM normalised prior to modelling (Robinson & Oshlack, 2010). See the extensive edgeR package vignette for more details.

# Gene-level annotations 
gene_idx <- tibble(EnsemblID = rownames(cnts)) %>%
  inner_join(anno, by = 'EnsemblID') %>%
  select(EnsemblID, GeneSymbol)
cnts <- cnts[gene_idx$EnsemblID, clin$Sample_id]
y <- DGEList(cnts, genes = gene_idx)

# Filter genes
keep <- rowSums(cpm(y) > 1) >= 20  
y <- DGEList(y[keep, ])
y <- calcNormFactors(y)

# Check dimensionality
dim(y)

## [1] 16172   400

The analysis will proceed with 16,172 probes. We use diagnostic plots and dimensionality reduction techniques to check for outliers, validate our preprocessing pipeline, and search for unsupervised clusters.

Mean-Variance

We begin by examining the data’s mean-variance trend. With this plot, we can examine the relationship between the first two moments of the transriptomic data and visually inspect for outlying probes. The shape of the curve is especially important for RNA-seq data analysed using limma::voom, as results from this LOWESS regression are used to fit weighted least squares models in subsequent differential expression tests (Law et al., 2014).

plot_voom(y)

This mean-variance trend looks about right for these data. The trend is as expected and no clear outliers emerge from this figure.

For the remaining plots, we conduct EDA on the log2-CPM scale of the normalised count matrix. We add 1 to all counts to avoid taking the logarithm of 0.

# Transform counts
mat <- cpm(y, log = TRUE, prior.count = 1)

# Save output
saveRDS(mat, './Data/lcpm_mat.rds')

Density Plot

Plotting expression densities for all samples is a good way to evaluate the overall skew of the data and visually inspect for outlying samples. We color the curves by tissue type to observe potential transcriptomic variation by tissue.

plot_density(mat, group = list(Tissue = clin$Tissue),  
             xlab = expression(log[2]*'-CPM'))

These sample-wise densities look exceptionally clean, and approximately symmetric under log2-CPM transform. The slight kink in the low end of the spectrum is common with count data, and there appears to be some systematic variation by tissue type, as we would expect.

Subject Similarity Matrix

We build a subject similarity matrix by calculating the pairwise Euclidean distance between all samples in the data. This matrix can then be visualised as a heatmap and used to build a hierarchical clustering dendrogram. We annotate the plot with phenotypic information to see if the unsupervised clusters track known clinical groupings.

plot_similarity(mat, group = list(Tissue = clin$Tissue, 
                                    Drug = clin$Drug,
                                    Time = clin$Time), 
                covar = list(PASI = clin$PASI))

Once again, we find systematic variation by tissue type. Differences by other phenotypic markers are considerably less pronounced. For a more informative visualisation, we should probably look within tissue type. We leave that for later, however, as our goal here is simply to confirm overall data quality.

Drivers Plot

Another way to visualise clinical drivers of transcriptomic variation is a drivers plot. To create this figure, we perform a principal component analysis (PCA) of the log2-CPM matrix. We then test associations between principal components of the data and individual clinical features.

plot_drivers(mat, select(clin, -Sample_id, Patient_id, -outcome), 
             alpha = 0.05, p_adj = 'fdr')

We find here that tissue is the overwhelming driver of variation in the first principal component, which accounts for nearly a third of all variation in the transcriptomic data. Time and PASI are also major drivers. Fortunately, it looks like there are little or no batch effects to speak of in these data, which is reassuring. Also note that Drug does not appear to be a major source of variation, which suggests that there was no confounding between omic markers and treatment assignment.

Principal Component Analysis

We plot the samples in a two-dimensional space defined by the first two principal components. This projection allows us to visually inspect for outliers and unsupervised clustering effects. We mark the samples by tissue type and time.

plot_pca(mat, group = list(Tissue = clin$Tissue, Time = clin$Time), size = 2)

Once again, we find a nice separation between lesional and nonlesional samples, although the boundary between the two is not perfectly linear. Note, however, that the areas of apparent overlap tend to occur in week 12, suggesting that effective treatment has blurred the distinction between the two tissue types.

Kernel PCA

Kernel PCA (KPCA) is a nonlinear variant of PCA that uses a Gaussian radial basis function (RBF) kernel to reweight the samples. For details, see Schölkopf et al., 1998. This transformation is designed to preserve local structure, capturing some interactions in the feature space that cannot be captured by linear combinations of the input variables.

plot_kpca(mat, group = list(Tissue = clin$Tissue, Time = clin$Time), size = 2)

This plot once again demonstrates some decent segregation between tissue types along the first PC. There is evidently greater variation among nonlesional samples than lesional samples, as they fan out across the second PC. Note that one baseline nonlesional sample appears somewhat stranded in a sea of lesional points, suggesting possible mislabelling.

t-SNE

Another popular embedding technique is t-distributed stochastic neighbour embedding (t-SNE). This method, originally developed by van der Maaten & Hinton (2008), explicitly attempts to preserve local structure when mapping a large matrix onto a low-dimensional subspace. It is therefore ideal for unsupervised cluster detection.

plot_tsne(mat, group = list(Tissue = clin$Tissue, Time = clin$Time), size = 2)

This plot shows even neater separation between lesional and nonlesional samples across opposing quadrants of the Cartesian plane. Once again, a single nonlesional sample appears to be potentially mislabelled.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Aggregate_counts		Aggregate_counts
Cell_Type_Correlations		Cell_Type_Correlations
DEA_files		DEA_files
Factor_analysis		Factor_analysis
WGCNA		WGCNA
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PSORT Analysis

Preliminaries

Clinical Exploratory Data Analysis (EDA)

Baseline vs. 12 Week PASI

Transcriptomic EDA

Mean-Variance

Density Plot

Subject Similarity Matrix

Drivers Plot

Principal Component Analysis

Kernel PCA

t-SNE

About

Releases

Packages

Contributors 6

Languages

C4TB/PSORT

Folders and files

Latest commit

History

Repository files navigation

PSORT Analysis

Preliminaries

Clinical Exploratory Data Analysis (EDA)

Baseline vs. 12 Week PASI

Transcriptomic EDA

Mean-Variance

Density Plot

Subject Similarity Matrix

Drivers Plot

Principal Component Analysis

Kernel PCA

t-SNE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages