Skip to content

Latest commit

 

History

History
182 lines (136 loc) · 9.22 KB

27.deu.md

File metadata and controls

182 lines (136 loc) · 9.22 KB

Differential exon usage

NOTE: This section is based on the code provided in the DEXSeq vignette, which can be checked for extra information.

So far we have been focusing on analysing the transcriptome from a gene-centric perspective. However, one of the advantages of RNA sequencing is that it allows us to address questions about alternative splicing in a much easier way than it used to be possible with microarrays. There are a large number of methods to detect significant differences in splicing across conditions (e.g. cuffdiff, mmdiff), many of which rely on the non-trivial task of estimating transcript expression levels. Here we will focus on DEXSeq, a Bioconductor package for the identification of differential exon usage events from exon counts. In other words, DEXSeq reports cases where the expression of a given exon, relative to the expression of the corresponding gene, changes across conditions.

The structure of a DEXSeq analysis is analogous to the one we have seen so far for differential expression with DESeq: count normalisation, dispersion estimation and differential exon usage test. However, there are a couple of things to take into account before getting started with that workflow, as we'll see next.

Preparing the annotation

Exons might be represented multiple times in a standard GTF file if they are shared between multiple transcripts or genes. This overlap can include the entire exon, or just part of it, as illustrated in Figure 1 from the DEXSeq paper. Thus, in order to ensure that each exon is tested only once, DEXSeq requires a slightly modified annotation file, in which exons are flattened into counting bins. We only need to prepare this flattened annotation file once, and this can be achieved by executing the command below:

# do not run - it takes a while
# you'll find the output here: RNAseq_all/reference

cd RNAseq_all/reference
python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py \
	--aggregate=no Drosophila_melanogaster.BDGP5.25.62.gtf \
	Drosophila_melanogaster.BDGP5.25.62_dexseq_noaggregate.gff

Exercise: How does the newly generated annotation file differ from the previous one? Try this:

cd RNAseq_all/reference

original=Drosophila_melanogaster.BDGP5.25.62.gtf
grep FBgn0031208 $original | awk '$3=="exon"'

flattened=Drosophila_melanogaster.BDGP5.25.62_dexseq_noaggregate.gff
grep FBgn0031208 $flattened | awk '$3=="exonic_part"'

What is the number of lines obtained in each case? What is the number of counting bins for this gene? Hint - Solution

Exercise: One of the options used in this example to generate the flattened annotation file is --aggregate=no. What does this refer to? Hint:

python /path/to/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py -h

Solution

Each of the exonic parts in the flattened annotation file are the potentially testable bins. However, before we can perform the testing, we first need to know the number of reads that overlap with each of them.

Counting reads overlapping exon bins

Provided that we have aligned our reads to the genome with a splice-aware aligner (e.g. TopHat) we can now proceed to count the number of reads that fall into exon bins in our sample:

# do not run - it takes a while
# you'll find the output here: RNAseq_all/data/mapped
gff_file=Drosophila_melanogaster.BDGP5.25.62_dexseq_noaggregate.gff
bam_file=../data/mapped/untreated3.nsorted.bam
out=$bam_file.dexseq_noaggregate.txt

samtools view $bam_file | python /path/to/library/DEXSeq/inst/python_scripts/dexseq_count.py \
	--paired=yes --stranded=no $gff_file - $out

Exercise: By default, dexseq_count will only consider reads that mapped with a mapping quality of 10 or more. Even though we didn't explicitly set this option in the command above, we can learn about this on the help text:

python /path/to/library/DEXSeq/inst/python_scripts/dexseq_count.py -h

What does this threshold refer to? Solution

Exercise: Previously we have been calculating the number of reads that overlap each gene using htseq-count. What's the difference between these two tools? Hint: think of the previous steps in this section and the input required by this tool. Also, would the gene counts generated with these two tools be the same? Solution

Loading the counts into R

Finally we just need to load the counts into R. Here we'll be working with an example dataset, and thus we'll be loading the counts directly from the pasilla library:

library(DEXSeq)
library("pasilla")

data(pasillaDEXSeqDataSet)
ecs=dxd ## the data loaded at the previous line is stored in an object named `dxd`
head(counts(ecs))

Alternative for your own data

Alternatively, to load the count files for your experiment into R, you should first generate a table summarising your experimental design:

cat sampleTable.txt

#             	countFile           condition
# untreated1	untreated1.counts   control
# untreated2 	untreated2.counts   control
# untreated3 	untreated3.counts   control
# untreated4 	untreated4.counts   control
# treated1      treated1.counts     knockdown
# treated2      treated2.counts     knockdown
# treated3      treated3.counts     knockdown

And then load it in R and generate an expression set object:

library(DEXSeq)

sampleTable=read.table("sampleTable.txt")
annot="Drosophila_melanogaster.BDGP5.25.62_dexseq_noaggregate.gff"

ecs = read.HTSeqCounts(
    	countfiles = sampleTable$countFile,
    	design = sampleTable,
    	flattenedfile = annot )

Count normalisation

Independently on whether you're working with the example counts (available through the pasilla library) or the ones for your own samples, the next essential step of the workflow consists on estimating the size factors for each library, used to take into account variable sequencing depths. This step is common to the one previously followed with DESeq (see the section on Normalising counts):

ecs=estimateSizeFactors(ecs)
sizeFactors(ecs)

Dispersion estimation

Also analogous to the workflow followed with DESeq is the step on estimating the dispersion on the observed counts for each of the exons (see the section on Differential gene expression). This information is used to quantify the variability that we can expect between biological replicates, and will help us in addressing which of the observed differences are big enough to be attributed to a change in the condition.

ecs=estimateDispersions( ecs )

We can next plot the calculated dispersion estimates as a function of the mean normalised counts, just as a sanity test:

out="dexseq_dispersion.pdf"
pdf(file=out)
plotDispEsts( ecs )
dev.off()

Exercise: Exons with a low number of counts tend to have very high variability and will not end up as a significant result. In order to reduce computation time, DEXSeq skips such exons, as specified in the help for the estimateDispersions function:

?estimateDispersions

Differential exon usage test

Finally, we can test for differential exon usage and calculate the fold-changes:

ecs=testForDEU(ecs)
ecs=estimateExonFoldChanges( ecs )

result=DEXSeqResults(ecs)
head( result )

Exercise: Why do we get NA values for FBgn0000256:E006? Solution

Exercise: How many exons do we identify as differentially used (e.g. FDR < 0.1)? How many genes? Solution

Visualisation

It is in general a good practise to visualise the results in the form of an MA-plot:

out="dexseq_ma.pdf"
pdf(file=out)
plotMA( ecs, FDR=0.1, ylim=c(-4,4), cex=0.8 )
dev.off()

In addition, DEXSeq offers a nice way to visualise differential exon usage events for a given gene:

out="dexseq_FBgn0085442.pdf"
pdf(file=out)

plotDEXSeq( result, "FBgn0085442", expression=FALSE, 
	norCounts=TRUE, displayTranscripts=TRUE,
   	legend=TRUE, cex.axis=1.2, cex=1.3, lwd=2 )

dev.off()