-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path21-GO-enrichment.Rmd
1942 lines (1571 loc) · 62.8 KB
/
21-GO-enrichment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Gene Set Enrichment Analysis
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE, echo = TRUE)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 60),
tidy = TRUE)
sapGO <- readRDS("data/sapGO.rds")
suppressPackageStartupMessages({
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(annotate)
library(ggplot2)
library(dplyr)
library(DOSE)
library(GOSemSim)
library(ViSEAGO)
library(topGO)
library(org.Sc.sgd.db)
library(org.Hs.eg.db)})
```
The .Rmd file for this chapter can be
found
[here](https://github.com/gurinina/omic_sciences/blob/main/21-GO-enrichment.Rmd).
To begin learning about GO set
enrichment analysis and the different
methods that can be utililized to preform
these analysis a good jumping off start
point is this Natural Protocols paper:
Pathway enrichment analysis and
visualization of omics data using
g:Profiler, GSEA, Cytoscape and
EnrichmentMap
Jüri Reimand, Ruth Isserlin, Veronique
Voisin, Mike Kucera, Christian
Tannus-Lopes, Asha Rostamianfar, Lina
Wadi, Mona Meyer, Jeff Wong, Changjiang
Xu, Daniele Merico and Gary D. Bader.
Pathway enrichment analysis helps
researchers gain mechanistic insight
into gene lists generated from
genome-scale (omics) experiments. This
method identifies biological pathways
that are enriched in a gene list more
than would be expected by chance. We
explain the procedures of pathway
enrichment analysis and present a
practical step-by-step guide to help
interpret gene lists resulting from
RNA-seq and genome-sequencing
experiments. The protocol comprises
three major steps: definition of a gene
list from omics data, determination of
statistically enriched pathways, and
visualization and interpretation of the
results. We describe how to use this
protocol with published examples of
differentially expressed genes and
mutated cancer genes; however, the
principles can be applied to diverse
types of omics data. The protocol
describes innovative visualization
techniques, provides comprehensive
background and troubleshooting
guidelines, and uses freely available
and frequently updated software,
including g:Profiler, Gene Set
Enrichment Analysis (GSEA), Cytoscape
and EnrichmentMap. The complete protocol
can be performed in \~4.5 h and is
designed for use by biologists with no
prior bioinformatics training.
Comprehensive quantification of DNA, RNA
and proteins in biological samples is
now routine. The resulting data are
growing exponentially, and their
analysis helps researchers discover
novel biological functions,
genotype--phenotype relationships and
disease mechanisms1,2. However, analysis
and interpretation of these data
represent a major challenge for many
researchers. Analyses often result in
long lists of genes that require an
impractically large amount of manual
literature searching to interpret. A
standard approach to addressing this
problem is pathway enrichment analysis,
which summarizes the large gene list as
a smaller list of more easily
interpretable pathways. Pathways are
statistically tested for
over-representation in the experimental
gene list relative to what is expected
by chance, using several common
statistical tests that consider the
number of genes detected in the
experiment, their relative ranking and
the number of genes annotated to a
pathway of interest. For instance,
experimental data containing 40% cell
cycle genes are surprisingly enriched,
given that only 8% of human
protein-coding genes are involved in
this process. In a recent example, we
used pathway enrichment analysis to help
identify histone and DNA methylation by
the polycomb repressive complex (PRC2)
as the first rational therapeutic target
for ependymoma, one of the most
prevalent childhood brain cancers3. This
pathway is targetable by available drugs
such as 5-azacytidine, which was used on
a compassionate basis in a terminally
ill patient and stopped rapid metastatic
tumor growth3. In another example, we
analyzed rare copynumber variants (CNVs)
in autism and identified several
significant pathways affected by gene
deletions, whereas few significant hits
were identified with case--control
association tests of single genes or
loci4,5. These examples illustrate the
useful insights into biological
mechanisms that can be achieved using
pathway enrichment analysis.
=======
The .Rmd file for this chapter can be found [here](https://github.com/gurinina/omic_sciences/blob/main/35GSEA.Rmd). To begin learning aobuto GO set enrichmet analysis and the different methods that can be utiliazed to preform these analysis a good jumping off start point is this Natural Protocols paper:
As a change a pace I thought we would go through this very helpful paper written...
Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap
Jüri Reimand, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, Mona Meyer, Jeff Wong, Changjiang Xu, Daniele Merico and Gary D. Bader. Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results. We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.
Comprehensive quantification of DNA, RNA and proteins in biological samples is now routine. The resulting data are growing exponentially, and their analysis helps researchers discover novel biological functions, genotype–phenotype relationships and disease mechanisms1,2. However, analysis and interpretation of these data represent a major challenge for many researchers. Analyses often result in long lists of genes that require an impractically large amount of manual literature searching to interpret. A standard approach to addressing this problem is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list relative to what is expected by chance, using several common statistical tests that consider the number of genes detected in the experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40% cell cycle genes are surprisingly enriched, given that only 8% of human protein-coding genes are involved in this process. In a recent example, we used pathway enrichment analysis to help identify histone and DNA methylation by the polycomb repressive complex (PRC2) as the first rational therapeutic target for ependymoma, one of the most prevalent childhood brain cancers3. This pathway is targetable by available drugs such as 5-azacytidine, which was used on a compassionate basis in a terminally ill patient and stopped rapid metastatic tumor growth3. In another example, we analyzed rare copynumber variants (CNVs) in autism and identified several significant pathways affected by gene deletions, whereas few significant hits were identified with case–control association tests of single genes or loci4,5. These examples illustrate the useful insights into biological mechanisms that can be achieved using pathway enrichment analysis.
## Application to diverse omics data
This protocol uses RNA-seq data and
somatic mutation data6 as examples
because these data types are frequently
encountered. However, the general
concepts of pathway enrichment analysis
that we present are applicable to many
types of experiments that can generate
lists of genes, such as single-cell
transcriptomics, CNVs, proteomics,
phosphoproteomics, DNA methylation
and metabolomics. Most data types
require protocol modifications, which we
only briefly discuss here. With certain
data types, specialized computational
methods are required to produce a gene
list that is appropriate for pathway
enrichment analysis, whereas with other
data types, a specialized pathway
enrichment analysis technique is
required.
## Pathway enrichment analysis methods
This protocol recommends the use of
g:Profiler and GSEA software for pathway
enrichment analysis. g:Profiler
analyzes gene lists using Fisher's exact
test and ordered gene lists using a
modified Fisher's test. It provides a
graphical web interface and access via R
and Python programming languages. The
software is frequently updated, and the
gene set database can be downloaded as a
[GMT file](http://biit.cs.ut.ee/gprofiler).
GSEA(http://software.broadinstitute.org/gsea)
analyzes ranked gene lists using
a permutation-based test. The software
typically runs as a desktop application.
Hundreds of pathway enrichment analysis
tools exist (reviewed in ref. Khatri,
P., Sirota, M. & Butte, A. J. Ten years
of pathway analysis: current approaches
and outstanding challenges. PLoS Comput.
Biol. 8, e1002375 (2012)), although many
rely on outof-date pathway databases or
lack unique features as compared to the
most commonly used tools; as such, we do
not cover them here. The following are
alternative free pathway enrichment
analysis software tools. Although we do
not cover these tools in our protocol,
we recommend the following, on the basis
of their ease of use, unique features or
advanced programming features.
## Comparison to alternative methods
(See paper for referenes) Enrichr(37):
This is a web-based enrichment analysis
tool for non-ranked gene lists that is
based on Fisher's exact test. It is easy
to use, has rich interactive reporting
features, and includes \>100 gene set
databases (called libraries), including
\>180,000 gene sets in multiple
categories. Functionality is similar to
that of the g:Profiler web server
described in this protocol.
[Camera(71)](https://bioconductor.org/packages/release/bioc/html/limma.html):
This R Bioconductor package analyzes
gene lists and corrects for inter-gene
correlations such as those apparent in
gene co-expression data. The software is
available as part of the limma package
in Bioconductor; (this is an advanced
tool that requires programming
expertise; Supplementary Protocol 3).
(similar to moast and roast)
[GOseq(72)](https://bioconductor.org/packages/release/bioc/html):
This R Bioconductor package analyzes
gene lists from RNA-seq experiments by
correcting for user-selected covariates
such as gene length; this is an advanced
tool that requires programming
expertise).
[Genomic Regions Enrichment of
Annotations Tool
(GREAT)(67)](http://bejerano.stanford.edu/great/public/html/):
In contrast to common methods that
analyze gene lists, GREAT analyzes
genomic regions such as DNA binding
sites and links these to nearby genes
for pathway enrichment analysis . See
'Application to diverse omics data'
section. PROTOCOL NATURE PROTOCOLS 492
NATURE PROTOCOLS \| VOL 14 \| FEBRUARY
2019 \| 482--517 \| www.nature.com/npro
## Visualization tools
This protocol recommends the use of
[EnrichmentMap](http://www.baderlab.org/Software/EnrichmentMap)
for pathway enrichment
analysis visualization to aid
interpretation. EnrichmentMap(16) is a
Cytoscape application that visualizes
the results from pathway enrichment
analysis and eases interpretation by
displaying pathways as a network in
which overlapping pathways are clustered
together to identify major biological
themes in the results.
Two alternative useful visualization
tools are:
[ClueGO(40)](https://apps.cytoscape.org/apps/cluego):
This Cytoscape application
is conceptually similar to EnrichmentMap
and provides a network-based
visualization to reduce redundancy of
results from pathway enrichment
analysis. It also includes a pathway
enrichment analysis feature for analysis
of GO annotations using Fisher's exact
tests. However, it currently supports
only GO gene sets.
[PathVisio(49)](https://pathvisio.org/):
This desktop application
provides a complementary visualization
approach to those of EnrichmentMap and
ClueGO. PathVisio enables the user to
visually interpret omics data in the
context of gene and protein interactions
in a pathway of interest.
[PathVisio](https://www.pathvisio.org)
colors pathway genes according to
user-provided omics data . This is the
main advantage of PathVisio as compared
to EnrichmentMap and ClueGO.
## Development of the protocol
This protocol covers pathway enrichment
analysis of large gene lists typically
derived from genomescale (omics)
technology. The protocol is intended for
experimental biologists who are
interested in interpreting their omics
data. It requires only an ability to
learn and use R programming language and
'point-and-click' computer software,
although advanced users can benefit from
the automatic analysis scripts we
provide as Supplementary Protocols 1--4.
We analyze previously published human
gene expression and somatic mutation
data as examples; however, our
conceptual framework is applicable to
analysis of lists of genes or
biomolecules from any organism derived
from large-scale data, including
proteomics, genomics, epigenomics and
gene-regulation studies. We extensively
use pathway enrichment analysis for many
projects and have evaluated numerous
available tools. The software
packages we cover here have been
selected for their ease of use, free
access, advanced features, extensive
documentation and up-to-date databases,
and they are ones we use daily in our
research and recommend to collaborators
and students. In addition, we have
provided feedback to the developers of
these tools, allowing them to implement
features we have needed in published
analyses. These tools are
g:Profiler(13), GSEA(14), Cytoscape(15)
and EnrichmentMap(16), all freely
available online:
[g:Profiler](https://biit.cs.ut.ee/gprofiler/)
[GSEA](http://software.broadinstitute.org/gsea/)
[Cytoscape](http://www.cytoscape.org/)
[EnrichmentMap](http://www.baderlab.org/Software/EnrichmentMap)
## Overview of the procedure
This section outlines the major stages
of pathway enrichment analysis. Pathway
enrichment analysis involves three major
stages. 1. Definition of a gene
list of interest using omics data. An
omics experiment comprehensively
measures the activity of genes in an
experimental context. The resulting raw
dataset generally requires computational
processing, such as normalization and
scoring, to identify genes of interest,
For example, a list of genes differentially
expressed between two groups of samples can
be derived from RNA-seq data1. 2. Pathway
enrichment analysis. A statistical method
is used to identify pathways enriched in the
gene list from stage 1, relative to what is
expected by chance. All pathways in a given
database are tested for enrichment in the gene
list (see Box 2 for a list of pathway databases).
3. Visualization and interpretation of pathway
enrichment analysis results. Many enriched
pathways can be identified in stage 2, often
including related versions of the same pathway.
Visualization can help identify the main biological
themes and their relationships for in-depth study
and experimental evaluation.
Now we have our ranked file from and our gmt
file from this paper; ahd they are in the correct
format to run `fgsea` from the `fgsea` package.
The .rnk file is from an RNA-seq experiment comparing
a mesochymal subtype to an immunureactive subtype ovarian
cancer cells.
`fgsea` requires a rank file and a pathway file in .gmt
format.
The .gmt format is a list, with pathways
as names of the list, and genes as members of the
pathwyas -- or in other words GO terms and gene set
members. So lets look at that. The .gmt file is in a package
I made foar tahe course, `GOenrichmet`. First we read in the
rnk file:
```{r, eval = FALSE}
url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable2_MesenvsImmuno_RNASeq_ranks.rnk"
filename <- "tables/STable2_MesenvsImmuno_RNASeq_ranks.rnk"
library(downloader)
if (!file.exists(filename)) download(url, filename)
```
```{r ranked list}
prank = read.delim("tables/STable2_MesenvsImmuno_RNASeq_ranks.rnk",stringsAsFactors = F,check.names = F)
library(GOenrichment)
pathways = hGOBP.gmt
pathways[1]
ranks = prank$rank
names(ranks) = prank$GeneName
wdup = which(duplicated(names(ranks)))
if(length(wdup) >0) ranks=ranks[-wdup]
```
The .gmt is from a variety of sources that is mainly comprised
of GO ontology, but has also been supplemented with ontologies
from other databases including, for example, MSIGDB, PANTHER,
NCI, and REACTOME.
The to run fgsea:
```{r run fgsea}
fgseaRes = fgsea::fgseaSimple(pathways = hGOBP.gmt,stats=ranks,nperm=1000,maxSize = 200,minSize = 15)
```
Some enrichment plots to have first glance:
```{r plot}
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
stringWINDOW = function(x, width = 25){
strng = paste(strwrap(x,width = width), collapse="\n")
strng
}
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
par(cex=0.5)
plotGseaTable(pathways[topPathways], ranks, fgseaRes,colwidths = c(5, 2, 0.8,0,1))
plotEnrichment(pathways[[topPathways[1]]], ranks)
```
## Methodology GSEA
Steps: 1. Sort genes by log fold change, or
by any metric. 2. The score is calculated by walking
down the list, increasing a running-sum statistic
when a gene in the geneSet, and decreasing it when
the gene is not. The magnitude of the increment
depends on the correlation of the gene with the
phenotype. The enrichment score (ES) is the maximum
deviation from zero encountered in the random walk.
A large ES means that genes in the set are
toward top of list. 3. Permute subject labels
to calculate signficance of the score.
 Let's look at
the results. I wrote a function here to
save time filtering and sorting outputs
of GSEA unfriendly outputs:
```{r GSEA sort and tidy output}
mygseatidy = function(result){
nam=names(data.frame(result))
wnam = which(nam== "enrichmentScore")
if(length(wnam)>0) nam[wnam]="ES"
wnam = which(nam== "p.adjust")
if(length(wnam)>0) nam[wnam]="padj"
pres = data.frame(result) %>% filter(ES>=0)%>% arrange(desc(ES),padj,pathway)
nres = data.frame(result) %>% filter(ES<0)%>% arrange(ES,padj,pathway)
return(list(nres=nres,pres=pres))
}
```
let's look at the results:
```{r results}
fnegsea = mygseatidy(fgseaRes)$nres
fnegsea=fnegsea%>%filter(padj <= 0.05)
fposgsea = mygseatidy(fgseaRes)$pres
fposgsea=fposgsea%>%filter(padj <= 0.05)
```
Let's compare this to the output run by
the GSEA desktop version of Cytoscape
covered in the paper. I would really
encourage you to run on your own, it's a
great interface and a lot of fun to work
with.
Here I am just reading in the
supplementarty tables of the negative
and positive GSEA results.
```{r, eval = FALSE}
url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable8_gsea_report_for_na_pos.txt"
filename <- "STable8_gsea_report_for_na_pos.txt"
library(downloader)
if (!file.exists(filename)) download(url, filename)
url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable9_gsea_report_for_na_neg.txt"
filename <- "tables/STable9_gsea_report_for_na_neg.txt"
if (!file.exists(filename)) download(url, filename)
```
```{r}
ppos = read.delim(file = "tables/STable8_gsea_report_for_na_pos.txt" ,stringsAsFactors = F,
check.names = F)
pneg = read.delim(file = "tables/STable9_gsea_report_for_na_neg.txt" ,stringsAsFactors = F,
check.names = F)
ppos=ppos%>% filter(`FDR Q-VAL`<= 0.05)
ppos = ppos %>% arrange(desc(ES),`FDR Q-VAL`,`TERM|SOURCE`)
pneg=pneg%>% filter(`FDR Q-VAL`<= 0.05)
pneg = pneg %>% arrange(ES,`FDR Q-VAL`,`TERM|SOURCE`)
intersect(fposgsea$pathway[1:20],ppos$`TERM|SOURCE`[1:20])
intersect(fnegsea$pathway[1:20],pneg$`TERM|SOURCE`[1:20])
```
We have successfully found the GO enrihchments by the two methods, one by GSEA desktop
and one by fgsea.
```{r}
nrow(fposgsea)
nrow(fnegsea)
```
There are over 700 terms for both the
positive and negative GSEA enrichments.
How do we make sense of them all?
You could look at just the most extreme
scores with the lowest adjusted pvalues:
```{r ego, cache = 1}
### GO enrichment analysis
plot(ranks)
abline(v = 3500)
abline(v = 1500)
ego <- clusterProfiler::enrichGO(
gene = names(ranks)[1:1500],
OrgDb = "org.Hs.eg.db",
keyType = 'SYMBOL',
universe = names(ranks),
ont = "BP",
pAdjustMethod = "BH",
minGSSize = 15,
maxGSSize = 200,
pvalueCutoff = 0.05,
qvalueCutoff = 0.05)
goplot(ego)
dotplot(ego, showCategory=10) + ggtitle("positive enrichment")
if (file.exists("data/sapGO.rds")) {
sapGO <- readRDS("data/sapGO.rds")
} else {
sapGO <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="BP", computeIC=TRUE)
saveRDS(sapGO, "data/sapGO.rds")
}
hx = pairwise_termsim(ego, semData = sapGO)
emapplot(hx, showCategory = 15)
```
We can also look at athe negative enrichment, or the repressed end:
```{r nego}
nego <- clusterProfiler::enrichGO(
gene = names(ranks)[13711:15211],
OrgDb = "org.Hs.eg.db",
keyType = 'SYMBOL',
universe = names(ranks),
ont = "BP",
pAdjustMethod = "BH",
minGSSize = 15,
maxGSSize = 200,
pvalueCutoff = 0.05,
qvalueCutoff = 1)
goplot(nego)
dotplot(nego, showCategory=10) + ggtitle("negative enrichment analysis")
hx = pairwise_termsim(nego, semData = sapGO)
emapplot(hx, showCategory = 15)
```
How well do these enrichGO terms agree with the fgsea enrichment terms?
We can look at the intersection but first to be fair we should
run GSEA with only the GO terms and not the supplemented terms
from MSIGDB, PANTHER, NCI, and REACTOME etc. So let's do that by
generating a GSEA enrichment only from the GO terms. We can do
that by filtering the fgeapos resulting only for GOBP or we could
use the function gseGO to regenerated a new set of terms. In this
particular case, it doesn't really matter so let's use gseGO.
For a simpler comparison, I am including a yeast dataset here.
The data for this is in the GOenrichment package. If you list
the GO enrichment package you'll see everything that is in it
(just like you can do for any package):
```{r}
ls("package:GOenrichment")
```
What you see here are mostly functions except for `dfGOBP`,`hGOBP.gmt`,
`sampleFitdata` and `yGOBP.gmt`. `sampleFitdata` contains sample
yeast fitness data that we will use for looking for enrichments. Unlike
expression data, fitness data is a measure of, in this case, strain
`fitness`, or the requirement when grown in a particular stress. Typically,
the stress is a drug, and the strain in question is a yeast deletion strain.
So a fitness defect in a deletion strain tells you that the gene deleted
in that strain is important for resistance to drug, which can be, among
other genes, the drug target. We will go into this more when we talk
about chemogenogics. `yGOBP.gmt` is the GO BP terms for yeast, and `dfGOBP`
is related; it just carries the GO ID for these terms in a lookup table.
Ok, so let's get some data from the yeast `sampleFitdata`. This is a matrix
where every column is a sample. The function `compSCORE` returns a dataframe
designating significant scores from the input matrix of screening data
based on the input fitness threshold cutoff set by `sig`, typically = 1.
Just like for fgsea, we need a rank file, that is how we use ygene here:
```{r gse, cache=2}
library(GOenrichment)
dfsig = compSCORE(sampleFitdata,coln = 1, sig = 1)
head(dfsig)
table(dfsig$index)
wna = which(is.na(dfsig$score))
dfsig = dfsig[-wna,]
table(dfsig$index)
ygene = dfsig$score
names(ygene) = dfsig$gene
gse <- clusterProfiler::gseGO(
geneList = ygene,
ont = "BP",
pvalueCutoff = 0.05,
keyType = "GENENAME",
eps = 0,
minGSSize = 5,
maxGSSize = 150,
verbose = TRUE,
OrgDb = "org.Sc.sgd.db",
pAdjustMethod = "BH",
by = "fgsea"
)
if (file.exists("data/scGO.rds")) {
scGO <- readRDS("data/scGO.rds")
} else {
scGO <- godata('org.Sc.sgd.db', keytype = "GENENAME", ont="BP", computeIC=FALSE)
saveRDS(scGO, "data/scGO.rds")
}
goplot(gse)
x = pairwise_termsim(gse, semData = scGO, showCategory = 200)
emapplot(x, showCategory = 15,cex_category = 1.5)
p2 <- treeplot(x, hclust_method = "average")
p2
upsetplot(gse)
dotplot(gse, showCategory = 5, split = ".sign") + facet_grid(. ~.sign)
cnetplot(gse, categorySize="pvalue", foldChange=names(ygene)[1:10], showCategory = 3)
gseaplot2(gse, geneSetID = 1:5)
## GSEA
hse <- clusterProfiler::gseGO(
geneList = ranks,
ont = "BP",
keyType = "SYMBOL",
eps = 0,
minGSSize = 15,
maxGSSize = 200,
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
verbose = TRUE,
OrgDb = "org.Hs.eg.db",
by = "fgsea"
)
goplot(hse)
gseaplot2(hse, geneSetID = 1:5)
```
Let's see what the intersection of the terms from running GSEA
and running over-representation analysis is:
```{r}
length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))
length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))
length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))/length(ego$Description)
length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))/length(hse$Description[hse$enrichmentScore>=0])
length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))/length(nego$Description)
length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))/length(hse$Description[hse$enrichmentScore<0])
```
The 2nd and the 4th percentages suggest that by testing just 10% of the genes in an over-representation GO analysis, we do nearly as well as using the entire list of genes in a GSEA analysis. This makes some sense because in GSEA the genes at the top
and bottom of the list are weighted more heavily. The GO term set used right
off the bat will be different; the scope of the over-representation analysis
is limited to that of the genes in the querySet.
## GO set redundancy
This doesn't solve our problem though of handing the large number of GO
terms and making sense of them, becausae we still wound up with ~700 terms
at least for the positive end of the enrichment scores. Because GO is
organized hierarchically in a parent-child structure, a parent term
can overlap with a large proportion with all its child
terms. This means that many of the enriched GO terms are redundant. So one
thing you can do is to collapse the GO terms by semantic similarity.
The package `GOSemSim` allows you to do that. There are two basic ways
of measurig GO term semanitic similarity, IC, information content
and graph-based measures. IC-based measures are based on the closest
common ancester and graph based measures are based on location and
topology in the hierarchical GO graph. The Wang metric is a graph-based
strategy and encodes GO terms into a numeric format. So lets try that
on our data. The default metric for GOSemSim is Wang, and it's
actually wrappped inside of a clusterProfiler function -- the clusterProfiler
package is actually written by the same person as the GOSemSim
package, so that makes sense. There's a really nice book here
[clusterProfiler/GOSemSim packages](https://yulab-smu.top/biomedical-knowledge-mining-book/GOSemSim.html).
The function `simplify` internally calls GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term. The simplify() method apply select_fun (which can be a user defined function) to feature by to select one representative term from redundant terms (which have similarity higher than cutoff).
```{r simp}
if (file.exists("data/simp.rds")) {
simp <- readRDS("data/simp.rds")
} else {
simp = clusterProfiler::simplify(hse,semData = sapGO)
saveRDS(simp, "data/simp.rds")
}
simp$Description[1:10]
dim(simp)
```
But we're still left with nearly 500 terms.
Another more straightforward way of measuring redundancy or similarity
between GO terms is simply to look at the overlap
between geneSets. There are two metrics to measure overlaps: one is called
the **overlap coefficient**, which is defined by the length of the
intersection between two GO terms divided by the length of the
shortest GO term. A more stringent approach is the **Jaccard
coefficient**,which is the length of the intersection divided
by the union of the two GO terms. By modifying the simplify
function I created a new function that implements these
measures of redundancies:
```{r}
overlapcoeff <- function (x, y) {
lenx = length(x)
leny = length(y)
len = c(lenx,leny)
mn = which.min(len)
int = length(intersect(x, y))
overcoeff = int/len[mn]
overcoeff
}
jaccard <- function (x, y) {
lenx = length(x)
leny = length(y)
len = c(lenx,leny)
int = length(intersect(x, y))
un = length(union(x,y))
jaccard = int/un
jaccard
}
calcOverGOcoeff= function(select_fun = overlapcoeff,res, cutoff = 0.6, showCategory = 200, keytype = "GENENAME",ont = "BP",orgDb = "org.Sc.sgd.db") {
library(dplyr)
gs = res
res = data.frame(res,stringsAsFactors = F)
res = res %>% arrange(p.adjust)
mx = matrix(rep(NA,showCategory*showCategory),nrow=showCategory,ncol=showCategory)
colnames(mx) = res$ID[1:showCategory]
rownames(mx) = res$ID[1:showCategory]
yGO= clusterProfiler:::get_GO_data(orgDb,ont,keytype)
ygo = get("PATHID2EXTID",envir=yGO)
mx[upper.tri(mx)]=888
m = reshape2::melt(mx, value.name = "overlap",as.is = TRUE,varnames = c("go1","go2"),na.rm = T)
m1 = match(m$go1,names(ygo))
m2 = match(m$go2,names(ygo))
m$overlap = mapply(select_fun,ygo[m1],ygo[m2])
wm = which(m$overlap < cutoff)
if(length(wm) > 0) m = m[-wm,]
m1=match(m$go1,res$ID)
m$padj1=res$p.adjust[m1]
m1=match(m$go2,res$ID)
m$padj2=res$p.adjust[m1]
m=m%>% dplyr::mutate(gotoremove= m%>%dplyr::select(padj1,padj2)%>% apply(MARGIN = 1, FUN = function(x) which.max(x)))
w1 = which(m$gotoremove == 1)
w2 = which(m$gotoremove == 2)
if(length(w1) > 0) m$gotoremove[w1]=m$go1[w1]
if(length(w2) > 0) m$gotoremove[w2]=m$go2[w2]
wres = c(w1,w2)
if(length(wres) > 0) res = res[!res$ID %in% m$gotoremove,]
gs@result = res
gs
}
```
Let's try this function to see how we can reduce the GO term redundancy in the
gseGO enrichment for both the yeast and the human. First let's tackle the yeast:
```{r yoverlapGO}
if (file.exists("data/y.rds")) {
y <- readRDS("data/y.rds")
} else {
y = calcOverGOcoeff(res=gse,showCategory = nrow(data.frame(gse)),cutoff=1,select_fun = overlapcoeff,orgDb = "org.Sc.sgd.db",keytype="GENENAME")
saveRDS(y, "data/y.rds")
}
x = pairwise_termsim(y, semData = scGO,showCategory = nrow(y))
treeplot(x, hclust_method = "ward.D2",showCategory =nrow(x), nCluster=6)
```
```{r hoverlapGO}
if (file.exists("data/h.rds")) {
h <- readRDS("data/h.rds")
} else {
h = calcOverGOcoeff(res=hse,showCategory = nrow(data.frame(hse)),cutoff=0.5,select_fun = overlapcoeff,orgDb = "org.Hs.eg.db",keytype="SYMBOL")
saveRDS(h, "data/h.rds")
}
x = pairwise_termsim(h, semData = sapGO,showCategory = nrow(h))
tp= treeplot(x, hclust_method = "ward.D2",showCategory = nrow(x),nCluster=12)
dp=tp$data
wna=which(is.na(dp$label))
dp=dp[-wna,]
s=split(dp$label,dp$group)
s[1]
s[2]
s[3]
s[4]
s[5]
s[6]
s[7]
s[8]
s[9]
s[10]
s[11]
s[12]
```
So the overlap coefficient got us down to just 200 terms. Still too
big for visualization, but we've preserved the top annotations:
```{r}
hse$Description[1:10]
h$Description[1:10]
```
Here is a visualization from topGO
that is useful because you see the GO
hierarchy. For this I am goiong to use
the yeast data.
The `bitr` function here is in the clusterProfiler
package, and is just an easy function for
translating gene names.
```{r topGO}
Bioconductor<-ViSEAGO::Bioconductor2GO()
myGENE2GO<-ViSEAGO::annotate(
"org.Sc.sgd.db",
Bioconductor
)
background = dfsig$gene
gene = dfsig$gene[dfsig$index ==1]
gene.df <- bitr(background, fromType = "GENENAME",
toType = c("ENSEMBL", "ENTREZID"),
OrgDb = org.Sc.sgd.db)
w=which(background%in%gene.df$GENENAME)
m = match(background[w],gene.df$GENENAME)
entrez = background[w]
entrez = gene.df$ENTREZID[m]
w=which(gene%in%gene.df$GENENAME)
m = match(gene[w],gene.df$GENENAME)
select = gene[w]
select = gene.df$ENTREZID[m]
BP=ViSEAGO::create_topGOdata(
geneSel=select,
allGenes=entrez,
gene2GO=myGENE2GO,
ont="BP",
nodeSize=5
)
classic<-topGO::runTest(
BP,
algorithm ="classic",
statistic = "fisher"
)
par(cex = 0.3)
showSigOfNodes(BP, score(classic), firstSigNodes = 5, useInfo = 'all')
```
Another way of doing that is with goplot from the
enrichplot package
Another way to filter GO terms is to filter them by GO level.
But this doesn't work that well because of the uneveness of
the GO ontology. For example, we can use this on the output
of the GO over-representation analysis:
```{r ego4}
ego4 = gofilter(ego,level = 4)
hx4 = pairwise_termsim(ego4, semData = sapGO)
treeplot(hx4)
### compare to the previous set of similariites, let it go with
### the default 30 categories
x = pairwise_termsim(h, semData = sapGO)
treeplot(x, hclust_method = "ward.D2",nCluster=12)
ego3 = gofilter(ego,level = 3)
ego3$Description
```
Level 4 goterms can be useful for filtering enriched terms from
overexpression analysis, though you may miss the very top GO
enrichments. Level 3 is too course.
## More on GO redundancy
Yet another way to avoid GO redundancy is to filter GO terms
upfront. This can be done by using what is called "GO slim",
literally a slimmed version of GO where all three ontologies
are combined into a much broader annotation. We can look
at how well that does by downloading the slim annotations from
Biomart.
```{r biomaRt, cache=3}
library(biomaRt)
hensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl",
verbose = TRUE,mirror = 'uswest')
yensembl <- useEnsembl(biomart = "genes", dataset = "scerevisiae_gene_ensembl",
verbose = TRUE,mirror = 'uswest')
# yensembl = useDataset("scerevisiae_gene_ensembl",mart=ensembl)
yslim <- getBM(attributes=c('ensembl_gene_id','external_gene_name',
'sgd_gene', "goslim_goa_accession", "goslim_goa_description" ),
mart = yensembl)
tapp=tapply(yslim$goslim_goa_accession,yslim$goslim_goa_description,length)