21-GO-enrichment.Rmd

# Gene Set Enrichment Analysis

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, 
        message = FALSE, echo = TRUE)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 60), 
                      tidy = TRUE)
sapGO <- readRDS("data/sapGO.rds")
suppressPackageStartupMessages({
library(clusterProfiler)
library(fgsea)
library(enrichplot)
library(annotate)
library(ggplot2)
library(dplyr)
library(DOSE)
library(GOSemSim)
library(ViSEAGO)
library(topGO)
library(org.Sc.sgd.db)
library(org.Hs.eg.db)})
```


The .Rmd file for this chapter can be
found
[here](https://github.com/gurinina/omic_sciences/blob/main/21-GO-enrichment.Rmd).
To begin learning about GO set
enrichment analysis and the different
methods that can be utililized to preform
these analysis a good jumping off start
point is this Natural Protocols paper:

Pathway enrichment analysis and
visualization of omics data using
g:Profiler, GSEA, Cytoscape and
EnrichmentMap

Jüri Reimand, Ruth Isserlin, Veronique
Voisin, Mike Kucera, Christian
Tannus-Lopes, Asha Rostamianfar, Lina
Wadi, Mona Meyer, Jeff Wong, Changjiang
Xu, Daniele Merico and Gary D. Bader.

Pathway enrichment analysis helps
researchers gain mechanistic insight
into gene lists generated from
genome-scale (omics) experiments. This
method identifies biological pathways
that are enriched in a gene list more
than would be expected by chance. We
explain the procedures of pathway
enrichment analysis and present a
practical step-by-step guide to help
interpret gene lists resulting from
RNA-seq and genome-sequencing
experiments. The protocol comprises
three major steps: definition of a gene
list from omics data, determination of
statistically enriched pathways, and
visualization and interpretation of the
results. We describe how to use this
protocol with published examples of
differentially expressed genes and
mutated cancer genes; however, the
principles can be applied to diverse
types of omics data. The protocol
describes innovative visualization
techniques, provides comprehensive
background and troubleshooting
guidelines, and uses freely available
and frequently updated software,
including g:Profiler, Gene Set
Enrichment Analysis (GSEA), Cytoscape
and EnrichmentMap. The complete protocol
can be performed in \~4.5 h and is
designed for use by biologists with no
prior bioinformatics training.

Comprehensive quantification of DNA, RNA
and proteins in biological samples is
now routine. The resulting data are
growing exponentially, and their
analysis helps researchers discover
novel biological functions,
genotype--phenotype relationships and
disease mechanisms1,2. However, analysis
and interpretation of these data
represent a major challenge for many
researchers. Analyses often result in
long lists of genes that require an
impractically large amount of manual
literature searching to interpret. A
standard approach to addressing this
problem is pathway enrichment analysis,
which summarizes the large gene list as
a smaller list of more easily
interpretable pathways. Pathways are
statistically tested for
over-representation in the experimental
gene list relative to what is expected
by chance, using several common
statistical tests that consider the
number of genes detected in the
experiment, their relative ranking and
the number of genes annotated to a
pathway of interest. For instance,
experimental data containing 40% cell
cycle genes are surprisingly enriched,
given that only 8% of human
protein-coding genes are involved in
this process. In a recent example, we
used pathway enrichment analysis to help
identify histone and DNA methylation by
the polycomb repressive complex (PRC2)
as the first rational therapeutic target
for ependymoma, one of the most
prevalent childhood brain cancers3. This
pathway is targetable by available drugs
such as 5-azacytidine, which was used on
a compassionate basis in a terminally
ill patient and stopped rapid metastatic
tumor growth3. In another example, we
analyzed rare copynumber variants (CNVs)
in autism and identified several
significant pathways affected by gene
deletions, whereas few significant hits
were identified with case--control
association tests of single genes or
loci4,5. These examples illustrate the
useful insights into biological
mechanisms that can be achieved using
pathway enrichment analysis.
=======
The .Rmd file for this chapter can be found [here](https://github.com/gurinina/omic_sciences/blob/main/35GSEA.Rmd). To begin learning aobuto GO set enrichmet analysis and the different methods that can be utiliazed to preform these analysis a good jumping off start point is this Natural Protocols paper:

As a change a pace I thought we would go through this very helpful paper written...

Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap 

Jüri Reimand, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, Mona Meyer, Jeff Wong, Changjiang Xu, Daniele Merico and Gary D. Bader. Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results. We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.

Comprehensive quantification of DNA, RNA and proteins in biological samples is now routine. The resulting data are growing exponentially, and their analysis helps researchers discover novel biological functions, genotype–phenotype relationships and disease mechanisms1,2. However, analysis and interpretation of these data represent a major challenge for many researchers. Analyses often result in long lists of genes that require an impractically large amount of manual literature searching to interpret. A standard approach to addressing this problem is pathway enrichment analysis, which summarizes the large gene list as a smaller list of more easily interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list relative to what is expected by chance, using several common statistical tests that consider the number of genes detected in the experiment, their relative ranking and the number of genes annotated to a pathway of interest. For instance, experimental data containing 40% cell cycle genes are surprisingly enriched, given that only 8% of human protein-coding genes are involved in this process. In a recent example, we used pathway enrichment analysis to help identify histone and DNA methylation by the polycomb repressive complex (PRC2) as the first rational therapeutic target for ependymoma, one of the most prevalent childhood brain cancers3. This pathway is targetable by available drugs such as 5-azacytidine, which was used on a compassionate basis in a terminally ill patient and stopped rapid metastatic tumor growth3. In another example, we analyzed rare copynumber variants (CNVs) in autism and identified several significant pathways affected by gene deletions, whereas few significant hits were identified with case–control association tests of single genes or loci4,5. These examples illustrate the useful insights into biological mechanisms that can be achieved using pathway enrichment analysis.


## Application to diverse omics data

This protocol uses RNA-seq data and
somatic mutation data6 as examples
because these data types are frequently
encountered. However, the general
concepts of pathway enrichment analysis
that we present are applicable to many
types of experiments that can generate
lists of genes, such as single-cell
transcriptomics, CNVs, proteomics,
phosphoproteomics, DNA methylation
and metabolomics. Most data types
require protocol modifications, which we
only briefly discuss here. With certain
data types, specialized computational
methods are required to produce a gene
list that is appropriate for pathway
enrichment analysis, whereas with other
data types, a specialized pathway
enrichment analysis technique is
required. 

## Pathway enrichment analysis methods

This protocol recommends the use of
g:Profiler and GSEA software for pathway
enrichment analysis. g:Profiler
analyzes gene lists using Fisher's exact
test and ordered gene lists using a
modified Fisher's test. It provides a
graphical web interface and access via R
and Python programming languages. The
software is frequently updated, and the
gene set database can be downloaded as a
[GMT file](http://biit.cs.ut.ee/gprofiler).
GSEA(http://software.broadinstitute.org/gsea)
analyzes ranked gene lists using
a permutation-based test. The software
typically runs as a desktop application.

Hundreds of pathway enrichment analysis
tools exist (reviewed in ref. Khatri,
P., Sirota, M. & Butte, A. J. Ten years
of pathway analysis: current approaches
and outstanding challenges. PLoS Comput.
Biol. 8, e1002375 (2012)), although many
rely on outof-date pathway databases or
lack unique features as compared to the
most commonly used tools; as such, we do
not cover them here. The following are
alternative free pathway enrichment
analysis software tools. Although we do
not cover these tools in our protocol,
we recommend the following, on the basis
of their ease of use, unique features or
advanced programming features.

## Comparison to alternative methods

(See paper for referenes) Enrichr(37):
This is a web-based enrichment analysis
tool for non-ranked gene lists that is
based on Fisher's exact test. It is easy
to use, has rich interactive reporting
features, and includes \>100 gene set
databases (called libraries), including
\>180,000 gene sets in multiple
categories. Functionality is similar to
that of the g:Profiler web server
described in this protocol.

[Camera(71)](https://bioconductor.org/packages/release/bioc/html/limma.html):
This R Bioconductor package analyzes
gene lists and corrects for inter-gene
correlations such as those apparent in
gene co-expression data. The software is
available as part of the limma package
in Bioconductor; (this is an advanced
tool that requires programming
expertise; Supplementary Protocol 3).
(similar to moast and roast)

[GOseq(72)](https://bioconductor.org/packages/release/bioc/html):
This R Bioconductor package analyzes
gene lists from RNA-seq experiments by
correcting for user-selected covariates
such as gene length; this is an advanced
tool that requires programming
expertise).

[Genomic Regions Enrichment of
Annotations Tool
(GREAT)(67)](http://bejerano.stanford.edu/great/public/html/):
In contrast to common methods that
analyze gene lists, GREAT analyzes
genomic regions such as DNA binding
sites and links these to nearby genes
for pathway enrichment analysis . See
'Application to diverse omics data'
section. PROTOCOL NATURE PROTOCOLS 492
NATURE PROTOCOLS \| VOL 14 \| FEBRUARY
2019 \| 482--517 \| www.nature.com/npro

## Visualization tools

This protocol recommends the use of
[EnrichmentMap](http://www.baderlab.org/Software/EnrichmentMap)
for pathway enrichment
analysis visualization to aid
interpretation. EnrichmentMap(16) is a
Cytoscape application that visualizes
the results from pathway enrichment
analysis and eases interpretation by
displaying pathways as a network in
which overlapping pathways are clustered
together to identify major biological
themes in the results.

Two alternative useful visualization
tools are:

[ClueGO(40)](https://apps.cytoscape.org/apps/cluego): 
This Cytoscape application
is conceptually similar to EnrichmentMap
and provides a network-based
visualization to reduce redundancy of
results from pathway enrichment
analysis. It also includes a pathway
enrichment analysis feature for analysis
of GO annotations using Fisher's exact
tests. However, it currently supports
only GO gene sets.

[PathVisio(49)](https://pathvisio.org/): 
This desktop application
provides a complementary visualization
approach to those of EnrichmentMap and
ClueGO. PathVisio enables the user to
visually interpret omics data in the
context of gene and protein interactions
in a pathway of interest.
[PathVisio](https://www.pathvisio.org)
colors pathway genes according to
user-provided omics data . This is the
main advantage of PathVisio as compared
to EnrichmentMap and ClueGO.

## Development of the protocol

This protocol covers pathway enrichment
analysis of large gene lists typically
derived from genomescale (omics)
technology. The protocol is intended for
experimental biologists who are
interested in interpreting their omics
data. It requires only an ability to
learn and use R programming language and
'point-and-click' computer software,
although advanced users can benefit from
the automatic analysis scripts we
provide as Supplementary Protocols 1--4.
We analyze previously published human
gene expression and somatic mutation
data as examples; however, our
conceptual framework is applicable to
analysis of lists of genes or
biomolecules from any organism derived
from large-scale data, including
proteomics, genomics, epigenomics and
gene-regulation studies. We extensively
use pathway enrichment analysis for many
projects and have evaluated numerous
available tools. The software
packages we cover here have been
selected for their ease of use, free
access, advanced features, extensive
documentation and up-to-date databases,
and they are ones we use daily in our
research and recommend to collaborators
and students. In addition, we have
provided feedback to the developers of
these tools, allowing them to implement
features we have needed in published
analyses. These tools are
g:Profiler(13), GSEA(14), Cytoscape(15)
and EnrichmentMap(16), all freely
available online:

[g:Profiler](https://biit.cs.ut.ee/gprofiler/)
[GSEA](http://software.broadinstitute.org/gsea/)
[Cytoscape](http://www.cytoscape.org/)
[EnrichmentMap](http://www.baderlab.org/Software/EnrichmentMap)

## Overview of the procedure

This section outlines the major stages
of pathway enrichment analysis. Pathway
enrichment analysis involves three major
stages. 1. Definition of a gene
list of interest using omics data. An
omics experiment comprehensively
measures the activity of genes in an
experimental context. The resulting raw
dataset generally requires computational
processing, such as normalization and
scoring, to identify genes of interest, 
For example, a list of genes differentially 
expressed between two groups of samples can 
be derived from RNA-seq data1. 2. Pathway 
enrichment analysis. A statistical method 
is used to identify pathways enriched in the 
gene list from stage 1, relative to what is 
expected by chance. All pathways in a given 
database are tested for enrichment in the gene 
list (see Box 2 for a list of pathway databases).
3. Visualization and interpretation of pathway 
enrichment analysis results. Many enriched 
pathways can be identified in stage 2, often 
including related versions of the same pathway. 
Visualization can help identify the main biological 
themes and their relationships for in-depth study 
and experimental evaluation.

Now we have our ranked file from and our gmt
file from this paper; ahd they are in the correct 
format to run `fgsea` from the `fgsea` package.
The .rnk file is from an RNA-seq experiment comparing
a mesochymal subtype to an immunureactive subtype ovarian 
cancer cells.  

`fgsea` requires a rank file and a pathway file in .gmt
format.


The .gmt format is a list, with pathways 
as names of the list, and genes as members of the
pathwyas -- or in other words GO terms and gene set
members. So lets look at that. The .gmt file is in a package
I made foar tahe course, `GOenrichmet`. First we read in the
rnk file:

```{r, eval = FALSE}
url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable2_MesenvsImmuno_RNASeq_ranks.rnk"
filename <- "tables/STable2_MesenvsImmuno_RNASeq_ranks.rnk"
library(downloader)
if (!file.exists(filename)) download(url, filename)

```

```{r ranked list}
prank = read.delim("tables/STable2_MesenvsImmuno_RNASeq_ranks.rnk",stringsAsFactors = F,check.names = F)


library(GOenrichment)
pathways = hGOBP.gmt
pathways[1]
ranks = prank$rank
names(ranks) = prank$GeneName
wdup = which(duplicated(names(ranks)))
if(length(wdup) >0) ranks=ranks[-wdup]
```
The .gmt is from a variety of sources that is mainly comprised  
of GO ontology, but has also been supplemented with ontologies
from other databases including, for example, MSIGDB, PANTHER, 
NCI, and REACTOME.


The to run fgsea:

```{r run fgsea}
fgseaRes = fgsea::fgseaSimple(pathways = hGOBP.gmt,stats=ranks,nperm=1000,maxSize = 200,minSize = 15)
```

Some enrichment plots to have first glance:
```{r plot}
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
stringWINDOW = function(x, width = 25){
  strng = paste(strwrap(x,width = width), collapse="\n")
  strng
}
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
par(cex=0.5)
plotGseaTable(pathways[topPathways], ranks, fgseaRes,colwidths = c(5, 2, 0.8,0,1))

plotEnrichment(pathways[[topPathways[1]]], ranks)

```

## Methodology GSEA

Steps: 1. Sort genes by log fold change, or
by any metric. 2. The score is calculated by walking 
down the list, increasing a running-sum statistic 
when a gene in the geneSet, and decreasing it when 
the gene is not. The magnitude of the increment 
depends on the correlation of the gene with the 
phenotype. The enrichment score (ES) is the maximum 
deviation from zero encountered in the random walk. 
A large ES means that genes in the set are
toward top of list. 3. Permute subject labels 
to calculate signficance of the score.

![GSEA](figure/GSEA.jpeg) Let's look at
the results. I wrote a function here to 
save time filtering and sorting outputs 
of GSEA unfriendly outputs:

```{r GSEA sort and tidy output}
mygseatidy = function(result){

nam=names(data.frame(result))
wnam = which(nam== "enrichmentScore")
if(length(wnam)>0) nam[wnam]="ES"
wnam = which(nam== "p.adjust")
if(length(wnam)>0) nam[wnam]="padj"
pres = data.frame(result) %>% filter(ES>=0)%>% arrange(desc(ES),padj,pathway)
nres = data.frame(result) %>% filter(ES<0)%>% arrange(ES,padj,pathway)
return(list(nres=nres,pres=pres))
}
```

let's look at the results:

```{r results}
fnegsea = mygseatidy(fgseaRes)$nres
fnegsea=fnegsea%>%filter(padj <= 0.05)

fposgsea = mygseatidy(fgseaRes)$pres
fposgsea=fposgsea%>%filter(padj <= 0.05)

```

Let's compare this to the output run by
the GSEA desktop version of Cytoscape
covered in the paper. I would really
encourage you to run on your own, it's a
great interface and a lot of fun to work
with.

Here I am just reading in the
supplementarty tables of the negative
and positive GSEA results.

```{r, eval = FALSE}
url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable8_gsea_report_for_na_pos.txt"
filename <- "STable8_gsea_report_for_na_pos.txt"
library(downloader)
if (!file.exists(filename)) download(url, filename)

url <- "https://github.com/gurinina/omic.data/tree/master/csv/STable9_gsea_report_for_na_neg.txt"
filename <- "tables/STable9_gsea_report_for_na_neg.txt"

if (!file.exists(filename)) download(url, filename)
```

```{r}

ppos = read.delim(file = "tables/STable8_gsea_report_for_na_pos.txt" ,stringsAsFactors = F,
                  check.names = F)

pneg = read.delim(file = "tables/STable9_gsea_report_for_na_neg.txt" ,stringsAsFactors = F,
                  check.names = F)
ppos=ppos%>% filter(`FDR Q-VAL`<= 0.05)

ppos = ppos %>% arrange(desc(ES),`FDR Q-VAL`,`TERM|SOURCE`)

pneg=pneg%>% filter(`FDR Q-VAL`<= 0.05)


pneg = pneg %>% arrange(ES,`FDR Q-VAL`,`TERM|SOURCE`)

intersect(fposgsea$pathway[1:20],ppos$`TERM|SOURCE`[1:20])

intersect(fnegsea$pathway[1:20],pneg$`TERM|SOURCE`[1:20])
```

We have successfully found the GO enrihchments by the two methods, one by GSEA desktop
and one by fgsea.

```{r}

nrow(fposgsea)
nrow(fnegsea)

```
There are over 700 terms for both the
positive and negative GSEA enrichments.
How do we make sense of them all? 

You could look at just the most extreme
scores with the lowest adjusted pvalues:

```{r ego, cache = 1}
### GO enrichment analysis
plot(ranks)
abline(v = 3500)
abline(v = 1500)
ego <- clusterProfiler::enrichGO(
        gene  = names(ranks)[1:1500],
        OrgDb = "org.Hs.eg.db",
        keyType  = 'SYMBOL',
        universe = names(ranks),
        ont = "BP",
        pAdjustMethod = "BH",
        minGSSize = 15,
        maxGSSize = 200,
        pvalueCutoff  = 0.05,
        qvalueCutoff = 0.05)

goplot(ego)
dotplot(ego, showCategory=10) + ggtitle("positive enrichment")


if (file.exists("data/sapGO.rds")) {
  sapGO <- readRDS("data/sapGO.rds")
} else {
  sapGO <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="BP", computeIC=TRUE)
  saveRDS(sapGO, "data/sapGO.rds")
}
hx = pairwise_termsim(ego, semData = sapGO)
emapplot(hx, showCategory = 15)
```
We can also look at athe negative enrichment, or the repressed end:
```{r nego}
nego <- clusterProfiler::enrichGO(
        gene  = names(ranks)[13711:15211],
        OrgDb = "org.Hs.eg.db",
        keyType  = 'SYMBOL',
        universe = names(ranks),
        ont = "BP",
        pAdjustMethod = "BH",
        minGSSize = 15,
        maxGSSize = 200,
        pvalueCutoff  = 0.05,
        qvalueCutoff = 1)

goplot(nego)
dotplot(nego, showCategory=10) + ggtitle("negative enrichment analysis")


hx = pairwise_termsim(nego, semData = sapGO)
emapplot(hx, showCategory = 15)

```

How well do these enrichGO terms agree with the fgsea enrichment terms?

We can look at the intersection but first to be fair we should
run GSEA with only the GO terms and not the supplemented terms
from MSIGDB, PANTHER, NCI, and REACTOME etc. So let's do that by
generating a GSEA enrichment only from the GO terms. We can do
that by filtering the fgeapos resulting only for GOBP or we could
use the function gseGO to regenerated a new set of terms. In this 
particular case, it doesn't really matter so let's use gseGO.

For a simpler comparison, I am including a yeast dataset here.
The data for this is in the GOenrichment package. If you list
the GO enrichment package you'll see everything that is in it
(just like you can do for any package):
```{r}
ls("package:GOenrichment")
```
What you see here are mostly functions except for `dfGOBP`,`hGOBP.gmt`,
`sampleFitdata` and `yGOBP.gmt`. `sampleFitdata` contains sample
yeast fitness data that we will use for looking for enrichments. Unlike
expression data, fitness data is a measure of, in this case, strain
`fitness`, or the requirement when grown in a particular stress. Typically,
the stress is a drug, and the strain in question is a yeast deletion strain.
So a fitness defect in a deletion strain tells you that the gene deleted
in that strain is important for resistance to drug, which can be, among
other genes, the drug target. We will go into this more when we talk
about chemogenogics. `yGOBP.gmt` is the GO BP terms for yeast, and `dfGOBP`
is related; it just carries the GO ID for these terms in a lookup table.
Ok, so let's get some data from the yeast `sampleFitdata`. This is a matrix
where every column is a sample. The function `compSCORE` returns a dataframe 
designating significant scores from the input matrix of screening data 
based on the input fitness threshold cutoff set by `sig`, typically = 1.

Just like for fgsea, we need a rank file, that is how we use ygene here:

```{r gse, cache=2}
library(GOenrichment)
dfsig = compSCORE(sampleFitdata,coln = 1, sig = 1)
head(dfsig)
table(dfsig$index)
wna = which(is.na(dfsig$score))
dfsig = dfsig[-wna,]
table(dfsig$index)


ygene = dfsig$score
names(ygene) = dfsig$gene

gse <- clusterProfiler::gseGO(
geneList = ygene,
ont = "BP",
pvalueCutoff = 0.05,
keyType = "GENENAME",
eps = 0,
minGSSize = 5,
maxGSSize = 150,
verbose = TRUE,
OrgDb = "org.Sc.sgd.db",
pAdjustMethod = "BH",
by = "fgsea"
)

if (file.exists("data/scGO.rds")) {
  scGO <- readRDS("data/scGO.rds")
} else {
  scGO <- godata('org.Sc.sgd.db', keytype = "GENENAME", ont="BP", computeIC=FALSE)
  saveRDS(scGO, "data/scGO.rds")
}

goplot(gse)


x = pairwise_termsim(gse, semData  = scGO, showCategory = 200)
emapplot(x, showCategory = 15,cex_category = 1.5)
p2 <- treeplot(x, hclust_method = "average")
p2
upsetplot(gse)
dotplot(gse, showCategory = 5, split = ".sign") + facet_grid(. ~.sign)
cnetplot(gse, categorySize="pvalue", foldChange=names(ygene)[1:10], showCategory = 3)

gseaplot2(gse, geneSetID = 1:5)


## GSEA
hse <- clusterProfiler::gseGO(
  geneList = ranks,
  ont = "BP",
  keyType = "SYMBOL",
  eps = 0,
  minGSSize = 15,
  maxGSSize = 200,
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  verbose = TRUE,
  OrgDb = "org.Hs.eg.db",
  by = "fgsea"
)
goplot(hse)

gseaplot2(hse, geneSetID = 1:5)
```
Let's see what the intersection of the terms from running GSEA
and running over-representation analysis is:

```{r}
length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))
length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))

length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))/length(ego$Description)

length(intersect(hse$Description[hse$enrichmentScore>=0],ego$Description))/length(hse$Description[hse$enrichmentScore>=0])

length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))/length(nego$Description)

length(intersect(hse$Description[hse$enrichmentScore<0],nego$Description))/length(hse$Description[hse$enrichmentScore<0])


```

The 2nd and the 4th percentages suggest that by testing just 10% of the genes in an over-representation GO analysis, we do nearly as well as using the entire list of genes in a GSEA analysis. This makes some sense because in GSEA the genes at the top
and bottom of the list are weighted more heavily. The GO term set used right
off the bat will be different; the scope of the over-representation analysis
is limited to that of the genes in the querySet.

## GO set redundancy
This doesn't solve our problem though of handing the large number of GO
terms and making sense of them, becausae we still wound up with ~700 terms
at least for the positive end of the enrichment scores. Because GO is 
organized hierarchically in a parent-child structure, a parent term 
can overlap with a large proportion with all its child
terms. This means that many of the enriched GO terms are redundant. So one
thing you can do is to collapse the GO terms by semantic similarity.
The package `GOSemSim` allows you to do that. There are two basic ways
of measurig GO term semanitic similarity, IC, information content
and graph-based measures. IC-based measures are based on the closest
common ancester and graph based measures are based on location and
topology in the hierarchical GO graph. The Wang metric is a graph-based
strategy and encodes GO terms into a numeric format. So lets try that
on our data. The default metric for GOSemSim is Wang, and it's
actually wrappped inside of a clusterProfiler function -- the clusterProfiler
package is actually written by the same person as the GOSemSim
package, so that makes sense. There's a really nice book here
[clusterProfiler/GOSemSim packages](https://yulab-smu.top/biomedical-knowledge-mining-book/GOSemSim.html).


The function `simplify` internally calls GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term. The simplify() method apply select_fun (which can be a user defined function) to feature by to select one representative term from redundant terms (which have similarity higher than cutoff).


```{r simp}

if (file.exists("data/simp.rds")) {
  simp <- readRDS("data/simp.rds")
} else {
 simp = clusterProfiler::simplify(hse,semData  = sapGO)
  saveRDS(simp, "data/simp.rds")
}

simp$Description[1:10]
dim(simp)
```

But we're still left with nearly 500 terms.
Another more straightforward way of measuring redundancy or similarity 
between GO terms is simply to look at the overlap 
between geneSets. There are two metrics to measure overlaps: one is called
the **overlap coefficient**, which is defined by the length of the
intersection between two GO terms divided by the length of the
shortest GO term. A more stringent approach is the **Jaccard 
coefficient**,which is the length of the intersection divided 
by the union of the two GO terms. By modifying the simplify 
function I created a new function that implements these
measures of redundancies:


```{r}

overlapcoeff <- function (x, y) {

    lenx = length(x)
    leny = length(y)

    len = c(lenx,leny)
    mn = which.min(len)
    int = length(intersect(x, y))
    overcoeff = int/len[mn]
    overcoeff
}
  jaccard <- function (x, y) {

    lenx = length(x)
    leny = length(y)

    len = c(lenx,leny)
    
    int = length(intersect(x, y))
    un  = length(union(x,y))
    jaccard = int/un
    jaccard
}
  

calcOverGOcoeff= function(select_fun = overlapcoeff,res, cutoff = 0.6, showCategory = 200, keytype = "GENENAME",ont = "BP",orgDb = "org.Sc.sgd.db") { 
    library(dplyr)
  
  
    gs = res
    res = data.frame(res,stringsAsFactors = F)
    res = res %>% arrange(p.adjust)
    mx = matrix(rep(NA,showCategory*showCategory),nrow=showCategory,ncol=showCategory)
    
    colnames(mx) = res$ID[1:showCategory]
    rownames(mx) = res$ID[1:showCategory]
    
    yGO= clusterProfiler:::get_GO_data(orgDb,ont,keytype)
    ygo = get("PATHID2EXTID",envir=yGO)
    
    mx[upper.tri(mx)]=888
    m = reshape2::melt(mx, value.name = "overlap",as.is = TRUE,varnames = c("go1","go2"),na.rm = T)
    m1 = match(m$go1,names(ygo))
    m2 = match(m$go2,names(ygo))
    
    m$overlap = mapply(select_fun,ygo[m1],ygo[m2])
    
    
    wm = which(m$overlap < cutoff)
    if(length(wm) > 0) m = m[-wm,]
    
    m1=match(m$go1,res$ID)
    m$padj1=res$p.adjust[m1]
    m1=match(m$go2,res$ID)
    m$padj2=res$p.adjust[m1]
    m=m%>% dplyr::mutate(gotoremove= m%>%dplyr::select(padj1,padj2)%>% apply(MARGIN = 1, FUN =      function(x) which.max(x)))
    
    
    w1 = which(m$gotoremove == 1)
    w2 = which(m$gotoremove == 2)
    if(length(w1) > 0) m$gotoremove[w1]=m$go1[w1]
    if(length(w2) > 0) m$gotoremove[w2]=m$go2[w2]
    wres = c(w1,w2)
    if(length(wres) > 0) res = res[!res$ID %in% m$gotoremove,]
    
    gs@result = res
    gs
  }


```
Let's try this function to see how we can reduce the GO term redundancy in the
gseGO enrichment for both the yeast and the human. First let's tackle the yeast:

```{r yoverlapGO}

if (file.exists("data/y.rds")) {
  y <- readRDS("data/y.rds")
} else {
  y = calcOverGOcoeff(res=gse,showCategory = nrow(data.frame(gse)),cutoff=1,select_fun = overlapcoeff,orgDb = "org.Sc.sgd.db",keytype="GENENAME")
  saveRDS(y, "data/y.rds")
}

x = pairwise_termsim(y, semData = scGO,showCategory = nrow(y))
treeplot(x, hclust_method = "ward.D2",showCategory =nrow(x), nCluster=6)

```

```{r hoverlapGO}
if (file.exists("data/h.rds")) {
  h <- readRDS("data/h.rds")
} else {
  h = calcOverGOcoeff(res=hse,showCategory = nrow(data.frame(hse)),cutoff=0.5,select_fun = overlapcoeff,orgDb = "org.Hs.eg.db",keytype="SYMBOL")
  saveRDS(h, "data/h.rds")
}


x = pairwise_termsim(h, semData = sapGO,showCategory = nrow(h))
tp= treeplot(x, hclust_method = "ward.D2",showCategory = nrow(x),nCluster=12)
dp=tp$data
wna=which(is.na(dp$label))

dp=dp[-wna,]
s=split(dp$label,dp$group)

s[1]
s[2]
s[3]
s[4]
s[5]
s[6]
s[7]
s[8]
s[9]
s[10]
s[11]
s[12]

```
So the overlap coefficient got us down to just 200 terms. Still too
big for visualization, but we've preserved the top annotations:

```{r}
hse$Description[1:10]
h$Description[1:10]
```
Here is a visualization from topGO
that is useful because you see the GO
hierarchy. For this I am goiong to use
the yeast data.

The `bitr` function here is in the clusterProfiler
package, and is just an easy function for
translating gene names.

```{r topGO}
Bioconductor<-ViSEAGO::Bioconductor2GO()
myGENE2GO<-ViSEAGO::annotate(
"org.Sc.sgd.db",
Bioconductor
)


background = dfsig$gene
gene = dfsig$gene[dfsig$index ==1]
gene.df <- bitr(background, fromType = "GENENAME",
toType = c("ENSEMBL", "ENTREZID"),
OrgDb = org.Sc.sgd.db)

w=which(background%in%gene.df$GENENAME)
m = match(background[w],gene.df$GENENAME)


entrez = background[w]
entrez = gene.df$ENTREZID[m]

w=which(gene%in%gene.df$GENENAME)
m = match(gene[w],gene.df$GENENAME)

select = gene[w]
select = gene.df$ENTREZID[m]

BP=ViSEAGO::create_topGOdata(
geneSel=select,
allGenes=entrez,
gene2GO=myGENE2GO,
ont="BP",
nodeSize=5
)

classic<-topGO::runTest(
BP,
algorithm ="classic",
statistic = "fisher"
)
par(cex = 0.3)
showSigOfNodes(BP, score(classic), firstSigNodes = 5, useInfo = 'all')
```
Another way of  doing that is with goplot from the
enrichplot package

Another way to filter GO terms is to filter them by GO level.
But this doesn't work that well because of the uneveness of 
the GO ontology. For example, we can use this on the output 
of the GO over-representation analysis:

```{r ego4}

ego4 = gofilter(ego,level = 4)

hx4 = pairwise_termsim(ego4, semData = sapGO)
treeplot(hx4)
### compare to the previous set of similariites, let it go with
### the default 30 categories
x = pairwise_termsim(h, semData = sapGO)
treeplot(x, hclust_method = "ward.D2",nCluster=12)
ego3 = gofilter(ego,level = 3)
ego3$Description
```
Level 4 goterms can be useful for filtering enriched terms from
overexpression analysis, though you may miss the very top GO
enrichments. Level 3 is too course.

## More on GO redundancy
Yet another way to avoid GO redundancy is to filter GO terms
upfront. This can be done by using what is called "GO slim",
literally a slimmed version of GO where all three ontologies
are combined into a much broader annotation. We can look
at how well that does by downloading the slim annotations from
Biomart.

```{r biomaRt, cache=3}

library(biomaRt)


hensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl",   
           verbose = TRUE,mirror = 'uswest')
yensembl <- useEnsembl(biomart = "genes", dataset = "scerevisiae_gene_ensembl",
            verbose = TRUE,mirror = 'uswest')
# yensembl = useDataset("scerevisiae_gene_ensembl",mart=ensembl)

yslim <- getBM(attributes=c('ensembl_gene_id','external_gene_name',
        'sgd_gene',  "goslim_goa_accession", "goslim_goa_description" ),
        mart = yensembl)

tapp=tapply(yslim$goslim_goa_accession,yslim$goslim_goa_description,length)
m=match(yslim$goslim_goa_description,names(tapp))
table(is.na(m))
yslim$geneSetSize = tapp[m]

length(unique(yslim$goslim_goa_description))


attr = c("ensembl_gene_id", "entrezgene_id","hgnc_symbol" ,
         "ensembl_gene_id_version", "goslim_goa_accession" ,
         "goslim_goa_description")
hslim <- getBM(attributes=attr, mart = hensembl)
tapp=tapply(hslim$goslim_goa_accession,hslim$goslim_goa_description,length)
m=match(hslim$goslim_goa_description,names(tapp))
table(is.na(m))
hslim$geneSetSize = tapp[m]
w=which(hslim$geneSetSize>1000)
length(unique(hslim$goslim_goa_description))
```

So 142 categories for human and 161 for yeast. That's
not alot. Still, the big ones aren't likely to be informative.

```{r}
w = which(hslim$geneSetSize > 2000)
tab = table(hslim$goslim_goa_description[w])
tab = tab[order(tab,decreasing = T)]

w = which(hslim$geneSetSize > 5000)
hslim = hslim[-w,]

w = which(yslim$geneSetSize > 5000)
yslim = yslim[-w,]
```

Let's see how well these agree with the full 
GO BP set:

```{r yslim}
ys=split(yslim$external_gene_name,yslim$goslim_goa_description)
ys=lapply(ys,unique)
ys=lapply(ys,sort)
ygsea= fgseaSimple(pathways = ys,stats=ygene,nperm=1000,maxSize = 5000,minSize = 5)
ypos=mygseatidy(ygsea)$pres
yneg=mygseatidy(ygsea)$nres
ypos = ypos %>% filter(padj < 0.05)
yneg = yneg %>% filter(padj < 0.05)

intersect(ypos$pathway,gse$Description)

intersect(yneg$pathway,gse$Description)


```
So, not so bad. We are getting the major pathways:
mitochondrial translation, which happens to be a large
GO term to begin with, and cellular amino acid metabolic process.
Three of the yeast slim gsea positive and nine of the negative
pathway terms intersect with the yeast gseGO enrichment terms.

Lets see how the human compares. Let's use the GOID this time,
it's probably smarter than using the GO term.

```{r hslim}
hs=split(hslim$hgnc_symbol,hslim$goslim_goa_accession)
hs=lapply(hs,unique)
hs=lapply(hs,sort)
hgsea= fgseaSimple(pathways = hs,stats=ranks,nperm=10000,maxSize = 5000,minSize = 15)
hpos=mygseatidy(hgsea)$pres
hneg=mygseatidy(hgsea)$nres
hpos = hpos %>% filter(padj < 0.05)
hneg = hneg %>% filter(padj < 0.05)

intersect(hpos$pathway,hse$ID)

intersect(hneg$pathway,hse$ID)

```
Now let's just find out what those GOIDs are:

```{r}
m = match(hpos$pathway,hslim$goslim_goa_accession)
hpos$term = hslim$goslim_goa_description[m]
m = match(hneg$pathway,hslim$goslim_goa_accession)
hneg$term = hslim$goslim_goa_description[m]

intersect(hpos$term,hse$Description)

intersect(hneg$term,hse$Description)

```

So again, we have agreemennt, but the human slim enrichment
standing on its own really isn't very informative at all.
Really in both cases we're lacking the granularity and the
richness of the GO BP terms. Let's just look at the top 10
terms of each:

```{r}
hpos$term[1:10]
hneg$term[1:10]


``` 

You can see that we also have other terms that are unfamiliar to us
thus far, suchh as "nuxleotidytransferase activity", "hydrolase
activity, acting on glycosyl bonds" -- these belong to molecular
function or MF GO terms. Others like "ribosome" and "nuclear 
chromosome" belong to cellular component, or CC GO terms. While 
these may be informative the MF and CC GO ontologies are in
general not nearly as rich as the BP GO terms.

So the take home lesson is that in general we don't see 
good agreement when we run GO enrichments from different 
packages even if they are based on the same functions. 
Why? We'd have to look into 
the package details, but the major reason is
likely to be the due to the differences
in geneSets. I did spend soome time
looking into the details. For example, 
the runfgsea in the ViSEAGO is the same function
as the fgsea function in fgsea and in clusterProfiler
gseGO, but the GO geneSets that they retrieve
are all different. fgsea uses the user input,
ViSEAGO uses a variety of databases depending on
how you define your `myGENE2GO` function, and
gseGO uses AnnotationDbi. The other point to make
is that people often carefully filter their
GO geneSets. For example, every gene assigned to
a GO term is associated with an evidence code.

Let's look at a quick example. Using `AnnotationDbi' and
the `select` function, we can pull down the GO annotaations
by querying the yeast database `org.Sc.sgd.db`.

```{r yanno}
if (file.exists("data/yanno.rds")) {
  yanno <- readRDS("data/yanno.rds")
} else {
  
yanno = AnnotationDbi::select(org.Sc.sgd.db,
  keys=dfsig$gene,columns=c("ORF","GENENAME","GOALL",
  "ONTOLOGYALL"),keytype="GENENAME")

   saveRDS(yanno, "data/yanno.rds")
}
names(yanno)


```

So right away we want to filter for "BP",biological process,
and we probably want to get the TERMs for the go ids. One
way to do that is to use the annotate package.

```{r}
yanno = yanno %>% filter(ONTOLOGYALL == "BP")
library(annotate)
gT = getGOTerm(yanno$GOALL)
gT = gT$BP
w=which(yanno$GOALL%in% names(gT))
yanno$term=NA
m=match(yanno$GOALL[w],names(gT))
table(is.na(m))
yanno$term[w]=toupper(gT)[m]
wna=which(is.na(yanno$term))
u=unique(yanno$GOALL[wna])
u[1:5]
```

If you look these terms up in the go ontology website
[amigo](http://amigo.geneontology.org/amigo/landing) you'll find
that most of them are obsolete and have been retired. So these
GO terms are constantly changing -- in fact several of these
are in my GO gene sets, as well as in the GO gene sets of the
functions we have been using. Not to even mention what members in
each set may have changed. 

Let's look at the numbers.
```{r}

nrow(yanno)
nrow(yanno)/length(unique(yanno$GENENAME))

```


Nearly 100 annotations/gene! But they are likely not evenly
distributed. Rather you have some genes with many terms, and
some with very few.

Let's look at the evidence codes.

```{r}
yanno = yanno[-wna,]
tmp = yanno %>% group_by(GENENAME)%>% count(terms= length(term))
hist(tmp$n)
```

## The importance of geneSet sources
Now let's compare this file to the GO annotations 
that I downloaded ~6 months ago from the Gene Ontology
Consortium:

```{r}

library(GOenrichment)
anno=split(yanno$GENENAME,yanno$term)
anno = lapply(anno,unique)
anno = lapply(anno,sort)
anno = anno[sort(names(anno))]
lanno = sapply(anno,length)
lbp = sapply(yGOBP.gmt,length)

wlanno = which(lanno <= max(lbp))

length(lanno[wlanno])
            
i=intersect(names(anno),names(yGOBP.gmt))
length(i)
lanno = sapply(anno[wlanno],length)
length(lbp)
table(lanno[wlanno][i]==lbp[i])

length(i)/length(lbp)
length(i)/length(lanno[wlanno])
table(lanno[wlanno][i]==lbp[i])/sum(table(lanno[wlanno][i]==lbp[i]))
```

About 85% of the GO geneSet downloaded six months ago and 
about 60% of the March 2022 SGD GO geneSet overlaps the
intersection of the two datasets. But there is a further discrepancy
in that there is only a ~35% agreement in the GO term 
memberships. 

So **clearly** one of the most important things to
remember about geneSet analysis is that
geneSets are constantly being updated.
I don't even want to think about looking at
the clusterProfiler version of the GO geneSets.
But that being said, the fact is that the GO enrichments would not change that
much even so -- **if** you have a robust profile.

Many of the R packages we are using, I
have noticed! don't update their files
to reflect this. Therefore the safest
way to ensure you are using the most
up-to-date file is to download it from
the GO ontology consortium website [Gene
Ontology
Consortium](http://current.geneontology.org/products/pages/downloads.html).
The GO ontology constortium updates
their files constantly and have all the
model organism geneSets. Another good
source for Human, woodchuck, rat and
mouse data is Gary Bader's at the
Univerisity of Toronto's website [Gary
Bader Enrichment Map Gene
Sets](http://baderlab.org/GeneSets).
There they have serveral versions of the
gmt files separated into GO ontologies,
plus/minus IEAs, +/ pathway annotations
etc. If you publish a paper, you always nee to cite
the source and the date of down load. The files on 
the GO ontology web site are provided as .GAF files.

Go through GAF file: 1. filter for
ontology: most commonly people use BP 2.
filter for evidence codes: most
commonly, IEA and other
computational codes are filtered out,
but it depends on the organism -- if the
organism is not well annotated, you may
want to choose to leave it in 3.
sometimes people filter for source or
whatever process you prefer, fiORFs,
duplicated genes. filter for IEA.

Ok, let's move on. Here it would be very nice
to have some handy visualization tools
so that we could more easily summarize
the similarities and differences between
these results. 


Finally, I'm going to show you a GO
enrichment tool that we developed in our
lab. I made a paakage for the course that has
all the functions and everything you need.

```{r}
library(GOenrichment)
ls("package:GOenrichment")
```

The GO enrichment itself is basically
over representation analysis, so we need a listdon't need to go over that
bit in detail.

## How network visualizations can help
Here, instead of using a ranked
list however, to run the GO enrichment we use a list of
"significant genes" and compared it to a
set of background genes, instead of
using a ranked list. This mode of running a GO
enrichment is called an over-representation
analysis. This decision is
very much based on our platform. Our
platform is a chemogenomics assay and
the uniqueness of our assay is that
unlike RNA expressioin, for example, the
assay returns a ranked list of genes,
but in this case the ranked list 
is directly informative of the
importance of each gene to the
biological question. In this case, the
biological question is how important is
that gene for resistance to a particular
compound or drug? In expression, a
larger fold change in expression in
response to perturbation does not
necessarily imply that that gene is more
important for adaptation or resistance
to the change in perturbation, but in
our chemogenomic assay, a larger change
in resistance, in this case measured as
a change in fitness, or growth does tell
us exactly that. So that is why we opt
not to use the ranked list approach.
There is still the issue of where to
draw the cutoff, but we have overcome
that after years of experience with the
assay and being intimately familiar with
the level of sensitivity and what level
of sensitivitiy we are comfortable with
calling statistically significant.

So because we know the level of that significance
is approximately one, we can run the GO enrichment.
So we need all the parameters. `bp_input` is the
geneSets and `go_input` just carries along the GOIDs
for the geneSets since they are in .gmt formet, which
we saw with fgsea. mat is the screening data, coln is
the column of the matrix that has the sample of interest,
sig is the significance threshold, fdrThresh is the FDR
threshold, which we set fairly high as it is quite 
stringent.

```{r yGORESP}
go_input = dfGOBP
bp_input = yGOBP.gmt

goresp = runGORESP(fdrThresh = 0.2,curr_exp = colnames(sampleFitdata)[1], 
      mat = sampleFitdata, coln = 1, sig = 1, bp_input = bp_input,go_input = go_input)
```

Lets look at what is returned.

There is an enrichInfo data.frame, and an edgeMat.
enrichInfo has all the enriched terms, their FDR scores, overlapping genes, etc.
Other columns are important for the visualization which we'll get to.
edgeMat has information that links the nodes to eachother depending
on the overlap between terms. edgeMat is strictly for visualiztion.
The overlappCoeff, or overlap between terms gets translated into an edge width,
the larger the overlap, the thicker the edge. We'll see this later.

```{r}
goresp$enrichInfo[3,]

goresp$edgeMat[1,]
```


Ok. Before we get back into that I just
want to go over what GO enrichment does.
So for a GO enrichment with a
significant set of genes and a
background list, in other words an over-representation
GO analysis, we can model the
association between genes and GO class
using a hypergeometric distribution. That is, we can calculate
the probability of a gene belonging to a GO geneSet in our set of 
significant genes compared to the probability of a gene belonging
to that set in the background list (universe).
The hypergeometric distribution is a probability distribution that describes 
the probability of k successes (random draws for which the 
object drawn has a specific feature) in n draws, from a population
of size N that contains exactly K objects with that feature. For 
reference, the binomial distribution describes the probabillity
of k successes in n draws with replacement. For reference, the
binomical distribution describes the probability of k successes
in n draws with replacement. The
classic example for the hypergeometric
is the random selection of k balls in
an urn containing N total balls, m white
and n black balls, and the observation
that the selection contains "x" marked
balls.

To illustrate the way to perform the
hypergeometric test, we will re-analyze
walk through from scratch one of the
enrichment results from the test we just
performed.

So first lets look at the parameters
that were used in `runGORESP` to calculate
the probability of a significant gene
belonging to a particular geneSet. So
within `runGORESP`, there is annother
function called `compSCORE`, that we've
seen already. Lets use it again here.

```{r}

dfsig = compSCORE(sampleFitdata,coln = 1, sig = 1)
head(dfsig)
table(dfsig$index)

```

As we saw before, there are often fitness scores that have
NA values, so lets check for that right
away and get rid of those.

```{r}
wna  = which(is.na(dfsig$score))
length(wna)
dfsig = dfsig[-wna,]
table(dfsig$index)

```

## The hypergeometric test

We define the parameters of the
hypergeometric test in the following
way:

N = total number of balls = total number
of genes in the universe. This means all
the gehes with GO annotation.

m the number of white balls in the
urn,i.e. the number of genes in the
geneSet.

n the number of black balls in the urn

k the number balls drawn from the urm

Lets start with N

N = total number of genes in the screening matrix 

So we need to know how many of the
significant genes have a GO annotation:

```{r}
N = nrow(dfsig)

# moving forward we should also filter out geneSets
# but this is done inside the runGORESP function:
yGOBP = lapply(yGOBP.gmt,intersect,dfsig$gene)

```

The second parameter:

m = white balls, take thise to mean
genes in the geneSet of interest i.e.
total number of genes annotated for the
selected GO term (ARGININE BIOSYNTHETIC
PROCESS).

```{r}
w = which(names(yGOBP)=="ARGININE BIOSYNTHETIC PROCESS")
yGOBP[[w]]
yGOBP.gmt[[w]]
m = length(yGOBP[[w]])
m
n = N-m
n
```

N = 4797 m = 9 n = N−m = 4788 #number of
blackballs, i.e. the number of genes
that have annotation distinct \# from
the GO term (ARGININE BIOSYNTHETIC
PROCESS).

k = random selection of k balls = size
of the query selection, i.e. number of
genes passing the significance threshold
this comes from out score data.frame

```{r}
table(dfsig$index)
k = sum(dfsig$index == 1)
k

```

So from here we can see that k = 191,

the size of the query set.


Finally, x = number k balls (query set)
that overlap with the m, the ARGININE
BIOSYNTHETIC PROCESS gene set

```{r arginine}
gene = dfsig$gene[dfsig$index==1]
arg = yGOBP$`ARGININE BIOSYNTHETIC PROCESS`
int = intersect(arg,gene)
x = length(int)
x
```

Before we go on here, I can't help but notice that the `sampleFitdata` has
a more recent version of gene names than my gmt files does. So I'm going to
cheat a little here for the ARGININE geneset to my favor and 
rename the offending rowname:

```{r}
dfsig$gene[5]
g = grep("ARG5",rownames(sampleFitdata))
rownames(sampleFitdata)[g] = "ARG5,6"

goresp$enrichInfo[1:3,1:10]

```

This correction is more to point out an example of why it's important
to keep geneSets updated. In this case, I know for a fact that none of the
current geneSets we've used in the yeast datasets so far (gseGO,enrichGO) have
updated that gene from "ARG5,6" to its current name "ARG56", neither does the
`org.Sc.sgd.db` have the correct genename. So in this case my error was using
a genename that was too recent! 

In fact I can check that by typing:

```{r}
org.Sc.sgd_dbInfo()
```

And I can see that the last update was in March. The yeast GMT file I generarted
was in January -- that's 6 months ago, so time for me to update that file...

Then let's rerun goresp, and rerun compSCORE, it won't make
too much difference but it serves to bring home the point of
the importance of geneSets being updated, particularly in this
case when there are only a handful of members in the geneSet.

```{r}
go_input = dfGOBP
bp_input = yGOBP.gmt

goresp = runGORESP(fdrThresh = 0.2,curr_exp = colnames(sampleFitdata)[1], 
      mat = sampleFitdata, coln = 1, sig = 1, bp_input = bp_input,go_input = go_input)
```

And them `compSCORE`:

```{r}
dfsig = compSCORE(sampleFitdata,coln = 1, sig = 1)
head(dfsig)
table(dfsig$index)
wna = which(is.na(dfsig$score))
dfsig = dfsig[-wna,]
table(dfsig$index)

```

So now for 'ARGININE BIOSYNTHETIC PROCESS' instead of a 
P-value of 4.76e-11 we get a value of 2.02e-12

```{r}
goresp$enrichInfo[1:3,1:10]
```

Going back to our calculation:
```{r ARGININEBIO}
N = nrow(dfsig)

# moving forward we should also filter out geneSets
# but this is done inside the runGORESP function:
yGOBP = lapply(yGOBP.gmt,intersect,dfsig$gene)
arg = yGOBP$`ARGININE BIOSYNTHETIC PROCESS`
gene = dfsig$gene[dfsig$index==1]
arg = yGOBP$`ARGININE BIOSYNTHETIC PROCESS`
m = length(arg)
int = intersect(arg,gene)
x = length(int)
x
n = N - m
k = 191
```


We can use the phyper function, which gives the
cumulative probability of the hypergeometric
distribution at a specified value where again:

q is the number of white balls drawn without replacment = x-1

m is the number of white balls in the urn

n is the number of black balls in the urn

k is the number of balls drawn from the urn

```{r}
phyper(q = x-1, m, n, k, lower.tail = FALSE)
goresp$enrichInfo$P[2]
goresp$enrichInfo$nOverlap[2]
```

## Fisher's test
So we get the same value. We can look at it
another way. 
This is equivalent to setting up a
contingency table and doing a Fisher's
exact test:

```{r contingency}

  
## q = length overlap-1 #successes in querySet
## m is the number of successes in population = geneSetsize 
## success in population
## n = N - m
## k is length of queryset
## N = universe, nrow in  matrix
## Prepare a two-dimensional contingency table
contingency.table <- data.frame(matrix(nrow=2, ncol=2))
rownames(contingency.table) <- c("significant","not significant")
colnames(contingency.table) <- c("in geneSet","not in geneSet")
 
## Assign the values one by one to make sure we put them in the right
## place (this is not necessary, we could enter the 4 values in a single instruction).
contingency.table[1,1] <- x 
## Number of marked genes in the selection
contingency.table[1,2] <- k - x 
## Number of non-marked genes in the selection
contingency.table[2, 1] <- m - x 
## Number of marked genes outside of the selection
contingency.table[2, 2] <- n - (k - x) ## Number of non-marked genes in the selection
contingency.table
```

From here we can do a fisher's test
```{r}
fish = fisher.test(contingency.table , alternative = "greater")
 
fish$p.value

```

We can fill this table out by adding the margins.

```{r}
(contingency.row.sum = apply(contingency.table, 1, sum))
(contingency.col.sum = apply(contingency.table, 2, sum))
## Create a contingency table with marginal sums
exp.contingency.table <- cbind(contingency.table, contingency.row.sum)
exp.contingency.table <- rbind(exp.contingency.table, apply(exp.contingency.table, 2, sum))
names(exp.contingency.table) <- c(names(contingency.table), "total")
rownames(exp.contingency.table) <- c(rownames(contingency.table), "total")
print(exp.contingency.table)
```

I find that this is a useful way to
grasp the phyper function, where it's
easy to get confused about what the
balls are and what's marked and what's
unmarked. This lays it out very clearly.
Contingency tables also really shine at
highlighting joint probabilities because
each cell displays the number of times
events occurred together. Those cell
values are the joint events for the
numerator. The grand total is the number
of outcomes for the denominator.

Under the hypothesis of independence, we
would expect to find the genes
distributed randomly within the table,
but with the same marginal row and
column sums. We can therefore estimate
the expected number of elements in each
cell by taking the product of the total
entries in the row and column, and
dividing it by the total number of genes
because these joint probabilities must
be additive under the laws of
independence. If you really want to test
for independence these estimates could
then be used to test in a chisq.test.

OK, lets move on.

Just for grins, lets see how the well
the human data agrees if we run a GO
enrichment test vs a GSEA. Instead of
ranks, we'll have to set a threshold.
Lets use a threshold

```{r hGORESP}
library(GOenrichment)

mat = cbind(ranks,ranks)
hresp = GOenrichment::runGORESP(fdrThresh = 0.05,mat=mat,
    coln=1,curr_exp = colnames(mat)[1], sig = ranks[1500], 
    bp_input = hGOBP.gmt,go_input = NULL,minSetSize = 15, 
    maxSetSize = 200)
intersect(fposgsea$pathway[1:20],hresp$enrichInfo$term[1:20])
```

Looks familiar right? That's because we used the same
geneSets.

So now we're going to use the results
from the hypergeometric tests to do some
visualization and construct a network of
GO enrichments. So right now our data is
in the form of an enrichment dataframe
with lots of nice annotation -- FDR
values, geneSetSize and all those nice
details we just looked at.

The other piece of the data that got
generated in this "runGORESP" function
was an edge matrix, and I am not going
to go through all the details here, you
can go through the code on your own
time, its pretty straight forward.
Basically what it does is it takes all
of the enriched geneSets and it walks
throgh them systematically comparing all
pairwise combinations to calculate the
overlap coefficient between them. So
here again is yet another method for
simplifying the output and trying to
minimize the rendundancy between terms.
We saw it with the collapsePathways in
fgsea, and the packages GOSemSim, DOSE,
enrichplot and meshes have implemented a
variety of tools based on semantic
similarity of genes or gene products
using [Gene Ontology, Disease Ontology
and Medical Subject Headings]
(<https://yulab-smu.top/biomedical-knowledge-mining-book/index.html>).
The GO semanitic similarities are
similar to GO slim methods, where child
GO terms are collapsed into more general
terms, but as far as I can tell the data
is not that easy to manage.

```{r}

#####
vis = visSetup(enrichInfo = goresp$enrichInfo,edgeMat = goresp$edgeMat)
############
# generates a network based on node and edge parameters
############

runNetwork(nodes = vis$nodes,edges=vis$edges)
```

Lets do the same for the human results:

```{r}
vis = visSetup(hresp$enrichInfo,hresp$edgeMat)
runNetwork(vis$nodes,vis$edges)
```

The map is very busy, but interestingly you can see 
that there is a major cluster. The major node in this
cluster is defined by HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION,
then there are some other large nodes, EXTRACELLULAR MATRIX ORGANIZATION,
EXTRACELLULAR STRUCTURE ORGANIZATION. Because these pink nodes
are all linked by edges, their GO terms overlap by at least 50%.
The HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION is related  to cancer.
Cancer cells undergo epitelial mesenchymal transition (EMT) 
-- a loss of polaritiy and cell-cell adhesion and a gain of migratory 
and invasive properties -- during metastasis.
properties. So this analysis really helped clarify the positive
end of the GSEA analysis. We could do the same for the negative end
by just negating the values.

```{r}
library(GOenrichment)

mat[,2]= -mat[,2]
mat = mat[order(mat[,2],decreasing = T),]

hresp = GOenrichment::runGORESP(fdrThresh = 0.05,mat=mat,
    coln=2,curr_exp = colnames(mat)[2], sig = mat[1500,2], 
    bp_input = hGOBP.gmt,go_input = NULL,minSetSize = 15, 
    maxSetSize = 200)
intersect(fnegsea$pathway[1:20],hresp$enrichInfo$term[1:20])
```

```{r}
vis = visSetup(hresp$enrichInfo,hresp$edgeMat)
runNetwork(vis$nodes,vis$edges)
```

It's diffficult to see what's in the cluster. But it
is in the data:

```{r}
table(hresp$enrichInfo$cluster)
w = which(hresp$enrichInfo$cluster == "#9900FF")
hresp$enrichInfo$term[w]

```

## Minimizing GO redundancy
By modifying our calcOverlap function we could get rid of a 
lot of the overlapping terms. So lets do that:

```{r}
calcOverGOcoeff= function(select_fun = overlapcoeff,res, cutoff = 0.6, padj = "p.adjust",term = "ID", goont) { 
    library(dplyr)
  
  
    gs = res
    res = data.frame(res,stringsAsFactors = F)
    
    mx = matrix(rep(NA,nrow(res)*nrow(res)),nrow=nrow(res),ncol = nrow(res))
    
    colnames(mx) = res[,term]
    rownames(mx) = res[,term]
    
    GO = goont
    
    
    mx[upper.tri(mx)]=888
    m = reshape2::melt(mx, value.name = "overlap",as.is = TRUE,varnames = c("go1","go2"),na.rm = T)
    m1 = match(m$go1,names(GO))
    m2 = match(m$go2,names(GO))
    
    m$overlap = mapply(select_fun,GO[m1],GO[m2])
    
    
    wm = which(m$overlap < cutoff)
    if(length(wm) > 0) m = m[-wm,]
    
    m1=match(m$go1,res[,term])
    m$padj1=res[m1,padj]
    m1=match(m$go2,res[,term])
    m$padj2=res[m1,padj]
    m=m%>% dplyr::mutate(gotoremove= m%>%dplyr::select(padj1,padj2)%>% apply(MARGIN = 1, FUN =      function(x) which.max(x)))
    
    
    w1 = which(m$gotoremove == 1)
    w2 = which(m$gotoremove == 2)
    if(length(w1) > 0) m$gotoremove[w1]=m$go1[w1]
    if(length(w2) > 0) m$gotoremove[w2]=m$go2[w2]
    wres = c(w1,w2)
    if(length(wres) > 0) {w = which(res[,term]%in% m$gotoremove)
    res = res[-w,]}
    
    if(class(gs)=="gseaResult"|class(gs)=="enrichResult") {gs@result=res
    return(gs)} else {return(res)}
    
}
```

Use the hresp enrichInfo and plug it into this revised function:
```{r newhresp}


if (file.exists("data/newhresp.rds")) {
  newhresp <- readRDS("data/newhresp.rds")
} else {
  
newhresp = calcOverGOcoeff(select_fun=overlapcoeff,cutoff=0.7,res = hresp$enrichInfo,padj="FDR",term="term",goont=hGOBP.gmt)
  saveRDS(newhresp, "data/newhresp.rds")
}


w=which(hresp$edgeMat$source%in% newhresp$id)
ww=which(hresp$edgeMat$target[w]%in% newhresp$id)
newedge = hresp$edgeMat[w[ww],]

v = visSetup(newhresp,newedge)

runNetwork(v$nodes,v$edges)
```

So we could play those parameters, to get something
pretty. You could also imagine modifying runGORESP such that
it doesn't rely on a significance cutoff and instead
tests iteratively down the ranked list of genes akin 
to GSEA until it reaches the maximum GO enrichment
and then stops there. It would still be an over representation
analysis, it just wouldn't rely on an arbitrary significance
cutoff. I've tried this and it works quite well, but
it takes a lot of memory as you can imagine.

Ok, moving on. So there are lots and lots of GO enrichment 
and GSEA packages, and lots of different algorithms. Most 
use GO enrichment or GSEA. We talked about some of them.

enrichGO, gseGO, enricher, goana,
go_enrich, gost etc. Over representation
analysis: goprofiler2, clusterProfiler,
GOfuncR, limma, goseq, topGO GSEA:
fgsea, GSEA, ViSEAGO, RGSEA 


## Take home
I hope from this discussion you realize that it is important to try different
GO analyses and different geneSets to turn your data around such that you 
end up with results you have confidence in. Meaning the results will generally
agree if you run the same enrichment algorithm using a different function with the
same geneSet. That is not always straightforward as we've seen, because
different functions can call different geneSets. An acceptable solution 
to this is to run two
enrichments and take the intersection between them. Or simply be very
explicit about the enrichment you do run and the parameters that 
were used as long as that geneSet is one broadly used set in the literature. 
Download dates need to be included obviously. I hope you've also recognized 
when it is ok to slim down your GO enrichment terms, and when it 
doesn't make sense and you lose too much information. I would definitely 
keep GSEA positive and GSEA 
negative results in different sets of data to clearly distinguish 
between results. I think clusterProfiler is one of the best packages, that's
why I've highlighted it here. It gives you a lot of options within the same
package and it plays nicely with GOSemSim and enrichplot.
Visualization wherever possible is essential in communicating
your results. GO enrichment is the fun part in my eyes, as it is here where 
you really can extract the meaning from large-scale data analysis...