code/05-Annotations.Rmd

---
title: "05-Annotating DMLs and MACAU loci"
author: "Laura H Spencer"
date: "11/20/2019"
output: html_document
---

### Load libraries 

```{r setup, message=FALSE, warning=FALSE, results=FALSE}
list.of.packages <- c("tidyverse", "reshape2", "here", "scales", "clipr") #add new libraries here 
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

# Load all libraries 
lapply(list.of.packages, FUN = function(X) {
  do.call("require", list(X)) 
})
sessionInfo()
```

## Use bedtools to see where DMLs and MACAU loci are located.

DMLs between the Olympia oyster populations, Hood Canal and South Sound, were identified using MethylKit. File is: [analyses/dml25.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/dml25.bed) 

MACAU was used to identify loci at which methylation is associated with a phenotype, in our case shell length, while controlling for relatedness. 

### Locations of feature files 

[Olurida_v081-20190709.gene.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.gene.gff) - genes, gene = "../genome-features/Olurida_v081-20190709.gene.gff"   
[Olurida_v081-20190709.gene.2kbslop.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.gene.2kbslop.gff) - genes +/- 2kb, gene2kb = "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"    
[Olurida_v081-20190709.2kbflank-up.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081-20190709.2kbflank-up.gff?raw=true) - genes-2kb (aka upstream / 5' end) 
[Olurida_v081-20190709.2kbflank-down.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081-20190709.2kbflank-down.gff?raw=true) - genes+2kb (aka 2kb downtream, 3' end)
[Olurida_v081-20190709.CDS.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.CDS.gff) - Coding regions of genes, CDS = "../genome-features/Olurida_v081-20190709.CDS.gff"      
[Olurida_v081-20190709.exon.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.exon.gff) - Exons, exon = "../genome-features/Olurida_v081-20190709.exon.gff"     
[Olurida_v081-20190709.mRNA.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081-20190709.mRNA.gff) - mRNA, mRNA = "../genome-features/Olurida_v081-20190709.mRNA.gff"      
[Olurida_v081_TE-Cg.gff](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/Olurida_v081_TE-Cg.gff) - Transposable elements, TE = "../genome-features/Olurida_v081_TE-Cg.gff"    
[20190709-Olurida_v081.stringtie.gtf](https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/genome-features/20190709-Olurida_v081.stringtie.gtf) - alternative splice variants, ASV = "../genome-features/20190709-Olurida_v081.stringtie.gtf"   

### Background files used for MACAU and DMLs 

AllLociMACAU = "../analyses/macau/macau-all-loci.bed"
AllLociDMLs = "../analyses/DMLs/mydiff-all.bed"

### File with MACAU loci 
macau = "../analyses/macau/macau.sign.length.perc.meth.bed"

### File with DML loci 

DML = "../analyses/DMLs/dml25.bed"

### Identify MACAU and DML loci in each genome feature using Bedtools intersect 

I will use `bedtools` to identify where DML and MACAU loci intersect with known genome features. 

`bedtool intersect` options to use:  
`-u` - Write the original A entry _once_ if _any_ overlaps found in B, _i.e._ just report the fact >=1 hit was found  
`-a` - File A  
`-b` - File B  

## 1. DMLs 
Olurida_v081-20190709.2kbflank-up.gff
Olurida_v081-20190709.2kbflank-down.gff

```{bash}
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" >  ../analyses/DMLs/DML-gene.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/DMLs/DML-gene2kb.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-up.gff" >  ../analyses/DMLs/DML-2kbflank-up.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-down.gff" >  ../analyses/DMLs/DML-2kbflank-down.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" >  ../analyses/DMLs/DML-exon.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" >  ../analyses/DMLs/DML-CDS.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" >  ../analyses/DMLs/DML-mRNA.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" >  ../analyses/DMLs/DML-TE.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/DMLs/DML-ASV.bed
bedtools intersect -v -a "../analyses/DMLs/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/DMLs/DML-unknown.bed
```

### Background loci features that were used to identify DMLs  

```{bash}
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" >  ../analyses/DMLs/AllLociDMLs-gene.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/DMLs/AllLociDMLs-gene2kb.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-up.gff" >  ../analyses/DMLs/AllLociDMLs-2kbflank-up.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-down.gff" >  ../analyses/DMLs/AllLociDMLs-2kbflank-down.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" >  ../analyses/DMLs/AllLociDMLs-exon.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" >  ../analyses/DMLs/AllLociDMLs-CDS.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" >  ../analyses/DMLs/AllLociDMLs-mRNA.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" >  ../analyses/DMLs/AllLociDMLs-TE.bed
bedtools intersect -wb -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/DMLs/AllLociDMLs-ASV.bed
bedtools intersect -v -a "../analyses/DMLs/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/DMLs/AllLociDMLs-unknown.bed
```

## 2. MACAU Loci 

```{bash}
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" >  ../analyses/macau/macau-gene.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/macau/macau-gene2kb.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-up.gff" >  ../analyses/macau/macau-2kbflank-up.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-down.gff" >  ../analyses/macau/macau-2kbflank-down.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" >  ../analyses/macau/macau-exon.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" >  ../analyses/macau/macau-CDS.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" >  ../analyses/macau/macau-mRNA.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" >  ../analyses/macau/macau-TE.bed
bedtools intersect -wb -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/macau/macau-ASV.bed
bedtools intersect -v -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/macau/macau-unknown.bed
```

###  Background loci used for MACAU  

```{bash}
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" >  ../analyses/macau/AllLociMACAU-gene.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/macau/AllLociMACAU-gene2kb.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-up.gff" >  ../analyses/macau/AllLociMACAU-2kbflank-up.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.2kbflank-down.gff" >  ../analyses/macau/AllLociMACAU-2kbflank-down.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" >  ../analyses/macau/AllLociMACAU-exon.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" >  ../analyses/macau/AllLociMACAU-CDS.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" >  ../analyses/macau/AllLociMACAU-mRNA.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" >  ../analyses/macau/AllLociMACAU-TE.bed
bedtools intersect -wb -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/macau/AllLociMACAU-ASV.bed
bedtools intersect -v -a "../analyses/macau/macau-all-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" >  ../analyses/macau/AllLociMACAU-unknown.bed
```

# Summarizing Annotations 

## 1. DMLs - where are differentially methylated loci located? 

```{r, message=FALSE, warning=FALSE}
DMLfiles <- c("DML-CDS.bed", "DML-exon.bed", "DML-gene.bed", "DML-gene2kb.bed", "DML-2kbflank-up.bed", "DML-2kbflank-down.bed", "DML-mRNA.bed", "DML-TE.bed", "DML-ASV.bed", "DML-unknown.bed")
DML.features <- list()
for (i in c(1:10)) {
  DML.features[[i]] <- read_delim(here::here("analyses", "DMLs", DMLfiles[i]), delim = '\t', col_names = FALSE) %>% as_tibble()}
for (i in 1:9) {
  DML.features[[i]] <- DML.features[[i]] %>%
    setNames(c("contig.dml","start.dml","end.dml","score.dml","contig.feat", "source.feat","feature","start.feat","end.feat","unknown1","strand","unknown2","attribute")) %>%
mutate(ID=str_extract(attribute, "ID=(.*?);"),
       Parent=str_extract(attribute, "Parent=(.*?);"),
       Name=str_extract(attribute, "Name=(.*?);"),
       Alias=str_extract(attribute, "Alias=(.*?);"),
       AED=str_extract(attribute, "AED=(.*?);"),
       eAED=str_extract(attribute, "eAED=(.*?);"),
       Note=str_extract(attribute, "Note=(.*?);"),
       Ontology_term=str_extract(attribute, "Ontology_term=(.*?);"),
       Dbxref=str_extract(attribute, "Dbxref=(.*?);"),
       SPID=str_extract(attribute, "SPID=(.*?);")
       ) %>%
mutate_at("feature", as.factor)
}
names(DML.features) <- c("DML.CDS", "DML.exon", "DML.gene","DML.gene2kb", "DML.flank-up", "DML.flank-down","DML.mRNA", "DML.TE", "DML.ASV", "DML.unknown")
DML.features[["DML.unknown"]] <- 
  DML.features[["DML.unknown"]] %>% 
  setNames(c("contig.DML", "start.dml", "end.dml", "score.dml"))

DML.features[["DML.CDS"]] <- DML.features[["DML.CDS"]] %>% mutate_at("unknown2", as.character)
DML.features[["DML.TE"]] <- DML.features[["DML.TE"]] %>% mutate_at("unknown1", as.character)
DML.features[["DML.ASV"]] <- DML.features[["DML.ASV"]] %>% mutate_at("unknown1", as.character)
DML.features[["DML.gene2kb"]]$feature <- "gene2kb"
DML.features[["DML.flank-up"]]$feature <- "flank-up"
DML.features[["DML.flank-down"]]$feature <- "flank-down"
DML.features[["DML.unknown"]]$feature <- "unknown"
DML.features.df <- bind_rows(DML.features[-7])
print(DML.summary <- table(DML.features.df[c("feature")])) #Note: "similarity" refers to transposable elements; also NO alternative splice variants are included. 

# 
# DML.features[[3]] %>% left_join(uniprot, by=c("contig.feat","start.feat", "end.feat")) 

#save DML loci feature df object to file, to use in notebook #12 
save(DML.features.df, file="../analyses/DMLs/R-objects/DML.features.df")
save(DML.summary, file="../analyses/DMLs/R-objects/DML.summary")

#How many DMLs are found within known genes? 
DML.features[["DML.gene"]] %>%
  select(contig.dml, start.dml, end.dml) %>% nrow()
1873/nrow(dml25)

#How many DMLs are found within one or more exons? 
DML.features[["DML.exon"]] %>%
  select(contig.dml, start.dml, end.dml) %>% unique() %>% nrow()
1463/nrow(dml25)

#How many DMLs are found upstream & downstream of genes? 
DML.features[["DML.flank-up"]] %>% 
    select(contig.dml, start.dml, end.dml) %>% nrow()
180/nrow(dml25)

DML.features[["DML.flank-down"]] %>% 
    select(contig.dml, start.dml, end.dml) %>% nrow()
171/nrow(dml25)

#How many DMLs are found within TEs? 
DML.features[["DML.TE"]] %>% 
  select(contig.dml, start.dml, end.dml) %>% unique() %>% nrow()
188/nrow(dml25)

DML.features[["DML.unknown"]] %>% 
  select(contig.DML, start.dml, end.dml) %>% unique() %>% nrow()
497/nrow(dml25)

```

## 2. MACAU - where are size-associated methylated loci located? 

```{r, message=FALSE, warning=FALSE}
macau.files <- c("macau-CDS.bed", "macau-exon.bed", "macau-gene.bed", "macau-gene2kb.bed", "macau-2kbflank-up.bed", "macau-2kbflank-down.bed", "macau-mRNA.bed", "macau-ASV.bed", "macau-TE.bed", "macau-unknown.bed")
macau.features <- list()
for (i in c(1:10)) {
  macau.features[[i]] <- read_delim(here::here("analyses", "macau", macau.files[i]), delim = '\t', col_names = FALSE) %>% as_tibble()}
for (i in c(1:4, 6:8)) {
  macau.features[[i]] <- macau.features[[i]] %>%
    setNames(c("contig.macau","start.macau","end.macau","contig.feat", "source.feat","feature","start.feat","end.feat","unknown1","strand","unknown2","attribute")) %>%
mutate(ID=str_extract(attribute, "ID=(.*?);"),
       Parent=str_extract(attribute, "Parent=(.*?);"),
       Name=str_extract(attribute, "Name=(.*?);"),
       Alias=str_extract(attribute, "Alias=(.*?);"),
       AED=str_extract(attribute, "AED=(.*?);"),
       eAED=str_extract(attribute, "eAED=(.*?);"),
       Note=str_extract(attribute, "Note=(.*?);"),
       Ontology_term=str_extract(attribute, "Ontology_term=(.*?);"),
       Dbxref=str_extract(attribute, "Dbxref=(.*?);"),
       SPID=str_extract(attribute, "SPID=(.*?);")
       ) %>%
mutate_at("feature", as.factor)
}
names(macau.features) <- c("macau.CDS", "macau.exon","macau.gene", "macau.gene2kb","macau.flank-up","macau.flank-down","macau.mRNA", "macau.ASV", "macau.TE", "macau.unknown")
macau.features[["macau.CDS"]] <- macau.features[["macau.CDS"]] %>% mutate_at("unknown2", as.character)
macau.features[["macau.ASV"]] <- macau.features[["macau.ASV"]] %>% mutate_at("unknown1", as.character)
macau.features[["macau.gene2kb"]]$feature <- "gene2kb"
macau.features[["macau.flank-up"]]$feature <- "flank-up"
macau.features[["macau.flank-down"]]$feature <- "flank-down"
#macau.features[["macau.unknown"]]$feature <- "unknown"
macau.features.df <- bind_rows(macau.features[-8])
print(macau.summary <- table(macau.features.df[c("feature")])) #NOTE this does not contain alternative splice variants. There are zero transposable elements and unknown loci.  

save(macau.features.df, file="../analyses/macau/R-objects/macau.features.df")
save(macau.summary, file="../analyses/macau/R-objects/macau.summary")

#macau.features[["macau.gene"]] %>% inner_join(uniprot, by=c("contig.feat","start.feat", "end.feat"))  #not sure if this is working?
```

### Barplots of mean % methylation by population, MACAU loci in known features  

```{r}
load(here::here("analyses", "methylation", "R-objects", "meth_filter_reshaped"))

meth_filter_calcs <- meth_filter_reshaped %>% 
  group_by(population, chr, start) %>% 
  dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n()) 

meth_filter_calcs %>% 
      filter(chr %in% macau.features[["macau.gene2kb"]]$contig.macau &
           start %in% macau.features[["macau.gene2kb"]]$start.macau) %>%
  ggplot(aes(x = population, y = mean_percMeth, fill = population, 
                         label=paste0(round(mean_percMeth, digits = 2), "%"))) + 
      geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
      geom_pointrange(aes(ymin=mean_percMeth, 
                        ymax=mean_percMeth+sd_percMeth, width=0.15)) + 
      geom_text(size=3, vjust=-0.5, hjust=1.25) +
      theme_light() + ggtitle("") + facet_wrap(~chr) + 
    scale_fill_manual(values=c("firebrick3","dodgerblue3"))

## 


```

## 3. All background loci used for DMLs & MACAU 

### Where are DML background loci located? 

```{r, message=FALSE, warning=FALSE}
allLociDMLfiles <- c("AllLociDMLs-CDS.bed", "AllLociDMLs-exon.bed", "AllLociDMLs-gene.bed",  "AllLociDMLs-gene2kb.bed", "AllLociDMLs-2kbflank-up.bed", "AllLociDMLs-2kbflank-down.bed", "AllLociDMLs-mRNA.bed", "AllLociDMLs-TE.bed", "AllLociDMLs-ASV.bed", "AllLociDMLs-unknown.bed")
allLociDML.features <- list()
for (i in c(1:10)) {
  allLociDML.features[[i]] <- read_delim(here::here("analyses","DMLs", allLociDMLfiles[i]), delim = '\t', col_names = FALSE) %>% as_tibble()}
for (i in 1:9) {
  allLociDML.features[[i]] <- allLociDML.features[[i]] %>%
    setNames(c("contig.allLoci","start.allLoci","end.allLoci", "unknown", "contig.feat", "source.feat","feature","start.feat","end.feat","unknown1","strand","unknown2","attribute")) %>%
mutate(ID=str_extract(attribute, "ID=(.*?);"),
       Parent=str_extract(attribute, "Parent=(.*?);"),
       Name=str_extract(attribute, "Name=(.*?);"),
       Alias=str_extract(attribute, "Alias=(.*?);"),
       AED=str_extract(attribute, "AED=(.*?);"),
       eAED=str_extract(attribute, "eAED=(.*?);"),
       Note=str_extract(attribute, "Note=(.*?);"),
       Ontology_term=str_extract(attribute, "Ontology_term=(.*?);"),
       Dbxref=str_extract(attribute, "Dbxref=(.*?);"),
       SPID=str_extract(attribute, "SPID=(.*?);")
       ) %>%
mutate_at("feature", as.factor)
}
#allLociDML.features[[8]] <- allLociDML.features[[8]] %>%
#    setNames(c("contig.feat", "start.feat","end.feat","unknown"))
names(allLociDML.features) <- c("allLociDML.CDS","allLociDML.exon","allLociDML.gene","allLociDML.gene2kb","allLociDML.flank-up","allLociDML.flank-down","allLociDML.mRNA","allLociDML.TE","allLociDML.ASV","allLociDML.unknown")
allLociDML.features[["allLociDML.unknown"]] <- 
  allLociDML.features[["allLociDML.unknown"]] %>% 
  setNames(c("contig.allLoci", "start.allLoci", "end.allLoci", "unkown"))
allLociDML.features[["allLociDML.CDS"]] <- allLociDML.features[["allLociDML.CDS"]] %>% mutate_at("unknown2", as.character)
allLociDML.features[["allLociDML.ASV"]] <- allLociDML.features[["allLociDML.ASV"]] %>% mutate_at("unknown1", as.character)
allLociDML.features[["allLociDML.gene2kb"]]$feature <- "gene2kb"
allLociDML.features[["allLociDML.flank-up"]]$feature <- "flank-up"
allLociDML.features[["allLociDML.flank-down"]]$feature <- "flank-down"
allLociDML.features[["allLociDML.unknown"]]$feature <- "unknown"

allLociDML.features.df <- bind_rows(allLociDML.features[-9]) #don't include alternative splice variants  
print(allLociDML.summary <- table(allLociDML.features.df[c("feature")])) #Note, "similarity"= transposable elements 

save(allLociDML.summary, file="../analyses/DMLs/R-objects/allLociDML.summary") 
```

### Where are MACAU background loci located? 

```{r, message=FALSE, warning=FALSE}
allLociMACAUfiles <- c("AllLociMACAU-CDS.bed", "AllLociMACAU-exon.bed", "AllLociMACAU-gene.bed",  "AllLociMACAU-gene2kb.bed", "AllLociMACAU-2kbflank-up.bed", "AllLociMACAU-2kbflank-down.bed", "AllLociMACAU-mRNA.bed", "AllLociMACAU-TE.bed", "AllLociMACAU-ASV.bed", "AllLociMACAU-unknown.bed")
allLociMACAU.features <- list()
for (i in c(1:10)) {
  allLociMACAU.features[[i]] <- read_delim(here::here("analyses","macau", allLociMACAUfiles[i]), delim = '\t', col_names = FALSE) %>% as_tibble()}
for (i in 1:9) {
  allLociMACAU.features[[i]] <- allLociMACAU.features[[i]] %>%
    setNames(c("contig.allLoci","start.allLoci","end.allLoci", "contig.feat", "source.feat","feature","start.feat","end.feat","unknown1","strand","unknown2","attribute")) %>%
mutate(ID=str_extract(attribute, "ID=(.*?);"),
       Parent=str_extract(attribute, "Parent=(.*?);"),
       Name=str_extract(attribute, "Name=(.*?);"),
       Alias=str_extract(attribute, "Alias=(.*?);"),
       AED=str_extract(attribute, "AED=(.*?);"),
       eAED=str_extract(attribute, "eAED=(.*?);"),
       Note=str_extract(attribute, "Note=(.*?);"),
       Ontology_term=str_extract(attribute, "Ontology_term=(.*?);"),
       Dbxref=str_extract(attribute, "Dbxref=(.*?);"),
       SPID=str_extract(attribute, "SPID=(.*?);")
       ) %>%
mutate_at("feature", as.factor)
}
names(allLociMACAU.features) <- c("allLociMACAU.CDS","allLociMACAU.exon","allLociMACAU.gene","allLociMACAU.gene2kb","allLociMACAU.flank-up","allLociMACAU.flank-down","allLociMACAU.mRNA","allLociMACAU.TE","allLociMACAU.ASV","allLociMACAU.unknown")
allLociMACAU.features[["allLociMACAU.unknown"]] <- 
  allLociMACAU.features[["allLociMACAU.unknown"]] %>% 
  setNames(c("contig.allLoci", "start.allLoci", "end.allLoci"))
allLociMACAU.features[["allLociMACAU.CDS"]] <- allLociMACAU.features[["allLociMACAU.CDS"]] %>% mutate_at("unknown2", as.character)
allLociMACAU.features[["allLociMACAU.ASV"]] <- allLociMACAU.features[["allLociMACAU.ASV"]] %>% mutate_at("unknown1", as.character)
allLociMACAU.features[["allLociMACAU.gene2kb"]]$feature <- "gene2kb"
allLociMACAU.features[["allLociMACAU.flank-up"]]$feature <- "flank-up"
allLociMACAU.features[["allLociMACAU.flank-down"]]$feature <- "flank-down"
allLociMACAU.features[["allLociMACAU.unknown"]]$feature <- "unknown"
allLociMACAU.features.df <- bind_rows(allLociMACAU.features[-9]) #don't include alternative splice variants or unknown 
print(allLociMACAU.summary <- table(allLociMACAU.features.df[c("feature")])) #Note, "similartiy"= transposable elements 

save(allLociMACAU.summary, file="../analyses/macau/R-objects/allLociMACAU.summary")
```

## Summary barplot showing where DMLs, MACAU and All Loci are located (relative to the total number assessed for each groups)

```{r, message=FALSE, warning=FALSE}
load("../analyses/DMLs/R-objects/dml25") # load df for no. of DMLs (including those not annotated to known features)
load("../analyses/DMLs/R-objects/mydiff.all") # load df for total no. of loci (which were analyzed for DML) 
load("../analyses/macau/R-objects/macau.FDR.length") #load df for total no. of MACAU loci analyzed & no. of sign. loci 

# Add row to summary dataframe with total number of loci in DMLs, all loci 

# Create single summary df showing where DML loci and all loci analyzed for DML are located 
loci.locations <-  merge(x=merge(x=merge(x=melt(DML.summary, varnames = "feature", value.name = "DML"), 
      y=melt(macau.summary, varnames = "feature", value.name = "macau"),
      by="feature", all=TRUE), 
      y=melt(allLociDML.summary, varnames = "feature", value.name = "allLociDML"),
      by="feature", all=TRUE),
      y=melt(allLociMACAU.summary, varnames = "feature", value.name = "allLociMACAU"),
      by="feature", all=TRUE) %>% 
  mutate(feature = as.character(feature)) %>%
  mutate(DML=as.numeric(DML), allLociDML=as.numeric(allLociDML))
#    DML=as.numeric(DML), allLociDML=as.numeric(allLociDML)) %>%

# First add a row with the # loci that flank genes both (upstream/downstream)", then add a row with the total # methylated loci, then calculate percentages 
loci.locations <- loci.locations  %>%
  rbind(c("geneflank2kb",
          loci.locations[loci.locations$feature=="gene2kb","DML"]-loci.locations[loci.locations$feature=="gene","DML"],
          loci.locations[loci.locations$feature=="gene2kb","macau"]-loci.locations[loci.locations$feature=="gene","macau"],
          loci.locations[loci.locations$feature=="gene2kb","allLociDML"]-loci.locations[loci.locations$feature=="gene","allLociDML"],
          loci.locations[loci.locations$feature=="gene2kb","allLociMACAU"]-loci.locations[loci.locations$feature=="gene","allLociMACAU"])) %>%
  rbind(c("all", as.numeric(nrow(dml25)), as.numeric(nrow(subset(macau.FDR.length, significant=="TRUE"))), as.numeric(nrow(mydiff.all)), as.numeric(nrow(macau.FDR.length)))) %>%
  mutate_at(vars(-feature), funs(as.numeric)) %>%
  mutate(DMLperc=DML/nrow(dml25), 
         MACAUperc=macau/nrow(subset(macau.FDR.length, significant=="TRUE")),
         AllDMLperc=allLociDML/nrow(mydiff.all),
         AllMACAUperc=allLociMACAU/nrow(macau.FDR.length))
save(loci.locations, file="../analyses/methylation/R-objects/loci.locations")

loci.locations.long <- cbind(melt(loci.locations[,c("feature", "DML", "macau", "allLociDML", "allLociMACAU")], 
                                  variable.name = "analysis", value.name = "count"),
                             melt(loci.locations[,c("feature", "DMLperc", "MACAUperc", "AllDMLperc", "AllMACAUperc")], 
                                  variable.name = "analysis", value.name = "percent"))[,c(1,2,3,6)] %>%
   mutate(analysis=fct_relevel(analysis, c("DML", "allLociDML", "macau", "allLociMACAU")),
         feature=as.factor(str_replace(feature, "similarity", "TE")))
save(loci.locations.long, file="../analyses/methylation/R-objects/loci.locations.long")
```


```{r}
ggplot(data=subset(loci.locations.long, feature=="gene" | feature=="flank-up"  | feature=="flank-down" | feature=="exon" | feature=="TE" | feature=="unknown"), aes(x=analysis, y=count, fill=feature, label=prettyNum(count, big.mark = ","))) +  #percent(percent, accuracy = 0.1)
#  geom_bar(stat="identity", width = .5) +
  geom_bar(position="fill", stat="identity", width=0.5) +
  labs(y="No. of Loci", x=NULL) + 
  scale_fill_manual(name = "Loci Location", labels = c("Exon", "Downtream Gene Flank (+2kb at 3')", "Upstream Gene Flank (-2kb at 5')", 
                                                       "Gene Body", "Transposable Elements", "Unknown Regions"),
                    values=c("#a6cee3", "#1f78b4", "#b2df8a","#33a02c", "#fb9a99", "gray")) +
  ggtitle("Locations of DML and size-associated (MACAU) loci in genome\n with background loci locations") +
  theme_minimal() + geom_text(size = 2, position = position_fill(vjust = 0.5)) #+ 
  scale_x_discrete(labels=c("DML" = "DMLs",
                            "allLociDML" = "DML Background\nLoci",
                            "macau" = "Size-Associated\nLoci (MACAU)",
                            "allLociMACAU" = "MACAU\nBackground\nLoci"))

ggsave(filename = "../analyses/barplot-DMLs-SALs.png", device = "png")
```

## DAVID Enrichment Analysis 

### DMLs 

```{r}
# Read in O. lurida gene file that connects OLUR gene ID to uniprot accession number 
# Olurida_gene_uniprot <- read_delim(file = here::here("genome-features", "Olur_gene_UPacc.gff"), delim = "\t", col_names = c("contig", "source", "feature", "start", "end", "unknown1", "strand", "unknown2", "geneID_uniprotID")) %>%
#   separate(geneID_uniprotID, into=c("geneID","uniprotID"), sep = ";") %>% 
#   select(geneID, uniprotID)


### Copy Uniprot accession numbers for DMLs that are annotated 
DML.features[["DML.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(ID, SPID) %>% unique() %>% #keep only one instance of each Oly gene 
  #left_join(Olurida_gene_uniprot, by=c("Name" = "geneID")) %>%  #add uniprot IDs to gene dataframe
  select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard 

### Copy Uniprot accession numbers for all loci assessed for DMLs that are annotated 
allLociDML.features[["allLociDML.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(ID, SPID) %>% unique() %>%  #keep only one instance of each Oly gene 
#left_join(Olurida_gene_uniprot, by=c("Name" = "geneID")) %>%  #add uniprot IDs to gene dataframe
  select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard       

```

## Enriched biological function, DMLs 

| Category         | Term                                                | Count | %    | PValue | Genes                                                                                          | List Total | Pop Hits | Pop Total | Fold Enrichment | Bonferroni | Benjamini | FDR |
|------------------|-----------------------------------------------------|-------|------|--------|------------------------------------------------------------------------------------------------|------------|----------|-----------|-----------------|------------|-----------|-----|
| GOTERM_BP_DIRECT | GO:0045214~sarcomere organization                   | 4     | 2.82 | 0.0099 | Q5DTJ9, G4SLH0, Q2V2M9, Q8WZ42                                                                 | 122        | 8        | 1984      | 8.13            | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0051693~actin filament capping                   | 3     | 2.11 | 0.011  | P13395, Q13813, P16546                                                                         | 122        | 3        | 1984      | 16.26           | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0006513~protein monoubiquitination               | 4     | 2.82 | 0.014  | Q80Z37, Q7Z6Z7, Q7TMY8, E1B932                                                                 | 122        | 9        | 1984      | 7.23            | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0006284~base-excision repair                     | 4     | 2.82 | 0.014  | Q13569, Q7Z6Z7, Q9U221, Q7TMY8                                                                 | 122        | 9        | 1984      | 7.23            | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0007275~multicellular organism development       | 12    | 8.45 | 0.017  | Q8R508, Q62513, Q8BNA6, A0JMF8, Q58EK5, Q60636, Q0V989, Q13535, Q6ZQ88, Q2QCI8, Q9P203, Q80TZ9 | 122        | 89       | 1984      | 2.19            | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0010842~retina layer formation                   | 3     | 2.11 | 0.033  | Q8R508, Q8BNA6, P28828                                                                         | 122        | 5        | 1984      | 9.76            | 0.999      | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0008104~protein localization                     | 4     | 2.82 | 0.069  | Q9W4E2, P0C5Y8, Q8CJ40, P29503                                                                 | 122        | 16       | 1984      | 4.066           | 1.0        | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0042493~response to drug                         | 5     | 3.52 | 0.069  | O95477, P28828, P08183, Q13535, Q99758                                                         | 122        | 26       | 1984      | 3.13            | 1.0        | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0008152~metabolic process                        | 5     | 3.52 | 0.087  | Q66PG2, P13086, Q9U221, P58058, O17732                                                         | 122        | 28       | 1984      | 2.90            | 1.0        | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0006974~cellular response to DNA damage stimulus | 8     | 5.63 | 0.091  | Q9CPR8, Q80Z37, Q13535, Q9U221, Q7TMY8, F1MRW8, Q2NL57, Q8IW19                                 | 122        | 64       | 1984      | 2.033           | 1.0        | 1.0       | 1.0 |
| GOTERM_BP_DIRECT | GO:0034220~ion transmembrane transport              | 4     | 2.82 | 0.092  | Q6Q760, Q9HCE9, Q18297, Q9P2D8                                                                 | 122        | 18       | 1984      | 3.61            | 1.0        | 1.0       | 1.0 |

## For each GO term, which population has a higher methylation rate? 

```{r}
EnrBP.DMLs <- read_delim(here::here("analyses", "DMLs", "Enriched-BioProc-DMLs.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into = c("Term", "Function"), sep="~") %>%
  mutate(GeneSet="DML") %>%
  mutate_at(vars(Term, Function), as.factor) #%>%
  #separate_rows(Genes)


# Bar plots for each DML by enriched GO term - looking for a pattern by population 
for (i in 1:nrow(EnrBP.DMLs)) {
  print(DML.features[["DML.gene2kb"]] %>% 
  left_join(DML.calcs %>% mutate(start=as.numeric(start-1)), by = c("contig.dml"="chr", "start.dml"="start")) %>%
  mutate(SPID=str_remove(SPID, "SPID="))  %>% mutate(SPID=str_remove(SPID, ";")) %>% #filter(!is.na(SPID)) %>% 
  filter(SPID %in% unlist(strsplit(EnrBP.DMLs$Genes[i], split=", "))) %>%
  ggplot(aes(x = population, y = mean_percMeth, fill = population, 
                         label=paste0(round(mean_percMeth, digits = 2), "%"))) + 
      geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
      geom_pointrange(aes(ymin=mean_percMeth, 
                        ymax=mean_percMeth+sd_percMeth, width=0.15)) + 
      geom_text(size=3, vjust=-0.5, hjust=1.25) +
      theme_light() + ggtitle(as.character(EnrBP.DMLs$Term[i])) +
    scale_fill_manual(values=c("firebrick3","dodgerblue3")) + 
    theme(strip.background = element_blank(), strip.text.x = element_blank(), 
          strip.text.y = element_blank(), title = element_blank()) +
  facet_wrap(~SPID+contig.dml+start.dml)) #can include "+contig.dml+start.dml" to separate each DML  
}


plot_list = list()

# Scatter plots for each enriched biological process showing the ratio of % meth. between HC and SS for each DML  
for (i in 1:nrow(EnrBP.DMLs)) {
  p <- DML.features[["DML.gene2kb"]] %>% 
  left_join(DML.calcs %>% mutate(start=as.numeric(start-1)), by = c("contig.dml"="chr", "start.dml"="start")) %>%
  dplyr::select(-sd_percMeth) %>%
  pivot_wider(names_from = population, values_from = mean_percMeth) %>%
  mutate(SPID=str_remove(SPID, "SPID="))  %>% mutate(SPID=str_remove(SPID, ";")) %>% #filter(!is.na(SPID)) %>% 
  filter(SPID %in% unlist(strsplit(EnrBP.DMLs$Genes[i], split=", "))) %>%
  ggplot(aes(x = HC, y = SS, color=SPID, label=paste0(round(HC/SS, digits = 2)))) + 
      geom_point(stat="identity", width = 0.5, size=3) + 
      #geom_text(size=3, vjust=-0.5, hjust=0.75) +
      geom_smooth(method=lm, color = "black") +
    ggtitle(paste("Enriched Biological Process: ", as.character(EnrBP.DMLs$Function[i]))) +
    xlab("Hood Canal % Methylation") + ylab("South Sound % Methylation") +
    theme_light() + theme(plot.title = element_text(size=10))
  plot_list[[i]] <- p
}

pdf("../analyses/DMLs/DML-Enriched-BP-meth-ratio-scatter.pdf")
for (i in 1:length(plot_list)) {
    print(plot_list[[i]])
}
dev.off()

```


### Size Associated Soci (SALs, identified via MACAU) 

```{r}
### Copy Uniprot accession numbers for SALs that are annotated 
macau.features[["macau.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  #left_join(Olurida_gene_uniprot, by=c("Name" = "geneID")) %>%  #add uniprot IDs to gene dataframe
  select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard 


### Copy Uniprot accession numbers for all loci assessed for DMLs that are annotated 
allLociMACAU.features[["allLociMACAU.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  #left_join(Olurida_gene_uniprot, by=c("Name" = "geneID")) %>%  #add uniprot IDs to gene dataframe
  select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard       

```

### Enriched biological functions, SALs

| Category         | Term                                               | Count | %      | PValue | Genes          | List Total | Pop Hits | Pop Total | Fold Enrichment | Bonferroni | Benjamini | FDR   |
|------------------|----------------------------------------------------|-------|--------|--------|----------------|------------|----------|-----------|-----------------|------------|-----------|-------|
| GOTERM_BP_DIRECT | GO:0006607~NLS-bearing protein import into nucleus | 2     | 18.182 | 0.0182 | Q8BFY9, H2QII6 | 10         | 4        | 1967      | 98.35           | 0.608      | 0.928     | 0.928 |

### Convert GO terms to GO Slim, 

Use this spreadsheet that assigns each GO Term to a GO Slim: http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

```{r}
# Read in the GO Slim table  
GOSlim <- read_delim(here::here("resources", "GO-GOslim.sorted"), delim = '\t', col_names = FALSE) %>% 
  setNames(c("GO", "term", "slim", "category")) %>% 
  mutate_at(vars(category, slim), as.factor)

GO.enriched.DML <- read_delim(here::here("analyses", "DMLs", "Enriched-BioFunc-DMLs.txt"), delim = '\t', col_names = TRUE) %>% 
   separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") %>% 
   left_join(GOSlim, by=c("GO", "term")) #add slim 

GO.enriched.DML %>% select(GO, term, slim, FDR) %>% View()

GO.enriched.SAL <- read_delim(here::here("analyses", "macau", "EnrichBioFunc-DAVID-macau.txt"), delim = '\t', col_names = TRUE) %>% 
   separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") %>% 
   left_join(GOSlim, by=c("GO", "term")) #add slim 

GO.enriched.SAL %>% select(GO, slim, FDR)
```

Extract GO terms for SAL genes and merge with GO SLIM 

```{r}
macau.features[["macau.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>% 
  mutate(Ontology_term=str_remove(Ontology_term, "Ontology_term=")) %>% mutate(Ontology_term=str_remove(Ontology_term, ";")) %>%  
  select(SPID, Ontology_term) %>% na.omit() %>% mutate(GO = strsplit(as.character(Ontology_term), ",")) %>% 
    unnest(GO) %>% select(SPID, GO) %>% left_join(GOSlim, by=c("GO")) %>% arrange(slim)
```

### Plot SALS by oyster size 

```{r, warning=FALSE, message=FALSE}
load(file="../analyses/methylation/R-objects/meth_filter_reshaped")
load(file="../analyses/macau/size.macau")

# Plot all SAL % meth against oyster size in one plot 
meth_filter_reshaped %>% 
      filter(chr %in% macau.features[["macau.gene2kb"]]$contig.macau &
           start %in% macau.features[["macau.gene2kb"]]$start.macau) %>%
  mutate(sample=as.integer(sample)) %>%
  left_join(size.macau,  by = c("sample" = "y.MBD.FILENAME")) %>%
ggplot(aes(x=y.Length, y=percMeth)) +
  geom_point() + theme_minimal() + geom_smooth(method="lm")  #+
  facet_wrap(~chr+start)

# Separate scatter plots for SAL % meth ~ oyster size 
temp <-  meth_filter_reshaped %>% 
      filter(chr %in% macau.features[["macau.gene2kb"]]$contig.macau &
           start %in% macau.features[["macau.gene2kb"]]$start.macau) %>%
  mutate(sample=as.integer(sample)) %>%
  left_join(size.macau,  by = c("sample" = "y.MBD.FILENAME")) %>%
  left_join(macau.features[["macau.gene2kb"]][c("contig.macau", "start.macau", "SPID", "Note")], 
            c("chr" = "contig.macau", "start" = "start.macau")) %>%
mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%
  mutate(Note=str_remove(Note, "Note=")) %>% mutate(Note=gsub("(.*)\\(.*", "\\1", Note)) %>% 
   mutate(Note=str_remove(Note, ";"))

SAL.loci <- temp %>% dplyr::select(chr, start) %>% distinct()

for (i in 1:nrow(SAL.loci)) {
  print(ggplot(data= temp %>% filter(chr==SAL.loci[i,1]), aes(x=`y.Length`, y=percMeth), color=population) +
  theme_minimal(base_size = 15) + geom_smooth(method="gam", col="gray70") + geom_point(size=2.8, aes(col=population)) + 
    scale_color_manual(values=c("firebrick3", "dodgerblue3")) +
  ggtitle(label = paste("SAL: ", SAL.loci[i,1], "-", SAL.loci[i,2], sep=""),
              subtitle = paste("Gene: ", unique(temp[temp$chr==SAL.loci[i,1], "SPID"]), " - ",
            unique(temp[temp$chr==SAL.loci[i,1], "Note"]), sep="")) +
    xlab("Oyster length (mm)") + ylab("% Methylated") + theme(legend.position = c(0.9, 0.15)) + 
    labs(color = "Population") +
    guides(colour = guide_legend(override.aes = list(size=4)))
  )
}  

```

### For SALS, run GLMs to see if % meth differs by population 

```{r, warning=FALSE, message=FALSE}
library(car)

SAL.glm.pop <- vector("list", nrow(SAL.loci))
for (i in 1:nrow(SAL.loci)) {
  SAL.glm.pop[[i]] <- Anova(glm(percMeth ~ population, data=temp %>% filter(chr==SAL.loci[i,1])))
  SAL.glm.pop[[i]] <- setNames(SAL.glm.pop[i], paste(SAL.loci[i,1], "-", SAL.loci[i,2], sep=""))
}

SAL.glm.size <- vector("list", nrow(SAL.loci))
for (i in 1:nrow(SAL.loci)) {
  SAL.glm.size[[i]] <- Anova(glm(percMeth ~ y.Length, data=temp %>% filter(chr==SAL.loci[i,1])))
  SAL.glm.size[[i]] <- setNames(SAL.glm.size[i], paste(SAL.loci[i,1], "-", SAL.loci[i,2], sep=""))
}

# These are the SALs in which GLMs also indicate % meth ~ size 
print(SAL.glm.size[c(1,2,5,6,11)])
SAL.loci[c(1,2,5,6,11),] 

# plot those 
for (i in c(1,2,5,6,11)) {
  print(ggplot(data= temp %>% filter(chr==SAL.loci[i,1]), aes(x=`y.Length`, y=percMeth), color=population) +
  theme_minimal(base_size = 15) + geom_smooth(method="gam", col="gray70") + geom_point(size=2.8, aes(col=population)) + 
    scale_color_manual(values=c("firebrick3", "dodgerblue3")) +
  ggtitle(label = paste("SAL: ", SAL.loci[i,1], "-", SAL.loci[i,2], sep=""),
              subtitle = paste("Gene: ", unique(temp[temp$chr==SAL.loci[i,1], "SPID"]), " - ",
            unique(temp[temp$chr==SAL.loci[i,1], "Note"]), sep="")) +
    xlab("Oyster length (mm)") + ylab("% Methylated") + theme(legend.position = c(0.9, 0.15)) + 
    labs(color = "Population") +
    guides(colour = guide_legend(override.aes = list(size=4)))
  )
}  

# 4 of the 5 are similar to annotated genes. Let's plot those in a grid 
# plot those 
SAL.plots <- vector("list", length(c(2,5,6,11)))
for (i in c(2,5,6,11)) {
p <-  ggplot(data= temp %>% filter(chr==SAL.loci[i,1]), 
             aes(x=`y.Length`, y=percMeth), color=population) +
  theme_minimal(base_size = 12) + geom_smooth(method="glm", col="gray70") + 
  geom_point(size=2.8, aes(col=population)) + 
    scale_color_manual(values=c("firebrick3", "dodgerblue3")) +
  ggtitle(label = paste("SAL: ", SAL.loci[i,1], "-", SAL.loci[i,2], sep=""),
              subtitle = paste("Gene: ", unique(temp[temp$chr==SAL.loci[i,1], "SPID"]), " - ",
            unique(temp[temp$chr==SAL.loci[i,1], "Note"]), sep="")) +
    xlab("Oyster length (mm)") + ylab("% Methylated") + 
  theme(legend.position = "bottom") + #c(0.9, 0.15)
    labs(color = "Population") +
    guides(colour = guide_legend(override.aes = list(size=4)))
SAL.plots[[i]] <- p
}  

library(gridExtra)
grid.arrange(SAL.plots[[2]], SAL.plots[[5]], SAL.plots[[6]], SAL.plots[[11]], nrow = 2)

```

# Merge Enriched Biological Processes in DMGs and genes containing DMLs - see which overlap 

```{r}
EnrBP.DMGs <- read_delim(here::here("analyses", "DMGs", "Enriched-BioProc-DMGs.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into = c("Term", "Function"), sep="~") %>%
  mutate(GeneSet="DMG") %>%
  mutate_at(vars(Term, Function), as.factor)

EnrBP.DMLs <- read_delim(here::here("analyses", "DMLs", "Enriched-BioProc-DMLs.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into = c("Term", "Function"), sep="~") %>%
  mutate(GeneSet="DML") %>%
  mutate_at(vars(Term, Function), as.factor)

# Read in the Term Slim table  
GOSlim <- read_delim(here::here("resources", "GO-GOslim.sorted"), delim = '\t', col_names = FALSE) %>% 
  setNames(c("Term", "term", "slim", "category")) %>% 
  mutate_at(vars(category, slim), as.factor)

# Identify biological processes enriched in both DMGs and genes containing DMLs
inner_join(by = c("Term", "Function"), 
          EnrBP.DMGs %>% dplyr::select(Term, Function, PValue, FDR, GeneSet) %>%
  mutate(DMG=paste0(signif(PValue, digits = 2), " (", FDR, ")")), 
          EnrBP.DMLs %>% dplyr::select(Term, Function, PValue, FDR, GeneSet) %>%
  mutate(DML=paste0(signif(PValue, digits = 2), " (", FDR, ")"))) %>%
    left_join(GOSlim) %>% dplyr::select(slim, Term, Function, PValue.x, PValue.y) %>% 
  arrange(slim) %>% 
  write_csv(file = "../analyses/methylation/DMG-DML-EnrichedBP-common.csv", col_names = TRUE)

nrow(EnrBP.DMGs)
nrow(EnrBP.DMLs)
31-7

# Write out table for supplemental materials - all enriched biological processes found in both DMGs and genes containing DMLs 
full_join(by = c("Term", "Function"), 
          EnrBP.DMGs %>% dplyr::select(Term, Function, PValue, FDR, GeneSet) %>%
  mutate(DMG=paste0(signif(PValue, digits = 2), " (", FDR, ")")), 
          EnrBP.DMLs %>% dplyr::select(Term, Function, PValue, FDR, GeneSet) %>%
  mutate(DML=paste0(signif(PValue, digits = 2), " (", FDR, ")"))) %>%
  dplyr::select(Term, Function, DMG, DML) %>%
  left_join(GOSlim) %>% arrange(slim, Function) %>% 
  dplyr::select(slim, Term, Function, DMG, DML) %>% dplyr::rename(., `Go Slim` =slim) %>% 
  write_csv(file = "../analyses/methylation/DMG-DML-EnrichedBP.csv", col_names = TRUE) 
```

## Extract GO terms

```{r}
###### ====== DML loci GO terms  

# Count string lengths in the Ontology_term column and check out - to figure out max # GO terms 
#data.frame(ontology=DML.features.df$Ontology_term,chr=apply(DML.features.df[,"Ontology_term"],2,nchar)) # %>% View()

DML.GO <- DML.features.df[c("contig.dml", "start.dml", "end.dml", "feature", "start.feat", "end.feat", "Note", "Ontology_term")] %>%
  distinct(contig.dml, start.dml, Note, Ontology_term, .keep_all = TRUE) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term)) 

write_csv(DML.features.df, path = here::here("analyses/", "DMLs", "DML.features.txt")) #write out df with all DML features 
write(DML.GO$GO_term, file = here::here("analyses/", "DMLs", "DML.GO.txt"))
write.table(subset(DML.features.df, feature=="gene"), here::here("analyses/", "DMLs", "DML.geneinfo.txt"), sep = "\t") #write out gene info for DMLs to associate with UniprotIDs

###### ====== MACAU loci GO terms  

# Count string lengths in the Ontology_term column and check out - to figure out max # GO terms 
#data.frame(ontology=macau.features.df$Ontology_term,chr=apply(macau.features.df[,"Ontology_term"],2,nchar)) #%>% View()

macau.GO <- macau.features.df[c("contig.macau", "start.macau", "end.macau", "feature", "start.feat", "end.feat", "Note", "Ontology_term")] %>%
  distinct(contig.macau, start.macau, Note, Ontology_term, .keep_all = TRUE) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:6, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))

write_csv(macau.features.df, path = here::here("analyses/", "macau/", "macau.features.csv")) #write out df with all MACAU features 
write(macau.GO$GO_term, file = here::here("analyses/", "macau/", "macau.GO.txt")) #write out df with just GO terms 
write.table(subset(macau.features.df, feature=="gene"), here::here("analyses/", "macau/", "MACAU.geneinfo.txt"), sep = "\t") #write out gene info for MACAU loci to associate with UniprotIDs

###### ====== DML Background loci GO terms  

# Count string lengths in the Ontology_term column and check out - to figure out max # GO terms 
#data.frame(ontology=allLociDML.features.df$Ontology_term,chr=apply(allLociDML.features.df[,"Ontology_term"],2,nchar)) # %>% View()

allLociDML.GO <- allLociDML.features.df[c("contig.allLoci", "start.allLoci", "end.allLoci", "feature", "start.feat", "end.feat", "Note", "Ontology_term")] %>%
  distinct(contig.allLoci, start.allLoci, Note, Ontology_term, .keep_all = TRUE) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term)) 
write_csv(allLociDML.features.df, path = here::here("analyses/", "DMLs", "allLociDML.features.txt")) #write out df with all loci features 
write(allLociDML.GO$GO_term, file = here::here("analyses/", "DMLs", "allLociDML.GO.txt"))
write.table(subset(allLociDML.features.df, feature=="gene"), here::here("analyses/", "DMLs", "allLociDML.geneinfo.txt"), sep = "\t") #write out gene info for All loci fed into DMLs to associate with UniprotIDs

#save all loci feature df object to file, to use in notebook #12 
save(allLociDML.features.df, file=here::here("analyses", "DMLs", "R-objects", "allLociDML.features.df")) 

###### ====== MACAU Background loci GO terms  

# Count string lengths in the Ontology_term column and check out - to figure out max # GO terms 
#data.frame(ontology=allLociMACAU.features.df$Ontology_term,chr=apply(allLociMACAU.features.df[,"Ontology_term"],2,nchar))  #%>% View()

allLociMACAU.GO <- allLociMACAU.features.df[c("contig.allLoci", "start.allLoci", "end.allLoci", "feature", "start.feat", "end.feat", "Note", "Ontology_term")] %>%
  distinct(contig.allLoci, start.allLoci, Note, Ontology_term, .keep_all = TRUE) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term)) 
write_csv(allLociMACAU.features.df, path = here::here("analyses/", "macau", "allLociMACAU.features.txt")) #write out df with all loci features 
write(allLociMACAU.GO$GO_term, file = here::here("analyses/", "macau", "allLociMACAU.GO.txt"))
write.table(subset(allLociMACAU.features.df, feature=="gene"), here::here("analyses/", "macau", "allLociMACAU.geneinfo.txt"), sep = "\t") #write out gene info for All loci fed into MACAU to associate with UniprotIDs

#save all loci feature df object to file, to use in notebook #12 
save(allLociMACAU.features.df, file=here::here("analyses", "macau", "R-objects", "allLociMACAU.features.df")) 

# Save GO terms only from enriched methylated gene lists (HC and SS)

read_delim(here::here("analyses", "methylation", "DAVID_enriched-methylated-genes-HC.txt"), delim = "\t") %>%
  separate(Term, into = c("GO", "function"), sep = "~") %>% select(GO) %>% as.vector() %>%
  write.table(here::here("analyses", "methylation", "GO-terms_enriched-methylated-genes-HC.txt"), col.names = FALSE, row.names = FALSE, quote = FALSE)

read_delim(here::here("analyses", "methylation", "DAVID_enriched-methylated-genes-SS.txt"), delim = "\t") %>%
  separate(Term, into = c("GO", "function"), sep = "~") %>% select(GO) %>% as.vector() %>%
  write.table(here::here("analyses", "methylation", "GO-terms_enriched-methylated-genes-SS.txt"), col.names = FALSE, row.names = FALSE, quote = FALSE)

```

# ===================================================
## BONEYARD CODE

### Annotate DMLs separately by those that are hypermethylated in each population 

```{bash}
bedtools intersect -wb -a "../analyses/DMLs/dml25-hypermethSS.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/DMLs/DML-hypermethSS.gene2kb.bed
bedtools intersect -wb -a "../analyses/DMLs/dml25-hypermethHC.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/DMLs/DML-hypermethHC.gene2kb.bed
```

### DAVID GO enrichment with DMLs that are hypermeth in SS and HC, respectively 

```{r}
# Create DF with DMLs that are hypermethylated in SS that are located in annotated gene regions  
DML.hyperSS <- read_delim(here::here("analyses","DMLs", "DML-hypermethSS.gene2kb.bed"), delim = '\t', col_names = FALSE) %>% as_tibble() %>%
mutate(Name=str_extract(X13, "Name=(.*?);"),
       SPID=str_extract(X13, "SPID=(.*?);")) %>%
  #remove extraneous info from Olur gene ID and Uniprot species ID ("SPID")
  mutate(Name=str_remove(Name, "Name=")) %>% mutate(Name=str_remove(Name, ";")) %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";"))

# Copy Uniprot IDs for Hypermethylated DMLs in SS to clipboard 
DML.hyperSS %>% 
  dplyr::select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard    
```

### Enriched Biological Functinos in the SS hypermethylated DMLs

| Term | Count | % | PValue | Genes | List Total | Pop Hits | Pop Total | Fold Enrichment | Benjamini | FDR |
|-|-|-|-|-|-|-|-|-|-|-|
| GO:0045214~sarcomere organization | 3 | 4.0 | 0.025 | Q5DTJ9, Q2V2M9, Q8WZ42 | 64 | 8 | 1984 | 11.63 | 1.0 | 1.0 |
| GO:0006284~base-excision repair | 3 | 4.0 | 0.031 | Q7Z6Z7, Q9U221, Q7TMY8 | 64 | 9 | 1984 | 10.33 | 1.0 | 1.0 |
| GO:0006513~protein monoubiquitination | 3 | 4.0 | 0.031 | Q7Z6Z7, Q7TMY8, E1B932 | 64 | 9 | 1984 | 10.33 | 1.0 | 1.0 |
| GO:0030837~negative regulation of actin filament polymerization | 2 | 2.67 | 0.063 | Q6NN85, Q2V2M9 | 64 | 2 | 1984 | 31.0 | 1.0 | 1.0 |
| GO:0051693~actin filament capping | 2 | 2.67 | 0.09 | P13395, Q13813 | 64 | 3 | 1984 | 20.67 | 1.0 | 1.0 |
| GO:0055003~cardiac myofibril assembly | 2 | 2.67 | 0.09 | Q2V2M9, Q8WZ42 | 64 | 3 | 1984 | 20.67 | 1.0 | 1.0 |

```{r}
# Now do the same for the HC hypermethylated DMLs 
DML.hyperHC <- read_delim(here::here("analyses","DMLs", "DML-hypermethHC.gene2kb.bed"), delim = '\t', col_names = FALSE) %>% as_tibble() %>%
mutate(Name=str_extract(X13, "Name=(.*?);"),
       SPID=str_extract(X13, "SPID=(.*?);")) %>%
  #remove extraneous info from Olur gene ID and Uniprot species ID ("SPID")
  mutate(Name=str_remove(Name, "Name=")) %>% mutate(Name=str_remove(Name, ";")) %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";"))

  # Copy Uniprot IDs for Hypermethylated DMLs in SS to clipboard 
DML.hyperHC %>% 
  dplyr::select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard    
```

### Enriched Biological Functions in the HC hypermethylated DMLs 

| Term | Count | % | PValue | Genes | List Total | Pop Hits | Pop Total | Fold Enrichment | FDR |
|-|-|-|-|-|-|-|-|-|-|
| GO:0007275~multicellular organism development | 8 | 10.53 | 0.024 | Q8R508, Q62513, Q8BNA6, A0JMF8, Q60636, Q13535, Q2QCI8, Q80TZ9 | 66 | 89 | 1984 | 2.70 | 1.0 |
| GO:0006974~cellular response to DNA damage stimulus | 6 | 7.89 | 0.056 | Q80Z37, Q13535, Q7TMY8, F1MRW8, Q2NL57, Q8IW19 | 66 | 64 | 1984 | 2.82 | 1.0 |
| GO:2000171~negative regulation of dendrite development | 2 | 2.63 | 0.064 | Q8R508, Q8BNA6 | 66 | 2 | 1984 | 30.06 | 1.0 |
| GO:0010629~negative regulation of gene expression | 3 | 3.95 | 0.065 | Q62655, Q60636, P08941 | 66 | 13 | 1984 | 6.94 | 1.0 |
| GO:0006816~calcium ion transport | 3 | 3.95 | 0.065 | Q6Q760, O08852, Q24270 | 66 | 13 | 1984 | 6.94 | 1.0 |
| GO:0008104~protein localization | 3 | 3.95 | 0.094 | Q9W4E2, P0C5Y8, Q8CJ40 | 66 | 16 | 1984 | 5.64 | 1.0 |
| GO:0051693~actin filament capping | 2 | 2.63 | 0.095 | P13395, P16546 | 66 | 3 | 1984 | 20.04 | 1.0 |
| GO:0001892~embryonic placenta development | 2 | 2.63 | 0.095 | O08852, Q60636 | 66 | 3 | 1984 | 20.04 | 1.0 |

```{r}
### Copy Uniprot accession numbers for all loci assessed for DMLs that are annotated 
allLociDML.features[["allLociDML.gene2kb"]] %>%
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  #left_join(Olurida_gene_uniprot, by=c("Name" = "geneID")) %>%  #add uniprot IDs to gene dataframe
  dplyr::select(SPID) %>% na.omit() %>% as.vector() %>%  # select only uniprot ID column, remove NA values and convert to vector 
  write_clip() #copy to clipboard      
```

## Population-specific loci that are highly- and lowly-methylated overlapping with genes+2kbslop

```{bash}
bedtools intersect -wb -a "../analyses/methylation/HC-highmeth-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation/HC-highmeth-gene.bed

bedtools intersect -wb -a "../analyses/methylation/HC-lowmeth-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation/HC-lowmeth-gene.bed

bedtools intersect -wb -a "../analyses/methylation/SS-highmeth-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation/SS-highmeth-gene.bed

bedtools intersect -wb -a "../analyses/methylation/SS-lowmeth-loci.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation/SS-lowmeth-gene.bed
```

### Compare all highly- and lowly-methylated genes in HC samples to all genes genome, then all genes in SS to all genes in genome. 

# These are genes which contain methylated loci in HC & SS samples: 
../analyses/methylation/HC-highmeth-gene.bed
../analyses/methylation/HC-lowmeth-gene.bed
../analyses/methylation/SS-highmeth-gene.bed
../analyses/methylation/SS-lowmeth-gene.bed

```{r}
# List of background genes for DAVID  
# Copy UniprotIDs for all genes in O. lurida genome- this is our background list for DAVID Enrichment Analyses 
read_delim(here::here("genome-features", "Olurida_v081-20190709.gene.gff"), delim = '\t', col_names = FALSE, skip = 1) %>%
  mutate(SPID=str_extract(X9, "SPID=(.*?);")) %>% mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #extracr UniprotID
  select(SPID) %>% na.omit() %>% as.vector() %>% write_clip() #copy Uniprot IDs to clipboard to paste into DAVID 

# -------------- 

# Highly methylated loci (mean meth >=75%), HC 
# Copy UniprotIDs for all genes that contain 5 or greater highly methylated loci in the Hood Canal 
read_delim(here::here("analyses", "methylation", "HC-highmeth-gene.bed"), delim = '\t', col_names = FALSE) %>% #read in list of methylated loci in HC population that fall within genes
  mutate(SPID=str_extract(X12, "SPID=(.*?);")) %>% mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(SPID) %>% count(SPID) %>%  #Select only gene ID column, count number of methylated loci in each gene 
  filter(n>4) %>% #filter for only genes that contain 5 or more methylated loci 
  select(SPID) %>% na.omit() %>% as.vector() %>% write_clip() #copy Uniprot IDs to clipboard to paste into DAVID 


# -------------- 

# Lowly methylated loci (mean meth <= 25%), HC 
# Copy UniprotIDs for all genes that contain 1 or more lowly methylated loci in the Hood Canal 
read_delim(here::here("analyses", "methylation", "HC-lowmeth-gene.bed"), delim = '\t', col_names = FALSE) %>% #read in list of methylated loci in HC population that fall within genes
  mutate(SPID=str_extract(X12, "SPID=(.*?);")) %>% mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(SPID) %>% count(SPID) %>%  #Select only gene ID column, count number of methylated loci in each gene 
  select(SPID) %>% na.omit() %>% as.vector() %>% write_clip() #copy Uniprot IDs to clipboard to paste into DAVID 

# -------------- 

# Highly methylated loci (mean meth >=75%), SS 
# Copy UniprotIDs for all genes that contain 5 or greater methylated loci in the South Sound 
read_delim(here::here("analyses", "methylation", "SS-highmeth-gene.bed"), delim = '\t', col_names = FALSE) %>% #read in list of methylated loci in HC population that fall within genes
  mutate(SPID=str_extract(X12, "SPID=(.*?);")) %>% mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(SPID) %>% count(SPID) %>%  #Select only gene ID column, count number of methylated loci in each gene 
  filter(n>4) %>% #filter for only genes that contain 5 or more methylated loci 
  select(SPID) %>% na.omit() %>% as.vector() %>% write_clip() #copy Uniprot IDs to clipboard to paste into DAVID 

# -------------- 

# Lowly methylated loci (mean meth <=25%), SS 
# Copy UniprotIDs for all genes that contain 1 or more methylated loci in the South Sound 
read_delim(here::here("analyses", "methylation", "SS-lowmeth-gene.bed"), delim = '\t', col_names = FALSE) %>% #read in list of methylated loci in HC population that fall within genes
  mutate(SPID=str_extract(X12, "SPID=(.*?);")) %>% mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>%  #remove extraneous info from Olur gene ID
  select(SPID) %>% count(SPID) %>%  #Select only gene ID column, count number of methylated loci in each gene 
  select(SPID) %>% na.omit() %>% as.vector() %>% write_clip() #copy Uniprot IDs to clipboard to paste into DAVID 

```

DAVID results saved here (Enriched GO terms and the uploaded gene sets): 
- Highly methylated genes, enriched biological functions in Hood Canal, methylated genes: analyses/methylation/GO-terms_enriched-highlymethylated-genes-HC  
- Highly methylated genes, enriched biological functions in South Sound, methylated genes: analyses/methylation/GO-terms_enriched-highlymethylated-genes-SS  
- Lowly methylated genes, enriched biological functions in Hood Canal, methylated genes: analyses/methylation/GO-terms_enriched-lowlymethylated-genes-HC  
- Lowly methylated genes, enriched biological functions in South Sound, methylated genes: analyses/methylation/GO-terms_enriched-lowlymethylated-genes-SS  

### Convert GO terms to GO Slim, highly and lowly methylated genes per population, and compare among populations 

Use this spreadsheet that assigns each GO Term to a GO Slim: http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/GO-GOslim.sorted

```{r}
# Read in the GO Slim table  
GOSlim <- read_delim(here::here("resources", "GO-GOslim.sorted"), delim = '\t', col_names = FALSE) %>% 
  setNames(c("GO", "term", "slim", "category")) %>% 
  mutate_at(vars(category, slim), as.factor)

GO.highmeth.HC <- read_delim(here::here("analyses", "methylation", "GO-terms_enriched-highlymethylated-genes-HC.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") 

GO.highmeth.SS <- read_delim(here::here("analyses", "methylation", "GO-terms_enriched-highlymethylated-genes-SS.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") 

GO.lowmeth.HC <- read_delim(here::here("analyses", "methylation", "GO-terms_enriched-lowlymethylated-genes-HC.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") 

GO.lowmeth.SS <- read_delim(here::here("analyses", "methylation", "GO-terms_enriched-lowlymethylated-genes-SS.txt"), delim = '\t', col_names = TRUE) %>% 
  separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") 
 
# Identify unique GO terms in the highly methylated loci (i.e. those not enriched in both pops)
GO.highmeth.HC %>% anti_join(GO.highmeth.SS, by = "GO") %>% select(GO, term, FDR, `Fold Enrichment`) %>% distinct() %>% left_join(GOSlim, by=c("GO", "term")) %>% arrange(slim, GO) #add slim  
GO.highmeth.SS %>% anti_join(GO.highmeth.HC, by = "GO") %>% select(GO, term, FDR, `Fold Enrichment`) %>% distinct() %>% left_join(GOSlim, by=c("GO", "term")) %>% arrange(slim, GO) #add slim  

# Identify unique GO terms in the lowly methylated loci
GO.lowmeth.HC %>% anti_join(GO.lowmeth.SS, by = "GO") %>% select(GO, term, FDR, `Fold Enrichment`) %>% distinct()  %>% left_join(GOSlim, by=c("GO", "term")) %>% arrange(slim, GO) #add slim  
GO.lowmeth.SS %>% anti_join(GO.lowmeth.HC, by = "GO") %>% select(GO, term, FDR, `Fold Enrichment`) %>% distinct()  %>% left_join(GOSlim, by=c("GO", "term")) %>% arrange(slim, GO) #add slim  
```

## Create separate dataframes with annotation information for all hypomethylated loci in Hood Canal & South Sound, separately 

```{r, eval=FALSE}
hypo_HC_genes <-
  DML.features.df %>%
  filter(contig.dml %in% hypo_HC$chr &
         start.dml %in% 1+hypo_HC$start &
         feature == "gene" &
         Note != "Note=Protein of unknown function;") %>%
  distinct(contig.dml, start.dml, Note, Ontology_term, .keep_all = TRUE)

hypo_HC_GO <- hypo_HC_genes %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:3, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3"), names_to = "GO_number", values_to = "GO_term") %>%
  select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))
write(hypo_HC_GO$GO_term, file = here::here("analyses/", "hypo_HC_GO.txt"))


hypo_SS_genes <- DML.features.df %>%
  filter(contig.dml %in% hypo_SS$chr &
         start.dml %in% 1+hypo_SS$start &
           feature == "gene" &
           Note != "Note=Protein of unknown function;") %>%
  distinct(contig.dml, start.dml, Note, Ontology_term, .keep_all = TRUE)

hypo_SS_GO <- hypo_SS_genes %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:5, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3", "GO_4", "GO_5"), names_to = "GO_number", values_to = "GO_term") %>%
  select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))
write(hypo_HC_GO$GO_term, file = here::here("analyses/", "hypo_SS_GO.txt"))


# Barplots of differentially methylated loci that are located in known genes

hypo_HC %>%
   filter(chr %in% hypo_HC_genes$contig.dml &
         start %in% hypo_HC_genes$start.dml-1) %>%
ggplot(aes(x = population, y = mean_percMeth, fill = population,
                         label=paste0(round(mean_percMeth, digits = 2), "%"))) +
      geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
      geom_pointrange(aes(ymin=mean_percMeth,
                        ymax=mean_percMeth+sd_percMeth, width=0.15)) +
      geom_text(size=3, vjust=-0.5, hjust=1.1) +
      theme_minimal() + xlab("Population") + ylab("% Methylation") +
  scale_y_continuous(breaks=c(0,25,50,75,100)) +
    scale_fill_manual(values=c("firebrick3","dodgerblue3")) +
   facet_wrap_paginate(~chr + start, nrow=2, ncol=3, page=2)


hypo_SS %>%
   filter(chr %in% hypo_SS_genes$contig.dml &
         start %in% hypo_SS_genes$start.dml-1) %>%
ggplot(aes(x = population, y = mean_percMeth, fill = population,
                         label=paste0(round(mean_percMeth, digits = 2), "%"))) +
      geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
      geom_pointrange(aes(ymin=mean_percMeth,
                        ymax=mean_percMeth+sd_percMeth, width=0.15)) +
      geom_text(size=3, vjust=-0.5, hjust=1.1) +
      theme_minimal() + xlab("Population") + ylab("% Methylation") +
  scale_y_continuous(breaks=c(0,25,50,75,100)) +
    scale_fill_manual(values=c("firebrick3","dodgerblue3")) +
   facet_wrap_paginate(~chr + start, nrow=2, ncol=3, page=2)
```


```{bash, eval=FALSE}
echo "Loci differentially methylated between SS and HC populations:"
cat "../analyses/dml25.bed" | wc -l 

echo "Loci that overlap with genes:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" | wc -l

echo "Loci that overlap with gene regions (genes +/- 2kb):"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" | wc -l

echo "Loci that overlap with exons:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" | wc -l

echo "Loci that overlap with coding sequences:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" | wc -l

echo "Loci that overlap with mRNA:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" | wc -l

echo "Loci that overlap with transposable elements:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" | wc -l

echo "Loci that overlap with alternative splice variants:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l

echo "Loci that do not overlap with known features:"
bedtools intersect -u -a "../analyses/dml25.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l
```


```{bash, eval=FALSE}
echo "genes" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" | wc -l
echo "gene regions (+/- 2kb)" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" | wc -l
echo "exon" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" | wc -l
echo "CDS" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" | wc -l
echo "mRNA"
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" | wc -l
echo "TE" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" | wc -l
echo "ASV" 
bedtools intersect -u -a "../analyses/mydiff-all.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l
echo "unknown" 
bedtools intersect -v -a "../analyses/mydiff-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l
```


```{bash, eval=FALSE}
echo "Total methylated loci:" 
cat ../analyses/macau/macau-all-loci.bed | wc -l

echo "Loci associated with shell length (MACAU):"
cat "../analyses/macau/macau.sign.length.perc.meth.bed" | wc -l 

echo "Loci that overlap with genes:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" | wc -l

echo "Loci that overlap with gene regions (+/- 2kb):"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" | wc -l

echo "Loci that overlap with exons:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.exon.gff" | wc -l

echo "Loci that overlap with coding sequences:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.CDS.gff" | wc -l

echo "Loci that overlap with mRNA:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.mRNA.gff" | wc -l

echo "Loci that overlap with transposable elements:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081_TE-Cg.gff" | wc -l

echo "Loci that overlap with alternative splice variants:"
bedtools intersect -u -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l

echo "Loci that do not overlap with known features:"
bedtools intersect -v -a "../analyses/macau/macau.sign.length.perc.meth.bed" -b "../genome-features/Olurida_v081-20190709.gene.gff" "../genome-features/Olurida_v081-20190709.exon.gff" "../genome-features/Olurida_v081-20190709.CDS.gff" "../genome-features/Olurida_v081-20190709.mRNA.gff" "../genome-features/Olurida_v081_TE-Cg.gff" "../genome-features/20190709-Olurida_v081.stringtie.gtf" | wc -l
```


## Prepare blastx annotation files to merge with DML and MACAU results 

#### Download oly blastx file
NOTE: remove "eval=FALSE" to execute this code chunk 

```{bash, eval = FALSE}
#curl https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/analyses/Olgene_blastx_uniprot.05.tab > ../data/Olgene_blastx_uniprot.05.tab
```

#### convert pipes to tab

```{bash}
tr '|' '\t' < ../data/Olgene_blastx_uniprot.05.tab \
> ../data/Olgene_blastx_uniprot.05-20191122.tab
wc -l ../data/Olgene_blastx_uniprot.05-20191122.tab
```

#### Reduce the number of columns using awk. Sort, and save as a new file.

```{bash}
awk -v OFS='\t' '{print $1, $3, $13}' \
< ../data/Olgene_blastx_uniprot.05-20191122.tab | sort \
> ../data/Olgene_blastx_uniprot.05-20191122-sort.tab
wc -l ../data/Olgene_blastx_uniprot.05-20191122-sort.tab
```

#### Preview blastx annotation files file that will be joined with feature lists to annotate 

```{bash}
head ../data/Olgene_blastx_uniprot.05-20191122-sort.tab
```

#### Read blastx annotation file to object 

```{r}
uniprot <- read_delim(here::here("data", "Olgene_blastx_uniprot.05-20191122-sort.tab"), delim="\t", col_names = FALSE) %>%
  setNames(c("gene", "UniprotID", "unknown")) %>% 
  separate(gene, into = c("contig.feat", "start.feat", "end.feat"), sep = '[:-]') %>%
  mutate_at(c("start.feat", "end.feat"),as.numeric)
```