code/02-Generating-meth-feature-files.Rmd

---
title: "02-Generating-gene-region-feature-files"
author: "Laura H Spencer"
date: "11/10/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## In this notebook I generate feature files of O. lurida gene regions, which include genes +/- 2kb flanking regions 

### NOTE

Katherine Silliman edited the "Olurida_v081-20190709.gene.gff" and "Olurida_v081-20190709.gene.2kbslop.gff" files so the Uniprot SPID is in there. It is the last entry in the "Notes" section, marked `SPID=`. A notebook of how she did it can be found here: [https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-annot/Add_SPID.ipynb](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-annot/Add_SPID.ipynb)

_Example entry_

`Contig1111	maker	gene	24968	28696	.	-	.	ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;SPID=Q9JLI7;`

### Use bedtools `slop` to include 2kb regions on either side of gene regions 

In many analyses want to identify methylated loci that are within genes but also 2kb upstream and downstream of gene regions. Therefore, before intersecting methylated loci with genes, I need to create a gene region +/- 2kb file. I will do this using `slopBed`.

First, generate a "genome file", which defines size of each chromosome. This is so the `slop` function restricts results to within a contig. I can use the indexed FASTA file that I created for a blast. 

Extract column 1 (contig name) and column 2 (# bases in the contig)

```{bash}
head -n 2 "../resources/Olurida_v081.fa.fai"
cut -f 1,2 "../resources/Olurida_v081.fa.fai" > "../resources/Olurida_chrom.sizes"
head -n 2 "../resources/Olurida_chrom.sizes"
```

### Run slopBed with gene feature file 

```{bash}
head -n 4 "../genome-features/Olurida_v081-20190709.gene.gff"
```

```{bash}
slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-b 2000 \
> "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
```

### Use intersectBed to find where loci with methylation data and genes intersect, allowing loci to be mapped to annotated genes
  
  wb: Print all lines in the second file
  a: input data file, which was created in previous code chunk
  b: save annotated gene list

```{bash}
intersectBed \
  -wb \
  -a "../analyses/methylation/meth_filter_long.tab" \
  -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" \
  > "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```

### Here is the number of loci associated with gene regions +/- 2kb: 

```{bash}
wc -l "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```

### Check out format of gene regions with methylated loci 

```{bash}
head -n 2  "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```

### Now run intersectBed to NOT include the 2kb flanks  

Return the number of loci associated with gene regions. Not sure if this will be used, but it's good to save it just in case. 

```{bash}
intersectBed \
  -wb \
  -a "../analyses/methylation/meth_filter_long.tab" \
  -b "../genome-features/Olurida_v081-20190709.gene.gff" \
  > "../analyses/methylation/meth_filter_gene.tab"
wc -l "../analyses/methylation/meth_filter_gene.tab"
```

### Check out resulting files which contain methylated loci that are within gene regions (+/- 2kb), and gene bodies: 
 
```{bash}
head -n 3 "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```

```{bash}
head -n 1 "../analyses/methylation/meth_filter_gene.tab"
```

### For the fun of it, re-run bedtools flank with gene feature file to create 1) downtream flanking region (+2kb downstream, 3'), and 2) upstream flanking region (-2kb upstream, 5')

Downstream / 3' (using the `-r` option to indicate "add") 

```{bash}
bedtools flank \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-r 2000 \
-l 0 \
> "../genome-features/Olurida_v081-20190709.2kbflank-down.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.2kbflank-down.gff"
```

Upstream / 5' (using the `-l` option to indicate "subract") 
```{bash}
bedtools flank \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-l 2000 \
-r 0 \
> "../genome-features/Olurida_v081-20190709.2kbflank-up.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.2kbflank-up.gff"
```

# ============================================================================

The below code was not needed for the paper 

### Createnew files that contains genes locations of various # of isoforms 

These files will be used to assess association between methylation and the # of isoforms in a gene 

These are the files I will create:  
- "genome-features/Olurida_v081_genes.1isoform.bed"  
- "genome-features/Olurida_v081_genes.2-5isoforms.bed"  
- "genome-features/Olurida_v081_genes.6-10isoforms.bed"  
- "genome-features/Olurida_v081_genes.11-20isoforms.bed"  
- "genome-features/Olurida_v081_genes.21-70isoforms.bed"  

```{r}
ASV.data <- read_delim(here::here("genome-features", "20190709-Olurida_v081.stringtie.gtf"), delim="\t", col_names=FALSE, skip = 2) %>% 
  setNames(c("Contig", "source", "type", "start", "stop", "unknown1",  "strand", "unknown2", "ID")) %>%
    mutate_at(vars(type), as.factor) %>%
  
  # Extract info for only the transcripts (aka, not exons)
  filter(type=="transcript") %>%

  # Split gene and transcript ID info into new columns 
    
  mutate(gene_id=str_extract(ID, "gene_id (.*?);"),
       transcript_id=str_extract(ID, "transcript_id (.*?);"),
       ref_gene_id=str_extract(ID, "ref_gene_id (.*?);")) %>%
  mutate(gene_id=str_remove(gene_id, "gene_id "),
           transcript_id=str_remove(transcript_id, "transcript_id "),
           ref_gene_id=str_remove(ref_gene_id, "ref_gene_id ")) %>%
    
  mutate(gene_id=str_remove(gene_id, ";"),
           transcript_id=str_remove(transcript_id, ";"),
           ref_gene_id=str_remove(ref_gene_id, ";")) %>%
  mutate(gene_id=str_remove_all(gene_id, "\""),
           transcript_id=str_remove_all(transcript_id, "\""),
           ref_gene_id=str_remove_all(ref_gene_id, "\"")) %>%
  
  mutate_at(vars(gene_id), as.factor) %>%

  # Find the start and stop loci for each gene 
    group_by(Contig, gene_id) %>% 
  mutate(
    start_gene = min(start, na.rm = T),
    stop_gene = max(stop, na.rm = T)) %>%
  ungroup()

# Create new dataframe with the # of isoforms per gene (including start and stop loci for each gene)
ASV.per.gene <- ASV.data %>%
  group_by(Contig, gene_id, start_gene, stop_gene) %>%
  tally(name = "isoform_count") %>% ungroup()

table(ASV.per.gene$isoform_count)

# this gene has 70 isoforms! Double check 
ASV.data %>% filter(gene_id == "MSTRG.55267")

# Create .bed files for genes that have various # of isoforms 

#genes with only 1 transcript 
write_delim(ASV.per.gene %>% 
              filter(isoform_count==1) %>% 
              select(Contig, start_gene, stop_gene) %>% 
              mutate_if(is.numeric, as.integer),
  "../genome-features/Olurida_v081_genes.1isoform.bed",  delim = '\t', col_names = FALSE)

#genes with 2-5 isoforms 
write_delim(ASV.per.gene %>% 
              filter(isoform_count>1 & isoform_count<=5) %>% 
              select(Contig, start_gene, stop_gene) %>% 
              mutate_if(is.numeric, as.integer),
  "../genome-features/Olurida_v081_genes.2-5isoforms.bed",  delim = '\t', col_names = FALSE)

#genes with 6-10 isoforms 
write_delim(ASV.per.gene %>% 
              filter(isoform_count>5 & isoform_count<=10) %>% 
              select(Contig, start_gene, stop_gene) %>% 
              mutate_if(is.numeric, as.integer),
  "../genome-features/Olurida_v081_genes.6-10isoforms.bed",  delim = '\t', col_names = FALSE)

#genes with 11-20 isoforms 
write_delim(ASV.per.gene %>% 
              filter(isoform_count>10 & isoform_count<=20) %>% 
              select(Contig, start_gene, stop_gene) %>% 
              mutate_if(is.numeric, as.integer),
  "../genome-features/Olurida_v081_genes.11-20isoforms.bed",  delim = '\t', col_names = FALSE)

#genes with >20 isoforms  
write_delim(ASV.per.gene %>% 
              filter(isoform_count>20) %>% 
              select(Contig, start_gene, stop_gene) %>% 
              mutate_if(is.numeric, as.integer),
  "../genome-features/Olurida_v081_genes.21-70isoforms.bed",  delim = '\t', col_names = FALSE)
```

#### Question: Does gene length ~ # isoforms? 
If so, a CpG could be more likely since it's longer. Could therfore be a weird artifacts of transcriptome assembly could lead to genes having multiple isoforms - e.g. long.

```{r}
#ggplotly(
  ASV.per.gene %>% mutate(length_gene=stop_gene-start_gene) %>%
  ggplot() + geom_boxplot(aes(x=as.factor(isoform_count), y=length_gene))
  #)

```