-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-Generating-meth-feature-files.Rmd
232 lines (175 loc) · 8.86 KB
/
02-Generating-meth-feature-files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "02-Generating-gene-region-feature-files"
author: "Laura H Spencer"
date: "11/10/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## In this notebook I generate feature files of O. lurida gene regions, which include genes +/- 2kb flanking regions
### NOTE
Katherine Silliman edited the "Olurida_v081-20190709.gene.gff" and "Olurida_v081-20190709.gene.2kbslop.gff" files so the Uniprot SPID is in there. It is the last entry in the "Notes" section, marked `SPID=`. A notebook of how she did it can be found here: [https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-annot/Add_SPID.ipynb](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-annot/Add_SPID.ipynb)
_Example entry_
`Contig1111 maker gene 24968 28696 . - . ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;SPID=Q9JLI7;`
### Use bedtools `slop` to include 2kb regions on either side of gene regions
In many analyses want to identify methylated loci that are within genes but also 2kb upstream and downstream of gene regions. Therefore, before intersecting methylated loci with genes, I need to create a gene region +/- 2kb file. I will do this using `slopBed`.
First, generate a "genome file", which defines size of each chromosome. This is so the `slop` function restricts results to within a contig. I can use the indexed FASTA file that I created for a blast.
Extract column 1 (contig name) and column 2 (# bases in the contig)
```{bash}
head -n 2 "../resources/Olurida_v081.fa.fai"
cut -f 1,2 "../resources/Olurida_v081.fa.fai" > "../resources/Olurida_chrom.sizes"
head -n 2 "../resources/Olurida_chrom.sizes"
```
### Run slopBed with gene feature file
```{bash}
head -n 4 "../genome-features/Olurida_v081-20190709.gene.gff"
```
```{bash}
slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-b 2000 \
> "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
```
### Use intersectBed to find where loci with methylation data and genes intersect, allowing loci to be mapped to annotated genes
wb: Print all lines in the second file
a: input data file, which was created in previous code chunk
b: save annotated gene list
```{bash}
intersectBed \
-wb \
-a "../analyses/methylation/meth_filter_long.tab" \
-b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" \
> "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```
### Here is the number of loci associated with gene regions +/- 2kb:
```{bash}
wc -l "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```
### Check out format of gene regions with methylated loci
```{bash}
head -n 2 "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```
### Now run intersectBed to NOT include the 2kb flanks
Return the number of loci associated with gene regions. Not sure if this will be used, but it's good to save it just in case.
```{bash}
intersectBed \
-wb \
-a "../analyses/methylation/meth_filter_long.tab" \
-b "../genome-features/Olurida_v081-20190709.gene.gff" \
> "../analyses/methylation/meth_filter_gene.tab"
wc -l "../analyses/methylation/meth_filter_gene.tab"
```
### Check out resulting files which contain methylated loci that are within gene regions (+/- 2kb), and gene bodies:
```{bash}
head -n 3 "../analyses/methylation/meth_filter_gene.2kbslop.tab"
```
```{bash}
head -n 1 "../analyses/methylation/meth_filter_gene.tab"
```
### For the fun of it, re-run bedtools flank with gene feature file to create 1) downtream flanking region (+2kb downstream, 3'), and 2) upstream flanking region (-2kb upstream, 5')
Downstream / 3' (using the `-r` option to indicate "add")
```{bash}
bedtools flank \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-r 2000 \
-l 0 \
> "../genome-features/Olurida_v081-20190709.2kbflank-down.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.2kbflank-down.gff"
```
Upstream / 5' (using the `-l` option to indicate "subract")
```{bash}
bedtools flank \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-l 2000 \
-r 0 \
> "../genome-features/Olurida_v081-20190709.2kbflank-up.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.2kbflank-up.gff"
```
# ============================================================================
The below code was not needed for the paper
### Createnew files that contains genes locations of various # of isoforms
These files will be used to assess association between methylation and the # of isoforms in a gene
These are the files I will create:
- "genome-features/Olurida_v081_genes.1isoform.bed"
- "genome-features/Olurida_v081_genes.2-5isoforms.bed"
- "genome-features/Olurida_v081_genes.6-10isoforms.bed"
- "genome-features/Olurida_v081_genes.11-20isoforms.bed"
- "genome-features/Olurida_v081_genes.21-70isoforms.bed"
```{r}
ASV.data <- read_delim(here::here("genome-features", "20190709-Olurida_v081.stringtie.gtf"), delim="\t", col_names=FALSE, skip = 2) %>%
setNames(c("Contig", "source", "type", "start", "stop", "unknown1", "strand", "unknown2", "ID")) %>%
mutate_at(vars(type), as.factor) %>%
# Extract info for only the transcripts (aka, not exons)
filter(type=="transcript") %>%
# Split gene and transcript ID info into new columns
mutate(gene_id=str_extract(ID, "gene_id (.*?);"),
transcript_id=str_extract(ID, "transcript_id (.*?);"),
ref_gene_id=str_extract(ID, "ref_gene_id (.*?);")) %>%
mutate(gene_id=str_remove(gene_id, "gene_id "),
transcript_id=str_remove(transcript_id, "transcript_id "),
ref_gene_id=str_remove(ref_gene_id, "ref_gene_id ")) %>%
mutate(gene_id=str_remove(gene_id, ";"),
transcript_id=str_remove(transcript_id, ";"),
ref_gene_id=str_remove(ref_gene_id, ";")) %>%
mutate(gene_id=str_remove_all(gene_id, "\""),
transcript_id=str_remove_all(transcript_id, "\""),
ref_gene_id=str_remove_all(ref_gene_id, "\"")) %>%
mutate_at(vars(gene_id), as.factor) %>%
# Find the start and stop loci for each gene
group_by(Contig, gene_id) %>%
mutate(
start_gene = min(start, na.rm = T),
stop_gene = max(stop, na.rm = T)) %>%
ungroup()
# Create new dataframe with the # of isoforms per gene (including start and stop loci for each gene)
ASV.per.gene <- ASV.data %>%
group_by(Contig, gene_id, start_gene, stop_gene) %>%
tally(name = "isoform_count") %>% ungroup()
table(ASV.per.gene$isoform_count)
# this gene has 70 isoforms! Double check
ASV.data %>% filter(gene_id == "MSTRG.55267")
# Create .bed files for genes that have various # of isoforms
#genes with only 1 transcript
write_delim(ASV.per.gene %>%
filter(isoform_count==1) %>%
select(Contig, start_gene, stop_gene) %>%
mutate_if(is.numeric, as.integer),
"../genome-features/Olurida_v081_genes.1isoform.bed", delim = '\t', col_names = FALSE)
#genes with 2-5 isoforms
write_delim(ASV.per.gene %>%
filter(isoform_count>1 & isoform_count<=5) %>%
select(Contig, start_gene, stop_gene) %>%
mutate_if(is.numeric, as.integer),
"../genome-features/Olurida_v081_genes.2-5isoforms.bed", delim = '\t', col_names = FALSE)
#genes with 6-10 isoforms
write_delim(ASV.per.gene %>%
filter(isoform_count>5 & isoform_count<=10) %>%
select(Contig, start_gene, stop_gene) %>%
mutate_if(is.numeric, as.integer),
"../genome-features/Olurida_v081_genes.6-10isoforms.bed", delim = '\t', col_names = FALSE)
#genes with 11-20 isoforms
write_delim(ASV.per.gene %>%
filter(isoform_count>10 & isoform_count<=20) %>%
select(Contig, start_gene, stop_gene) %>%
mutate_if(is.numeric, as.integer),
"../genome-features/Olurida_v081_genes.11-20isoforms.bed", delim = '\t', col_names = FALSE)
#genes with >20 isoforms
write_delim(ASV.per.gene %>%
filter(isoform_count>20) %>%
select(Contig, start_gene, stop_gene) %>%
mutate_if(is.numeric, as.integer),
"../genome-features/Olurida_v081_genes.21-70isoforms.bed", delim = '\t', col_names = FALSE)
```
#### Question: Does gene length ~ # isoforms?
If so, a CpG could be more likely since it's longer. Could therfore be a weird artifacts of transcriptome assembly could lead to genes having multiple isoforms - e.g. long.
```{r}
#ggplotly(
ASV.per.gene %>% mutate(length_gene=stop_gene-start_gene) %>%
ggplot() + geom_boxplot(aes(x=as.factor(isoform_count), y=length_gene))
#)
```