-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_geoquery.Rmd
145 lines (110 loc) · 4.31 KB
/
lab_geoquery.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
title: "Online Repositories"
subtitle: "Workshop on RNA-Seq"
editor_options:
chunk_output_type: console
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file).
</div>
Load R packages.
```{r, eval=F}
# BiocManager::install("GEOquery")
library(GEOquery)
library(Biobase)
library(rafalib)
rafalib::mypar(mar=c(6,2.5,2.5,1)) #sets nice arrangement for the whole document
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```
## Loading a dataset from GEO using GEOquery
For this exercise, let's try getting the full count matrix from the original [GSE131032](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131032) dataset.
In some cases the simple `getGEO` will download the data, metadata and gene annotation into a ExpressionSet format, where you can then extract the information about the gene counts, metadata and gene annotations.
```{r, eval=F}
#Get the GSE object from it
gset <- GEOquery::getGEO("GSE131032")[[1]]
gset
#Get the count matrix from the GSE objetc
data <- gset@assayData$exprs
dim(data)
#Get the sample metadata / phenotypic from the GSE object
metadata <- gset@phenoData@data
dim(metadata)
#Get the gene annotation data from the GSE object
annot <- gset@featureData@data
dim(annot)
```
However, in most cases (in our experience) that is not the case since every dataset is a bit unique. So we recommend checking the files deposited individually and download what is most relevant for you. We can first list the files deposited:
```{r, eval=F}
file_list <- GEOquery::getGEOSuppFiles("GSE131032",fetch_files = F)
print(file_list)
```
You can see the `RAW.tar` file, which contains the original count file deposited individually for every sample (under the GSM). The command below will dowlaod the file into a folder with the GSE name. This is a `.tar` compressed file that can be extracted.
```{r, eval=F}
GEOquery::getGEOSuppFiles("GSE131032",fetch_files = T,filter_regex = "_RAW.tar")
list.files("GSE131032")
#extract tar using bash (within R, but can be done outside too)
system("cd GSE131032; tar -xvf GSE131032_RAW.tar")
system("cd GSE131032; rm GSE131032_RAW.tar")
list.files("GSE131032")
system("cd GSE131032; gunzip *.gz")
```
Having the individual files, we can combine them using a for loop or apply function:
```{r, eval=F}
#list the files
list.files("GSE131032")
#Check the column names from the 1st file
head( read.delim(paste0("GSE131032/",list.files("GSE131032")[1])) )
#Do an apply function to get all the estimated counts from each file and compile into a single matrix
data <- sapply(list.files("GSE131032",pattern = ".tsv"), function(x){
x <- read.delim(paste0("GSE131032/",x))[,"est_counts"]
return(x)
})
rownames(data)
colnames(data)
#Add the gene names and sample names
rownames(data) <- read.delim(paste0("GSE131032/",list.files("GSE131032")[1]))[,"target_id"]
colnames(data) <- sub("_.*","",colnames(data))
#Check the column names from the 1st file
rownames(metadata) == colnames(data)
#Rename data
colnames(data) <- metadata$title
head(data)
```
Alternativelly, in the `file_list` above we could also simply load the count matrix.
```{r, eval=F}
#list files available
print(file_list)
#Download and extract the full matrix
GEOquery::getGEOSuppFiles("GSE131032",fetch_files = T,filter_regex = "GSE131032_kallisto_counts.csv.gz")
system("cd GSE131032; gunzip GSE131032_kallisto_counts.csv.gz")
#Read the file (note that the sample names here are not the GSM accesccion, but the investigator's IDs)
data <- read.delim(paste0("GSE131032/GSE131032_kallisto_counts.csv"),sep = ",",row.names = 1)
colnames(data)
#Those sample IDs are stored in one of the metadata columns
metadata$description
metadata$description == colnames(data)
#Rename data
colnames(data) <- metadata$title
head(data)
```