-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.Rmd
202 lines (160 loc) · 12.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
output:
html_document:
theme: spacelab
toc: true
toc_depth: 2
toc_float: true
---
```{r 'setup', echo = FALSE, warning = FALSE, message = FALSE}
## Bib setup
library('knitcitations')
library('BiocStyle')
## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html')
## Write bibliography information
bib <- c(
R = citation(),
BiocStyle = citation('BiocStyle'),
bioportal = bib_metadata('10.1093/nar/gkr469'),
devtools = citation('devtools'),
downloader = citation('downloader'),
DT = citation('DT'),
knitcitations = citation('knitcitations'),
knitr = citation('knitr')[3],
metasra = bib_metadata('10.1093/bioinformatics/btx334'),
phenopredict = citation('recount')[3],
recount = citation('recount')[1],
rmarkdown = citation('rmarkdown'),
sessioninfo = citation('sessioninfo'),
shinycsv = citation('shinycsv'),
srp = bib_metadata('10.1101/gr.165126.113'),
tidyverse = citation('tidyverse'),
gtex = bib_metadata('10.1038/s41467-017-02772-x')
)
write.bibtex(bib, file = 'index.bib')
```
<a href="https://jhubiostatistics.shinyapps.io/recount/"><img src="https://raw.githubusercontent.com/LieberInstitute/recount-brain/master/recount_brain.png" align="center"></a>
Code and results for the [recount-brain](https://github.com/LieberInstitute/recount-brain) project that enhances the [recount2 project](https://jhubiostatistics.shinyapps.io/recount/) project. The `recount_brain` table can be accessed via the `r Biocpkg('recount')` `r citep(bib[['recount']])` Bioconductor package using `recount::add_metadata(source = 'recount_brain_v2')`.
# Contents
* [select_studies](select_studies.html) uses the predicted phenotype information by Shannon Ellis _et al._ `r citep(bib[['phenopredict']])` version 0.0.03 to determine candidate studies for `recount_brain` from the Sequence Read Archive (SRA) that have at least 4 samples and over 70% of the samples are from the brain. It creates the list of candidate projects saved in [projects_lists.txt](projects_list.txt).
* [SRA_run_selector_info](https://github.com/LieberInstitute/recount-brain/tree/master/SRA_run_selector_info) contains a table per study in [projects_lists.txt](projects_list.txt) with the data downloaded from the SRA Run Selector website https://www.ncbi.nlm.nih.gov/Traces/study/.
* [SRA_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/SRA_metadata) contains a CSV table with the curated metadata for each study. This is the data that is then used to create `recount_brain`. Note that not all candidate studies were brain studies so the final number of projects considered is 62.
* [merged_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/merged_metadata) contains the `recount_brain` table that can be easily accessed via `r Biocpkg('recount')` `r citep(bib[['recount']])` using the `add_metadata()` function. The document [merging_data](merged_metadata/merging_data.html) describes how the `recount_brain` was created using the files from `SRA_metadata` and includes some brief examples on how to explore the `recount_brain` table. You can access this initial version of `recount_brain` using `recount::add_metadata(source = 'recount_brain_v1')`.
* [metadata_reproducibility](https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility) contains a document describing how the metadata was processed for each SRA study. It is intended to be useful for reproducibility purposes.
* The [cross_studies_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/cross_studies_metadata) directory contains the [cross_studies_metadata](cross_studies_metadata/cross_studies_metadata.html) document describing how the recount-brain version 1 table was merged with GTEx and TCGA brain samples metadata to create the recount-brain version 2 table that facilitates cross-study comparisons. [SupplementaryTable2.csv](SupplementaryTable2.csv) describes which fields from the GTEx and TCGA data were used to merge them with `recount_brain` and any manipulations required to do so. The [cross_studies_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/cross_studies_metadata) directory also contains a second document, [recount_brain_ontologies](cross_studies_metadata/recount_brain_ontologies.html), with the code used for adding Broadmann area, disease and tissue ontology information to `recount_brain`. This final table is the one you can access using `recount::add_metadata(source = 'recount_brain_v2')`.
* [metasra_comp](https://github.com/LieberInstitute/recount-brain/tree/master/metasra_comp) contains a comparison of `recount_brain_v2` and `MetaSRA` `r citep(bib[['metasra']])` as described in the [metasra_comp](metasra_comp/metasra_comp.html) html document.
# Example analyses
* We used the data from [SRP027383](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP027383) `r citep(bib[['srp']])` to show how `recount_brain` can be used for a gene differential expression analysis. See the full example for more information: [example_SRP027383](example_SRP027383/example_SRP027383.html). You can also access the [pdf version](example_SRP027383/example_SRP027383.pdf) if you prefer over the HTML version.
* We used the data from ten studies to replicate some of the analyses by Ferreira _et al._ `r citep(bib[['gtex']])` that explore the relationship between post-mortem interval and gene expression. See the full example for more information: [example_PMI](example_PMI/example_PMI.html). You can also access the [pdf version](example_PMI/example_PMI.pdf) if you prefer it over the HTML version.
* We also illustrate how to perform an analysis across multiple studies present in `recount_brain` and combining them with specific tissue data from The Cancer Genome Atlas (TCGA). See the full example for more information: [example_multistudy](example_multistudy/recount_brain_multistudy.html). You can also access the [pdf version](example_multistudy/recount_brain_multistudy.pdf) if you prefer it over the HTML version.
# List of variables
This information is also available as a csv file at [SupplementaryTable1.csv](SupplementaryTable1.csv).
1. `age`: Age of donor
1. `age_units`: Units of age - (Years / Months / Post Conception Weeks)
1. `assay_type_s`: Sequencing technique - (RNA-Seq)
1. ` avgspotlen_l`: Average length of sequenced read
1. `bioproject_s`: NCBI BioProject ID
1. `biosample_s`: NCBI BioSample ID
1. `brain_bank`: Brain tissue repository source
1. `brodmann_area`: Brodmann area for tissue from cerebral cortex - (1-52)
1. `cell_line`: Cell line description
1. `center_name_s`: Project center
1. `clinical_stage_1`: Clinically relevant tissue sample information
1. `clinical_stage_2`: Clinically relevant tissue sample information
1. `consent_s`: Data availability - (Public)
1. `development`: Stage of human development - (Fetus / Infant / Child / Adolescent / Adult)
1. `disease`: Disease description
1. `disease_status`: Nature of tissue - (Disease / Control)
1. `experiment_s`: NCBI Experiment ID
1. `hemisphere`: Cerebral hemisphere - (Left / Right)
1. `insertsize_l`: Length of sequence between adaptors
1. `instrument_s`: High throughput sequencing system
1. `library_name_s`: Internal sample ID used by original study
1. `librarylayout_s`: Sequencing layout - (Single / Paired)
1. `libraryselection_s`: Sequencing library - (cDNA)
1. `librarysource_s`: Sequencing source - (Transcriptomic)
1. `loaddate_s`: Sequencing load date
1. `mbases_l`: Megabases
1. `mbytes_l`: Megabytes
1. `organism_s`: Organism - (Homo sapiens)
1. `pathology`: Tissue pathology
1. `platform_s`: Sequencing platform - (Illumina)
1. `pmi`: Postmortem interval
1. `pmi_units`: Units of postmortem interval - (Hours)
1. `preparation`: Specimen preparation - (Frozen)
1. `present_in_recount`: Expression data present in recount2
1. `race`: Race of donor - (Asian / Black / Hispanic / White)
1. `releasedate_s`: Sequencing release date
1. `rin`: RNA integrity number
1. `run_s`: NCBI Run ID
1. `sample_name_s`: GEO Accession ID
1. `sample_origin`: Tissue origin - (Brain / iPSC)
1. `sex`: Sex of donor - (Female / Male)
1. `sra_sample_s`: NCBI SRA Sample ID
1. `sra_study_s`: NCBI SRA Study ID
1. `tissue_site_1`: Anatomic site of tissue
1. `tissue_site_2`: Anatomic site of tissue, further specified
1. `tissue_site_3`: Anatomic site of tissue, further specified
1. `tumor_type`: Type of tumor - (Glioblastoma / Astrocytoma / Ependymoma / Oligodendroglioma)
1. `viability`: Tissue viability - (Postmortem / Biopsy)
You can access this initial version with `recount::add_metadata(source = 'recount_brain_v1')`.
List of variables present in `recount_brain_v2`.
49. `Study_full`: either the SRA study accession, GTEX or TCGA.
50. `drugName_full`: the drug name for TCGA samples.
51. `drug_info_full`: logical, whether the sample has drug information; only for TCGA.
52. `drug_type_full`: the drug classification (chemotherapy, immunotherapy, ...); only for TCGA.
53. `full_260_280`: the 260 to 280 ratio; only for TCGA.
54. `count_file_identifier`: the SRA run accession or the TCGA run (sample) identifier. Useful for merging with the rest of recount2 metadata.
55. `Dataset`: either SRA, GTEX or TCGA.
56. `brodmann_ontology`: URL for the Brodmann region ontology. See the [`recount_brain_ontologies`](cross_studies_metadata/recount_brain_ontologies.html) file for how this information was added.
57. `brodmann_synonyms`: synonyms used for the Brodmann regions. These facilitate text based searches. Separated by `|`.
58. `brodmann_parents`: URLs for the Brodmann ontology parents. Separated by `|`.
59. `brodmann_parents_label`: Brodmann ontology parent text preferred labels. Separated by `|`.
60. `disease_ontology`: URL for the disease ontology.
61. `tissue`: tissue as prioritized by `tissue_site_3` over `tissue_site_2` over `tissue_site_1`.
62. `tissue_ontology`: URL for the tissue ontology.
63. `tissue_synonyms`: tissue synonyms which facilitate text based searches. Separated by `|`.
64. `tissue_parents`: URLs for the tissue ontology parents. Separated by `|`.
65. `tissue_parents_label`: tissue ontology parent text preferred labels. Separated by `|`.
You can access this version with `recount::add_metadata(source = 'recount_brain_v2')`.
# List of SRA projects present in `recount_brain`
```{r 'list of projects', echo = FALSE, results = 'asis'}
load('merged_metadata/recount_brain_v1.Rdata')
cat(paste0('1. [', unique(recount_brain$sra_study_s), '](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=', unique(recount_brain$sra_study_s), ') \n'))
```
# Explore interactively
We recommend opening the [interactive `recount_brain` exploration](https://jhubiostatistics.shinyapps.io/recount-brain/) in another window.
<iframe id="example1" src="https://jhubiostatistics.shinyapps.io/recount-brain/"
style="border: non; width: 1400px; height: 1500px"
frameborder="0">
</iframe>
This application is a custom version of `shinycsv` `r citep(bib[['shinycsv']])`. The code for making this application is available in the [shinytable](https://github.com/LieberInstitute/recount-brain/tree/master/shinytable/) directory.
# Questions
If you have any questions about `recount_brain` please post them as an issue at [LieberInstitute/recount-brain](https://github.com/LieberInstitute/recount-brain/issues) and include the relevant session information using the following code. Thank you!
```{r, eval = FALSE}
library('sessioninfo')
options(width = 120)
session_info()
```
# References
The analyses were made possible thanks to `BioPortal` `r citep(bib[['bioportal']])`, `MetaSRA` `r citep(bib[['metasra']])`, and:
* R `r citep(bib[['R']])`
* `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`
* `r CRANpkg('devtools')` `r citep(bib[['devtools']])`
* `r CRANpkg('downloader')` `r citep(bib[['downloader']])`
* `r CRANpkg('DT')` `r citep(bib[['DT']])`
* `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`
* `r CRANpkg('knitr')` `r citep(bib[['knitr']])`
* `r Githubpkg('leekgroup/phenopredict')` `r citep(bib[['phenopredict']])`
* `r Biocpkg('recount')` `r citep(bib[['recount']])`
* `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])`
* `r CRANpkg('sessioninfo')` `r citep(bib[['sessioninfo']])`
* `r Githubpkg('LieberInstitute/shinycsv')` `r citep(bib[['shinycsv']])`
* `r CRANpkg('tidyverse')` `r citep(bib[['tidyverse']])`
[Bibliography file](index.bib)
```{r bibliography, results='asis', echo=FALSE, warning = FALSE, message = FALSE}
## Print bibliography
bibliography(style = 'html')
```