index.Rmd

---
output:
  html_document:
    theme: spacelab
    toc: true
    toc_depth: 2
    toc_float: true
---

```{r 'setup', echo = FALSE, warning = FALSE, message = FALSE}
## Bib setup
library('knitcitations')
library('BiocStyle')

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = 'to.doc', citation_format = 'text', style = 'html')

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation('BiocStyle'),
    bioportal = bib_metadata('10.1093/nar/gkr469'),
    devtools = citation('devtools'),
    downloader = citation('downloader'),
    DT = citation('DT'),
    knitcitations = citation('knitcitations'),
    knitr = citation('knitr')[3],
    metasra = bib_metadata('10.1093/bioinformatics/btx334'),
    phenopredict = citation('recount')[3],
	recount = citation('recount')[1],
    rmarkdown = citation('rmarkdown'),
    sessioninfo = citation('sessioninfo'),
    shinycsv = citation('shinycsv'),
    srp = bib_metadata('10.1101/gr.165126.113'),
    tidyverse = citation('tidyverse'),
    gtex = bib_metadata('10.1038/s41467-017-02772-x')
)

write.bibtex(bib, file = 'index.bib')
```

<a href="https://jhubiostatistics.shinyapps.io/recount/"><img src="https://raw.githubusercontent.com/LieberInstitute/recount-brain/master/recount_brain.png" align="center"></a>

Code and results for the [recount-brain](https://github.com/LieberInstitute/recount-brain) project that enhances the [recount2 project](https://jhubiostatistics.shinyapps.io/recount/) project. The `recount_brain` table can be accessed via the `r Biocpkg('recount')` `r citep(bib[['recount']])` Bioconductor package using `recount::add_metadata(source = 'recount_brain_v2')`.


# Contents

* [select_studies](select_studies.html) uses the predicted phenotype information by Shannon Ellis _et al._ `r citep(bib[['phenopredict']])` version 0.0.03 to determine candidate studies for `recount_brain` from the Sequence Read Archive (SRA) that have at least 4 samples and over 70% of the samples are from the brain. It creates the list of candidate projects saved in [projects_lists.txt](projects_list.txt).
* [SRA_run_selector_info](https://github.com/LieberInstitute/recount-brain/tree/master/SRA_run_selector_info) contains a table per study in [projects_lists.txt](projects_list.txt) with the data downloaded from the SRA Run Selector website https://www.ncbi.nlm.nih.gov/Traces/study/. 
* [SRA_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/SRA_metadata) contains a CSV table with the curated metadata for each study. This is the data that is then used to create `recount_brain`. Note that not all candidate studies were brain studies so the final number of projects considered is 62.
* [merged_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/merged_metadata) contains the `recount_brain` table that can be easily accessed via `r Biocpkg('recount')` `r citep(bib[['recount']])` using the `add_metadata()` function. The document [merging_data](merged_metadata/merging_data.html) describes how the `recount_brain` was created using the files from `SRA_metadata` and includes some brief examples on how to explore the `recount_brain` table. You can access this initial version of `recount_brain` using `recount::add_metadata(source = 'recount_brain_v1')`.
* [metadata_reproducibility](https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility) contains a document describing how the metadata was processed for each SRA study. It is intended to be useful for reproducibility purposes.
* The [cross_studies_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/cross_studies_metadata) directory contains the [cross_studies_metadata](cross_studies_metadata/cross_studies_metadata.html) document describing how the recount-brain version 1 table was merged with GTEx and TCGA brain samples metadata to create the recount-brain version 2 table that facilitates cross-study comparisons. [SupplementaryTable2.csv](SupplementaryTable2.csv) describes which fields from the GTEx and TCGA data were used to merge them with `recount_brain` and any manipulations required to do so. The [cross_studies_metadata](https://github.com/LieberInstitute/recount-brain/tree/master/cross_studies_metadata) directory also contains a second document, [recount_brain_ontologies](cross_studies_metadata/recount_brain_ontologies.html), with the code used for adding Broadmann area, disease and tissue ontology information to `recount_brain`. This final table is the one you can access using `recount::add_metadata(source = 'recount_brain_v2')`.
* [metasra_comp](https://github.com/LieberInstitute/recount-brain/tree/master/metasra_comp) contains a comparison of `recount_brain_v2`  and `MetaSRA` `r citep(bib[['metasra']])` as described in the [metasra_comp](metasra_comp/metasra_comp.html) html document. 

# Example analyses

* We used the data from [SRP027383](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP027383) `r citep(bib[['srp']])` to show how `recount_brain` can be used for a gene differential expression analysis. See the full example for more information: [example_SRP027383](example_SRP027383/example_SRP027383.html). You can also access the [pdf version](example_SRP027383/example_SRP027383.pdf) if you prefer over the HTML version.
* We used the data from ten studies to replicate some of the analyses by Ferreira _et al._ `r citep(bib[['gtex']])` that explore the relationship between post-mortem interval and gene expression. See the full example for more information: [example_PMI](example_PMI/example_PMI.html). You can also access the [pdf version](example_PMI/example_PMI.pdf) if you prefer it over the HTML version.
* We also illustrate how to perform an analysis across multiple studies present in `recount_brain` and combining them with specific tissue data from The Cancer Genome Atlas (TCGA). See the full example for more information: [example_multistudy](example_multistudy/recount_brain_multistudy.html). You can also access the [pdf version](example_multistudy/recount_brain_multistudy.pdf) if you prefer it over the HTML version.

# List of variables

This information is also available as a csv file at [SupplementaryTable1.csv](SupplementaryTable1.csv).

1. `age`: Age of donor
1. `age_units`: Units of age - (Years / Months / Post Conception Weeks)
1. `assay_type_s`: Sequencing technique - (RNA-Seq)
1. ` avgspotlen_l`: Average length of sequenced read
1. `bioproject_s`: NCBI BioProject ID
1. `biosample_s`: NCBI BioSample ID
1. `brain_bank`: Brain tissue repository source
1. `brodmann_area`: Brodmann area for tissue from cerebral cortex - (1-52)
1. `cell_line`: Cell line description
1. `center_name_s`: Project center
1. `clinical_stage_1`: Clinically relevant tissue sample information
1. `clinical_stage_2`: Clinically relevant tissue sample information
1. `consent_s`: Data availability - (Public)
1. `development`: Stage of human development - (Fetus / Infant / Child / Adolescent / Adult)
1. `disease`: Disease description
1. `disease_status`: Nature of tissue - (Disease / Control)
1. `experiment_s`: NCBI Experiment ID
1. `hemisphere`: Cerebral hemisphere - (Left / Right)
1. `insertsize_l`: Length of sequence between adaptors 
1. `instrument_s`: High throughput sequencing system
1. `library_name_s`: Internal sample ID used by original study
1. `librarylayout_s`: Sequencing layout - (Single / Paired)
1. `libraryselection_s`: Sequencing library - (cDNA)
1. `librarysource_s`: Sequencing source - (Transcriptomic)
1. `loaddate_s`: Sequencing load date
1. `mbases_l`: Megabases
1. `mbytes_l`: Megabytes
1. `organism_s`: Organism - (Homo sapiens)
1. `pathology`: Tissue pathology
1. `platform_s`: Sequencing platform - (Illumina)
1. `pmi`: Postmortem interval
1. `pmi_units`: Units of postmortem interval - (Hours)
1. `preparation`: Specimen preparation - (Frozen)
1. `present_in_recount`: Expression data present in recount2
1. `race`: Race of donor - (Asian / Black / Hispanic / White)
1. `releasedate_s`: Sequencing release date
1. `rin`: RNA integrity number
1. `run_s`: NCBI Run ID
1. `sample_name_s`: GEO Accession ID
1. `sample_origin`: Tissue origin - (Brain / iPSC)
1. `sex`: Sex of donor - (Female / Male)
1. `sra_sample_s`: NCBI SRA Sample ID
1. `sra_study_s`: NCBI SRA Study ID
1. `tissue_site_1`: Anatomic site of tissue
1. `tissue_site_2`: Anatomic site of tissue, further specified
1. `tissue_site_3`: Anatomic site of tissue, further specified
1. `tumor_type`: Type of tumor - (Glioblastoma / Astrocytoma / Ependymoma / Oligodendroglioma)
1. `viability`: Tissue viability - (Postmortem / Biopsy)

You can access this initial version with `recount::add_metadata(source = 'recount_brain_v1')`.


List of variables present in `recount_brain_v2`.

49. `Study_full`: either the SRA study accession, GTEX or TCGA.
50. `drugName_full`: the drug name for TCGA samples.
51. `drug_info_full`: logical, whether the sample has drug information; only for TCGA.
52. `drug_type_full`: the drug classification (chemotherapy, immunotherapy, ...); only for TCGA.
53. `full_260_280`: the 260 to 280 ratio; only for TCGA.
54. `count_file_identifier`: the SRA run accession or the TCGA run (sample) identifier. Useful for merging with the rest of recount2 metadata.
55. `Dataset`: either SRA, GTEX or TCGA.
56. `brodmann_ontology`: URL for the Brodmann region ontology. See the [`recount_brain_ontologies`](cross_studies_metadata/recount_brain_ontologies.html) file for how this information was added.
57. `brodmann_synonyms`: synonyms used for the Brodmann regions. These facilitate text based searches. Separated by `|`.
58. `brodmann_parents`: URLs for the Brodmann ontology parents. Separated by `|`.
59. `brodmann_parents_label`: Brodmann ontology parent text preferred labels. Separated by `|`.
60. `disease_ontology`: URL for the disease ontology.
61. `tissue`: tissue as prioritized by `tissue_site_3` over `tissue_site_2` over `tissue_site_1`.
62. `tissue_ontology`: URL for the tissue ontology.
63. `tissue_synonyms`: tissue synonyms which facilitate text based searches. Separated by `|`.
64. `tissue_parents`: URLs for the tissue ontology parents. Separated by `|`.
65. `tissue_parents_label`: tissue ontology parent text preferred labels. Separated by `|`.


You can access this version with `recount::add_metadata(source = 'recount_brain_v2')`.


# List of SRA projects present in `recount_brain`

```{r 'list of projects', echo = FALSE, results = 'asis'}
load('merged_metadata/recount_brain_v1.Rdata')
cat(paste0('1. [', unique(recount_brain$sra_study_s), '](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=', unique(recount_brain$sra_study_s), ') \n'))
```


# Explore interactively

We recommend opening the [interactive `recount_brain` exploration](https://jhubiostatistics.shinyapps.io/recount-brain/) in another window.

<iframe id="example1" src="https://jhubiostatistics.shinyapps.io/recount-brain/"
style="border: non; width: 1400px; height: 1500px"
frameborder="0">
</iframe>

This application is a custom version of `shinycsv` `r citep(bib[['shinycsv']])`. The code for making this application is available in the [shinytable](https://github.com/LieberInstitute/recount-brain/tree/master/shinytable/) directory.


# Questions

If you have any questions about `recount_brain` please post them as an issue at [LieberInstitute/recount-brain](https://github.com/LieberInstitute/recount-brain/issues) and include the relevant session information using the following code. Thank you!

```{r, eval = FALSE}
library('sessioninfo')
options(width = 120)
session_info()
```


# References

The analyses were made possible thanks to `BioPortal` `r citep(bib[['bioportal']])`, `MetaSRA` `r citep(bib[['metasra']])`, and:

* R `r citep(bib[['R']])`
* `r Biocpkg('BiocStyle')` `r citep(bib[['BiocStyle']])`
* `r CRANpkg('devtools')` `r citep(bib[['devtools']])`
* `r CRANpkg('downloader')` `r citep(bib[['downloader']])`
* `r CRANpkg('DT')` `r citep(bib[['DT']])`
* `r CRANpkg('knitcitations')` `r citep(bib[['knitcitations']])`
* `r CRANpkg('knitr')` `r citep(bib[['knitr']])`
* `r Githubpkg('leekgroup/phenopredict')` `r citep(bib[['phenopredict']])`
* `r Biocpkg('recount')` `r citep(bib[['recount']])`
* `r CRANpkg('rmarkdown')` `r citep(bib[['rmarkdown']])`
* `r CRANpkg('sessioninfo')` `r citep(bib[['sessioninfo']])`
* `r Githubpkg('LieberInstitute/shinycsv')` `r citep(bib[['shinycsv']])`
* `r CRANpkg('tidyverse')` `r citep(bib[['tidyverse']])`

[Bibliography file](index.bib)

```{r bibliography, results='asis', echo=FALSE, warning = FALSE, message = FALSE}
## Print bibliography
bibliography(style = 'html')
```