Skip to content

v2 Data Sources

Tiffany J. Callahan edited this page Dec 21, 2021 · 134 revisions

Release V2.0.0 Knowledge Graph Data Sources


Release: v2.0.0

Data Access: https://console.cloud.google.com/storage/browser/pheknowlator/archived_builds/release_v2.0.0

Dependencies:

Rationale: The goal of this build was to create a knowledge graph that represented human disease mechanisms and included the central dogma. The data sources utilized in this release include many of the sources used in the initial release, as well as some new data made available by the Comparative Toxicogenomics Database and experimental data from the Human Protein Atlas.



ONTOLOGIES



Screen Shot 2019-12-12 at 21 59 22

Cell Ontology

Homepage: GitHub
Citation:

Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6(2):R21

Usage: Utilized to connect transcripts and proteins to cells. Additionally, the edges between this ontology and its dependencies are utilized:

Return to Top


Cell Line Ontology

Homepage: http://www.clo-ontology.org/
Citation:

Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, Schürer SC, Pang C, Malone J, Parkinson H, Liu Y. CLO: the cell line ontology. Journal of Biomedical Semantics. 2014;5(1):37

Usage: Utilized this ontology to map cell lines to transcripts and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

ChEBI Ontology - Lite

Homepage: https://www.ebi.ac.uk/chebi/
Citation:

Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research. 2015;44(D1):D1214-9

Usage: Utilized to connect chemicals to complexes, diseases, genes, GO biological processes, GO cellular components, GO molecular functions, pathways, phenotypes, reactions, and transcripts.

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Gene Ontology

Homepage: http://geneontology.org/
Citations:

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25

The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47(D1):D330-8

Usage: Utilized to connect biological processes, cellular components, and molecular functions to chemicals, pathways, and proteins. Additionally, the edges between this ontology and its dependencies are utilized:

Other Gene Ontology Data Used: goa_human.gaf.gz

Usage: Utilized to create protein-gobp, protein-gocc, and protein-gomf edges. The original data is filtered such that only records meeting the following criteria were included:
- Protein-GO Biological Process: column[3] not in ["NOT"] and column[8] == "P" and column[11] == "protein" and column[12] == "taxon:9606"
- Protein-GO Cellular Component: column[3] not in ["NOT"] and column[8] == "C" and column[11] == "protein" and column[12] == "taxon:9606"
- Protein-GO Molecular Function: column[3] not in ["NOT"] and column[8] == "F" and column[11] == "protein" and column[12] == "taxon:9606"

Metadata:
GO-BP, GO-CC, and GO-MF Relations

Type Source Column Metadata Variable Name
Node Metadata
protein
DB_Object_Symbol dbxref
With_Or_From dbxref
DB_Object_Name synonym
DB_Object_Synonym synonym
go-bp
go-cc
go-mf
Aspect GOA_Aspect
Edge Metadata
protein-gobp
protein-go-cc
protein-go-mf
Qualifier GOA_Qualifier
DB_Reference GOA_DB_Reference
EvidenceCode GOA_EvidenceCode
Taxon GOA_Taxon
AssignedBy GOA_AssignedBy

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Human Phenotype Ontology

Homepage: https://hpo.jax.org/
Citation:

Köhler S, Carmody L, Vasilevsky N, Jacobsen JO, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research. 2018;47(D1):D1018-27

Usage: Utilized to connect phenotypes to chemicals, diseases, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

Other Human Phenotype Ontology Data Used: phenotype.hpoa

Usage: Utilized to create disease-phenotype edges. The original data is filtered such that only records meeting the following criteria were included:
- Qualifier != "NOT"

Metadata:
Disease-Phenotype Relations

Type Source Column Metadata Variable Name
Node Metadata
disease
DiseaseName synonym
DiseaseID dbxref
Edge Metadata
disease-phenotype
Reference HPO_Reference
Evidence HPO_Evidence
Frequency HPO_Frequency
Sex HPO_Sex
Modifier HPO_Modifier
Aspect HPO_Aspect
Biocuration HPO_Biocuration

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Mondo Disease Ontology

Homepage: https://mondo.monarchinitiative.org/
Citation:

Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Research. 2017;45(D1):D712-22

Usage: Utilized to connect diseases to chemicals, phenotypes, genes, and variants. Additionally, the edges between this ontology and its dependencies are utilized:

Other Mondo Ontology Data Used: [DISEASE_MONDO_MAP.txt](ADD ME HERE)

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Pathway Ontology

Homepage: rgd.mcw.edu
Citation:

Petri V, Jayaraman P, Tutaj M, Hayman GT, Smith JR, De Pons J, Laulederkind SJ, Lowry TF, Nigam R, Wang SJ, Shimoyama M. The pathway ontology–updates and applications. Journal of Biomedical Semantics. 2014;5(1):7.

Usage: Utilized to connect pathways to GO biological processes, GO cellular components, GO molecular functions, Reactome pathways. Several steps are taken in order to connect Pathway Ontology identifiers to Reactome pathways and GO biological processes. To connect Pathway Ontology identifiers to Reactome pathways, we use ComPath Pathway Database Mappings developed by Daniel Domingo-Fernández (PMID:30564458).

Files:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Protein Ontology

Homepage: https://proconsortium.org/
Citation:

Natale DA, Arighi CN, Barker WC, Blake JA, Bult CJ, Caudy M, Drabkin HJ, D’Eustachio P, Evsikov AV, Huang H, Nchoutmboube J. The Protein Ontology: a structured representation of protein forms and complexes. Nucleic Acids Research. 2010;39(suppl_1):D539-45

Usage: Utilized to connect proteins to chemicals, genes, anatomy, catalysts, cell lines, cofactors, complexes, GO biological processes, GO cellular components, GO molecular functions, pathways, proteins, reactions, and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

Notes: A partial, human-only version of this ontology was used. Details on how this version of the ontology was generated can be found under the Protein Ontology section of the Data_Preparation.ipynb Jupyter Notebook.

Files:

  • Generated Human Versions of Protein Ontology (PRO):

  • Other PRO Data Used: promapping.txt

  • Generated Mapping Data:

    • Merged RNA, Gene, and PRO Identifiers: Merged_Human_Ensembl_Entrez_HGNC_Uniprot_Identifiers.txt
    • Merged Gene, RNA, Protein Map: Merged_gene_rna_protein_identifiers.pkl
    • Ensembl Transcript-PRO Identifier Mapping: ENSEMBL_TRANSCRIPT_PROTEIN_ONTOLOGY_MAP.txt
    • Entrez Gene-PRO Identifier Mapping: ENTREZ_GENE_PRO_ONTOLOGY_MAP.txt
    • UniProt Accession-PRO Identifier Mapping: UNIPROT_ACCESSION_PRO_ONTOLOGY_MAP.txt
    • STRING-PRO Identifier Mapping: STRING_PRO_ONTOLOGY_MAP.txt

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Relations Ontology

Homepage: GitHub
Citation:

Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.

Usage: Utilizing this ontology to connect all data sources in knowledge graph. Additionally, the ontology is queried prior to building the knowledge graph to identify all relations, their inverse properties, and their labels.

Generated RO Data:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Sequence Ontology

Homepage: GitHub
Citation:

Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44

Usage: Utilized to connect transcripts and other genomic material like genes and variants.

Files:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Uber-Anatomy Ontology

Homepage: GitHub
Citation:

Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA. Uberon, an integrative multi-species anatomy ontology. Genome Biology. 2012;13(1):R5

Usage: Utilized to connect tissues, fluids, and cells to proteins and transcripts. Additionally, the edges between this ontology and its dependencies are utilized:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Vaccine Ontology

Homepage: http://www.violinet.org/vaccineontology/
Citations:

He Y, Racz R, Sayers S, Lin Y, Todd T, Hur J, Li X, Patel M, Zhao B, Chung M, Ostrow J. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Research. 2013;42(D1):D1124-32

Xiang Z, Todd T, Ku KP, Kovacic BL, Larson CB, Chen F, Hodges AP, Tian Y, Olenzek EA, Zhao B, Colby LA. VIOLIN: vaccine investigation and online information network. Nucleic Acids Research. 2007;36(suppl_1):D923-8

Usage: Utilized the edges between this ontology and its dependencies:

Return to Top



DATA SOURCES



Screen Shot 2019-12-12 at 21 59 22

BioPortal

Homepage: BioPortal
Citation:

BioPortal. Lexical OWL Ontology Matcher (LOOM)

Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: simple methods work. In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 198). American Medical Informatics Association

Usage: To obtain mappings between MeSH identifiers and ChEBI identifiers for chemicals-diseases, chemicals-genes, chemical-GO biological processes, chemicals-GO cellular components, chemicals-GO molecular functions, chemicals-phenotypes, chemicals-proteins, and chemicals-transcripts. Additional information on how this data was processed can be obtained from the NCBO_rest_api.py GitHub Gist script.

ALTERNATIVE METHOD⭐ Since the above approach can take over two days to process, we have developed an alternative solution which downloads the mesh2021.nt data file directly from MeSH and the Flat_file_tab_delimited/names.tsv.gz file directly from ChEBI. Using these files, we have recapitulated the LOOM algorithm implemented by BioPortal when creating mappings between these resources. The procedure is relatively straightforward and utilizes the following information from each resource:

  • For all MeSH SCR Chemicals, obtain the following information:
    • Identifiers: MeSH identifiers
    • Labels: string labels using the RDFS:label object property
    • Synonyms: track down all synonyms using the vocab:concept and vocab:preferredConcept object properties
  • For all ChEBI classes, obtain the following information:
    • Labels: string labels using the RDFS:label object property
    • Synonyms: track down all synonyms using all synonym object properties

Generated Data: MESH_CHEBI_MAP.txt

Return to Top


Screen Shot 2019-12-12 at 21 59 22

ClinVar

Homepage: https://www.ncbi.nlm.nih.gov/clinvar/
Citation:

Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2017;46(D1):D1062-7

Usage: Utilized to create variant-gene, variant-disease, and variant-phenotype edges. The original data files (list under the Downloaded Data heading below) are combined and filtered to create the most robust file of variants. Detailed explanations of the steps performed can be found in the Clinvar Variant-Diseases and Phenotypes section of the Data_Preparation.ipynb notebook. The original data is filtered such that only records meeting the following criteria were included:

  • Variant-Gene Relations
  • Variant-Disease Relations
  • Variant-Phenotype Relations

Files:

Metadata:
Variant-Gene Relations


Variant-Disease Relations


Variant-Phenotype Relations


Return to Top


Screen Shot 2019-12-12 at 21 59 22

Comparative Toxicogenomics Database

Homepage: http://ctdbase.org/
Citations:

Curated [chemical–gene interactions|chemical-go interactions|chemical–disease interactions|gene–pathway interactions] data were retrieved from the Comparative Toxicogenomics Database (CTD), MDI Biological Laboratory, Salisbury Cove, Maine, and NC State University, Raleigh, North Carolina. World Wide Web (URL: http://ctdbase.org/)

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The comparative toxicogenomics database: update 2019. Nucleic Acids Research. 2018;47(D1):D948-54

Usage: Utilized to create chemical-disease, chemical-gene, chemical-GO biological process, chemical-GO cellular components, chemical-GO molecular functions, chemical-phenotype, chemical-protein, chemical-rna, and gene-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

  • Chemical-Disease/Phenotype Relations
    • chemical-disease: PubMedIDs != ""
    • chemical-phenotype: PubMedIDs != ""
  • Chemical-Gene Relations
    • chemical-gene: Organism == "Homo sapiens", GeneForms == "gene", and PubMedIDs != ""
    • chemical-protein: Organism == "Homo sapiens", GeneForms == "protein", and PubMedIDs != ""
    • chemical-rna: Organism == "Homo sapiens", GeneForms == "mRNA", and affects and PubMedIDs != ""
  • Chemical-GO Relations
    • chemical-GO biological process: PhenotypeName == "Biological Process"
    • chemical-GO cellular components: PhenotypeName == "Cellular Component"
    • chemical-GO molecular functions: PhenotypeName == "Molecular Function"
  • Gene-Pathway Relations
    • gene-pathway: column[5] == "Homo sapiens"

Downloaded Data:

Metadata:
Chemical-Gene/Protein/RNA Relations

Type Source Column Metadata Variable Name
Node Metadata
chemical
ChemicalName synonym
ChemicalID dbxref
CasRN dbxref
gene
rna
protein
GeneSymbol synonym
GeneSymbol dbxref
OrganismID CTD_OrganismID
Edge Metadata
chemical-gene
chemical-rna
chemical-protein
Interaction CTD_Interaction
InferenceActions CTD_InferenceActions
PubMedIDs CTD_PubMedIDs
OrganismID CTD_OrganismID

Chemical-Disease/Phenotype Relations

Type Source Column Metadata Variable Name
Node Metadata
chemical
ChemicalName synonym
ChemicalID dbxref
CasRN dbxref
disease
Phenotype
DiseaseName synonym
DiseaseID dbxref
OmimIDs dbxref
Edge Metadata
chemical-disease
chemical-phenotype
DirectEvidence CTD_DirectEvidence
InferenceScore CTD_InferenceScore
PubMedIDs CTD_PubMedIDs
InferenceGeneSymbol CTD_InferenceGeneSymbol

Chemical-GO Relations

Type Source Column Metadata Variable Name
Node Metadata
gobp
gocc
gomf
GOTermName synonym
Ontology CTD_Ontology
Edge Metadata
chemical-gobp
chemical-gocc
chemical-gomf
HighestGOLevel CTD_HighestGOLevel
Pvalue CTD_Pvalue
CorrectedPValue CTD_CorrectedPValue
TargetMatchQty CTD_TargetMatchQty
TargetTotalQty CTD_TargetTotalQty
BackgroundMatchQty CTD_BackgroundMatchQty
BackgroundTotalQty CTD_BackgroundTotalQty

Gene-Pathway Relations

Type Source Column Metadata Variable Name
Node Metadata
pathway
PathwayName synonym
PathwayID dbxref

Return to Top


Screen Shot 2019-12-12 at 21 59 22

DisGeNET

Homepage: https://www.disgenet.org/
Citation:

Gene-disease association data retrieved from DisGeNET v6.0 (http://www.disgenet.org/), Integrative Biomedical Informatics Group GRIB/IMIM/UPF. [December, 2019].

Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research. 2019.

Usage: Utilized to create gene-disease, and gene-phenotype edges. The original data is filtered such that only records meeting the following criteria were included: diseaseType != "group". Additionally, data from this source was used to create mappings between different types of disease and phenotype identifiers, including:

  • OMIM, ORPHA, UMLS, ICD ➞ DOID
  • OMIM, ORPHA, UMLS, ICD ➞ HPO

Files:

Metadata:
Gene-Disease/Phenotype Relations

Type Source Column Metadata Variable Name
Node Metadata
disease
diseaseName synonym
diseaseId dbxref
diseaseSemanticType DisGeNET_diseaseSematnicType
diseaseClass DisGeNET_diseaseClass
phenotype
diseaseName synonym
diseaseId dbxref
diseaseSemanticType DisGeNET_diseaseSematnicType
diseaseClass DisGeNET_diseaseClass
gene
geneSymbol dbxref
Edge Metadata
gene-disease
DSI DisGeNET_DSI
DPI DisGeNET_DPIe
score DisGeNET_score
EI DisGeNET_EI
YearInitial DisGeNET_YearInitial
YearFinal DisGeNET_YearFinal
NofPmids DisGeNET_NofPmids
NofSnps DisGeNET_NofSnps
source DisGeNET_source

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Ensembl

Homepage: https://uswest.ensembl.org/index.html
Citation:

Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Girón CG, Gil L. Ensembl 2018. Nucleic Acids Research. 2017;46(D1):D754-61

Usage: This data was used to create mappings between Ensembl genes, transcripts, and proteins with NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers in the knowledge graph (for additional details on the processing of these data, see Data_Preparation.ipynb):

  • Ensembl Transcript IDs ➞ PRO IDs
  • Gene Ensembl IDs ➞ Entrez Gene IDs
  • Gene Ensembl IDs ➞ PRO IDs
  • Gene Symbols ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ PRO IDs
  • Protein Ensembl IDs ➞ UniProt Protein Accession
  • STRING IDs ➞ PRO IDs

Files:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

GeneMANIA

Homepage: https://genemania.org/
Citation:

Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C, Kazi F, Lopes CT, Maitland A. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research. 2010;38(suppl_2):W214-20

Usage: Utilized to create gene-gene edges.

Downloaded Data: COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt

Metadata:
Gene-Gene Relations

Type Source Column Metadata Variable Name
Edge Metadata
gene-gene
Weight GeneMania_Weight

Return to Top


Screen Shot 2019-12-12 at 21 59 22

The Genotype-Tissue Expression (GTEx) Project

Homepage: https://gtexportal.org/home/
Citation:

Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B. The genotype-tissue expression (GTEx) project. Nature Genetics. 2013;45(6):580

Usage: Utilized to create edges between protein-cell, protein-anatomy, rna-cell and rna-anatomy entities.

  • We chose to use the RNASeQC file over the RSEM file as advised by the GTEx website

The RSEM estimates are based on combining isoform-level estimates, which adds uncertainty to the resulting gene-level values (the isoform-level estimates are highly inaccurate in some cases). For analyses, we recommend filterng the metadata to only keep ExpressionValues >= 1.0.

  • Zooma was utilized to automatically annotate the 154 unique tissues and cell types. GTEx provides mappings from tissue types to UBERON and EFO. These provided mappings were verified and extended, such that all samples which referenced a cell type were also mapped to the Cell and the Cell Line ontologies. This resulted in a total of 56 mappings (1.04 mappings/concepts).
  • The original data is filtered such that only records meeting the following criteria were included:
    • Protein-Anatomy/Cell Relations
      • protein-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
      • protein-Cell: column[3] == Evidence at protein level and column[4] == cell line
    • RNA-Anatomy/Cell Relations
      • rna-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
      • rna-Cell: column[3] == Evidence at protein level and column[4] == cell line

Files:

Metadata:
protein-anatomy/cell and rna-anatomy/cell Relations

Type Source Column Metadata Variable Name
Node Metadata
protein
UniProtIDs dbxref
rna
Ensembl_IDs dbxref
anatomy/cell
Anatomy GTEx_Anatomy
Edge Metadata
protein-anatomy/cell
rna-anatomy/cell
Expression_Value GTEx_Expression_Value
Subcellular_Location GTEx_Subcellular_Location

Return to Top


Screen Shot 2019-12-12 at 21 59 22

HUGO Gene Nomenclature Committee

Homepage: https://www.genenames.org/
Citations:

HGNC Database, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom www.genenames.org

Yates B, Braschi B, Gray K, Seal R, Tweedie S, Bruford E. Genenames.org: the HGNC and VGNC Resources in 2017. Nucleic Acids Research. 2017;45(D1):D619-625

Usage: Parsing HUGO data to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

  • Ensembl Transcript IDs ➞ PRO IDs
  • Gene Ensembl IDs ➞ Entrez Gene IDs
  • Gene Ensembl IDs ➞ PRO IDs
  • Gene Symbols ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ PRO IDs
  • Protein Ensembl IDs ➞ UniProt Protein Accession
  • STRING IDs ➞ PRO IDs

Files:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Human Protein Atlas

Homepage: https://www.proteinatlas.org/
Citation:

Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419

Usage: Utilized to create rna-cell, rna-anatomy, protein-cell, and protein-anatomy edges.

  • Evidence between gene and RNA expression in specific tissue types was derived by HPA. For analyses, we recommend filterng the metadata to only keep ExpressionValues >= 1.0.
  • Zooma was utilized to automatically annotate the 153 unique tissues and cell types from Human Protein Atlas for all human protein-coding genes in the Human Proteome to the Cell Ontology, Cell Line Ontology, and the Uber-Anatomy Ontology. To best represent each concept, the automatic mappings from Zooma were extend through manual mapping efforts to ensure each concept cell type was matched to a Cell Ontology, Cell Line Ontology, and UBERON ontology term. This resulted in a total of 281 mappings (1.84 mappings/concepts).
  • The original data is filtered such that only records meeting the following criteria were included:
    • Protein-Anatomy/Cell Relations
      • protein-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
      • protein-Cell: column[3] == Evidence at protein level and column[4] == cell line
    • RNA-Anatomy/Cell Relations
      • rna-anatomy: column[3] == Evidence at protein level and column[4] == anatomy
      • rna-Cell: column[3] == Evidence at protein level and column[4] == `cell line``

Files:

Metadata:
protein-anatomy/cell and rna-anatomy/cell Relations

Type Source Column Metadata Variable Name
Node Metadata
protein
UniProtIDs dbxref
rna
Ensembl_IDs dbxref
anatomy/cell
Anatomy HPA_Anatomy
Edge Metadata
protein-anatomy/cell
rna-anatomy/cell
Expression_Value HPA_Expression_Value
Subcellular_Location HPA_Subcellular_Location

Return to Top


Screen Shot 2019-12-12 at 21 59 22

NCBI Gene

Homepage: https://www.ncbi.nlm.nih.gov/gene/
Citation:

Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2005;33(suppl_1):D54-8.

Usage: Parsing NCBI Gene data to obtain mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

  • Ensembl Transcript IDs ➞ PRO IDs
  • Gene Ensembl IDs ➞ Entrez Gene IDs
  • Gene Ensembl IDs ➞ PRO IDs
  • Gene Symbols ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ PRO IDs
  • Protein Ensembl IDs ➞ UniProt Protein Accession
  • STRING IDs ➞ PRO IDs

Files:

Return to Top


Screen Shot 2019-12-12 at 21 59 22

Reactome Pathway Database

Homepage: https://reactome.org/
Citation:

Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, Haw R, Jassal B, Korninger F, May B, Milacic M. The reactome pathway knowledgebase. Nucleic Acids Research. 2017;46(D1):D649-55

Usage: Utilized to create chemical-pathway, GO Biological process-pathway, pathway-GO Cellular component, GO Molecular function-pathway, and protein-pathway edges. The original data is filtered such that only records meeting the following criteria were included:

  • chemical-pathway: column[5] == "Homo sapiens"
  • GO Biological Process-pathway: column[5] startswith "REACTOME", column[8] == "P", column[12] == "taxon:9606", and column[3] not in ["NOT"]
  • pathway-GO Cellular Component: column[5] startswith "REACTOME", column[8] == "C", column[12] == "taxon:9606", and column[3] not in ["NOT"]
  • GO Molecular Function-Pathway: column[5] startswith "REACTOME", column[8] == "F", column[12] == "taxon:9606", and column[3] not in ["NOT"]
  • protein-pathway: column[5] == "Homo sapiens"

Files:

Metadata:
Chemical-Pathway Relations

Type Source Column Metadata Variable Name
Node Metadata
pathway
ReactomeID dbxref
Species Reactome_Species
Edge Metadata
chemical-pathway
EvidenceID Reactome_EvidenceID
Species Reactome_Species

Pathway-GO Relations

Type Source Column Metadata Variable Name
Node Metadata
pathway
DBReference dbxref
TaxonID Reactome:TaxonID
gobp
gocc
gomf
GOID dbxref
Aspect Reactome:Aspect
TaxonID Reactome:TaxonID
Edge Metadata
gobp-pathway
pathway-gocc
pathway-gomf
Qualifier Reactome_Qualifier
EvidenceCode Reactome_EvidenceCode
TaxonID Reactome_TaxonID
AssignedBy Reactome_AssignedBy

Protein-Pathway Relations

Type Source Column Metadata Variable Name
Node Metadata
pathway
ReactomeID dbxref
Species Reactome_Species
Edge Metadata
protein-pathway
EvidenceID Reactome_EvidenceID
Species Reactome_Species

Return to Top


Screen Shot 2019-12-12 at 21 59 22

STRING Database

Homepage: string-db.org
Citation:

Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research. 2018;47(D1):D607-13

Usage: This data was used to create protein-protein edges. For analyses, we recommend filtering these edges such that only records with a combined_score >= "700" (>90th percentile) are included.

Files:

Metadata:
Protein-Protein Relations

Type Source Column Metadata Variable Name
Node Metadata
protein
protein1 dbxref
protein2 dbxref
Edge Metadata
protein-protein
combined_score STRING_combined_score

Return to Top


Screen Shot 2019-12-12 at 21 59 22

UniProt Knowledgebase

Homepage: https://www.uniprot.org/
Citation:

UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2018;47(D1):D506-15

Usage: To obtain cofactor/catalyst-protein and protein-coding gene-protein edges as well as mappings between NCBI Gene identifiers, HUGO gene symbols, UniProt Accession identifiers, and Protein Ontology identifiers. For additional details on the processing of these data, see Data_Preparation.ipynb:

  • Ensembl Transcript IDs ➞ PRO IDs
  • Gene Ensembl IDs ➞ Entrez Gene IDs
  • Gene Ensembl IDs ➞ PRO IDs
  • Gene Symbols ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ Transcript Ensembl IDs
  • Entrez Gene IDs ➞ PRO IDs
  • Protein Ensembl IDs ➞ UniProt Protein Accession
  • STRING IDs ➞ PRO IDs

Files:

Metadata:
Protein-Catalyst/Cofactor Relations

Type Source Column Metadata Variable Name
Node Metadata
protein
UniProt_ID dbxref
UniProt_Entry_Name dbxref
Edge Metadata
protein-catalyst
protein-cofactor
Status Uniprot_Status

Return to Top



This project is licensed under Apache License 2.0 - see the LICENSE.md file for details. If you intend to use any of the information on this Wiki, please provide the appropriate attribution by citing this repository:

@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
Clone this wiki locally