Skip to content

Latest commit

 

History

History
135 lines (110 loc) · 13.7 KB

README.md

File metadata and controls

135 lines (110 loc) · 13.7 KB

GMT files for mulea

GitHub issues GitHub pulls

This repository provides ready-to-use gene sets formatted in the standardized Gene Matrix Transposed (GMT) format, compatible with the mulea R package, a comprehensive tool for overrepresentation and functional enrichment analysis.

The GMT format is a tab-delimited text file used to represent collections of genes or proteins associated with specific ontology entries. Each row in a GMT file corresponds to a single ontology element and comprises three main columns:

  1. Ontology identifier: This column uniquely identifies the element within the referenced ontology.

  2. Ontology name or description: This column provides a user-friendly label or textual description for the ontology element.

  3. List of associated genes/proteins: This column lists the gene or protein identifiers belonging to the corresponding ontology element, separated by spaces.

Within the mulea package, these entities are referred to as ontology_id, ontology_name, and list_of_values, respectively. Additionally, rows starting with a “#” symbol in the GMT file are considered comment lines and may contain supplementary information about the referenced ontology, such as its type, source, species, version, and identifier.

This repository offers pre-processed gene sets for 27 model organisms (from Escherichia coli to human) with various identifiers including UniProt, Entrez, Gene Symbol, and Ensembl IDs.

The GMT files can be found in the GMT_files folder, and the scripts we applied to create them are available in the scripts_to_create_GMT_files folder. Also, there is a script for mapping between different ID types in the scripts_to_create_GMT_files/ID_mapping_scripts folder.

The GMT files can be downloaded and read with the mulea::read_gmt() function. i.e.

mulea::read_gmt(file = "Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")

Or can be loaded directly from this github repository. i.e.

mulea::read_gmt(file = "https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Drosophila_melanogaster_7227/Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")

Besides, we also created the muleaData ExperimentHubData Bioconductor package to ease browsing and reading the ontologies.

List of species we cover:

  • Arabidopsis thaliana
  • Bacillus subtilis
  • Bacteroides thetaiotaomicron VPI-5482
  • Bifidobacterium longum
  • Bos taurus
  • Caenorhabditis elegans
  • Chlamydomonas reinhardtii
  • Danio rerio
  • Daphnia pulex
  • Dictyostelium discoideum
  • Drosophila melanogaster
  • Drosophila simulans
  • Escherichia coli
  • Gallus gallus
  • Homo sapiens
  • Macaca mulatta
  • Mus musculus
  • Mycobacterium tuberculosis
  • Neurospora crassa
  • Pan troglodytes
  • Rattus norvegicus
  • Saccharomyces cerevisiae
  • Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
  • Schizosaccharomyces pombe
  • Tetrahymena thermophila
  • Xenopus tropicalis
  • Zea mays

Type, name, link and citation of the databases we cover:

Ontology category Ontology name Short description of content Reference
Gene expression FlyAtlas Tissue-specific expression data for Drosophila melanogaster. Chintapalli,V.R. et al. (2007) Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nat Genet, 39, 715–720.
ModEncode Functional characterization (cell line, temporal expression, tissue expression, treatment) of elements for Caenorhabditis elegans and Drosophila melanogaster. The Modencode Consortium et al. (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330, 1787–1797.
Genomic location Chromosomal Bands Location of genes on the chromosome. Martin,F.J. et al. (2023) Ensembl 2023. Nucleic Acids Res, 51, D933–D941.
Consecutive genes n consecutive genes on the chromosome.
miRNA regulation miRTarBase Experimentally validated miRNA - target interactions. Huang,H.-Y. et al. (2022) miRTarBase update 2022: an informative resource for experimentally validated miRNA–target interactions. Nucleic Acids Res, 50, D222–D230.
Gene Ontology GO Gene Ontology (GO) categorizes genes into unified categories and attributes. The Gene Ontology Consortium et al. (2023) The Gene Ontology knowledgebase in 2023. Genetics, 224, iyad031.
Pathway Pathway Commons Collection of biological pathway and interaction data. Rodchenkov,I. et al. (2020) Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res, 48, D489–D497.
Reactome Collection of biological pathway and interaction data. Jassal,B. et al. (2020) The reactome pathway knowledgebase. Nucleic Acids Res, 48, D498–D503.
Signalink Interaction database focussing on pathways and interactions of pathways. Csabai,L. et al. (2022) SignaLink3: a multi-layered resource to uncover tissue-specific signaling networks. Nucleic Acids Res, 50, D701–D709.
Wikipathways Collection of biological pathway and interaction data. Martens,M. et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res, 49, D613–D621.
Protein domain PFAM Protein domain structure database. Mistry,J. et al. (2021) Pfam: The protein families database in 2021. Nucleic Acids Res, 49, D412–D419.
Transcription factor regulation ATRM Transcription factor - target gene interactions for Arabidopsis thaliana. Jin,J. et al. (2015) An Arabidopsis transcriptional regulatory map reveals distinct functional and evolutionary features of novel transcription factors. Mol Biol Evol, 32, 1767–1773.
dorothEA Transcription factor - target gene interactions for human and mouse. Garcia-Alonso,L. et al. (2019) Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res, 29, 1363–1375.
RegulonDB Transcription factor - target gene interactions for Escherichia coli bacteria. Tierrafría,V.H. et al. (2022) RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb Genom, 8, 000833.
TFLink Small- and large-scale transcription factor - target gene interactions for human and 6 model organisms. Liska,O. et al. (2022) TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species. Database, 2022, baac083.
TRRUST Transcription factor - target gene interactions for human. Han,H. et al. (2018) TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res, 46, D380–D386.
Yeastract Transcription factor - target gene interactions for Saccharomyces cerevisiae. Teixeira,M.C. et al. (2018) YEASTRACT: an upgraded database for the analysis of transcription regulatory networks in Saccharomyces cerevisiae. Nucleic Acids Res, 46, D348–D353.

Citation

To cite the GMT files in publications use:

Ari E, Ölbei M, Gul L, Bohár B, Stirling T (2024). muleaData: ExperimentalData Bioconductor Package for the mulea R Package, Contains Genes Sets for Functional Enrichment Analysis in GMT File Format. R package version 0.99.0, https://github.com/ELTEbioinformatics/muleaData.