This repository provides ready-to-use gene sets formatted in the
standardized Gene Matrix Transposed (GMT) format, compatible with the
mulea
R package, a
comprehensive tool for overrepresentation and functional enrichment
analysis.
The GMT format is a tab-delimited text file used to represent collections of genes or proteins associated with specific ontology entries. Each row in a GMT file corresponds to a single ontology element and comprises three main columns:
-
Ontology identifier: This column uniquely identifies the element within the referenced ontology.
-
Ontology name or description: This column provides a user-friendly label or textual description for the ontology element.
-
List of associated genes/proteins: This column lists the gene or protein identifiers belonging to the corresponding ontology element, separated by spaces.
Within the mulea
package, these entities are referred to as
ontology_id
, ontology_name
, and list_of_values
, respectively.
Additionally, rows starting with a “#” symbol in the GMT file are
considered comment lines and may contain supplementary information about
the referenced ontology, such as its type, source, species, version, and
identifier.
This repository offers pre-processed gene sets for 27 model organisms (from Escherichia coli to human) with various identifiers including UniProt, Entrez, Gene Symbol, and Ensembl IDs.
The GMT files can be found in the GMT_files folder, and the scripts we applied to create them are available in the scripts_to_create_GMT_files folder. Also, there is a script for mapping between different ID types in the scripts_to_create_GMT_files/ID_mapping_scripts folder.
The GMT files can be downloaded and read with the mulea::read_gmt()
function. i.e.
mulea::read_gmt(file = "Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")
Or can be loaded directly from this github repository. i.e.
mulea::read_gmt(file = "https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Drosophila_melanogaster_7227/Transcription_factor_TFLink_Drosophila_melanogaster_LS_GeneSymbol.gmt")
Besides, we also created the
muleaData
ExperimentHubData Bioconductor package to ease browsing and reading the
ontologies.
List of species we cover:
- Arabidopsis thaliana
- Bacillus subtilis
- Bacteroides thetaiotaomicron VPI-5482
- Bifidobacterium longum
- Bos taurus
- Caenorhabditis elegans
- Chlamydomonas reinhardtii
- Danio rerio
- Daphnia pulex
- Dictyostelium discoideum
- Drosophila melanogaster
- Drosophila simulans
- Escherichia coli
- Gallus gallus
- Homo sapiens
- Macaca mulatta
- Mus musculus
- Mycobacterium tuberculosis
- Neurospora crassa
- Pan troglodytes
- Rattus norvegicus
- Saccharomyces cerevisiae
- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
- Schizosaccharomyces pombe
- Tetrahymena thermophila
- Xenopus tropicalis
- Zea mays
Type, name, link and citation of the databases we cover:
Ontology category | Ontology name | Short description of content | Reference |
Gene expression | FlyAtlas | Tissue-specific expression data for Drosophila melanogaster. | Chintapalli,V.R. et al. (2007) Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nat Genet, 39, 715–720. |
ModEncode | Functional characterization (cell line, temporal expression, tissue expression, treatment) of elements for Caenorhabditis elegans and Drosophila melanogaster. | The Modencode Consortium et al. (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330, 1787–1797. | |
Genomic location | Chromosomal Bands | Location of genes on the chromosome. | Martin,F.J. et al. (2023) Ensembl 2023. Nucleic Acids Res, 51, D933–D941. |
Consecutive genes | n consecutive genes on the chromosome. | ||
miRNA regulation | miRTarBase | Experimentally validated miRNA - target interactions. | Huang,H.-Y. et al. (2022) miRTarBase update 2022: an informative resource for experimentally validated miRNA–target interactions. Nucleic Acids Res, 50, D222–D230. |
Gene Ontology | GO | Gene Ontology (GO) categorizes genes into unified categories and attributes. | The Gene Ontology Consortium et al. (2023) The Gene Ontology knowledgebase in 2023. Genetics, 224, iyad031. |
Pathway | Pathway Commons | Collection of biological pathway and interaction data. | Rodchenkov,I. et al. (2020) Pathway Commons 2019 Update: integration, analysis and exploration of pathway data. Nucleic Acids Res, 48, D489–D497. |
Reactome | Collection of biological pathway and interaction data. | Jassal,B. et al. (2020) The reactome pathway knowledgebase. Nucleic Acids Res, 48, D498–D503. | |
Signalink | Interaction database focussing on pathways and interactions of pathways. | Csabai,L. et al. (2022) SignaLink3: a multi-layered resource to uncover tissue-specific signaling networks. Nucleic Acids Res, 50, D701–D709. | |
Wikipathways | Collection of biological pathway and interaction data. | Martens,M. et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res, 49, D613–D621. | |
Protein domain | PFAM | Protein domain structure database. | Mistry,J. et al. (2021) Pfam: The protein families database in 2021. Nucleic Acids Res, 49, D412–D419. |
Transcription factor regulation | ATRM | Transcription factor - target gene interactions for Arabidopsis thaliana. | Jin,J. et al. (2015) An Arabidopsis transcriptional regulatory map reveals distinct functional and evolutionary features of novel transcription factors. Mol Biol Evol, 32, 1767–1773. |
dorothEA | Transcription factor - target gene interactions for human and mouse. | Garcia-Alonso,L. et al. (2019) Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res, 29, 1363–1375. | |
RegulonDB | Transcription factor - target gene interactions for Escherichia coli bacteria. | Tierrafría,V.H. et al. (2022) RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb Genom, 8, 000833. | |
TFLink | Small- and large-scale transcription factor - target gene interactions for human and 6 model organisms. | Liska,O. et al. (2022) TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species. Database, 2022, baac083. | |
TRRUST | Transcription factor - target gene interactions for human. | Han,H. et al. (2018) TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res, 46, D380–D386. | |
Yeastract | Transcription factor - target gene interactions for Saccharomyces cerevisiae. | Teixeira,M.C. et al. (2018) YEASTRACT: an upgraded database for the analysis of transcription regulatory networks in Saccharomyces cerevisiae. Nucleic Acids Res, 46, D348–D353. |
To cite the GMT files in publications use:
Ari E, Ölbei M, Gul L, Bohár B, Stirling T (2024). muleaData: ExperimentalData Bioconductor Package for the mulea R Package, Contains Genes Sets for Functional Enrichment Analysis in GMT File Format. R package version 0.99.0, https://github.com/ELTEbioinformatics/muleaData.