Skip to content

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)

License

Notifications You must be signed in to change notification settings

pluskal-lab/DreaMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra) is a transformer-based neural network designed to interpret tandem mass spectrometry (MS/MS) data. Pre-trained in a self-supervised way on millions of unannotated spectra from our new GeMS (GNPS Experimental Mass Spectra) dataset, DreaMS acquires rich molecular representations by predicting masked spectral peaks and chromatographic retention orders. When fine-tuned for tasks such as spectral similarity, chemical properties prediction, and fluorine detection, DreaMS achieves state-of-the-art performance across various mass spectrometry interpretation tasks. The DreaMS Atlas, a comprehensive molecular network comprising 201 million MS/MS spectra annotated with DreaMS representations, along with pre-trained models and training datasets, is publicly accessible for further research and development. 🚀

This repository provides the code and tutorials to:

  • 🔥 Generate DreaMS representations of MS/MS spectra and utilize them for downstream tasks such as spectral similarity prediction or molecular networking.
  • 🤖 Fine-tune DreaMS for your specific tasks of interest.
  • 💎 Access and utilize the extensive GeMS dataset of unannotated MS/MS spectra.
  • 🌐 Explore the DreaMS Atlas, a molecular network of 201 million MS/MS spectra from diverse MS experiments annotated with DreaMS representations and metadata, such as studied species, experiment descriptions, etc.
  • ⭐ Efficiently cluster MS/MS spectra in linear time using locality-sensitive hashing (LSH).

Additionally, for further research and development:

  • 🔄 Convert conventional MS/MS data formats into our new, ML-friendly HDF5-based format.
  • 📊 Split MS/MS datasets into training and validation folds using Murcko histograms of molecular structures.

Please refer our documentation and paper "Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra" for more details.

Getting started

Installation

Run the following code from the command line.

# Download this repository
git clone https://github.com/pluskal-lab/DreaMS.git
cd DreaMS

# Create conda environment
conda create -n dreams python==3.11.0 --yes
conda activate dreams

# Install DreaMS
pip install -e .

If you are not familiar with conda or do not have it installed, please refer to the official documentation.

Compute DreaMS representations

To compute DreaMS representations for MS/MS spectra from .mgf file, run the following Python code.

from dreams.api import dreams_embeddings
embs = dreams_embeddings('data/examples/example_5_spectra.mgf')

The resulting embs object is a matrix with 5 rows and 1024 columns, representing 5 1024-dimensional DreaMS representations for 5 input spectra stored in the .mgf file.

References

If you use DreaMS in your research, please cite the following paper:

@article{bushuiev2024emergence,
    author = {Bushuiev, Roman and Bushuiev, Anton and Samusevich, Raman and Brungs, Corinna and Sivic, Josef and Pluskal, Tomáš},
    title = {Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra},
    journal = {ChemRxiv},
    doi = {doi:10.26434/chemrxiv-2023-kss3r-v2},
    year = {2024}
}

About

DreaMS (Deep Representations Empowering the Annotation of Mass Spectra)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published