United Formula Annotation (UFA) by the Integrated Data Science Laboratory for Metabolomics and Exposomics (IDSL.ME) is a light-weight R package to annotate peaklists from the IDSL.IPA package with molecular formula of a prioritized chemical space using an isotopic profile matching approach. The IDSL.UFA pipeline only requires MS1 for molecular formula annotation.
molecular formulas are fundamental property of chemical compounds and represent their elemental compositions. Assigning molecular formulas to peaks in data generated using untargeted LC/HRMS can help in gaining biological insights from metabolomics and exposomics datasets. Molecular formula annotation can also complement the peak annotation pipelines that need MS2 spectra to assign a structural identity to a peak. Formulas can be assigned using only MS1 spectral data which is available for every sample analyzed using a LC/HRMS instrument in metabolomics and exposomics studies.
Because of the naturally occuring isotope atoms for each element, MS1 spectral data have more than one mass to charge ratio (m/z) values observed for an ionized species. The isotopic pattern for a chemical structure can be accurately predicted using a set of combinatorial rules that uses atomic mass tables provided by the International Union of Pure and Applied Chemistry (IUPAC). To assign a molecular formula, the theoretical isotopic profile of carbon-containing compounds can be queried against the MS1 spectral data using a set of matching criteria and ranking system. The universality of molecular formula assignment can allow almost all commercial and academic software to process untargeted LC/HRMS datasets to search a single or list of molecular formulas against the raw MS1 data. Community guidelines for peak annotation also recommend performing the molecular formula assignment step on untargeted LC/HRMS datasets.
While existing solutions offer a straightforward solution to match theoretical isotopic patterns against the MS1 spectral data, there is still unmet needs to improve the workflow for larger studies and various sources of molecular formula. This is important for exposomics studies where we do expect to see many more compounds from formula sources other than common metabolite databases.
- Parameter selection through a user-friendly and well-described parameter spreadsheet
- Analyzing population size untargeted studies (n > 500)
- Generating comprehensive in-silico theoretical libraries (known as IPDB) using natural isotopic distribution profiles
- Aggregating annotated molecular formulas on the aligned peak table. This is a very unique feature that is only presented by IDSL.UFA. To familiarize with this statistical mass spectrometry feature, try PARAM0006 in the
parameters
tab in the UFA parameter spreadsheet - Ranking candidate molecular formulas for a peak
- Generating batch untargeted isotopic profile match figures
- Parallel processing in Windows and Linux environments
install.packages("IDSL.UFA")
To annotate your mass spectrometry data (mzXML, mzML, netCDF), mass spectrometry data should be processed using the IDSL.IPA workflow to acquire chromatographic information of the peaks (m/z-RT). When the chromatographic information of individual and aggregated aligned peaklists were generated using the IDSL.IPA workflow, download the UFA parameter spreadsheet and select the parameters accordingly and then use this spreadsheet as the input for the IDSL.UFA workflow:
library(IDSL.UFA)
UFA_workflow("Address of the UFA parameter spreadsheet")
Follow these steps for a quick case study (n = 33) ST002263 which has Thermo Q Exactive HF hybrid Orbitrap data collected in the HILIC-ESI-POS/NEG modes.
-
Process raw mass spectrometry data to generate chromatographics information using the method described by IDSL.IPA for this study.
-
Download these pre-calculated IPDBs and use positive or negative mode IPDB from RefMetDB folder according to the IDSL.IPA folder results. RefMet represents a Reference list of Metabolite names.
-
IDSL.UFA requires 30 parameters distributed into 4 separate sections for a full scale analysis. For this study, use default parameter values presented in the UFA parameter spreadsheet. Next, provide information for
3.1. PARAM0004 for the Address of the IPDB (.Rdata)
3.2. PARAM0005 and PARAM0006 should be YES
3.3. PARAM0009 for HRMS data location address (MS1 level HRMS data)
3.4. PARAM0011 for Address of the
peaklists
directory generated by the IDSL.IPA workflow3.5. PARAM0012 for Address of the
peak_alignment
directory generated by the IDSL.IPA workflow3.6. PARAM0014 for Output location address (MS1 processed data)
3.7. You may also increase the number of processing threads using PARAM0008 according to your computational power
-
Run this command in the R/Rstudio console or terminal
library(IDSL.UFA)
UFA_workflow("Address of the UFA parameter spreadsheet")
-
You see the results in the address you provided for PARAM0014 including:
5.1. Individual annotated peaklists with molecular formulas for each HRMS file in the annotated_mf_tables directory in the .Rdata and .csv formats
5.2. Aligned molecular formula table in the aligned_molecular_formula_table directory in the .Rdata and .csv formats. We strongly recommend to familiarize yourself with the structure of this table to find the most probable candidate molecular formulas.
5.3. If you had selected numbers greater than 0 for PARAM0024, match spectra figures are presented in the UFA_spectra folder.
- A population size study with 499 individual mass spectrometry file
- List of consistent labeled isotopes
- Standard Adduct Type
- Definitions of Peak Spacing and Intensity Cutoff
- Isotopic Profile DataBase (IPDB)
- PubChem molecular formula database for IDSL.UFA
- Definitions of NEME and PCS
- Definitions of NDCS and RCS
- Score Coefficients Optimization
- Molecular formula class detection
[1] Fakouri Baygi, S., Banerjee S. K., Chakraborty P., Kumar, Y. Barupal, D.K. IDSL.UFA assigns high confidence molecular formula annotations for untargeted LC/HRMS datasets in metabolomics and exposomics. Analytical Chemistry, 2022, 94(39), 13315-13322.
[2] Fakouri Baygi, S., Kumar, Y. Barupal, D.K. IDSL. IPA characterizes the organic chemical space in untargeted LC/HRMS datasets. Journal of proteome research, 2022, 21(6), 1485-1494.