Publication: In preparation
Citation: Flores et al. 2025
Metadata
-
compound_metadata.csv: Contains compound type information for a few compounds (e.g. whether they're amino acids, etc.)
-
score_metadata.csv: Contains all metadata information on each spectral similarity score
-
sample_metadata.csv: Contains all metadata information on each sample
Models
-
model.RDS: The trained ensemble model with all scores
-
reduced_model.RDS: The trained ensemble model with the top 6 performing scores
Result_Data
-
BinSizes.csv: The number of candidate molecules per sample and retention index bin
-
FP_FN_Ranks.txt: Full model and reduced model predictions on the testing dataset
-
reduced_test_pred.RDS: An R object with the reduced model predictions on the testing dataset
-
test_pred.RDS: An R object with the full model predictions on the testing dataset
-
TP_Ranks.txt: Rankings of the true positive per sample and retention index bin for the top 6 scores, the full model, and the reduced model
Note: All other data used in this study is too large for a github repo and can be found here: https://data.pnnl.gov/group/nodes/dataset/33302
Scripts
-
build_dataset.R: Extracts all molecule information needed from this study after downloading https://data.pnnl.gov/group/nodes/dataset/33302
-
ensemble_model.R: Code to build the ensemble model after running build_dataset.R
-
false_positive_&_false_negative: Extracts all needed information about false positives and false negatives after running the ensemble model
-
top_N.R: Compares the true positive rankings of the built models and the top 6 scores
Visualization
- plots.R: Generates all visualizations of results for this study