Skip to content

Latest commit

 

History

History
63 lines (61 loc) · 4.02 KB

README.md

File metadata and controls

63 lines (61 loc) · 4.02 KB

Analysis of Uncertainty of Neural Fingerprint-based Models

This repository contains the code to reproduce the results of the paper "Analysis of Uncertainty of Neural Fingerprint-based Models" (under review).

Abstract

Estimating the uncertainty of model predictions is crucial in a wide range of cheminformatics applications, not only to better understand machine learning models but also to establish trust in deployed models. Uncertainty estimates for many standard machine learning models, like Random Forest, are well studied. However, their predictive performances can be inferior to deep learning models, like graph neural networks (GNNs). We investigated whether the neural fingerprint extracted from a GNN can be combined with classical machine learning models to achieve good prediction performance and reliable uncertainty estimates.

Reproducing the results

DVC

The experiments are managed using DVC, where each step is specified in the dvc.yaml file. Running the pipeline will create a dvc.lock file, which contains the hashes of the scripts, input files, and output files, ensuring that the results originate from the provided code and data. The cache is available from this repository as a tarball. The following sections describe how to set up the project and reproduce the results.

Commands to reproduce the results

  1. Clone the repository
git clone https://github.com/basf/neural-fingerprint-uncertainty.git
cd neural-fingerprint-uncertainty
  1. Install the requirements
pip install -r requirements.txt
  1. Unzip the dvc cache
tar -xf dvc_cache.tar.gz .dvc/
  1. Pull the data
dvc pull
  1. Reproduce the results
dvc repro

The generated results, figures, etc are saved in the data folder.

Workflow of the experiments

Molecular standardization

The molecular standardization is performed using molpipeline. Details of the standardization are provided in the 01_preprocess_data.py script.

Creating the folds

The data is split into 5 folds using the StratifiedKFold method and the GroupKFold method, where the group is determined by Agglomerative Clustering. The details of the fold creation are provided in the 02_assign_groups.py script.

ML experiments with Morgan fingerprints

The Morgan fingerprints are used to train the classical machine learning models. The details of the experiments are provided in the 03_ml_experiments.py script.

ML experiments with neural fingerprints

The neural fingerprints are extracted from a pre-trained Chemprop model. In addition to the neural fingerprints, the GNN is also used to predict the target values. The details of the experiments are provided in the 04_neural_fingerprint_predictions.py script.

Create plots for each endpoint

The results of the experiments are visualized using matplotlib, where the plots are saved in the figures folder. Figures have to be loaded using the commands provided [above](#Commands to reproduce the results). Code for the plots is provided in the 05_create_plots.py script.

Plots used in the paper

The plots used in the paper are saved in the final_figures folder. The code for the plots is provided in the 06_create_final_figures.py script.

Tables used in the paper

The tables used in the paper were logged and directly extracted from the console. A copy of the console output is provided in the file 07_create_final_tables.log (again, only available after executing the commands [above](#Commands to reproduce the results)).

License

This software is licensed under the MIT license. See the LICENSE file for details.