RL-GraphINVENT is a platform for graph-based targeted molecular generation using graph neural networks and reinforcement learning. RL-GraphINVENT uses a Gated Graph Neural Network -based model fine-tuned using reinforcement learning to probabilistically generate new molecules with desired property profiles.
- Anaconda or Miniconda with Python 3.6 or 3.8.
- CUDA-enabled GPU.
To generate targeted molecules using the RL-GraphINVENT:
- Pre-train GraphINVENT in a supervised learning manner using a general molecular dataset.
- Define a scoring function based on a set of desired molecular properties. This score should assign high scores to the set of actions taken to build molecules with the desired attributes.
- Use the RL framework to fine-tune the pre-trained model towards generation of molecules with the desired properties.
- Analyse the properties of the generated molecules.
To generate molecules with other properties or use a different scoring function, one can go back to step 2 and repeat the process for the same pre-trained model.
There are 6 different kinds of jobs for the pre-training phase:
- "preprocess": for generating the target Action-Probability Distributions (APDs) of each sub-graph of the molecules in the dataset.
- "train" : for pre-training the generative model so that it learns to generate molecules which properties resemble those of the molecules in the training data.
- "generate": for sampling molecules from a chosen trained model.
- "test": for computing the evaluation metrics on the test set.
- "multiple_valid": for reevaluating the model at different training epochs.
- "valid_loss": for computing the validation loss of the model at different epochs.
You can run for instance a training job by setting the job type to `train' and selecting the appropiate hyperparameter values in submitPT.py:
submitPT.py >
# define what you want to do for the specified job(s)
dataset = "chembl25_500k"
job_type = "train" # "preprocess", "train", "generate", "test", "multiple_valid" or "valid_loss"
jobdir_start_idx = 0 # where to start indexing job dirs
n_jobs = 1 # number of jobs to run per model
restart = False
force_overwrite = False # overwrite job directories which already exist
jobname = "train" # used to create a sub directory
# if running using SLURM, specify the parameters below
use_slurm = False # use SLURM or not
run_time = "5-00:00:00" # hh:mm:ss
mem_GB = 30
# set paths here
python_path = f"{home}/.conda/envs/RL-GraphINVENT-env/bin/python"
graphinvent_path = f"pre-training/"
data_path = f"data/pre-training/"
# define dataset-specific parameters
params = {
"atom_types": ["C", "N", "O", "Cl", "F", "S", "Br"],
"formal_charge": [-1, 0, +1],
"max_n_nodes": 72,
"job_type": job_type,
"dataset_dir": f"{data_path}{dataset}/",
"restart": restart,
"model": "GGNN",
"sample_every": 5,
"init_lr": 1e-4,
"min_rel_lr": 1e-3,
"max_rel_lr": 1,
"epochs": 30,
"batch_size": 1000,
"block_size": 100000,
"n_samples": 1000, # <-- how many structures to generate
"n_workers": 0
}
Then simply run
(RL-GraphINVENT-env)$ python submitPT.py
There are 2 different kinds of jobs for the fine-tuning phase:
- "learn": for fine-tuning the model via reinforcement learning with the defined scoring function
- "generate": for sampling molecules from a chosen model.
The scoring functions are defined in scores.py.
You can run for instance a learning job by setting the job type to `learn' and selecting the appropiate hyperparameter values in submitFT.py:
submitFT.py >
# define what you want to do for the specified job(s)
dataset = "chembl25_500k"
job_type = "learn" # "learn" or "generate"
jobdir_start_idx = 0 # where to start indexing job dirs
n_jobs = 1 # number of jobs to run per model
restart = False
force_overwrite = False # overwrite job directories which already exist
jobname = "learn" # used to create a sub directory
# if running using SLURM, specify the parameters below
use_slurm = False # use SLURM or not
run_time = "0-02:00:00" # hh:mm:ss
mem_GB = 30
# set paths here
python_path = f"{home}/.conda/envs/RL-GraphINVENT-env/bin/python"
graphinvent_path = f"fine-tuning/"
data_path = f"data/fine-tuning/"
# define dataset-specific parameters
params = {
"atom_types": ['C', 'N', 'O', 'Cl', 'F', 'S', 'Br'],
"formal_charge": [-1, 0, +1],
"max_n_nodes": 72,
"job_type": job_type,
"dataset_dir": f"{data_path}{dataset}/",
"data_path": f"{data_path}",
"restart": restart,
"model": "GGNN",
"sample_every": 5,
"init_lr": 1e-4,
"min_rel_lr": 1e-2,
"max_rel_lr": 1,
"epochs": 200,
"batch_size": 64,
"gen_batch_size": 1000,
"block_size": 100000,
"n_samples": 1000, # <-- how many structures to generate
"n_workers": 0,
"sigma": 20,
"alpha": 0.5,
"score_type": "activity", #"reduce", "augment", "qed", "activity"
}
Then simply run
(RL-GraphINVENT-env)$ python submitFT.py
Contributions are welcome in the form of issues or pull requests. To report a bug, please submit an issue.
The implementation of GraphINVENT adapted for this work was pulled from the MolecularAI repo in January 2021. It is available at
https://github.com/MolecularAI/GraphINVENT
This work is explained in this publication.
The data used for pre-training the generative model and for checking the overlap between generated molecules and known 'true' actives is a subset of the data found in https://github.com/pcko1/Deep-Drug-Coder/tree/master/datasets.
The model used for the activity scoring function can be found at https://github.com/pcko1/Deep-Drug-Coder/blob/master/models/qsar_model.pickle.
RL-GraphINVENT is licensed under the MIT license and is free and provided as-is.
Before doing anything with RL-GraphINVENT, you will need to configure the RL-GraphINVENT virtual environment, as the code is dependent on very specific versions of packages. You can use conda for this.
The ../environment.yml file lists all the packages required for GraphINVENT to run. From within the RL-GraphINVENT/ directory, a virtual environment can be easily created using the YAML file and conda by typing into the terminal:
conda env create -f environment.yml
Then, to activate the environment:
conda activate RL-GraphINVENT-env
To install additional packages to the virtual environment, should the need arise, use:
conda install -n RL-GraphINVENT-env {package_name}
To save an updated environment as a YAML file using conda, use:
conda env export > path/to/environment.yml
And that's it!