This follows the guideline on keep a changelog
- CAZP scoring component
- Convert float nan and infs to valid json format before remote reporting
- optional tautomer canonicalisation in data pipeline
- read configuration file from stdin
- refactor of top level code
- check if DF is set
- YAML configuration file reader
- Logging of configuration file absolute path
- Automatic configuration file format detection
- Exponential decay transform
- Ambiguity in parsing optional parameters with multiple endpoints and multiple optional parameters
- component-level parameters for scoring components
- executable module: can run
python -m reinvent
- SIGUSR1 for controlled termination
- PepInvent in Sampling and Staged learning mode with example toml config provided
- PepInvent prior
- Atom map number removal for Libinvent sampling dropped SMILES
- Stage number for JSON to remote monitor
- Relaxed dependencies
- Terminate staged learning on SIGTERM and check if running in multiprocessing environment
- ValueError for all scoring components such that the staged learning handler can handle failing components
- SMILES in DF memory were wrongly computed
- run-qsartuna.py: convert ndarray to list to make it JSON serializble
- PMI component: check for embedding failure in RDKit's conformer generator
- Dockstream component wrongly quoted the SMILES string
- Diversity filter setup in config was ignored
- Fixed config reading bug for DF
- Changed Molformer sampling valid and unique from percentage to fraction on tensorboard
- Fixed incorrect tanimoto similarity log in Mol2Mol sampling mode
- Corrected typo in Libinvent report
- Report for sampling returned np.array which is incompatibile with JSON serialization
- Allowed responder as an optional input in scoring input validation
- Fixed remote for Libinvent
- Batchsize defaults to 1 for TL
- Added temperature parameter in Sampling and RL config validation
- Scalar return value from make_grid_image()
- Removed labels from Libinvent generated molecules before passing to scoring components: maize, icolos, dockstream and ROCS
- Write out clean SMILES without labels to CSV and tensorboard in staged learning and sampling run modes
- Simplified update code for Libinvent RL
- Added missing layer_normalization when getting Reinvent RNN model parameters
- Log likelihood calculation is handled more efficiently
- Added missing tag to legacy TanimotoDistance component
- Custom RDKit normalization transforms hard-coded or from file
- Safe-guard against invalid SMILES in SMILES file/CSV reading
- Allow dot SMILES fragment separator for Lib/Linkinvent input
- Renamed pipeline options keep_isotopes to keep_isotope_molecules
- New data pipeline options: uncharge, kekulize, randomize_smiles
- Parallel logging in data pipeline preprocessor (partial only)
- Libinvent based on unified Transformer
- Error handling for unsupported tokens in RNN-based Libinvent
- Parallel processing of regex and rdkit filters
- RDKit filter uses simpler functions not compound functions
- RL max_score is optional and has a default of 1.0
- Parallel implementation of chemistry filter in data pipeline
- Added multinomial as default sampling strategy for Transformer.
- Chemistry filter: customizable normalization, do not use RDKit logger
- Reworked remote responder
- Reporting for sampling runmode
- Minor fix to DF validation
- Chemistry filter for data pipeline
- Check if agent is in RL state info to avoid unnecessary exception
- Toplevel validation of config file to detect extra sections
- Initial support for data pipeline
- Example scoring component script for RAScore
- Reimplentation of Conversion class for backward compatibility
- Various fixes in TLRL notebook
- Scoring runmode now also allows import of scoring components
- More consistent JSON config writing (includes imported scoring functions)
- Better handling of value mapping
- All scoring components need to compute the number of endpoints, added where sensible
- Filter out invalid fragments for Lib/Linkinvent
- Lifted static methods to module level functions
- More chemistry code
- Unused chemistry code and associated tests
- Chained reporters for RL
- Compatibility support for model file metadata: dict vs dataclass
- Various cosmetic fixes to TB output
- TL responder: validation loss was reported as sampled loss
- Add metadata when "metadata" files is empty
- Code clean-up in Reinvent model and RNN code
- Global pydantic configuration
- Affected test cases after code rearrangement
Various code improvements.
- Metadata writing for all created RL and TL models
- Chained reporters
- Prior registry
- Config validation
- Write additional information to RL CSV
- For Mol2Mol, add Input_SMILEs
- For Linkinvent, add Warheads and Linker
- For Libinvnet, add Input_Scaffold and R-groups
- Notebook: plot of reverse sigmoid transform
- Stages can now define their own diversity filters. Global filter always overwrites stage settings. Currently no mechanism to carry over DF from previous stage, use single stage runs.
- TanimotoSimilarity replaces TanimotoDistance
- TanimotoDistance: computes actually a similarity and TanimotoSimilarity
- ChemProp scoring component now supports multitask models
- Optional [scheduler] section for TL
- LibInvent: fixed issue with multiple R-groups on one atom
- ReactionFilter: selective filter will now function correctly as filter
- Notebook: a more complete RL/TL demo
- Fixed crash when all molecules in batch are filtered out
- Code clean-up in create_adapter()
- Import mol2mol vocabulary rather than copying the file
- Write invalid SMILES unchanged to RL CSV
- Notebook: demo on how to analyse RL CSV
- Dataclass validation for scoring component parameters
- Datatype in MatchingSubstructure's Parameters: only single SMARTS is allowed
- Notebook to demo simple RL run, TensorBoard visualisation and TensorBoard data extraction.
- Linkinvent based on unified Transformer model supported by RL and sampling. Both beam search and multinomial sampling are implemented.
- downgraded Chemprop to 1.5.2 and sklearn to 1.2.2 to retain backward compatibility
- New default torch device setup from PyTorch 2.x
- Config parameter "device" to explicitly set torch device e.g. "cuda:0"
- Fixed unknown token handling for Mol2mol TL
- Fixed dataloader for TL to use incomplete batch
- Skip hash check in metadata if no metadata in model file
- Mol2Mol supports unknown tokens for all the priors
- Optional randomization in all TL epochs for Reinvent
- Return from make_grid_image()
- Log network parameters
- Reworked TL code: clean-up, image layout, graph for Reinvent
- Added options and better statistics for TL: valid SMILES, duplicates
- Disableable standardization
- KL divergence in TB output
- Batch size calculation
- Vocabulary for mol2mol and reinvent is saved as dictionary
- Sigmoid functions in scoring have now a stable implementation
- Long chain (SMARTS for 5 aliphatic carbons) check
- Unified transformer code to faciliate new model designs
- filters now apply to transformed scores
- Minor change inception filter: cleaner way of handling internal SMILES store
- Updated script for creating an empty classical Reinvent model
- Memory bug in TL related to similarity calculation: made this optional
- Allowed runs with only filter/penalty components
- Better logging for Reinvent standardizer
- Inception filters out SMILES containing tokens that are not compatible with the prior
- Numerically stable double sigmoid implementation
- Number of CPUs for TL (Mol2Mol) is now 1
- Save models with no metadata
- Tensorboard histogram bug fixed again
- TL is now running for the expected
num_epochs
- Get model_type from save_dict prior model correctly
- Staged learning does not allocate GPU memory if device is set to CPU
- Prior model files have been tagged with meta data
- Model files read in are checked for integrity
- Tab reader unit tests now uses mocks for open
- Wite correctly CSV scoring file when from one columns SMILES file
- Scoring filter components work as filters again
- CSV and SMILES file reader for the scoring run mode, will retain all columns form the input and write to output CSV
- Tobias Ploetz' (Merck) REINFORCE implementations of the DAP, MAULI and MASCOF RL reward strategies
- Check if RDKit descriptor names are valid
- Filename issue on Windows which lead to termination
- General scoring component for all 210 RDKit descriptors
- Optional cwd for run_command()
- Collect all names for remote monitoring*
- Pass data to request as dict rather than a JSON string
- Pair generator multiprocessing in TL is supported on Linux, Windows, and MacOS
- The number of cpus is optional and could be specified in the toml/json configuration file through the parameter
number_of_cpus
- Added missing second mask in inception call to scoring function
- Fixed cuda out of memory for Reinvent batch sampling
- Handle the case when there are no non-cached SMILES and thus the scpring function does not need to run.
- Improved type safety in
value_mapping
- Number of cpus can be specified in toml/json config files for TL jobs
- Check for CUDA before checking GPU memory otherwise will fail on CPU
- Removed obsolete code which broke TL with Reinvent
- Windows support: correct signal handling
- Scoring component MolVolume to compure molecular volume via RDKit
- Minimal SMILES pre-processing for scoring to allow keeping of stereochemistry and only choose largest fragment, and use the general RDKit cleanup/sanitation/hydrogen. Skip heavy filtering on molecules size, allowed atoms, tokens, vocabulary, etc. This faciliates situation where only scoring is desired.
- Allow zero weights to only display a component score. This will have no effect on aggregation but the component score is still computed. So, be careful with computationally expensive components.
- Flag to purge diversity filter memories after each staged learning stage. This is useful in multiple stage runs and is equivalent to
use_checkpoint
for single stage reruns.
- The CSV file from RL has controlled output precision: 7 for total score and transformed scores, 4 for all other floating point values
- Critical: all scores of duplicate SMILES including the first occurence where set to zero rather than the computed value
- Scores of duplicates are now correctly copied over from first occurence
- All tests support both cpu and gpu
- Contidional import of
resource
to allow running on Windows
- Some rudimentary information on GPU memory usage in staged learning
- Bug in sampling related to the way the sampled.nlls object was treated. Now it is always pytorch tensor object without gradient on cpu
- Moved TPSA to separate component to enable TPSA calculation for polar S and P, the original RDKit implementation does not consider those atoms and the default is still to leave those out from TPSA calculation
- Issue with NaNs being in the raw and transformed code that would not allow to compute the mean
- Explicit serialization of JSON string because the internal one from requests may fail
- Added a patch to fix a bug on the native implementation of Pytorch related to the histogram functionality of the Tensorboard report
- Added a check which raises an exception if the user enters scaffolds with multiple attachement points connected to the same atom (Libinvent, Linkinvent). This will be lifted in a future update
- Fixed report format (TL)
- Normalize SMILES before passing to string-based scoring component because the SMILES may still contain lables (Libinvent)
- Fixed fragment effective length ratio error when fragment has single atom (e.g. "[]N[]") with max graph length=0
- For Compound Sampling, removed explicit dependency on matplotlib: report histgram and scatter plot to tensorboard only if matplotlib is available.
- For Compound Sampling, introduced a new parameter unique_molecules to do canonicalize smiles deduplication instead of using sequemce deduplication which can be confusing
- warn if sequence deduplication is requested
- TL integration test fixed (no impact on GUI or core)
- TL reporting of epochs
- multiple scoring components are reported as means again rather than as lists
- fixed graph length fragment component
- fixes for fragment scoring components
- scores reported for filters and penalties as in REINVENT3
- initial release of REINVENT4
- RL rewards MASCOF, MAULI, SDAP (inefficient in practice)
- sequence deduplication (interferes with diversity filter and SMILES deduplication)