Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #33

Merged
merged 1 commit into from
Jul 25, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# Chemlactica / Chemma: Large Language Models for Small Molecules

TL;DR
* A family of models that understand small organic molecules written in SMILES, their basic properties, and similarities between molecules.
* [**Chemlactica-125M** 🤗](https://huggingface.co/yerevann/chemlactica-125m) and [**-1.3B** 🤗](https://huggingface.co/yerevann/chemlactica-1.3b) trained on top of Meta's [Galactica models](https://huggingface.co/facebook/galactica-1.3b).
* A family of models that "understand" small organic molecules (SMILES), their basic properties (molecular weight, QED, SAS, TPSA, CLogP, ...), and similarities between molecules (Tanimoto over ECFC4).
* [**Chemlactica-125M** 🤗](https://huggingface.co/yerevann/chemlactica-125m) and [**-1.3B** 🤗](https://huggingface.co/yerevann/chemlactica-1.3b) are trained on top of Meta's [Galactica models](https://huggingface.co/facebook/galactica-1.3b).
* [**Chemma-2B** 🤗](https://huggingface.co/yerevann/chemma-2b) is built on top of Google's [Gemma-2B](https://huggingface.co/google/gemma-2b).
* All models are trained on **40B** tokens covering 100M+ molecules from PubChem. [The dataset is also available at 🤗](https://huggingface.co/datasets/yerevann/PubChemForLM).
* All models are trained on **40B** tokens covering 100M+ molecules from PubChem. [Check the corpus at 🤗](https://huggingface.co/datasets/yerevann/PubChemForLM).
* A prompt like `</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]` will generate a molecule that has ~2.25 SAS score and has ~0.62 similarity score to the given molecule.
* The models can be easily tuned to perform property prediction (~0.3 RMSE on FreeSolv from MoleculeNet).
* The models can be easily tuned to perform property prediction (~0.3 RMSE on [FreeSolv](https://paperswithcode.com/sota/molecular-property-prediction-on-freesolv) from MoleculeNet).
* The models wrapped into a **genetic-like optimization algorithm** beat all **molecular optimization** benchmarks we tried.
* [**Practical Molecular Optimization**](https://arxiv.org/abs/2206.12411): **17.5** vs 16.2 (previous SOTA: [Genetic-guided GFlowNets](https://arxiv.org/abs/2402.05961)).
* Optimization for **docking** with AutoDock Vina: 3-4x less oracle calls for generating 100 _good_ molecules than previous SOTA.
* QED optimization from the [RetMol paper](https://arxiv.org/abs/2208.11126): **99%** success rate with 10K oracle calls with Chemlactica-125M (vs. 96% with 50K calls).
* All details in the paper [Small Molecule Optimization with Large Language Models](https://yerevann.com/papers/small-molecule-optimization-with-large-language-models).

* [**Practical Molecular Optimization**](https://arxiv.org/abs/2206.12411)
* **17.5** vs 16.2 (previous SOTA: [Genetic-guided GFlowNets](https://arxiv.org/abs/2402.05961)).
* Optimization for **docking** with AutoDock Vina
* 3-4x fewer oracle calls for generating 100 _good_ molecules than previous SOTA ([Beam Enumeration](https://arxiv.org/abs/2309.13957)).
* QED optimization from the [RetMol paper](https://arxiv.org/abs/2208.11126)
* **99%** success rate with 10K oracle calls with Chemlactica-125M (vs. 96% with 50K calls of the original paper).
* Read the details in the paper [_Small Molecule Optimization with Large Language Models_](https://yerevann.com/papers/small-molecule-optimization-with-large-language-models.pdf).

We are looking forward to the community utilizing these models for solving various problems in molecular design.

## Table of contents
- [Description](#Description)
Expand Down
Loading