GitHub - Lyu6PosHao/HME: HME is a multimodal multitask Chemical LLM.

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

A. Update Log

[2024/12/31] Opensource the whole datasets, some model checkpoints and codes.

B. Glance

Recent advances in Chemical Language Models (CLMs) have shown great promise in bridging molecular structures and natural language for drug discovery and molecular comprehension. However, current approaches face significant challenges due to inherent biases in different molecular representations, limiting their effectiveness in capturing comprehensive molecular information and achieving reliable molecular design. Building upon recent developments in Large Language Models (LLMs), we propose a Heterogeneous Molecular Encoding (HME) framework that aims to improve the bidirectional mapping within the chemical-linguistic sharing space.

C. Highlights

1. Framework suitable for advanced LLMs

We propose Heterogeneous Molecular Encoding (HME), a streamlined framework suitable for LLMs that integrates sequential and geometric molecular features to achieve unbiased encoding.

2. A dataset for multi-conditional molecular design

We propose the MCMoD dataset, a comprehensive dataset containing more than 1 million molecules with their corresponding textual descriptions, molecular fragments, and chemical property control signals.

3. Excellent performance

Navigating linguisitic space: HME achieves substantial performance in molecular captioning and question-answering. Navigating chemical space: HME demonstrates \textbf{reliable molecular design capabilities under various control signals}. Notably, in zero-shot scenarios, our framework achieves a remarkable 79.4% success rate.

D. Main Results

Navigating chemical space with linguistic guidance:

including: description-based molecular generation (Chain of Thought used), multi-objective molecular reverse design (fragment serves as one of conditions)
Navigating linguisitic space with molecular guidance:

including: molecular captioning, molecular general QA, molecular property QA, etc.
More results can be found in the paper.

E. Src File Stucture

datasets/: store the datasets we used to train and test HME
fragment_vocabs/: fragment vocabulary files
metrics/: python files to evaluate the experimental results
molecular_towers/: frozen molecular 2D encoders and 3D encoders
psvae/: the principle subgraph mining algorithm to transform SMILES into fragments.
scripts/: bash scripts to run the models.
configuration_llava.py: the config of the HME model.
data.py: to define the dataset class.
frg.py: to fragment molecules.
infer.py: model inference.
modeling_llava.py: the architecture of the HME model.
preprocess.py: preprocess the json files of our datasets.
utils.py: some utils including loading models, initializing models, setting random seeds, etc.

F. Install Requirements

conda create -n hme python=3.10
pip install -r requirements.txt
#only some key packages are listed in the requirements.txt.
#If bugs occur during run our codes, please let us know and we will handle it promptly.

G. Dataset Preparation

All the data used in our paper can be found in HuggingFace, where a detailed description is also provided. To reproduce HME:

Download the dataset from HuggingFace and put it under src/datasets/
Preprocess the dataset:
```
python preprocess.py
```
The purpose of this step is to use frozen 2D encoder and 3D encoder to obtain the 2D and 3D features of the molecule. You need to modify the file path in preprocess.py to the actual json file path.

H. Quick Inference

Download our HME checkpoint from HuggingFace.
Model inference:
```
sh ./scripts/eval.sh
```
use ./metrics/*.py to evaluate the generated results.
- mol2text_metrics.py: for captioning, general qa
- number_metrics.py: for property qa
- text2mol_metrics.py: for description-based molecular generation

I. Acknowledgements

We would like to express our gratitude to the related projects and research and development personnel:

Codes: Huggingface LLaVa, PS-VAE

Data: 3D-MoIT, Tartarus, PubChem, PubChemQC, ChEBI, DTP, ZINC

Others: Meta-llama, GraphFP, Uni-Mol, MoleculeSTM

J. Citation

If you find our repo helpful, please consider citing us.

@article{lv2024navigating,
  title={Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding},
  author={Lv, Liuzhenghao and Li, Hao and Wang, Yu and Yan, Zhiyuan and Chen, Zijun and Lin, Zongying and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2412.20888},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

A. Update Log

B. Glance

C. Highlights

1. Framework suitable for advanced LLMs

2. A dataset for multi-conditional molecular design

3. Excellent performance

D. Main Results

E. Src File Stucture

F. Install Requirements

G. Dataset Preparation

H. Quick Inference

I. Acknowledgements

J. Citation

About

Releases

Packages

Languages

License

Lyu6PosHao/HME

Folders and files

Latest commit

History

Repository files navigation

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

A. Update Log

B. Glance

C. Highlights

1. Framework suitable for advanced LLMs

2. A dataset for multi-conditional molecular design

3. Excellent performance

D. Main Results

E. Src File Stucture

F. Install Requirements

G. Dataset Preparation

H. Quick Inference

I. Acknowledgements

J. Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages