Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models -- Code

This is the code corresponding to the experiments conducted for the work "Understanding Protein-DNA Interactions by Paying Attention to Genomic Foundation Models" where we use the ESM2 and GPN foundation models coupled with a Cross-Attention module to learn Protein-DNA contacts at single amino acid and single nucleotide resolution.

Requirements

Create an environment (users will be prompted to select Conda or Virtualenv )and install all dependencies by running the create_env.sh script. To do so, follow the below steps:

chmod +x ./create_env.sh
./create_env.sh
You will be prompted to choose Conda or Virtualenv, and an environment name, and the script should install everything else.

Datasets

We extracted Protein-Nucleic acid complex data from the NAKB database. Our processed dataset (pickle) can be downloaded here. We provide a helper Jupyter notebook (notebooks/read_and_save_data.ipynb) to create your own dataset and process it to be compatible with our pipeline.

Files

├── notebooks/

af3_read.ipynb: Notebook for reading and analyzing AlphaFold 3 data.
read_and_save_data.ipynb: Notebook to download 3D complex pdbs from NAKB, process to obtain sequences and contact_maps and save them into a pickled list.
inference.ipynb: Notebook to run inference on trained models and analyze predictions.

├── utils/

loss.py file: Implements the BCE and Weighted BCE loss with support for attention masking
data.py file: Code for all data handling operations including creating custom dataloaders for handling variable length (sequence,contact_map) pairs.
model.py file: Implements the Cross-Attention module for contact-map prediction
finetune.py file: Custom wrappers for the ESM2 and GPN models allowing a user to select the number of layers to unfreeze and the model size (for the ESM2 family)
util.py file: Helper functions to set a seed for reproducibility, and to compute distance and contact maps from 3D co-ordinates
train.py file: Code to train the whole pipeline, and to evaluate it. Saves 10 randomly sampled attention map overlays (GT vs pred), all model gradients, loss values and metrics for training and evaluation sets.
plots.py file: Code for plotting metrics as well as contact maps.

└── main.py

main_train.py file: Wrapper file for an end to end run.

└── run_training.sh

run_training.sh: Script to run the main.py file in background using nohup

Misc:

checkpoints/ directory (user needs to create this): Model checkpoints will be saved here by default.
runs/ directory (user needs to create this): Tensorboard logs saved will be saved here by default.
logs/ directory (user needs to create this): stdout and stderr files for any run will be merged and saved here.
data/dna_protein_dataset.pkl: Our processed DNA-Protein complex dataset, available at this link.
data/train_test_clusters.pkl: Train test split indices based on DNA sequence-similarity based clustering (using mmseq2) .
data/AF3_pdbs.txt: Subset of AlphaFold3 test set which we used for our baseline comparison.

Usage

Training using default hyperparemeters (Save your dataset as dna_xxx.pkl in the Data directory)

To train a model using default hyperparameters and save checkpoints, tensorboard logs and evaluate, use: (Note: The CUBLAS part is for reproducibility)

CUBLAS_WORKSPACE_CONFIG=:4096:8 python3 -u main_train.py --log_suffix 'my_suffix' --lr 0.00001 --num_epochs 1000 --train_batch_size 8 --eval_batch_size 8 --finetune_protein_model --finetune_na_model --dna_num_layers_to_unfreeze 1 --protein_num_layers_to_unfreeze 1

To train a model in the background using nohup, set your environment path in the run_training.sh file, verify the argparse agruments you want for the run and use:

nohup ./run_training.sh > logs/my_suffix.out 2>&1 &

Further Documentation

See the code documentation for more details. main_train.py can be called with the -h option for additional help.

Hyperparameters

Relevant Hyperparameters for setting up training:

d_k: Dimension of the key vector in our cross-attention module
protein_num_layers_to_unfreeze : These many last layers of the Protein model will be kept trainable
dna_num_layers_to_unfreeze : These many last layers of the DNA model will be kept trainable
log_suffix: This will be appended to the path of your model's checkpoints and runs

Citation

Please consider citing our work using the following bibtex entry,

@inproceedings{seq2contact,
  title     = {Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models},
  author    = {Rajwade, Dhruva and Wang, Erica, and Satpathy, Aryan and Brace, Alexander and Guo, Hongyu and Ramanathan, Arvind and Liu, Shengchao and Anandkumar, Anima},
  booktitle = {Foundation Models for Science Workshop, NeurIPS},
  year      = {2024},
  url       = {https://openreview.net/pdf?id=NifnqNdpaf}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models -- Code

Requirements

Datasets

Files

├── notebooks/

├── utils/

└── main.py

└── run_training.sh

Misc:

Usage

Training using default hyperparemeters (Save your dataset as dna_xxx.pkl in the Data directory)

Further Documentation

Hyperparameters

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
notebooks		notebooks
utils		utils
.gitignore		.gitignore
README.md		README.md
create_env.sh		create_env.sh
main_train.py		main_train.py
run_training.sh		run_training.sh

DhruvaRajwade/Seq2Contact

Folders and files

Latest commit

History

Repository files navigation

Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models -- Code

Requirements

Datasets

Files

├── notebooks/

├── utils/

└── main.py

└── run_training.sh

Misc:

Usage

Training using default hyperparemeters (Save your dataset as dna_xxx.pkl in the Data directory)

Further Documentation

Hyperparameters

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages