This repo lists a collection of resources for performing Deep Learning in Python for Life Sciences. Since the end of 2021 I started observing an always growing volume of academic work and Open Source initiatives related to topics such as biochemistry, genetics, molecular biology, bioinformatics, etc. Being my study background in Biomedical Engineering and Deep Learning, coming from past experiences as Software Engineer and currently working on applying ML/DL to real-life use cases in the pharma industry, these new Open Source efforts have caught my interest. That's why finally I decided to start this repository to provide researchers, developers and practitioners a single place to keep track of the latest developments in this space, with focus in particular on biotech and pharma.
Contributions, suggestions and stars are welcome!
Image generated through DALL-E mini by prompting "A fancy protein folding".
👉 Papers and models published before 2023 have been moved to the Archived section of this repo. 👈
Molecules
Proteins
Cheminformatics
Drug Discovery
Datasets
Explainable AI
Other
- pysmiles - A lightweight Python library for reading and writing SMILES strings.
- SmilesDrawer - A Colab notebook to draw from SMILES strings.
- PySMILESUtils - Utilities for working with SMILES based encodings of molecules for Deep Learning (PyTorch oriented).
- SELFIES - Robust representation of semantically constrained graphs, in particular for molecules in chemistry.
- ChemProp - Message Passing Neural Networks for molecule property prediction.
- Evidential Deep Learning for Guided Molecular Property Prediction and Discovery - Fast and scalable uncertainty quantification for neural molecular property prediction, accelerated optimization, and guided virtual screening. [Paper]
- mols2grid - Interactive molecule viewer for 2D structures.
- Image to SMILES Generator - Code to generate datasets of pairs "image - sequence" for chemical molecules. [Article].
- Auto3D - Automatic generation of the low-energy 3D structures with ANI Neural Network potentials.
- Specklit - A Streamlit Component for creating Speck molecular structures within a Streamlit Web app.
- molcloud - A package to draw molecules in a big canvas packed together.
- Img2Mol - Inferring molecules from pictures.
- MOSES - Molecular Sets: a benchmarking platform for Molecular Generation Models. [Paper]
- Tartarus - A benchmarking platform for realistic and practical inverse molecular design. [Paper]
- GraphINVENT - A platform for graph-based molecular generation using graph neural networks.
- Chem Faiss - Vector similarity search functionality from Faiss, in conjunction with chemical fingerprinting to build a scalable similarity search architecture for compounds/molecules.
- DECIMER - Deep lEarning for Chemical ImagE Recognition (DECIMER): it translates a bitmap image of a molecule into a SMILES. [Paper]
- DECIMER Image Transformer - The DECIMER (Deep lEarning for Chemical ImagE Recognition) 2.1 project.
- STOUT - Transformer based SMILES to IUPAC Translator.
- CLAMP - CLAMP (Contrastive Language-Assay Molecule Pre-Training): natural language to predict the most relevant molecule, given a textual description of a bioassay, without training samples. [Paper].
- molplotly - An add-on to Plotly built on RDKit which allows 2D images of molecules to be shown in Plotly figures when hovering over the data points.
- MolForge - Neural-machine-translation based models that translate a set of various structural fingerprints to conventional text-based molecular representations, such as SMILES and SELFIES. [Paper]
- SELFormer - Molecular Representation Learning via SELFIES Language Models. [Paper]
- Regression Transformer - Concurrent sequence regression and generation for molecular language modelling. [Paper]
- Bio-Diffusion - A PyTorch hub of denoising diffusion probabilistic models designed to generate novel biological data. [Paper]
- InstructMol - Multi-Modal integration for building a versatile and reliable molecular assistant in Drug Discovery. [Paper]
- MHNfs - A few-shot drug discovery model which consists of a context module, a cross-attention module, and a similarity module. [Paper]
- Small Molecule Autoencoders - Architecture Engineering to optimize latent space utility and sustainability. [Paper]
- AlphaFold Protein Structure Database - A online database which provides open access to 992,316 protein structure predictions for the human proteome and other key proteins of interest, to accelerate scientific research.
- AlphaFold - Open source code for DeepMind's AlphaFold.
- OpenFold - Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2.
- AlphaFold - single sequence input - A Colab notebook to predict the protein structure from a single sequence (for educational purposes only).
- ColabFold - Making Protein folding accessible to all via Google Colab. [Article]
- LocalColabFold - Running ColabFold on your local PC.
- FastFold - Optimizing Protein Structure Prediction Model Training and Inference on GPU Clusters.
- ESM - Pretrained language models that enable zero-shot prediction of the effects of mutations on protein function. [Paper]
- Graphein - A Python package which provides functionality for producing a number of types of graph-based representations of proteins, compatible with standard geometric Deep Learning library formats, as well as graph objects designed for ease of use with popular Deep Learning libraries.
- alphafold_finetune - Python code for fine-tuning AlphaFold to perform protein-peptide binding predictions.
- Uni-Fold - An Open Source platform for developing protein models beyond AlphaFold. [Paper]
- AF2Rank - State-of-the-art estimation of protein model accuracy using AlphaFold. [Paper]
- DPAM - A Domain Parser for AlphaFold Models. [Paper]
- ModelAngelo - Automatic atomic model building program for Electron cryo-microscopy (cryo-EM) maps. [Paper]
- DiffDock - Diffusion steps, twists, and turns for Molecular Docking. [Paper].
- PDBench - A dataset and software package for evaluating fixed-backbone sequence design algorithms. [Paper]
- vcMSA - A Python library to run vector clustering Multiple Sequence Alignment. [Paper]
- Ankh - A optimized Protein Language Model. [Paper]
- TorchProtein - A Machine Learning library for protein science, built on top of TorchDrug.
- TargetGAN - A deep generative model for drug design from protein target sequence. [Paper]
- Iterative_masking - An iterative method that directly employs the masked language modeling objective to generate sequences using a MSA Transformer. [Paper]
- protpardelle - An all-atom protein generative model. [Paper]
- MassiveFold - A tool that allows to massively expand the sampling of structure predictions by improving the computing of AlphaFold based predictions. [Paper]
- PLAID - PLAID (Protein Latent Induced Diffusion). [Paper]
- Cfold - Structure prediction of alternative protein conformations. [Paper]
- PLMSearch - A protein language model that powers accurate and fast sequence search for remote homology. [Paper]
- Raygun - Template-based protein design. [Paper]
- DRFP - An NLP-inspired chemical reaction fingerprint based on basic set arithmetic. [Article].
- DeepChem - A high quality Open Source toolchain that democratizes the use of Deep Learning in drug discovery, materials science, quantum chemistry, and biology.
- CompAugCycleGAN - Augmented CycleGAN used for generating chemical compositions. [Article]
- Chemformer - A pre-trained transformer for computational chemistry.
- RDKit - Open Source toolkit for cheminformatics and Machine Learning.
- Streamlit-app - A Streamlit web app for cheminformatics which includes also a RDKit cheatsheet.
- datamol - A lightweight Python library to work with molecules, built on top of RDKit.
- rxn_yields - Prediction of chemical reaction yields using Deep Learning and data augmentation strategies. [Article]
- gptchem - Using GPT-3 to solve Chemistry problems. [Paper]
- protein_scoring - Computational Scoring and experimental evaluation of enzymes generated by Neural Networks. [Paper]
- Jazzy - A Python library that allows to calculate a set of atomic/molecular descriptors which include the Gibbs free energy of hydration (kJ/mol), its polar/apolar components, and the hydrogen-bond strength of donor and acceptor atoms using either SMILES or MOL/SDF inputs.
- CRISPRi - Improved prediction of bacterial CRISPRi guide efficiency from depletion screens through mixed-effect Machine Learning and data integration. [Paper]
- TorchDrug - A powerful and flexible PyTorch-based Deep Learning platform for drug discovery.
- COVID-19 Multi-Targeted Drug Repurposing Using Few-Shot - PyTorch implementation of MolGNN Few-shot. [Article] [Paper]
- PaddleHelix - A Bio-Computing Platform featuring Large-Scale Representation Learning and Multi-Task Deep Learning.
- liGAN - Deep generative models of 3D grids for structure-based drug discovery. [Article] [Paper]
- LIMO - Latent Inceptionism for targeted MOlecule Generation: a generative model for drug discovery. [Paper]
- DelFTa - Δ-Quantum Machine Learning for medicinal chemistry. [Paper]
- Fréchet ChemNet Distance - Fréchet ChemNet Distance: a quality measure for generative models for molecules. [Paper].
- DrugOOD - A systematic OOD (Out-Of-Distribution) dataset curator and benchmark for AI-aided drug discovery. [Paper]
- PIGNet - a Physics Informed Deep Learning model toward generalized drug-target interaction predictions. [Paper]
- REINVENT - An AI tool for de novo drug design. [Paper]
- MolScore - An automated scoring function to facilitate and standardize evaluation of goal-directed generative models for de novo molecular design. [Article]
- SMILES-RNN - A SMILES-based recurrent neural network used for de novo molecule generation with several reinforcement learning algorithms available for molecule optimization. [Article]
- DiffLinker - Equivariant 3D-Conditional Diffusion Model for molecular linker design. [Paper]
- SQUID - Equivariant shape-conditioned generation of 3D molecules for Ligand-Based Drug Design. [Paper]
- DiffSBDD - A Euclidean diffusion model for structure-based drug design. [Paper]
- MF-PCBA - Multi-fidelity high-throughput screening benchmarks for drug discovery and Machine Learning. [Paper]
- Deep Surrogate Docking - Accelerating automated Drug Discovery with Graph Neural Networks. [Paper]
- HGAN-DTI - Heterogeneous Graph Attention Network for Drug-Target Interaction Prediction. [Paper]
- MolSkill - Learning chemical intuition from humans in the loop. [Paper]
- AI-Bind - Interpretable AI pipeline improving binding predictions for novel protein targets and ligands. [Paper]
- ESP - A general model to predict small molecule substrates of enzymes based on Machine and Deep Learning. [Paper]
- HyperPCM - Robust task-conditioned modeling of drug-target interactions. [Paper]
- Umol - Structure prediction of protein-ligand complexes from sequence information. [Paper]
- sChemNET - A Deep Learning framework for predicting small molecules targeting microRNA function. [Paper]
- DECIMER - Hand-drawn molecule images dataset - A standardised, openly available benchmark dataset of 5088 hand-drawn depictions of diversely picked chemical structures. [Article]
- UniProt - The world’s leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information.
- UniLanguage - Homology reduced UniProt, train-/valid-/testsets for language modeling.
- ChEMBL - A manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
- Molecule OCR Real images Dataset - Test dataset from paper "Image2SMILES: Transformer-based Molecular Optical Recognition Engine". It contains 296 structures: images and Functional Groups SMILES (FG-SMILES). [Paper]
- FS-Mol - A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets.
- ProteinNet - A standardized data set for Machine Learning of protein structure.
- SidechainNet - An all-atom protein structure dataset for Deep Learning. It is an extension of the ProteinNet dataset. [Paper]
- DIPS - Database of Interacting Protein Structures. [Paper]
- Aggregated Views of Proteins - Protein data bank in Europe knowledge base.
- ProtCAD - Protein Common Assembly Database. A comprehensive structural resource of protein complexes. [Paper]
- gget - A free, Open Source command-line tool and Python package that enables efficient querying of genomic databases.
- ESM Atlas - An open atlas of 617 million metagenomic protein structures.
- Progres - A Python package to perform fast search structures against pre-embedded structural databases and pre-embed datasets. [Paper]
- ZINC - A free public resource for ligand discovery. The database contains over twenty million commercially available molecules in biologically relevant representations that may be downloaded in popular ready-to-dock formats and subsets. [Paper]
- Papyrus - A large-scale curated dataset aimed at bioactivity predictions. [Paper]
- MISATO - Machine Learning dataset of protein-ligand complexes for structure-based drug discovery. [Paper]
- Interpretable and Explainable Machine Learning for Materials Science and Chemistry - Interpretable and Explainable Machine Learning applied to materials science and chemistry.
- exmol - Explainer for black box models that predict molecule properties. [Article]
- BERTology - Interpreting Attention in Protein Language Models. [Paper]
- DRPreter - Interpretable anticancer drug response prediction using Knowledge-Guided Graph Neural Networks and Transformer. [Paper]
- nglview - A Jupyter widget to interactively view molecular structures and trajectories.
- Panel-Chemistry - Easy exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.
- libmolgrid - A comprehensive library for fast, GPU accelerated molecular gridding for Deep Learning workflows. [Paper]
- stmol - A component for building interactive molecular 3D visualizations within Streamlit web applications.
- MolecularNodes - An add-on and set of pre-made nodes for Blender & Blender’s Geometry Nodes, to import, animate and manipulate molecular data.
- Jupyter Dock - Molecular Docking integrated in Jupyter notebooks.
- AugLiChem - A data augmentation library for chemical structures. [Paper]
- Chemiscope - An interactive structure/property explorer for materials and molecules. While the core project is implemented in a different programming language, it has been added to this list because it provides Python extensions that allow using it within a Jupyter or Colab notebook.
- ReMODE - A Deep Learning-based web server for target-specific drug design. [Paper]