Skip to content

An evolutionary probability-based hybrid model to support protein engineering campaigns

License

Notifications You must be signed in to change notification settings

Protein-Engineering-Framework/MERGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains supplementary information to

Alexander-Maurice Illig1,§, Niklas E. Siedhoff1,§, Ulrich Schwaneberg1,2, Mehdi D. Davari3,*,
A hybrid model combining evolutionary probability and machine learning leverages data-driven protein engineering (Working title; To be published)
Preprint available at bioRxiv: https://www.biorxiv.org/content/10.1101/2022.06.07.495081v1

now published as

Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort,
J. Chem. Inf. Model. 2024, 64, 16, 6350–6360
https://doi.org/10.1021/acs.jcim.4c00704

1Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
2DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany
3Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
*Corresponding author
§Equal contribution

MERGE

A hybrid method (MERGE) combining evolutionary probability and machine learning leverages data-driven protein engineering by providing trustworthy prediction of the fitness of a variant based on its sequence, even when only a few screened variants are available.

This repository contains the source files to reproduce the results of our manuscript using the form of protein sequence encoding in combination with the hybrid (statistical energy of a DCA model/predicted fitness of a trained supervised regression model) prediction presented using two sequence-fitness datasets as examples of a "low-N" and a "substitutional extrapolation" protein engineering task. To reproduce the results of the example, run the provided Jupyter notebooks. The "substitutional extrapolation" notebook also contains commands for preprocessing tasks required to create a hybrid model. For all datasets studied, already encoded datasets containing the variant identifiers, the corresponding fitness values, and the encoded sequences are provided as CSV files and used wild-type sequences are provided as FASTA files (see Data).

The new repository, which contains the source files of the published manuscript version of MERGE, is available at https://github.com/amillig/MERGE.

Framework Implementation

Using our protein engineering framework PyPEF, a simplified application of the MERGE hybrid method alongside other encoding and machine learning-based modeling methods is possible. Variant-fitness datasets that can be used for encoding and modeling with PyPEF are provided at Data/_variant_fitness_wtseq.

About

An evolutionary probability-based hybrid model to support protein engineering campaigns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •