MERGE

This repository contains supplementary information to

Alexander-Maurice Illig^1,§, Niklas E. Siedhoff^1,§, Ulrich Schwaneberg^1,2, Mehdi D. Davari^3,*,
A hybrid model combining evolutionary probability and machine learning leverages data-driven protein engineering (Working title; To be published)
Preprint available at bioRxiv: https://www.biorxiv.org/content/10.1101/2022.06.07.495081v1

now published as

Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort,
J. Chem. Inf. Model. 2024, 64, 16, 6350–6360
https://doi.org/10.1021/acs.jcim.4c00704

¹_{Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany}
²_{DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany}
³_{Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany}
^*_{Corresponding author}
^§_{Equal contribution}

MERGE

A hybrid method (MERGE) combining evolutionary probability and machine learning leverages data-driven protein engineering by providing trustworthy prediction of the fitness of a variant based on its sequence, even when only a few screened variants are available.

This repository contains the source files to reproduce the results of our manuscript using the form of protein sequence encoding in combination with the hybrid (statistical energy of a DCA model/predicted fitness of a trained supervised regression model) prediction presented using two sequence-fitness datasets as examples of a "low-N" and a "substitutional extrapolation" protein engineering task. To reproduce the results of the example, run the provided Jupyter notebooks. The "substitutional extrapolation" notebook also contains commands for preprocessing tasks required to create a hybrid model. For all datasets studied, already encoded datasets containing the variant identifiers, the corresponding fitness values, and the encoded sequences are provided as CSV files and used wild-type sequences are provided as FASTA files (see Data).

The new repository, which contains the source files of the published manuscript version of MERGE, is available at https://github.com/amillig/MERGE.

Framework Implementation

Using our protein engineering framework PyPEF, a simplified application of the MERGE hybrid method alongside other encoding and machine learning-based modeling methods is possible. Variant-fitness datasets that can be used for encoding and modeling with PyPEF are provided at Data/_variant_fitness_wtseq.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Data		Data
Examples		Examples
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
linux-64-env.yml		linux-64-env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MERGE

Framework Implementation

About

Releases

Packages

Contributors 3

License

Protein-Engineering-Framework/MERGE

Folders and files

Latest commit

History

Repository files navigation

MERGE

Framework Implementation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages