This repository contains supplementary information to
Alexander-Maurice Illig1,§, Niklas E. Siedhoff1,§, Ulrich Schwaneberg1,2, Mehdi D. Davari3,*,
A hybrid model combining evolutionary probability and machine learning leverages data-driven protein engineering (Working title; To be published)
Preprint available at bioRxiv: https://www.biorxiv.org/content/10.1101/2022.06.07.495081v1
now published as
Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort,
J. Chem. Inf. Model. 2024, 64, 16, 6350–6360
https://doi.org/10.1021/acs.jcim.4c00704
1Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
2DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany
3Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
*Corresponding author
§Equal contribution
A hybrid method (MERGE) combining evolutionary probability and machine learning leverages data-driven protein engineering by providing trustworthy prediction of the fitness of a variant based on its sequence, even when only a few screened variants are available.
This repository contains the source files to reproduce the results of our manuscript using the form of protein sequence encoding in combination with the hybrid (statistical energy of a DCA model/predicted fitness of a trained supervised regression model) prediction presented using two sequence-fitness datasets as examples of a "low-N" and a "substitutional extrapolation" protein engineering task. To reproduce the results of the example, run the provided Jupyter notebooks. The "substitutional extrapolation" notebook also contains commands for preprocessing tasks required to create a hybrid model. For all datasets studied, already encoded datasets containing the variant identifiers, the corresponding fitness values, and the encoded sequences are provided as CSV files and used wild-type sequences are provided as FASTA files (see Data).
The new repository, which contains the source files of the published manuscript version of MERGE, is available at https://github.com/amillig/MERGE.
Using our protein engineering framework PyPEF, a simplified application of the MERGE hybrid method alongside other encoding and machine learning-based modeling methods is possible. Variant-fitness datasets that can be used for encoding and modeling with PyPEF are provided at Data/_variant_fitness_wtseq.