The repository includes codes for reproducing work in paper Enzyme Activity Prediction of Sequence Variants onNovel Substrates using Improved Substrate Encodings and Convolutional Pooling, (https://proceedings.mlr.press/v165/xu22a.html). In this work, a new compound protein interaction prediction pipeline is proposed with performance tested on datasets obtained from Machine learning modeling of family wide enzyme-substrate specificity screens (arXiv:2109.03900v1, by S. Goldman and C. W. Coley). The pipeline is based on sequence embeddings generated by protein language models and count encodings of molecule fingerprints.
The figure below shows the prediction model's architechture,
We were able to show a substantial improvements with the new pipeline as we tested the predictions on multiple enzyme-substrate-activity datasets (i.e. aminotransferase, kinase, halogenase, phosphatase, etc. ) as shown in the table below.