Ontology Prediction 🧬 🎱

This app is a base application that computes the predicted Molecular Gene Ontology given a protein sequence. It makes use of the bio-transformers with a ESM prot_bert model as backend to compute embeddings for sequences. This work is based on the paper Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. The data used for training comes from the source code of the paper. I rebundled a portion of this data and have made it available in DeepChain's open source dataset library, biodatasets. Due to limitations of compute the model remains untrained on the full dataset and only a very limited number of embeddings are pre-computed. Hopefully this will change in the near future. A full blog post of how I built this app is available here.

The default model included is a multi-layer perceptron (MLP) with input size of 1024, a single hidden layer with 512 fully connected nodes, and an output dimension 256 - the number of GO classes for the PDB dataset. This model is well captured by the below diagram. Here we see an example input of an arbitrary length protein sequence.

The paper this app is based on argues that sequence information with unsupervised embeddings is not aided by structural information of the protein. Regardless, I think it could be an interesting expansion to include strucutral information generated by DeepChain as additional predictive information for the model. This could then be tested against purely sequence embedding based approaches.

If you'd like to play around and create your own app, check out the DeepChain Apps GitHub repo for guidance. I am in the process of writing a blog about the process of creating this app. I have some ideas for other apps that could be interesting and am open to collaboration. Feel free to reach out to me!

Example

The app is designed to be very easy to use by inputting a sequence to the app. The app will automate computing an embedding using biodatasets ProtBert model. This can be run as follows:

    seq = [
        "PKIVILPHQDLCPDGAVLEANSGETILDAALRNGIEIEHACEKSCACTTCHCIVREGF \
         DSLPESSEQEDDMLDKAWGLEPESRLSCQARVTDEDLVVEIPRYTINHARE", 
        "PMILGYWNVRGLTHPIRLLLEYTDSSYEEKRYAMGDAPDYDRSQWLNEKFKLGLDFPN \
         LPYLIDGSRKITQSNAIMRYLARKHHLCGETEEERIRVDVLENQAMDTRLQLAMVCYS \
         PDFERKKPEYLEGLPEKMKLYSEFLGKQPWFAGNKITYVDFLVYDVLDQHRIFEPKCL \
         DAFPNLKDFVARFEGLKKISDYMKSGRFLSKPIFAKMAFWNPK"
    ]
    app = App(device = "cpu") # Set to App() for GPU training
    score_dict = app.compute_scores(seq)
    print(score_dict)

The app then outputs a numpy array with scores corresponding the the classes - and thus function - it believes the protein sequence codes for. Since protein sequences can code for multiple functions, and multiple proteins can code for similar functions, there will likely be a range of scores! An example of output for the above sequences is:

{
    'Class': 
        tensor(
            [[ 0.0256, -0.0338, -0.0292, -0.0446, -0.0345,  0.0050,  0.0508, -0.0102,
            ... 0.0110,  0.0077,  0.0633,  0.0896, -0.0260,  0.0021, -0.1576, -0.0036],
            [ 0.1083, -0.1110, -0.0437, -0.0807,  0.0331, -0.0005,  0.0704, -0.0314,
            ... -0.0331,  0.0576, -0.0238,  0.0333, -0.0444, -0.0256, -0.1072, -0.0553]],
            grad_fn=<AddmmBackward>),

    'Embedding': 
        tensor(
            [[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.1248, 0.0276],
            [0.0000, 0.1116, 0.0000,  ..., 0.0000, 0.0000, 0.0356]],
            grad_fn=<MulBackward0>)
}

If you take a closer look at the code, you'll notice you can also specify your own dataset by setting the dataset parameter in the initialisation of the app. This dataset needs to conform to the style of ontologyprediction dataset that I have made available in biodatasets. Of course, you can also modify this app by forking my GitHub repo and deploying your own DeepChain app!

The Dataset

The dataset used during development of this app is largely based on the subset of Protein Data Bank data provided by the Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function paper. The authors lay out the details of how they filtered appropriate sequences for use with their models. The main points to consider are that they "selected onsidered proteins with sequence length in the range [40, 1000] that had GO annotations in the Molecular Function Ontology (MFO) with evidence codes ’EXP’, ’IDA’, ’IPI’, ’IMP’, ’IGI’, ’IEP’, ’HTP’, ’HDA’, ’HMP’, ’HGI’, ’HEP’, ’IBA’, ’IBD’, ’IKR’, ’IRD’, ’IC’ and ’TAS’."

Future Work & Contributing

This app has loads of room for improvement. For one, note that the original paper made use of ELMo embeddings for training and evaluating their models. They also made use of a wide variety of different models, and used fairly (empirically) rigorous procedures for testing their models. For example, they performed 5-fold cross validation. They also were able to train on the full SwissProt dataset. This dataset can be added in the future, but calculating the embeddings on a local machine will take a fair amount of compute. Feel free to do this and add the embeddings to the open access biodatasets portal 😄 🧬.

This app is open to open source contributions. Please connect with me on the public GitHub repo to discuss ideas for making this app more generally useful. Some ideas for expansion involve providing more models for training the multi-label classification task. The original paper using unsupervised embeddings for GO classification compared multiple models, including the MLP presented here. Some other interesting models include GNNs and models that combine sequence information with 3D structural information.

Author

St John Grimbly Research Engineer Intern | InstaDeep MSc Applied Mathematics | University of Cape Town

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.docs/source/_static		.docs/source/_static
checkpoint		checkpoint
datasets		datasets
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ontology Prediction 🧬 🎱

Example

The Dataset

Future Work & Contributing

Author

Tags

Libraries

Tasks

Embeddings

Datasets / Resources

About

Uh oh!

Releases

Packages

Languages

sgrimbly/OntologyPrediction

Folders and files

Latest commit

History

Repository files navigation

Ontology Prediction 🧬 🎱

Example

The Dataset

Future Work & Contributing

Author

Tags

Libraries

Tasks

Embeddings

Datasets / Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages