This repository is outdated. See the new and extended repository for the current paper version here.
This repository contains the code and dataset for the article "Proteasomal cleavage prediction: state-of-the-art and future directions".
Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 0.4 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.
preprocessing
includes the notebooks that shuffle and split the data into C- and N-terminal, train, evaluation, and test split, as well as perform 3-mer and other basic preprocessing steps, such as tokenizationdata
holds the.csv
and.tsv
train, evaluation, and test files, as well as a vocabulary filedenoise/divide_mix
holds our adjusted implementation of the DivideMix algorithm- try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under
denoise/dividemix_tryout_debug
- all other tested denoising methods are directly impelemented in the notebooks
- try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under
models/hyperparam_search
holds the training and hyperparameter search implementation of the asynchronous hyperband algorithm using Ray Tunemodels/final
holds the final training and evaluation structure for model architectures paired with all denoising approaches. Ablation experiments are also included there.
- All notebooks are named as follows: the applicable terminal, i.e.
c
orn
, followed by the model architecture, e.g.bilstm
, followed by the denoising method, e.g.dividemix
- Example:
c_bilstm_dividemix.ipynb
- BiLSTM, called
bilstm
- BiLSTM with Attention, called
bilstm_att
- BiLSTM with pre-trained Prot2Vec embeddings, called
bilstm_prot2vec
- Attention enhanced CNN, called
cnn
- BiLSTM with ESM2 representations as embeddings, called
bilstm_esm2
- Fine-tuning of ESM2, called
esm2
- BiLSTM with T5 representations as embeddings, called
bilstm_t5
- Base BiLSTM with various trained tokenizers
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
bilstm_bppe1
andbilstm_bbpe50
- WordPair tokenizer with vocabulary size 50000, called
bilstm_wp50
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
- BiLSTM with forward-backward representations as embeddings, called
bilstm_fwbw
- Co-Teaching, called
coteaching
- Co-Teaching+, called
coteaching_plus
- JoCoR, called
jocor
- Noise Adaptation Layer, called
nad
- DivideMix, called
dividemix
Method | C-Terminal | N-Terminal |
---|---|---|
PCPS | 51.3 | 50.0 |
PCM | 64.5 | 52.4 |
NetChop 3.1 (20S) | 66.1 | 52.7 |
NetChop 3.1 (C-term) | 81.5 | 51.0 |
SVM* | 84.8 | 73.2 |
PCM* | 85.3 | 75.5 |
Logistic Regression* | 86.2 | 76.2 |
NetCleave* | 86.9 | 76.4 |
PUUPL* | 87.2 | 78.0 |
Pepsickle* | 88.1 | 78.9 |
Our BiLSTM (6+4) | 88.5 | 79.5 |
Our BiLSTM (28+28) | 92.3 | 89.3 |
* Method has been re-trained from scratch on our dataset.
For other methods, we used published pre-trained models (NetChop, PCM), or web-server functionality (PCPS).
- BiLSTM model architecture based on Ozols et. al., 2021
- Model architecture based on Liu and Gong, 2019, Github
- Model architecture based on Li et al., 2020, Repository available via download section on Homepage
- Prot2Vec embeddings based on Asgari and Mofrad, 2015, available on Github
- Sequence Encoder model architecture based on Heigold et al., 2016
- Model architecture based on DeepCalpain, Liu et al., 2019 and Terminitor, Yang et al., 2020
- T5 Encoder taken from Elnagger et al., 2020, Github, Model on Huggingface Hub
- ESM2 taken from Lin et al., 2022, Github
- Noise adaptation layer implementation is based on Goldberger and Ben-Reuven, 2017, and unofficial implementation on Github
- Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github
- Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github
- JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github
- DivideMix structure is based on Li et al., 2020, Github
- As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
- This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section
- Based around Pcleavage, Bhasin and Raghava, 2005
- Based around Tenzer et al., 2005