This repository contains the code and dataset for the paper Proteasomal cleavage prediction: state-of-the-art and future directions
Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 1.7 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.
data/
holds the.csv
and.tsv
train, evaluation, and test files, as well as a vocabulary fileparams/
holds vocabulary and tokenization merges filescode/
contains all of the preparation, training, evaluation, and runtime configuration filesrun_configs/
contains all possible argparse configs to execute trainingargs.py
holds all argparse optionsdenoise.py
implements the tested denoising methodsloaders.py
defines the dataloaders for all subsequent training architecturesmodels.py
implements all tested model architecturesprep_dataset.py
shows how we split and prepared the raw dataprocessors.py
implements the training loops for all architecture and denoising variantsrun_train.py
is the overall training script that takes the argparse options and executes training and evaluationtrain_tokenizers.py
is used to create the vocab and merges files underparams/
utils.py
features utility functions such as masking
- All config files are named as follows: the applicable terminal, i.e.
c
orn
, followed by the model architecture, e.g.bilstm
, followed by the denoising method, e.g.coteaching
- Example:
c_bilstm_coteaching.cfg
- BiLSTM, called
bilstm
- BiLSTM with Attention, called
bilstm_att
- BiLSTM with pre-trained Prot2Vec embeddings, called
bilstm_prot2vec
- Attention enhanced CNN, called
cnn
- BiLSTM with ESM2 representations as embeddings, called
bilstm_esm2
- Fine-tuning of ESM2, called
esm2
- BiLSTM with T5 representations as embeddings, called
bilstm_t5
- Base BiLSTM with various trained tokenizers
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
bilstm_bppe1
andbilstm_bbpe50
- WordPair tokenizer with vocabulary size 50000, called
bilstm_wp50
- Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called
- BiLSTM with forward-backward representations as embeddings, called
bilstm_fwbw
- Co-Teaching, called
coteaching
- Co-Teaching+, called
coteaching_plus
- JoCoR, called
jocor
- Noise Adaptation Layer, called
nad
- DivideMix, called
dividemix
Method | C-Terminal | N-Terminal |
---|---|---|
PCPS | 51.3 | 50.0 |
PCM | 64.5 | 52.4 |
NetChop 3.1 (20S) | 66.1 | 52.7 |
NetChop 3.1 (C-term) | 81.5 | 51.0 |
SVM* | 84.8 | 73.2 |
PCM* | 85.3 | 75.5 |
Logistic Regression* | 86.2 | 76.2 |
NetCleave* | 86.9 | 76.4 |
PUUPL* | 87.2 | 78.0 |
Pepsickle* | 88.1 | 78.9 |
Our BiLSTM (6+4) | 89.8 | 80.6 |
Our BiLSTM (28+28) | 92.8 | 89.4 |
* Method has been re-trained from scratch on our dataset.
For other methods, we used published pre-trained models (NetChop, PCM), or web-server functionality (PCPS).
- BiLSTM model architecture based on Ozols et. al., 2021
- Model architecture based on Liu and Gong, 2019, Github
- Model architecture based on Li et al., 2020, Repository available via download section on Homepage
- Prot2Vec embeddings based on Asgari and Mofrad, 2015, available on Github
- Sequence Encoder model architecture based on Heigold et al., 2016
- Model architecture based on DeepCalpain, Liu et al., 2019 and Terminitor, Yang et al., 2020
- T5 Encoder taken from Elnagger et al., 2020, Github, Model on Huggingface Hub
- ESM2 taken from Lin et al., 2022, Github
- Noise adaptation layer implementation is based on Goldberger and Ben-Reuven, 2017, and unofficial implementation on Github
- Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github
- Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github
- JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github
- DivideMix structure is based on Li et al., 2020, Github
- As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
- This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section
- Based around Pcleavage, Bhasin and Raghava, 2005
- Based around Tenzer et al., 2005